Pandas 是基于 NumPy 开发的 Python 数据分析库，专门解决「表格型数据」「异构数据」的处理需求，
比 NumPy 更适合处理带「行列标签」的业务数据（Excel/CSV/ 数据库表），是数据分析、数据挖掘、机器学习的必备工具。

可以简单理解为**python版本的Excel**

In [1]:
import pandas as pd # pd 是 pandas 的约定俗成别名
import numpy as np  # pandas依赖numpy，通常一起导入

**简单来说就是一维数据用 Series，二维表格数据用 DataFrame**

Series（一维序列）
    Series 是 带「索引 (Index)」的一维数组，由两部分组成：

    
    values：值，本质是 numpy 数组，可以是数值、字符串、布尔值等任意类型
    index：索引，默认是 0,1,2,3... 的整数索引，也可以自定义索引


In [2]:
# 方式1：默认索引
s1 = pd.Series([10, 20, 30, 40, 50])
print(s1) #最后会输出这个series的元数据

#常见对象有 int64,float64,object,bool

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [3]:

# 方式2：自定义索引
s2 = pd.Series([10,20,30], index=["语文", "数学", "英语"])
# 可以传入列表list，元组tuple，字典dictionary,Numpy数组array,序列Series
print(s2)


语文    10
数学    20
英语    30
dtype: int64


In [4]:
print(s1.loc)
print(s1.loc[0]) #访问元素 

<pandas.core.indexing._LocIndexer object at 0x77f2132d9a40>
10


In [5]:
s2.loc["语文"] = 20 #通过标签访问
print(s2.loc["语文"])

s1.iloc[0] = 25 #通过索引访问
print(s1.iloc[0])

20
25


In [6]:
print(s1[s1 >=20])

0    25
1    20
2    30
3    40
4    50
dtype: int64


In [7]:
# Series通过字典创建时，label默认为字典的key
calories = {"Day 1": 1750, "Day 2": 2100, "Day 3": 1700}
series = pd.Series(calories)
print(series)

print(series.loc["Day 1"])
series.loc["Day 1"] += 500 
print(series.loc["Day 1"]) 

Day 1    1750
Day 2    2100
Day 3    1700
dtype: int64
1750
2250


DataFrame（二维表格）
    DataFrame 是 Series 的集合，每个 Series 都有相同的索引，并且 DataFrame 可以有列名。

    DataFrame 由两部分组成：

    
    data：数据，本质是 numpy 数组
    columns：列名，默认是 0,1,2,3... 的整数索引，也可以自定义索引

In [8]:
import pandas as pd

data = {
    "Name": ["Spongebob", "Patrick", "Squidward"],
    "Age": [30, 35, 50]
}

df = pd.DataFrame(data, index=["Employee1", "Employee2", "Employee3"])
print(df)


                Name  Age
Employee1  Spongebob   30
Employee2    Patrick   35
Employee3  Squidward   50


In [9]:
print(df.loc["Employee1"])

Name    Spongebob
Age            30
Name: Employee1, dtype: object


In [10]:
# 添加新列
df["Job"] = ["Data Scientist", "Data Analyst", "Data Engineer"]

print(df)

                Name  Age             Job
Employee1  Spongebob   30  Data Scientist
Employee2    Patrick   35    Data Analyst
Employee3  Squidward   50   Data Engineer


In [11]:
#添加新行
new_row =pd.DataFrame([[1,2,3]],columns=['a','b','c'],index=["Employee4"]) #演示没有对齐key时的结果
df = pd.concat([df,new_row])
print(df)

                Name   Age             Job    a    b    c
Employee1  Spongebob  30.0  Data Scientist  NaN  NaN  NaN
Employee2    Patrick  35.0    Data Analyst  NaN  NaN  NaN
Employee3  Squidward  50.0   Data Engineer  NaN  NaN  NaN
Employee4        NaN   NaN             NaN  1.0  2.0  3.0


In [12]:
# 筛选成绩>=80的行
df[df["Age"] >= 20]


Unnamed: 0,Name,Age,Job,a,b,c
Employee1,Spongebob,30.0,Data Scientist,,,
Employee2,Patrick,35.0,Data Analyst,,,
Employee3,Squidward,50.0,Data Engineer,,,


In [13]:
df = pd.read_csv("data.csv",index_col="NOC") #可以指定索引列
#df = pd.read_json("data.json")
print(df) #默认只会打印前五行和后五行

                      Rank  Gold  Silver  Bronze  Total  Year
NOC                                                          
United States            1    11       7       2     20  1896
Greece                   2    10      18      19     47  1896
Germany                  3     6       5       2     13  1896
France                   4     5       4       2     11  1896
Great Britain            5     2       3       2      7  1896
...                    ...   ...     ...     ...    ...   ...
Qatar                   84     0       0       1      1  2024
Refugee Olympic Team    84     0       0       1      1  2024
Singapore               84     0       0       1      1  2024
Slovakia                84     0       0       1      1  2024
Zambia                  84     0       0       1      1  2024

[1435 rows x 6 columns]


In [14]:
df = pd.read_csv("data.csv")
#df = pd.read_json("data.json")
print(df) #默认只会打印前五行和后五行

      Rank                   NOC  Gold  Silver  Bronze  Total  Year
0        1         United States    11       7       2     20  1896
1        2                Greece    10      18      19     47  1896
2        3               Germany     6       5       2     13  1896
3        4                France     5       4       2     11  1896
4        5         Great Britain     2       3       2      7  1896
...    ...                   ...   ...     ...     ...    ...   ...
1430    84                 Qatar     0       0       1      1  2024
1431    84  Refugee Olympic Team     0       0       1      1  2024
1432    84             Singapore     0       0       1      1  2024
1433    84              Slovakia     0       0       1      1  2024
1434    84                Zambia     0       0       1      1  2024

[1435 rows x 7 columns]


In [15]:
print(df.to_string()) #打印所有行

      Rank                               NOC  Gold  Silver  Bronze  Total  Year
0        1                     United States    11       7       2     20  1896
1        2                            Greece    10      18      19     47  1896
2        3                           Germany     6       5       2     13  1896
3        4                            France     5       4       2     11  1896
4        5                     Great Britain     2       3       2      7  1896
5        6                           Hungary     2       1       3      6  1896
6        7                           Austria     2       1       2      5  1896
7        8                         Australia     2       0       0      2  1896
8        9                           Denmark     1       2       3      6  1896
9       10                       Switzerland     1       2       0      3  1896
10      11                        Mixed team     1       0       1      2  1896
11       1                            Fr

In [16]:
#按列选择
print(df['NOC']) #实际上相当于一个serie

0              United States
1                     Greece
2                    Germany
3                     France
4              Great Britain
                ...         
1430                   Qatar
1431    Refugee Olympic Team
1432               Singapore
1433                Slovakia
1434                  Zambia
Name: NOC, Length: 1435, dtype: object


In [17]:
print(df[['NOC', 'Rank']]) #实际相当于一个子dataframe

                       NOC  Rank
0            United States     1
1                   Greece     2
2                  Germany     3
3                   France     4
4            Great Britain     5
...                    ...   ...
1430                 Qatar    84
1431  Refugee Olympic Team    84
1432             Singapore    84
1433              Slovakia    84
1434                Zambia    84

[1435 rows x 2 columns]


In [18]:
#按行选择
print(df.loc[0])

Rank                  1
NOC       United States
Gold                 11
Silver                7
Bronze                2
Total                20
Year               1896
Name: 0, dtype: object


In [19]:
#按行选择，同时指定需要的列
print(df.loc[0,["NOC","Rank"]])

NOC     United States
Rank                1
Name: 0, dtype: object


In [20]:
#支持切片 起始：0 结束：11 步长：2
print(df.loc[0:11:2,["NOC","Rank"]])

              NOC  Rank
0   United States     1
2         Germany     3
4   Great Britain     5
6         Austria     7
8         Denmark     9
10     Mixed team    11


In [21]:
#支持切片 起始：0 结束：11 步长：2
print(df.iloc[0:11:2,0:3]) #注意是iloc

    Rank            NOC  Gold
0      1  United States    11
2      3        Germany     6
4      5  Great Britain     2
6      7        Austria     2
8      9        Denmark     1
10    11     Mixed team     1


In [22]:
#筛选
major_NOC = df[df['Rank'] <= 1]
print(major_NOC)

      Rank             NOC  Gold  Silver  Bronze  Total  Year
0        1   United States    11       7       2     20  1896
11       1          France    27      39      37    103  1900
32       1   United States    76      78      77    231  1904
45       1   Great Britain    56      51      39    146  1908
64       1   United States    26      19      19     64  1912
82       1   United States    41      27      27     95  1920
104      1   United States    45      27      27     99  1924
131      1   United States    22      18      16     56  1928
164      1  United States     44      36      30    110  1932
192      1         Germany    38      31      32    101  1936
224      1   United States    38      27      19     84  1948
261      1   United States    40      19      17     76  1952
304      1    Soviet Union    37      29      32     98  1956
342      1   Soviet Union     43      29      31    103  1960
386      1   United States    36      26      28     90  1964
427     

In [23]:
#筛选
Selected_NOC = df[df['NOC'] == 'Soviet Union']
print(Selected_NOC)

     Rank           NOC  Gold  Silver  Bronze  Total  Year
262     2  Soviet Union    22      30      19     71  1952
304     1  Soviet Union    37      29      32     98  1956
387     2  Soviet Union    30      31      35     96  1964
428     2  Soviet Union    29      32      30     91  1968
471     1  Soviet Union    50      27      22     99  1972
519     1  Soviet Union    49      41      35    125  1976
560     1  Soviet Union    80      69      46    195  1980
643     1  Soviet Union    55      31      46    132  1988


In [24]:
#多条件筛选，运算符类似于C语言
Selected_NOC = df[(df['NOC'] == 'Soviet Union') & (df['Rank'] == 1)]
print(Selected_NOC)

     Rank           NOC  Gold  Silver  Bronze  Total  Year
304     1  Soviet Union    37      29      32     98  1956
471     1  Soviet Union    50      27      22     99  1972
519     1  Soviet Union    49      41      35    125  1976
560     1  Soviet Union    80      69      46    195  1980
643     1  Soviet Union    55      31      46    132  1988


Pandas 高频常用函数

    df.unique() → 查看某列的唯一值（去重）
    df.value_counts() → 统计某列值的出现次数（频次）
    df.astype() → 修改列的数据类型（如：df ["成绩"].astype (float)）
    df.drop_duplicates() → 删除重复行
    df.isnull() / df.notnull() → 判断缺失值
    df.replace(old, new) → 替换指定值（如：df.replace (88, 90)）
    df.apply(func) → 对列 / 行应用自定义函数（高阶用法）
    df.merge(df2) → 合并两个数据表（类似 SQL 的联表查询）
    df.concat([df1,df2]) → 拼接多个数据表
    df.query() → 用字符串写条件筛选（简洁，如：df.query ("成绩> 80")）

In [25]:
# print(df.mean()) #含有非数字列会报错
print(df.mean(numeric_only=True))

Rank        30.334495
Gold         4.048084
Silver       4.025784
Bronze       4.384669
Total       12.458537
Year      1981.828571
dtype: float64


In [26]:
# print(df.sum()) #含有非数字列会报错
print(df.sum(numeric_only=True))

Rank        43530
Gold         5809
Silver       5777
Bronze       6292
Total       17878
Year      2843924
dtype: int64


In [27]:
# print(df.min()) #含有非数字列会报错
print(df.min(numeric_only=True))

Rank         1
Gold         0
Silver       0
Bronze       0
Total        1
Year      1896
dtype: int64


In [28]:
# print(df.max()) #含有非数字列会报错
print(df.max(numeric_only=True))

Rank        86
Gold        83
Silver      78
Bronze      77
Total      231
Year      2024
dtype: int64


In [29]:
print(df.count()) #不会统计NAN

Rank      1435
NOC       1435
Gold      1435
Silver    1435
Bronze    1435
Total     1435
Year      1435
dtype: int64


In [30]:
print(df["Rank"].sum())

43530


In [31]:
groups = df.groupby('NOC')
print(groups)
for group in groups:
    print(group)


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x77f21331b050>
('Afghanistan',       Rank          NOC  Gold  Silver  Bronze  Total  Year
1073    82  Afghanistan     0       0       1      1  2008
1157    79  Afghanistan     0       0       1      1  2012)
('Albania',       Rank      NOC  Gold  Silver  Bronze  Total  Year
1423    80  Albania     0       0       2      2  2024)
('Algeria',       Rank      NOC  Gold  Silver  Bronze  Total  Year
637     42  Algeria     0       0       2      2  1984
728     34  Algeria     1       0       1      2  1992
792     34  Algeria     2       0       1      3  1996
879     42  Algeria     1       1       3      5  2000
1059    68  Algeria     0       1       1      2  2008
1128    50  Algeria     1       0       0      1  2012
1227    63  Algeria     0       2       0      2  2016
1382    39  Algeria     2       0       1      3  2024)
('Argentina',       Rank        NOC  Gold  Silver  Bronze  Total  Year
119     16  Argentina     1     

In [32]:
groups = df.groupby('NOC')
print(groups["Rank"].mean())

NOC
Afghanistan     80.500000
Albania         80.000000
Algeria         46.500000
Argentina       35.368421
Argentina       21.000000
                  ...    
West Germany     4.800000
Yugoslavia      18.538462
Yugoslavia      18.000000
Zambia          62.666667
Zimbabwe        36.666667
Name: Rank, Length: 210, dtype: float64


In [37]:
print(df)
df_1 = df.drop(columns=["Total"])
print(df_1)

      Rank                   NOC  Gold  Silver  Bronze  Total  Year
0        1         United States    11       7       2     20  1896
1        2                Greece    10      18      19     47  1896
2        3               Germany     6       5       2     13  1896
3        4                France     5       4       2     11  1896
4        5         Great Britain     2       3       2      7  1896
...    ...                   ...   ...     ...     ...    ...   ...
1430    84                 Qatar     0       0       1      1  2024
1431    84  Refugee Olympic Team     0       0       1      1  2024
1432    84             Singapore     0       0       1      1  2024
1433    84              Slovakia     0       0       1      1  2024
1434    84                Zambia     0       0       1      1  2024

[1435 rows x 7 columns]
      Rank                   NOC  Gold  Silver  Bronze  Year
0        1         United States    11       7       2  1896
1        2                Greece    1

In [40]:
df_2 = df.dropna(subset=['Bronze']) #如果某一行中有空值，则删除该行
df_3 = df.fillna({'Bronze':0}) #将某列的空值用0填充
df["NOC"] = df["NOC"].replace({"United States":"USA"})
print(df)

      Rank                   NOC  Gold  Silver  Bronze  Total  Year
0        1                   USA    11       7       2     20  1896
1        2                Greece    10      18      19     47  1896
2        3               Germany     6       5       2     13  1896
3        4                France     5       4       2     11  1896
4        5         Great Britain     2       3       2      7  1896
...    ...                   ...   ...     ...     ...    ...   ...
1430    84                 Qatar     0       0       1      1  2024
1431    84  Refugee Olympic Team     0       0       1      1  2024
1432    84             Singapore     0       0       1      1  2024
1433    84              Slovakia     0       0       1      1  2024
1434    84                Zambia     0       0       1      1  2024

[1435 rows x 7 columns]
