# Padans 1

这份文档是关于使用Pandas库处理和分析美国历史选举数据的教程。以下是文档中提到的一些关键知识点：

1. **导入库**：使用`import numpy as np`和`import pandas as pd`导入所需的库。

2. **读取数据**：使用`pd.read_csv("data/elections.csv")`读取CSV文件到Pandas DataFrame。

3. **查看数据**：
   - 使用`.head()`查看数据的前几行。
   - 使用`.tail()`查看数据的最后几行。

4. **数据选择**：
   - 使用行切片、列名列表或单个列名选择数据。
   - 选择数据时，可以使用`.loc`（基于标签索引）和`.iloc`（基于整数索引）。

5. **条件选择**：
   - 使用布尔索引进行条件选择，如`["Party" == "Independent"]`。

6. **数据筛选**：
   - 使用`.isin()`筛选特定列表中的项。
   - 使用`.str.startswith()`筛选以特定字符开始的字符串。
   - 使用`.query()`执行类似SQL的条件筛选。

7. **统计分析**：
   - 使用`.mean()`计算平均值。
   - 使用`.max()`找到最大值。

8. **实用方法**：
   - `.size`和`.shape`分别提供DataFrame的行数、列数和行数与列数。
   - `.describe()`提供数据的统计摘要。
   - `.sample()`用于随机抽样。
   - `.value_counts()`统计每个唯一值的出现次数。
   - `.unique()`列出所有唯一值。
   - `.sort_values()`对数据进行排序。

In [1]:
import numpy as np
import pandas as pd

读取文件

In [2]:
elections = pd.read_csv("data/elections.csv")
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


### head

In [12]:
elections.head(5) 

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


In [13]:
elections.tail(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979
181,2020,Howard Hawkins,Green,405035,loss,0.255731


# [ ]只能传以下形式的一个参数
- a slice of row numbers 行的切片 -- DF
- a list of column labels 多个离散的列名(顺序) -- DF
- a single column label 单个列名 -- Series
### [ ]在选择单个列时比loc更常用!


In [42]:
elections[1:3] # a slice of row numbers

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927


In [51]:
elections["Year"].head(5) # a single column label

0    1824
1    1824
2    1828
3    1828
4    1832
Name: Year, dtype: int64

In [52]:
elections[["Year"]].head(5)

Unnamed: 0,Year
0,1824
1,1824
2,1828
3,1828
4,1832


框起来能变得更美观

In [50]:
elections[["Year","Result","Party"]].tail(5) # a list of column with certain rows

Unnamed: 0,Year,Result,Party
177,2016,loss,Green
178,2020,win,Democratic
179,2020,loss,Republican
180,2020,loss,Libertarian
181,2020,loss,Green


我试过乱序输入列名, 不可以得出结果

# loc and iloc
- loc selects items by label
- iloc selects items by number
- iloc的切片是左闭右开的,而loc都是闭的

## loc的用法
它通常比iloc更常用

In [18]:
elections.loc[0:4,"Year":"Party"]

Unnamed: 0,Year,Candidate,Party
0,1824,Andrew Jackson,Democratic-Republican
1,1824,John Quincy Adams,Democratic-Republican
2,1828,Andrew Jackson,Democratic
3,1828,John Quincy Adams,National Republican
4,1832,Andrew Jackson,Democratic


选择数据框elections中从第0行到第4行(共5行),以及从"Year"列到"Party"列(共3列)的数据。

In [21]:
elections.loc[[15,25,35],["Year","Party","Result"]]

Unnamed: 0,Year,Party,Result
15,1848,Free Soil,loss
25,1860,Southern Democratic,loss
35,1880,Greenback,loss


In [26]:
elections.loc[15,"Year":"Party"] 

Year                     1848
Candidate    Martin Van Buren
Party               Free Soil
Name: 15, dtype: object

当选择单行数据时,pandas会将其输出为一个Series对象,而不是一个DataFrame。

In [30]:
elections.loc["Year":"Party"] # if there is no rows

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%


In [29]:
elections.loc[:,"Year":"Party"] # show all the lines

Unnamed: 0,Year,Candidate,Party
0,1824,Andrew Jackson,Democratic-Republican
1,1824,John Quincy Adams,Democratic-Republican
2,1828,Andrew Jackson,Democratic
3,1828,John Quincy Adams,National Republican
4,1832,Andrew Jackson,Democratic
...,...,...,...
177,2016,Jill Stein,Green
178,2020,Joseph Biden,Democratic
179,2020,Donald Trump,Republican
180,2020,Jo Jorgensen,Libertarian


## iloc

In [31]:
elections.iloc[[1,3,5],[0,1,2]]

Unnamed: 0,Year,Candidate,Party
1,1824,John Quincy Adams,Democratic-Republican
3,1828,John Quincy Adams,National Republican
5,1832,Henry Clay,National Republican


In [33]:
elections.iloc[1:5,0:2]

Unnamed: 0,Year,Candidate
1,1824,John Quincy Adams
2,1828,Andrew Jackson
3,1828,John Quincy Adams
4,1832,Andrew Jackson


### 1.注意到iloc这里的切片是左闭右开的,而loc都是闭的
### 2.注意到切片是不用再用括号括起来的

In [36]:
 elections.iloc[:,0:3] #all the rows and certain columns

Unnamed: 0,Year,Candidate,Party
0,1824,Andrew Jackson,Democratic-Republican
1,1824,John Quincy Adams,Democratic-Republican
2,1828,Andrew Jackson,Democratic
3,1828,John Quincy Adams,National Republican
4,1832,Andrew Jackson,Democratic
...,...,...,...
177,2016,Jill Stein,Green
178,2020,Joseph Biden,Democratic
179,2020,Donald Trump,Republican
180,2020,Jo Jorgensen,Libertarian


# Conditional Election

## 当你想选择党派为independent的候选人时

In [56]:
elections["Party"] == "Independent"

0      False
1      False
2      False
3      False
4      False
       ...  
177    False
178    False
179    False
180    False
181    False
Name: Party, Length: 182, dtype: bool

In [57]:
elections[elections["Party"] == "Independent"] #只显示true的行

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
121,1976,Eugene McCarthy,Independent,740460,loss,0.911649
130,1980,John B. Anderson,Independent,5719850,loss,6.631143
143,1992,Ross Perot,Independent,19743821,loss,18.956298
161,2004,Ralph Nader,Independent,465151,loss,0.380663
167,2008,Ralph Nader,Independent,739034,loss,0.563842
174,2016,Evan McMullin,Independent,732273,loss,0.539546


## 选择win且拥有<47%的投票率的候选人
这里包含了很多运算符

In [59]:
elections[(elections["Result"] == "win") & (elections["%"] < 47)]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
20,1856,James Buchanan,Democratic,1835140,win,45.30608
23,1860,Abraham Lincoln,Republican,1855993,win,39.699408
47,1892,Grover Cleveland,Democratic,5553898,win,46.121393
70,1912,Woodrow Wilson,Democratic,6296284,win,41.933422
117,1968,Richard Nixon,Republican,31783783,win,43.565246
140,1992,Bill Clinton,Democratic,44909806,win,43.118485
173,2016,Donald Trump,Republican,62984828,win,46.407862


## 下面是一个比较重复性的选择写法

In [65]:
elections[(elections["Party"] == "Anti-Masonic") | (elections["Party"] == "American")]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
22,1856,Millard Fillmore,American,873053,loss,21.554001
126,1976,Thomas J. Anderson,American,158271,loss,0.194862


## Pandas提供了一些更加简洁的筛选方法, 如:
- .isin
- .str.starswith
- .query
- .groupby.filter

### .isin 筛选出存在这个列表里的

In [68]:
parties = ["Anti-Masonic","American","American Independent"]
elections[elections["Party"].isin(parties)]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
22,1856,Millard Fillmore,American,873053,loss,21.554001
115,1968,George Wallace,American Independent,9901118,loss,13.571218
119,1972,John G. Schmitz,American Independent,1100868,loss,1.421524
124,1976,Lester Maddox,American Independent,170274,loss,0.20964
126,1976,Thomas J. Anderson,American,158271,loss,0.194862


### .str.startswith 筛选出以特定格式开头的

In [72]:
elections[elections["Party"].str.startswith("A")]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
22,1856,Millard Fillmore,American,873053,loss,21.554001
38,1884,Benjamin Butler,Anti-Monopoly,134294,loss,1.335838
115,1968,George Wallace,American Independent,9901118,loss,13.571218
119,1972,John G. Schmitz,American Independent,1100868,loss,1.421524
124,1976,Lester Maddox,American Independent,170274,loss,0.20964
126,1976,Thomas J. Anderson,American,158271,loss,0.194862


### .query 有点像sql的条件筛选方法

In [73]:
elections.query('Year >= 2000 and Result == "win"')

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
152,2000,George W. Bush,Republican,50456002,win,47.974666
157,2004,George W. Bush,Republican,62040610,win,50.771824
162,2008,Barack Obama,Democratic,69498516,win,53.02351
168,2012,Barack Obama,Democratic,65915795,win,51.258484
173,2016,Donald Trump,Republican,62984828,win,46.407862
178,2020,Joseph Biden,Democratic,81268924,win,51.311515


In [74]:
parties = ["Republican","Democratic"]
elections.query('Result == "win" and Party not in @parties')

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
11,1840,William Henry Harrison,Whig,1275583,win,53.051213
16,1848,Zachary Taylor,Whig,1360235,win,47.309296
27,1864,Abraham Lincoln,National Union,2211317,win,54.951512


用query指定范围以及特定的列

In [8]:
winners = elections.query('Result == "win"')["%"]
winners.head(5)

1     42.789878
2     56.203927
4     54.574789
8     52.272472
11    53.051213
Name: %, dtype: float64

In [9]:
np.mean(winners)

51.711492943

In [11]:
max(winners)

61.34470329

## 一些实用的方法
- size (行数, 列数)
- shape 行x列的信息总数
- describe 提供一些诸如中位数,max和min的信息
- sample 抽样, 通常与其他方法一起链式使用
- value_count 统计特定列中每个唯一值出现的频次
- unique 排除了重复的信息
- sort_values 排序

In [12]:
elections.shape

(182, 6)

In [13]:
elections.size

1092

In [15]:
elections.describe()

Unnamed: 0,Year,Popular vote,%
count,182.0,182.0,182.0
mean,1934.087912,12353640.0,27.47035
std,57.048908,19077150.0,22.968034
min,1824.0,100715.0,0.098088
25%,1889.0,387639.5,1.219996
50%,1936.0,1709375.0,37.677893
75%,1988.0,18977750.0,48.354977
max,2020.0,81268920.0,61.344703


### sample 一个方便抽样的方法, 默认情况下不会重复选中一个对象, 但是可以通过replace = True来实现
通常与其他方法一起使用, 如query,iloc,etc

In [18]:
elections.sample(5).iloc[: , 0:2]

Unnamed: 0,Year,Candidate
161,2004,Ralph Nader
115,1968,George Wallace
53,1896,William McKinley
133,1984,Ronald Reagan
15,1848,Martin Van Buren


In [21]:
elections.query('Year == 2000').sample(4,replace = True).iloc[:, 0:2]

Unnamed: 0,Year,Candidate
151,2000,Al Gore
154,2000,Pat Buchanan
151,2000,Al Gore
154,2000,Pat Buchanan


### value_counts()

In [22]:
elections["Candidate"].value_counts() 

Candidate
Norman Thomas         5
Ralph Nader           4
Franklin Roosevelt    4
Eugene V. Debs        4
Andrew Jackson        3
                     ..
Silas C. Swallow      1
Alton B. Parker       1
John G. Woolley       1
Joshua Levering       1
Howard Hawkins        1
Name: count, Length: 132, dtype: int64

### unique()

In [25]:
elections["Party"].unique()

array(['Democratic-Republican', 'Democratic', 'National Republican',
       'Anti-Masonic', 'Whig', 'Free Soil', 'Republican', 'American',
       'Constitutional Union', 'Southern Democratic',
       'Northern Democratic', 'National Union', 'Liberal Republican',
       'Greenback', 'Anti-Monopoly', 'Prohibition', 'Union Labor',
       'Populist', 'National Democratic', 'Socialist', 'Progressive',
       'Farmer–Labor', 'Communist', 'Union', 'Dixiecrat',
       "States' Rights", 'American Independent', 'Independent',
       'Libertarian', 'Citizens', 'New Alliance', 'Taxpayers',
       'Natural Law', 'Green', 'Reform', 'Constitution'], dtype=object)

### sort_values()
#### 括号内不填写参数的话就是Series
#### 括号内填写参数就是DF
#### ascending == False 就是降序, 默认为升序

In [30]:
elections["Candidate"].sort_values()

75           Aaron S. Watkins
27            Abraham Lincoln
23            Abraham Lincoln
108           Adlai Stevenson
105           Adlai Stevenson
                ...          
19             Winfield Scott
37     Winfield Scott Hancock
74             Woodrow Wilson
70             Woodrow Wilson
16             Zachary Taylor
Name: Candidate, Length: 182, dtype: object

In [37]:
elections.sort_values("Candidate")

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
75,1920,Aaron S. Watkins,Prohibition,188787,loss,0.708351
27,1864,Abraham Lincoln,National Union,2211317,win,54.951512
23,1860,Abraham Lincoln,Republican,1855993,win,39.699408
108,1956,Adlai Stevenson,Democratic,26028028,loss,42.174464
105,1952,Adlai Stevenson,Democratic,27375090,loss,44.446312
...,...,...,...,...,...,...
19,1852,Winfield Scott,Whig,1386942,loss,44.056548
37,1880,Winfield Scott Hancock,Democratic,4444976,loss,48.278422
74,1916,Woodrow Wilson,Democratic,9126868,win,49.367987
70,1912,Woodrow Wilson,Democratic,6296284,win,41.933422


In [32]:
elections["%"].sort_values()

156     0.098088
141     0.101918
160     0.117542
148     0.118219
165     0.123442
         ...    
133    59.023326
79     60.574501
120    60.907806
91     60.978107
114    61.344703
Name: %, Length: 182, dtype: float64

In [34]:
elections.sort_values("%", ascending = False)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
156,2004,David Cobb,Green,119859,loss,0.098088
141,1992,Bo Gritz,Populist,106152,loss,0.101918
160,2004,Michael Peroutka,Constitution,143630,loss,0.117542
148,1996,John Hagelin,Natural Law,113670,loss,0.118219
165,2008,Cynthia McKinney,Green,161797,loss,0.123442
...,...,...,...,...,...,...
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
79,1920,Warren Harding,Republican,16144093,win,60.574501
120,1972,Richard Nixon,Republican,47168710,win,60.907806
91,1936,Franklin Roosevelt,Democratic,27752648,win,60.978107


## discussion

**1. We want to select the ”Popular vote” column as a pd.Series. Which of the following lines of code will error?**
- A) elections['Popular vote']   √
- B) elections.iloc['Popular vote']
- C) elections.loc['Popular vote']
- D) elections.loc[:, 'Popular vote']   √
- E) elections.iloc[:, 'Popular vote']

In [6]:
elections['Popular vote']

0        151271
1        113142
2        642806
3        500897
4        702735
         ...   
177     1457226
178    81268924
179    74216154
180     1865724
181      405035
Name: Popular vote, Length: 182, dtype: int64

In [8]:
elections.iloc['Popular vote'] #iloc必须跟整数
# 正确写法: elections.iloc[:,3]

TypeError: Cannot index by location index with a non-integer key

In [9]:
elections.loc['Popular vote'] #loc要指定row
#正确写法:elections.loc[:, 'Popular vote']

KeyError: 'Popular vote'

In [10]:
elections.loc[:, 'Popular vote']

0        151271
1        113142
2        642806
3        500897
4        702735
         ...   
177     1457226
178    81268924
179    74216154
180     1865724
181      405035
Name: Popular vote, Length: 182, dtype: int64

In [11]:
elections.iloc[:, 'Popular vote'] #iloc要跟整数
# 正确写法: elections.iloc[:,3]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

**2. Write one line of Pandas code that returns a pd.DataFrame that only contains election
results from the 1900s.**

In [19]:
elections.query('Year == 1900').loc[:,['Result']]

Unnamed: 0,Result
54,loss
55,loss
56,win


**3. Write one line of Pandas code that returns a pd.Series, where the index is the Party,
and the values are how many times that party won an election.**
- Hint: use value counts().

In [24]:
elections.query('Result == "win"')['Party'].value_counts()

Party
Democratic               23
Republican               23
Whig                      2
Democratic-Republican     1
National Union            1
Name: count, dtype: int64