- 需求：
    - 导入文件，查看原始数据
    - 将人口数据和各州简称数据进行合并
    - 将合并的数据中重复的abbreviation列进行删除
    - 查看存在缺失数据的列
    - 找到有哪些state/region使得state的值为NaN，进行去重操作
    - 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
    - 合并各州面积数据areas
    - 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
    - 去除含有缺失数据的行
    - 找出2010年的全民人口数据
    - 计算各州的人口密度
    - 排序，并找出人口密度最高的五个州   df.sort_values()

In [1]:
from pandas import Series,DataFrame
import numpy as np
import pandas as pd

- 导入文件，查看原始数据

In [10]:
abb = pd.read_csv('./data/state-abbrevs.csv')
abb.head(2)

Unnamed: 0,state,abbreviation
0,Alabama,AL
1,Alaska,AK


In [4]:
area = pd.read_csv('./data/state-areas.csv')
area

Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707
5,Colorado,104100
6,Connecticut,5544
7,Delaware,1954
8,Florida,65758
9,Georgia,59441


In [9]:
pop = pd.read_csv('./data/state-population.csv')
pop.head(2)

Unnamed: 0,state/region,ages,year,population
0,AL,under18,2012,1117489.0
1,AL,total,2012,4817528.0


- 将人口数据和各州简称数据进行合并

In [13]:
pop_abb = pd.merge(pop,abb,how="outer",left_on='state/region',right_on='abbreviation')
pop_abb.head()

Unnamed: 0,state/region,ages,year,population,state,abbreviation
0,AL,under18,2012,1117489.0,Alabama,AL
1,AL,total,2012,4817528.0,Alabama,AL
2,AL,under18,2010,1130966.0,Alabama,AL
3,AL,total,2010,4785570.0,Alabama,AL
4,AL,under18,2011,1125763.0,Alabama,AL


- 将合并的数据中重复的abbreviation列进行删除

In [14]:
pop_abb.drop('abbreviation',axis=1,inplace=True)

In [17]:
pop_abb.head()

Unnamed: 0,state/region,ages,year,population,state
0,AL,under18,2012,1117489.0,Alabama
1,AL,total,2012,4817528.0,Alabama
2,AL,under18,2010,1130966.0,Alabama
3,AL,total,2010,4785570.0,Alabama
4,AL,under18,2011,1125763.0,Alabama


- 查看存在缺失数据的列

In [18]:
#对原数据中所有数据进行空值判断（True（空） False（非空））
pop_abb.isnull().head()

Unnamed: 0,state/region,ages,year,population,state
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False


In [19]:
pop_abb.isnull().any(axis=0)  #population  state 存在空值

state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

- 找到有哪些state/region使得state的值为NaN，进行去重操作

In [21]:
#定位到state 列中空值对应的行数据
pop_abb['state'].isnull()   #False(非空)  True(空)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: state, Length: 2544, dtype: bool

In [22]:
pop_abb['state'].isnull()   #将改行代码返回的结果值作用到pop_abb行索引中

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: state, Length: 2544, dtype: bool

In [23]:
pop_abb.loc[pop_abb['state'].isnull()]   #返回的是一个df，df中的数据就是pop_abb原始表中state列中空值对应的行数据

Unnamed: 0,state/region,ages,year,population,state
2448,PR,under18,1990,,
2449,PR,total,1990,,
2450,PR,total,1991,,
2451,PR,under18,1991,,
2452,PR,total,1993,,
2453,PR,under18,1993,,
2454,PR,under18,1992,,
2455,PR,total,1992,,
2456,PR,under18,1994,,
2457,PR,total,1994,,


In [25]:
pop_abb.loc[pop_abb['state'].isnull()]['state/region']

2448     PR
2449     PR
2450     PR
2451     PR
2452     PR
2453     PR
2454     PR
2455     PR
2456     PR
2457     PR
2458     PR
2459     PR
2460     PR
2461     PR
2462     PR
2463     PR
2464     PR
2465     PR
2466     PR
2467     PR
2468     PR
2469     PR
2470     PR
2471     PR
2472     PR
2473     PR
2474     PR
2475     PR
2476     PR
2477     PR
       ... 
2514    USA
2515    USA
2516    USA
2517    USA
2518    USA
2519    USA
2520    USA
2521    USA
2522    USA
2523    USA
2524    USA
2525    USA
2526    USA
2527    USA
2528    USA
2529    USA
2530    USA
2531    USA
2532    USA
2533    USA
2534    USA
2535    USA
2536    USA
2537    USA
2538    USA
2539    USA
2540    USA
2541    USA
2542    USA
2543    USA
Name: state/region, Length: 96, dtype: object

In [26]:
#unique()是Series的一个函数，可以对Series元素进行去重操作
pop_abb.loc[pop_abb['state'].isnull()]['state/region'].unique()

array(['PR', 'USA'], dtype=object)

- 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN

In [28]:
pop_abb['state/region'] == 'PR'

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514    False
2515    False
2516    False
2517    False
2518    False
2519    False
2520    False
2521    False
2522    False
2523    False
2524    False
2525    False
2526    False
2527    False
2528    False
2529    False
2530    False
2531    False
2532    False
2533    False
2534    False
2535    False
2536    False
2537    False
2538    False
2539    False
2540    False
2541    False
2542    False
2543    False
Name: state/region, Length: 2544, dtype: bool

In [29]:
pop_abb.loc[pop_abb['state/region'] == 'PR']

Unnamed: 0,state/region,ages,year,population,state
2448,PR,under18,1990,,
2449,PR,total,1990,,
2450,PR,total,1991,,
2451,PR,under18,1991,,
2452,PR,total,1993,,
2453,PR,under18,1993,,
2454,PR,under18,1992,,
2455,PR,total,1992,,
2456,PR,under18,1994,,
2457,PR,total,1994,,


In [30]:
pop_abb.loc[pop_abb['state/region'] == 'PR']['state'] = 'Puerto Rico'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [None]:
pop_abb['state/region'] == 'USA'

In [None]:
pop_abb.loc[pop_abb['state/region'] == 'USA']

In [31]:
pop_abb.loc[pop_abb['state/region'] == 'USA']['state'] = 'United State'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


- 合并各州面积数据areas

In [39]:
all = pd.merge(pop_abb,area,how="outer")
all.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012.0,1117489.0,Alabama,52423.0
1,AL,total,2012.0,4817528.0,Alabama,52423.0
2,AL,under18,2010.0,1130966.0,Alabama,52423.0
3,AL,total,2010.0,4785570.0,Alabama,52423.0
4,AL,under18,2011.0,1125763.0,Alabama,52423.0


- 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行

In [40]:
all.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012.0,1117489.0,Alabama,52423.0
1,AL,total,2012.0,4817528.0,Alabama,52423.0
2,AL,under18,2010.0,1130966.0,Alabama,52423.0
3,AL,total,2010.0,4785570.0,Alabama,52423.0
4,AL,under18,2011.0,1125763.0,Alabama,52423.0


In [41]:
all['area (sq. mi)']  #去取该列

0       52423.0
1       52423.0
2       52423.0
3       52423.0
4       52423.0
5       52423.0
6       52423.0
7       52423.0
8       52423.0
9       52423.0
10      52423.0
11      52423.0
12      52423.0
13      52423.0
14      52423.0
15      52423.0
16      52423.0
17      52423.0
18      52423.0
19      52423.0
20      52423.0
21      52423.0
22      52423.0
23      52423.0
24      52423.0
25      52423.0
26      52423.0
27      52423.0
28      52423.0
29      52423.0
         ...   
2515        NaN
2516        NaN
2517        NaN
2518        NaN
2519        NaN
2520        NaN
2521        NaN
2522        NaN
2523        NaN
2524        NaN
2525        NaN
2526        NaN
2527        NaN
2528        NaN
2529        NaN
2530        NaN
2531        NaN
2532        NaN
2533        NaN
2534        NaN
2535        NaN
2536        NaN
2537        NaN
2538        NaN
2539        NaN
2540        NaN
2541        NaN
2542        NaN
2543        NaN
2544     3515.0
Name: area (sq. mi), Len

In [42]:
all['area (sq. mi)'].isnull()   #True(空值)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
2544    False
Name: area (sq. mi), Length: 2545, dtype: bool

In [43]:
#获取True对应的行号
all['area (sq. mi)'].isnull()[all['area (sq. mi)'].isnull()]

2448    True
2449    True
2450    True
2451    True
2452    True
2453    True
2454    True
2455    True
2456    True
2457    True
2458    True
2459    True
2460    True
2461    True
2462    True
2463    True
2464    True
2465    True
2466    True
2467    True
2468    True
2469    True
2470    True
2471    True
2472    True
2473    True
2474    True
2475    True
2476    True
2477    True
        ... 
2514    True
2515    True
2516    True
2517    True
2518    True
2519    True
2520    True
2521    True
2522    True
2523    True
2524    True
2525    True
2526    True
2527    True
2528    True
2529    True
2530    True
2531    True
2532    True
2533    True
2534    True
2535    True
2536    True
2537    True
2538    True
2539    True
2540    True
2541    True
2542    True
2543    True
Name: area (sq. mi), Length: 96, dtype: bool

In [45]:
#空行对应的索引
indexs = all['area (sq. mi)'].isnull()[all['area (sq. mi)'].isnull()].index

- 去除含有缺失数据的行

In [46]:
all.drop(indexs,axis=0,inplace=True)

- 找出2010年的全民人口数据

In [48]:
all.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012.0,1117489.0,Alabama,52423.0
1,AL,total,2012.0,4817528.0,Alabama,52423.0
2,AL,under18,2010.0,1130966.0,Alabama,52423.0
3,AL,total,2010.0,4785570.0,Alabama,52423.0
4,AL,under18,2011.0,1125763.0,Alabama,52423.0


In [52]:
#df的方法query，可以指定一个字符串形式的筛选条件 逻辑与&  或 |    ==
dataOf2010 = all.query('year == 2010 & ages == "total"')

- 计算各州的人口密度

In [53]:
dataOf2010['population'] / dataOf2010['area (sq. mi)']

3         91.287603
91         1.087509
101       56.214497
189       54.948667
197      228.051342
283       48.493718
293      645.600649
379      460.445752
389     8898.897059
475      286.597129
485      163.409902
570      124.746707
581       18.794338
666      221.687472
677      178.197831
762       54.202751
773       34.745266
858      107.586994
869       87.676099
954       37.509990
965      466.445797
1050     621.815538
1061     102.015794
1146      61.078373
1157      61.321530
1242      86.015622
1253       6.736171
1338      23.654153
1349      24.448796
1434     140.799273
1445    1009.253268
1530      16.982737
1541     356.094135
1626     177.617157
1637       9.537565
1722     257.549634
1733      53.778278
1818      39.001565
1829     275.966651
1914     681.339159
1925     144.854594
2010      10.583512
2021     150.825298
2106      93.987655
2117      32.677188
2202      65.085075
2213     187.622273
2298      94.557817
2309      76.519582
2394      86.851900


In [54]:
dataOf2010['midu'] = dataOf2010['population'] / dataOf2010['area (sq. mi)']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [55]:
dataOf2010.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi),midu
3,AL,total,2010.0,4785570.0,Alabama,52423.0,91.287603
91,AK,total,2010.0,713868.0,Alaska,656425.0,1.087509
101,AZ,total,2010.0,6408790.0,Arizona,114006.0,56.214497
189,AR,total,2010.0,2922280.0,Arkansas,53182.0,54.948667
197,CA,total,2010.0,37333601.0,California,163707.0,228.051342


- 排序，并找出人口密度最高的五个州   df.sort_values()

In [57]:
dataOf2010.sort_values(by="midu",axis=0,ascending=False).head(5)

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi),midu
389,DC,total,2010.0,605125.0,District of Columbia,68.0,8898.897059
1445,NJ,total,2010.0,8802707.0,New Jersey,8722.0,1009.253268
1914,RI,total,2010.0,1052669.0,Rhode Island,1545.0,681.339159
293,CT,total,2010.0,3579210.0,Connecticut,5544.0,645.600649
1050,MA,total,2010.0,6563263.0,Massachusetts,10555.0,621.815538
