# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pandas Joins & Grouping
Week 2 | Day 4

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Join data via concat
- Perform left, right, inner, outer joins, and groupbys

## Concatenation


We've seen previously that concatenation on collections takes place with the '+' operator.

### String concatenation

In [1]:
s1 = "This is a string!" 
s2 = "This is also a string!"

In [2]:
s1 + s2

'This is a string!This is also a string!'

### List concatenation

In [3]:
l1 = ['list_item_a1', 'list_item_a2']
l2 = ['list_item_b1', 'list_item_b2']

In [4]:
# list concat
l1 + l2

['list_item_a1', 'list_item_a2', 'list_item_b1', 'list_item_b2']

## How do we concat pandas series?

### Not like that.

## Remember that adds the Series

In [5]:
import pandas as pd

ps1 = pd.Series([1, 2, 3, 4, 5])
ps2 = pd.Series([10, 20, 30, 40, 50])

In [6]:
ps1 + ps2

0    11
1    22
2    33
3    44
4    55
dtype: int64

Remember the '+' operator on pandas Series is for addition

## So there is another way

```python
pd.concat()
```

We pass the series we want to concat into this.

### Let's see an example

In [14]:
pd.concat([ps1, ps2])

0     1
1     2
2     3
3     4
4     5
0    10
1    20
2    30
3    40
4    50
dtype: int64

Notice two things: 1. we put our series into a list; 2. the indices are the original indices

## How do we fix the index (presuming we want one continuous index)

In [20]:
# solution

pd.concat([ps1, ps2]).reset_index(drop=True)

0     1
1     2
2     3
3     4
4     5
5    10
6    20
7    30
8    40
9    50
dtype: int64

## What about a DataFrame?

### Let's create a couple...

In [15]:
# frame 1
ab_dict = {'names': ['Alma', 'Anthony', 'Ava', 'Barry', 'Brick', 'Betty'],
          'letter_grades': ['A', 'A', 'A', 'B', 'B', 'B'],
          'number_grades': [100, 95, 93, 88, 87, 89]
          }
ab_students = pd.DataFrame(ab_dict)
ab_students

Unnamed: 0,letter_grades,names,number_grades
0,A,Alma,100
1,A,Anthony,95
2,A,Ava,93
3,B,Barry,88
4,B,Brick,87
5,B,Betty,89


In [16]:
# frame 2
cd_dict = {'names': ['Cam', 'Caroly', 'Cathy', 'David', 'Darius', 'Dipsy'],
          'letter_grades': ['C', 'C', 'C', 'D', 'D', 'D'],
          'number_grades': [79, 79, 76, 69, 66, 68]
          }
cd_students = pd.DataFrame(cd_dict)
cd_students

Unnamed: 0,letter_grades,names,number_grades
0,C,Cam,79
1,C,Caroly,79
2,C,Cathy,76
3,D,David,69
4,D,Darius,66
5,D,Dipsy,68


## Concating the two DataFrames

In [17]:
pd.concat([ab_students, cd_students]).reset_index(drop=True)

Unnamed: 0,letter_grades,names,number_grades
0,A,Alma,100
1,A,Anthony,95
2,A,Ava,93
3,B,Barry,88
4,B,Brick,87
5,B,Betty,89
6,C,Cam,79
7,C,Caroly,79
8,C,Cathy,76
9,D,David,69


## Visualization

![](http://i.imgur.com/N5p9y8F.png)

## Can we concat horizontally?

In [35]:
pd.concat([ab_students, cd_students], axis=1)

Unnamed: 0,letter_grades,names,number_grades,letter_grades.1,names.1,number_grades.1
0,A,Alma,100,C,Cam,79
1,A,Anthony,95,C,Caroly,79
2,A,Ava,93,C,Cathy,76
3,B,Barry,88,D,David,69
4,B,Brick,87,D,Darius,66
5,B,Betty,89,D,Dipsy,68


Yes, by passing the 'axis=1'parameter, we concatenate horizontally. <br>
Why would you do this? Mostly for presentation purposes.

## Let's look at another set of data

In [18]:
states_atom = pd.DataFrame({'abbreviation': {0: 'AL', 1: 'AK', 2: 'AZ', 3: 'AR', 4: 'CA', 5: 'CO', 6: 'CT', 7: 'DE', 8: 'FL', 9: 'GA', 10: 'HI', 11: 'ID', 12: 'IL', 13: 'IN', 14: 'IA', 15: 'KS', 16: 'KY', 17: 'LA', 18: 'ME', 19: 'MD', 20: 'MA', 21: 'MI', 22: 'MN', 23: 'MS', 24: 'MO', 25: 'MT'}, 'name': {0: 'Alabama', 1: 'Alaska', 2: 'Arizona', 3: 'Arkansas', 4: 'California', 5: 'Colorado', 6: 'Connecticut', 7: 'Delaware', 8: 'Florida', 9: 'Georgia', 10: 'Hawaii', 11: 'Idaho', 12: 'Illinois', 13: 'Indiana', 14: 'Iowa', 15: 'Kansas', 16: 'Kentucky', 17: 'Louisiana', 18: 'Maine', 19: 'Maryland', 20: 'Massachusetts', 21: 'Michigan', 22: 'Minnesota', 23: 'Mississippi', 24: 'Missouri', 25: 'Montana'}})

In [19]:
states_ntoz = pd.DataFrame({'abbreviation': {0: 'NE', 1: 'NV', 2: 'NH', 3: 'NJ', 4: 'NM', 5: 'NY', 6: 'NC', 7: 'ND', 8: 'OH', 9: 'OK', 10: 'OR', 11: 'PA', 12: 'RI', 13: 'SC', 14: 'SD', 15: 'TN', 16: 'TX', 17: 'UT', 18: 'VT', 19: 'VA', 20: 'WA', 21: 'WV', 22: 'WI', 23: 'WY'}, 'name': {0: 'Nebraska', 1: 'Nevada', 2: 'New Hampshire', 3: 'New Jersey', 4: 'New Mexico', 5: 'New York', 6: 'North Carolina', 7: 'North Dakota', 8: 'Ohio', 9: 'Oklahoma', 10: 'Oregon', 11: 'Pennsylvania', 12: 'Rhode Island', 13: 'South Carolina', 14: 'South Dakota', 15: 'Tennessee', 16: 'Texas', 17: 'Utah', 18: 'Vermont', 19: 'Virginia', 20: 'Washington', 21: 'West Virginia', 22: 'Wisconsin', 23: 'Wyoming'}})

In [20]:
capitals = pd.DataFrame({'capital': {0: 'Montgomery', 1: 'Juneau', 2: 'Phoenix', 3: 'Little Rock', 4: 'Sacramento', 5: 'Denver', 6: 'Hartford', 7: 'Dover', 8: 'Honolulu', 9: 'Tallahassee', 10: 'Atlanta', 11: 'Boise', 12: 'Springfield', 13: 'Indianapolis', 14: 'Des Moines', 15: 'Topeka', 16: 'Frankfort', 17: 'Baton Rouge', 18: 'Augusta', 19: 'Annapolis', 20: 'Boston', 21: 'Lansing', 22: 'St. Paul', 23: 'Jackson', 24: 'Jefferson City', 25: 'Helena', 26: 'Lincoln', 27: 'Carson City', 28: 'Concord', 29: 'Trenton', 30: 'Santa Fe', 31: 'Raleigh', 32: 'Bismarck', 33: 'Albany', 34: 'Columbus', 35: 'Oklahoma City', 36: 'Salem', 37: 'Harrisburg', 38: 'Providence', 39: 'Columbia', 40: 'Pierre', 41: 'Nashville', 42: 'Austin', 43: 'Salt Lake City', 44: 'Montpelier', 45: 'Richmond', 46: 'Olympia', 47: 'Charleston', 48: 'Madison', 49: 'Cheyenne'}, 'latitude': {0: 32.377715999999999, 1: 58.301597999999998, 2: 33.448143000000002, 3: 34.746613000000004, 4: 38.576667999999998, 5: 39.739227, 6: 41.764046, 7: 39.157307000000003, 8: 21.307442000000002, 9: 30.438117999999999, 10: 33.749027000000005, 11: 43.617775000000002, 12: 39.798363000000002, 13: 39.768622999999998, 14: 41.591087000000002, 15: 39.048190999999996, 16: 38.186721999999996, 17: 30.457069000000001, 18: 44.307167, 19: 38.978763999999998, 20: 42.358162, 21: 42.733635, 22: 44.955096999999995, 23: 32.303847999999995, 24: 38.579200999999998, 25: 46.585709000000001, 26: 40.808075000000002, 27: 39.163913999999998, 28: 43.206897999999995, 29: 40.220596, 30: 35.68224, 31: 35.780429999999996, 32: 46.82085, 33: 42.652842999999997, 34: 39.961345999999999, 35: 35.492207000000001, 36: 44.938460999999997, 37: 40.264378000000001, 38: 41.830914, 39: 34.000343000000001, 40: 44.367030999999997, 41: 36.16581, 42: 30.27467, 43: 40.777477000000005, 44: 44.262436000000001, 45: 37.538857, 46: 47.035804999999996, 47: 38.336246000000003, 48: 43.074684000000005, 49: 41.140259}, 'longitude': {0: -86.300567999999998, 1: -134.42021200000002, 2: -112.096962, 3: -92.288985999999994, 4: -121.493629, 5: -104.98485600000001, 6: -72.682198, 7: -75.519722000000002, 8: -157.85737599999999, 9: -84.281295999999998, 10: -84.38822900000001, 11: -116.19972199999999, 12: -89.654961, 13: -86.162643000000003, 14: -93.603729000000001, 15: -95.677956000000009, 16: -84.875373999999994, 17: -91.187393, 18: -69.781693000000004, 19: -76.490936000000005, 20: -71.063698000000002, 21: -84.555328000000003, 22: -93.102210999999997, 23: -90.182106000000005, 24: -92.172934999999995, 25: -112.018417, 26: -96.69965400000001, 27: -119.766121, 28: -71.537993999999998, 29: -74.769913000000003, 30: -105.939728, 31: -78.639099000000002, 32: -100.78331800000001, 33: -73.757874000000001, 34: -82.999068999999992, 35: -97.503342000000004, 36: -123.03040300000001, 37: -76.883597999999992, 38: -71.414963, 39: -81.033210999999994, 40: -100.346405, 41: -86.784241000000009, 42: -97.740348999999995, 43: -111.88823700000002, 44: -72.580535999999995, 45: -77.433639999999997, 46: -122.90501399999999, 47: -81.612328000000005, 48: -89.384444999999999, 49: -104.82023599999999}, 'name': {0: 'Alabama', 1: 'Alaska', 2: 'Arizona', 3: 'Arkansas', 4: 'California', 5: 'Colorado', 6: 'Connecticut', 7: 'Delaware', 8: 'Hawaii', 9: 'Florida', 10: 'Georgia', 11: 'Idaho', 12: 'Illinois', 13: 'Indiana', 14: 'Iowa', 15: 'Kansas', 16: 'Kentucky', 17: 'Louisiana', 18: 'Maine', 19: 'Maryland', 20: 'Massachusetts', 21: 'Michigan', 22: 'Minnesota', 23: 'Mississippi', 24: 'Missouri', 25: 'Montana', 26: 'Nebraska', 27: 'Nevada', 28: 'New Hampshire', 29: 'New Jersey', 30: 'New Mexico', 31: 'North Carolina', 32: 'North Dakota', 33: 'New York', 34: 'Ohio', 35: 'Oklahoma', 36: 'Oregon', 37: 'Pennsylvania', 38: 'Rhode Island', 39: 'South Carolina', 40: 'South Dakota', 41: 'Tennessee', 42: 'Texas', 43: 'Utah', 44: 'Vermont', 45: 'Virginia', 46: 'Washington', 47: 'West Virginia', 48: 'Wisconsin', 49: 'Wyoming'}})

In [21]:
states_atom.head(3)

Unnamed: 0,abbreviation,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona


In [22]:
states_ntoz.head(3)

Unnamed: 0,abbreviation,name
0,NE,Nebraska
1,NV,Nevada
2,NH,New Hampshire


In [23]:
capitals.head(3)

Unnamed: 0,capital,latitude,longitude,name
0,Montgomery,32.377716,-86.300568,Alabama
1,Juneau,58.301598,-134.420212,Alaska
2,Phoenix,33.448143,-112.096962,Arizona


## We can now concat

In [26]:
pd.concat([states_atom, states_ntoz], ignore_index=True)

Unnamed: 0,abbreviation,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona
3,AR,Arkansas
4,CA,California
5,CO,Colorado
6,CT,Connecticut
7,DE,Delaware
8,FL,Florida
9,GA,Georgia


To avoid having to reset the index, we can pass in True for the 'ignore_index' parameter

## Horizontally concat

We can add the 'keys' parameter to improve the look with a column header:

In [33]:
r=pd.concat([states_atom, states_ntoz], axis=1, keys=['A to M', 'N to Z'])
r

Unnamed: 0_level_0,A to M,A to M,N to Z,N to Z
Unnamed: 0_level_1,abbreviation,name,abbreviation,name
0,AL,Alabama,NE,Nebraska
1,AK,Alaska,NV,Nevada
2,AZ,Arizona,NH,New Hampshire
3,AR,Arkansas,NJ,New Jersey
4,CA,California,NM,New Mexico
5,CO,Colorado,NY,New York
6,CT,Connecticut,NC,North Carolina
7,DE,Delaware,ND,North Dakota
8,FL,Florida,OH,Ohio
9,GA,Georgia,OK,Oklahoma


In [30]:
r['A to M']['abbreviation']

0     AL
1     AK
2     AZ
3     AR
4     CA
5     CO
6     CT
7     DE
8     FL
9     GA
10    HI
11    ID
12    IL
13    IN
14    IA
15    KS
16    KY
17    LA
18    ME
19    MD
20    MA
21    MI
22    MN
23    MS
24    MO
25    MT
Name: abbreviation, dtype: object

## Exercise

Using the DataFrame ```capitals`` DataFrame and the ```states_atom``` and ```states_ntoz``` DataFrames:
- vertically concat the states_atom DataFrame with the states_ntoz DataFrame
- save that as a new DataFrame called sc
- now horizontally concat the everything in sc with the capitals['capital'] series 
- if you get errors, be mindful of you indices
- can you concat a DF with a Series? Is there a way around it?
- what happens if you add the keys argument? why is this different from what appears above when we did this?

In [66]:
sc=pd.concat([states_atom,states_ntoz],ignore_index=True)
states=pd.concat([sc,capitals],axis=1,keys=['sc','capitals'])
states

Unnamed: 0_level_0,sc,sc,capitals,capitals,capitals,capitals
Unnamed: 0_level_1,abbreviation,name,capital,latitude,longitude,name
0,AL,Alabama,Montgomery,32.377716,-86.300568,Alabama
1,AK,Alaska,Juneau,58.301598,-134.420212,Alaska
2,AZ,Arizona,Phoenix,33.448143,-112.096962,Arizona
3,AR,Arkansas,Little Rock,34.746613,-92.288986,Arkansas
4,CA,California,Sacramento,38.576668,-121.493629,California
5,CO,Colorado,Denver,39.739227,-104.984856,Colorado
6,CT,Connecticut,Hartford,41.764046,-72.682198,Connecticut
7,DE,Delaware,Dover,39.157307,-75.519722,Delaware
8,FL,Florida,Honolulu,21.307442,-157.857376,Hawaii
9,GA,Georgia,Tallahassee,30.438118,-84.281296,Florida


In [73]:
#This is much prettier
states=pd.merge(sc,capitals,on='name')
states

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,AL,Alabama,Montgomery,32.377716,-86.300568
1,AK,Alaska,Juneau,58.301598,-134.420212
2,AZ,Arizona,Phoenix,33.448143,-112.096962
3,AR,Arkansas,Little Rock,34.746613,-92.288986
4,CA,California,Sacramento,38.576668,-121.493629
5,CO,Colorado,Denver,39.739227,-104.984856
6,CT,Connecticut,Hartford,41.764046,-72.682198
7,DE,Delaware,Dover,39.157307,-75.519722
8,FL,Florida,Tallahassee,30.438118,-84.281296
9,GA,Georgia,Atlanta,33.749027,-84.388229


## Now having taught you this, please don't ever do that

### Horizontal matching like that should be done with joins

## Joins

![](http://i.imgur.com/17doCFl.png)

## A look at DataFrame.merge()

![](http://i.imgur.com/0N0UgLo.png)

## Let's join the two DataFrames sc and capital on the state names

In [74]:
sc.head(3)

Unnamed: 0,abbreviation,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona


In [75]:
capitals.head(3)

Unnamed: 0,capital,latitude,longitude,name
0,Montgomery,32.377716,-86.300568,Alabama
1,Juneau,58.301598,-134.420212,Alaska
2,Phoenix,33.448143,-112.096962,Arizona


## Both DataFrames have 'name'  and are merge/joined on that

In [144]:
sc.merge(capitals)

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,AL,Alabama,Montgomery,32.377716,-86.300568
1,AK,Alaska,Juneau,58.301598,-134.420212
2,AZ,Arizona,Phoenix,33.448143,-112.096962
3,AR,Arkansas,Little Rock,34.746613,-92.288986
4,CA,California,Sacramento,38.576668,-121.493629
5,CO,Colorado,Denver,39.739227,-104.984856
6,CT,Connecticut,Hartford,41.764046,-72.682198
7,DE,Delaware,Dover,39.157307,-75.519722
8,FL,Florida,Tallahassee,30.438118,-84.281296
9,GA,Georgia,Atlanta,33.749027,-84.388229


What type of merge occured?

## Inner Join

![](http://i.imgur.com/3CNHEbV.png)

## Let's see what happens if we only have some overlap

In [76]:
abbrev_0_to_9 = sc.iloc[:10,:]
caps_5_to_14 = capitals.iloc[5:15,:]

In [77]:
abbrev_0_to_9

Unnamed: 0,abbreviation,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona
3,AR,Arkansas
4,CA,California
5,CO,Colorado
6,CT,Connecticut
7,DE,Delaware
8,FL,Florida
9,GA,Georgia


In [78]:
caps_5_to_14

Unnamed: 0,capital,latitude,longitude,name
5,Denver,39.739227,-104.984856,Colorado
6,Hartford,41.764046,-72.682198,Connecticut
7,Dover,39.157307,-75.519722,Delaware
8,Honolulu,21.307442,-157.857376,Hawaii
9,Tallahassee,30.438118,-84.281296,Florida
10,Atlanta,33.749027,-84.388229,Georgia
11,Boise,43.617775,-116.199722,Idaho
12,Springfield,39.798363,-89.654961,Illinois
13,Indianapolis,39.768623,-86.162643,Indiana
14,Des Moines,41.591087,-93.603729,Iowa


## With inner joins, we only get those rows that overlap

In [87]:
abbrev_0_to_9.merge(caps_5_to_14,how='outer')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,AL,Alabama,,,
1,AK,Alaska,,,
2,AZ,Arizona,,,
3,AR,Arkansas,,,
4,CA,California,,,
5,CO,Colorado,Denver,39.739227,-104.984856
6,CT,Connecticut,Hartford,41.764046,-72.682198
7,DE,Delaware,Dover,39.157307,-75.519722
8,FL,Florida,Tallahassee,30.438118,-84.281296
9,GA,Georgia,Atlanta,33.749027,-84.388229


## Syntax option 2: pd.merge(left_df, right_df)

In [196]:
pd.merge(abbrev_0_to_9, caps_5_to_14)

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,CO,Colorado,Denver,39.739227,-104.984856
1,CT,Connecticut,Hartford,41.764046,-72.682198
2,DE,Delaware,Dover,39.157307,-75.519722
3,FL,Florida,Tallahassee,30.438118,-84.281296
4,GA,Georgia,Atlanta,33.749027,-84.388229


## Exercise

use the example DataFrames below to do the following:
- inner join the two DataFrame with both syntax types - confirm they are the same
- using the df.merge() syntax, what is the difference between df1.merge(df2) and df2.merge(df2)?

In [89]:
last_ten = sc.iloc[-10::,]
last_twenty = capitals.iloc[-20:,:]

In [90]:
last_ten

Unnamed: 0,abbreviation,name
40,SD,South Dakota
41,TN,Tennessee
42,TX,Texas
43,UT,Utah
44,VT,Vermont
45,VA,Virginia
46,WA,Washington
47,WV,West Virginia
48,WI,Wisconsin
49,WY,Wyoming


In [91]:
last_twenty

Unnamed: 0,capital,latitude,longitude,name
30,Santa Fe,35.68224,-105.939728,New Mexico
31,Raleigh,35.78043,-78.639099,North Carolina
32,Bismarck,46.82085,-100.783318,North Dakota
33,Albany,42.652843,-73.757874,New York
34,Columbus,39.961346,-82.999069,Ohio
35,Oklahoma City,35.492207,-97.503342,Oklahoma
36,Salem,44.938461,-123.030403,Oregon
37,Harrisburg,40.264378,-76.883598,Pennsylvania
38,Providence,41.830914,-71.414963,Rhode Island
39,Columbia,34.000343,-81.033211,South Carolina


In [94]:
pd.merge(last_ten,last_twenty,how='inner')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,SD,South Dakota,Pierre,44.367031,-100.346405
1,TN,Tennessee,Nashville,36.16581,-86.784241
2,TX,Texas,Austin,30.27467,-97.740349
3,UT,Utah,Salt Lake City,40.777477,-111.888237
4,VT,Vermont,Montpelier,44.262436,-72.580536
5,VA,Virginia,Richmond,37.538857,-77.43364
6,WA,Washington,Olympia,47.035805,-122.905014
7,WV,West Virginia,Charleston,38.336246,-81.612328
8,WI,Wisconsin,Madison,43.074684,-89.384445
9,WY,Wyoming,Cheyenne,41.140259,-104.820236


In [97]:
#The df calling the .merge() function is put first onto the new df, and will put NaN values for\
#the following missing values. If the left cells end before the right, the left cells will have the initial\
#value and then will have NaN values.
last_twenty.merge(last_ten,how='outer')

Unnamed: 0,capital,latitude,longitude,name,abbreviation
0,Santa Fe,35.68224,-105.939728,New Mexico,
1,Raleigh,35.78043,-78.639099,North Carolina,
2,Bismarck,46.82085,-100.783318,North Dakota,
3,Albany,42.652843,-73.757874,New York,
4,Columbus,39.961346,-82.999069,Ohio,
5,Oklahoma City,35.492207,-97.503342,Oklahoma,
6,Salem,44.938461,-123.030403,Oregon,
7,Harrisburg,40.264378,-76.883598,Pennsylvania,
8,Providence,41.830914,-71.414963,Rhode Island,
9,Columbia,34.000343,-81.033211,South Carolina,


In [103]:
#The left and right types will choose which data frame the merge should join on
last_ten.merge(last_twenty,how='right')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,SD,South Dakota,Pierre,44.367031,-100.346405
1,TN,Tennessee,Nashville,36.16581,-86.784241
2,TX,Texas,Austin,30.27467,-97.740349
3,UT,Utah,Salt Lake City,40.777477,-111.888237
4,VT,Vermont,Montpelier,44.262436,-72.580536
5,VA,Virginia,Richmond,37.538857,-77.43364
6,WA,Washington,Olympia,47.035805,-122.905014
7,WV,West Virginia,Charleston,38.336246,-81.612328
8,WI,Wisconsin,Madison,43.074684,-89.384445
9,WY,Wyoming,Cheyenne,41.140259,-104.820236


## Left Joins

![](http://i.imgur.com/TEqMpMe.png)

In [203]:
abbrev_0_to_9

Unnamed: 0,abbreviation,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona
3,AR,Arkansas
4,CA,California
5,CO,Colorado
6,CT,Connecticut
7,DE,Delaware
8,FL,Florida
9,GA,Georgia


In [204]:
caps_5_to_14

Unnamed: 0,capital,latitude,longitude,name
5,Denver,39.739227,-104.984856,Colorado
6,Hartford,41.764046,-72.682198,Connecticut
7,Dover,39.157307,-75.519722,Delaware
8,Honolulu,21.307442,-157.857376,Hawaii
9,Tallahassee,30.438118,-84.281296,Florida
10,Atlanta,33.749027,-84.388229,Georgia
11,Boise,43.617775,-116.199722,Idaho
12,Springfield,39.798363,-89.654961,Illinois
13,Indianapolis,39.768623,-86.162643,Indiana
14,Des Moines,41.591087,-93.603729,Iowa


## Left Join of abbr_0_to_9 and caps_5_to_14

In [205]:
pd.merge(abbrev_0_to_9, caps_5_to_14, how='left')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,AL,Alabama,,,
1,AK,Alaska,,,
2,AZ,Arizona,,,
3,AR,Arkansas,,,
4,CA,California,,,
5,CO,Colorado,Denver,39.739227,-104.984856
6,CT,Connecticut,Hartford,41.764046,-72.682198
7,DE,Delaware,Dover,39.157307,-75.519722
8,FL,Florida,Tallahassee,30.438118,-84.281296
9,GA,Georgia,Atlanta,33.749027,-84.388229


## With df.merge() syntax - version 1

In [190]:
abbrev_0_to_9.merge(caps_5_to_14, how='left')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,AL,Alabama,,,
1,AK,Alaska,,,
2,AZ,Arizona,,,
3,AR,Arkansas,,,
4,CA,California,,,
5,CO,Colorado,Denver,39.739227,-104.984856
6,CT,Connecticut,Hartford,41.764046,-72.682198
7,DE,Delaware,Dover,39.157307,-75.519722
8,FL,Florida,Tallahassee,30.438118,-84.281296
9,GA,Georgia,Atlanta,33.749027,-84.388229


## With df.merge() syntax - version 2

In [105]:
caps_5_to_14.merge(abbrev_0_to_9, how='left').sort_values('longitude')

Unnamed: 0,capital,latitude,longitude,name,abbreviation
3,Honolulu,21.307442,-157.857376,Hawaii,
6,Boise,43.617775,-116.199722,Idaho,
0,Denver,39.739227,-104.984856,Colorado,CO
9,Des Moines,41.591087,-93.603729,Iowa,
7,Springfield,39.798363,-89.654961,Illinois,
8,Indianapolis,39.768623,-86.162643,Indiana,
5,Atlanta,33.749027,-84.388229,Georgia,GA
4,Tallahassee,30.438118,-84.281296,Florida,FL
2,Dover,39.157307,-75.519722,Delaware,DE
1,Hartford,41.764046,-72.682198,Connecticut,CT


## Right joins

![](http://i.imgur.com/Cne9KHD.png)

In [214]:
pd.merge(abbrev_0_to_9, caps_5_to_14, how='right')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,CO,Colorado,Denver,39.739227,-104.984856
1,CT,Connecticut,Hartford,41.764046,-72.682198
2,DE,Delaware,Dover,39.157307,-75.519722
3,FL,Florida,Tallahassee,30.438118,-84.281296
4,GA,Georgia,Atlanta,33.749027,-84.388229
5,,Hawaii,Honolulu,21.307442,-157.857376
6,,Idaho,Boise,43.617775,-116.199722
7,,Illinois,Springfield,39.798363,-89.654961
8,,Indiana,Indianapolis,39.768623,-86.162643
9,,Iowa,Des Moines,41.591087,-93.603729


## Outer join

![](http://i.imgur.com/V7T3iEc.png)

## Outer join

In [250]:
pd.merge(abbrev_0_to_9, caps_5_to_14, how='outer')

Unnamed: 0,abbreviation,name,capital,latitude,longitude
0,AL,Alabama,,,
1,AK,Alaska,,,
2,AZ,Arizona,,,
3,AR,Arkansas,,,
4,CA,California,,,
5,CO,Colorado,Denver,39.739227,-104.984856
6,CT,Connecticut,Hartford,41.764046,-72.682198
7,DE,Delaware,Dover,39.157307,-75.519722
8,FL,Florida,Tallahassee,30.438118,-84.281296
9,GA,Georgia,Atlanta,33.749027,-84.388229


## Exercise

Using the two DataFrames below:
- What happens if you use pd.merge() with no arguments? What type of merge is it? What column is the join key?
- Left join the df_short to the df_long
- Do it again but join on the letters rather than the integers. Use the documentation for help if needed.
- Now perform a right join of df_short to df_long
- Finally, perform a right join

In [106]:
df_long = pd.DataFrame({'int-numbers': [1, 2, 3, 4, 5], 'str-numbers':['one', 'two', 'three', 'four', 'five'], 'lc-letter': ['a', 'b', 'c', 'd', 'e'], 'uc-letter': ['A', 'B', 'C', 'D', 'E']})

In [107]:
df_short = pd.DataFrame({'int-numbers': [1, 2, 3, 26], 'phonetics': ['alpha', 'bravo', 'charlie', 'zulu'], 'letters': ['a', 'b', 'c', 'z']})

In [108]:
pd.merge(df_long,df_short)

Unnamed: 0,int-numbers,lc-letter,str-numbers,uc-letter,letters,phonetics
0,1,a,one,A,a,alpha
1,2,b,two,B,b,bravo
2,3,c,three,C,c,charlie


In [130]:
#Can use left_on to name left df column to be joined on and right_on to name right df column to be joined on
#Otherwise, just use on= ... 
pd.merge(df_short,df_long,how='outer',left_on='letters',right_on='lc-letter')

Unnamed: 0,int-numbers_x,letters,phonetics,int-numbers_y,lc-letter,str-numbers,uc-letter
0,1.0,a,alpha,1.0,a,one,A
1,2.0,b,bravo,2.0,b,two,B
2,3.0,c,charlie,3.0,c,three,C
3,26.0,z,zulu,,,,
4,,,,4.0,d,four,D
5,,,,5.0,e,five,E


## Grouping

In [131]:
# create a single state DataFrame
states = pd.concat([states_atom, states_ntoz],ignore_index=True)
states

Unnamed: 0,abbreviation,name
0,AL,Alabama
1,AK,Alaska
2,AZ,Arizona
3,AR,Arkansas
4,CA,California
5,CO,Colorado
6,CT,Connecticut
7,DE,Delaware
8,FL,Florida
9,GA,Georgia


### Get all the states first letters

In [132]:
st_letters = states['name'].apply(lambda x: x[0])

### And their length

In [133]:
st_len = states['name'].apply(lambda x: len(x))

## Then make that a single DataFrame

In [134]:
sdf = pd.concat([st_letters, st_len], axis=1, keys=['letter', 'length'])
sdf

Unnamed: 0,letter,length
0,A,7
1,A,6
2,A,7
3,A,8
4,C,10
5,C,8
6,C,11
7,D,8
8,F,7
9,G,7


## Now for the grouping!

In [265]:
sdf.groupby('letter')['length'].mean().to_frame()

Unnamed: 0_level_0,length
letter,Unnamed: 1_level_1
A,7.0
C,9.666667
D,8.0
F,7.0
G,7.0
H,6.0
I,6.0
K,7.0
L,9.0
M,8.625


## Other methods: .count()

In [276]:
sdf.groupby('letter')['length'].count()

letter
A    4
C    3
D    1
F    1
G    1
H    1
I    4
K    2
L    1
M    8
N    8
O    3
P    1
R    1
S    2
T    2
U    1
V    2
W    4
Name: length, dtype: int64

## Other methods: .nunique() and .unique()

In [271]:
sdf.groupby('letter')['length'].nunique().to_frame().head(3)

Unnamed: 0_level_0,length
letter,Unnamed: 1_level_1
A,3
C,3
D,1


In [277]:
# to_frame accepts and argument that modifies the column name
sdf.groupby('letter')['length'].unique().to_frame('unique items').head(3)

Unnamed: 0_level_0,unique items
letter,Unnamed: 1_level_1
A,"[7, 6, 8]"
C,"[10, 8, 11]"
D,[8]


## Groupby with a multiple methods

In [298]:
import numpy as np
sdf.groupby('letter')['length'].agg([max, min, np.mean, len]).head(3)

Unnamed: 0_level_0,max,min,mean,len
letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,8,6,7.0,4
C,11,8,9.666667,3
D,8,8,8.0,1


## Groupby with lambda functions

In [146]:
import numpy as np
sdf.groupby('letter')['length'].agg([max, min, lambda x: x.nunique(), np.count_nonzero]).head(3)

Unnamed: 0_level_0,max,min,<lambda>,count_nonzero
letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,8,6,3,4
C,11,8,3,3
D,8,8,1,1


## Multiple indices grouping

In [151]:
# get a single state DataFrame
bdf = pd.concat([states_atom, states_ntoz])

# use apply to get the first and last letters and length
bdf['first_letter'] = bdf['name'].apply(lambda x: x[0].upper())
bdf['last_letter'] = bdf['name'].apply(lambda x: x[-1].upper())
bdf['name_length'] = bdf['name'].apply(lambda x: len(x))

bdf.head(3)

Unnamed: 0,abbreviation,name,first_letter,last_letter,name_length
0,AL,Alabama,A,A,7
1,AK,Alaska,A,A,6
2,AZ,Arizona,A,A,7


In [320]:
bdf.groupby(['first_letter', 'last_letter'])['name_length'].nunique().to_frame('uniques').head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,uniques
first_letter,last_letter,Unnamed: 2_level_1
A,A,2
A,S,1
C,A,1
C,O,1
C,T,1


## Exercise

Use the dataset below of major cities of the world to answer the following:
- How many cities are there in each country? How about Algeria specifically?
- How many cities for country and subcounty together? Herat, Afghanistan?
- How many unique subcountries in Azerbijan vs. Austria?
- Bonus: Using only a single groupby line, create a DataFrame that gives the unique first characters of the cities' names
- Double Bonus: Using only a single groupby line, create a DataFrame that give the unique first and last characters of the cities' name

In [152]:
cities = pd.read_csv('http://data.okfn.org/data/core/world-cities/r/world-cities.csv', encoding='latin-1')

In [175]:
cities.head(2)

Unnamed: 0,name,country,subcountry,geonameid
0,les Escaldes,Andorra,Escaldes-Engordany,3040051
1,Andorra la Vella,Andorra,Andorra la Vella,3041563


In [170]:
cities['country'].unique()

array([u'Andorra', u'United Arab Emirates', u'Afghanistan',
       u'Antigua and Barbuda', u'Anguilla', u'Albania', u'Armenia',
       u'Angola', u'Argentina', u'American Samoa', u'Austria',
       u'Australia', u'Aruba', u'Aland Islands', u'Azerbaijan',
       u'Bosnia and Herzegovina', u'Barbados', u'Bangladesh', u'Belgium',
       u'Burkina Faso', u'Bulgaria', u'Bahrain', u'Burundi', u'Benin',
       u'Saint Barthelemy', u'Bermuda', u'Brunei', u'Bolivia',
       u'Bonaire, Saint Eustatius and Saba ', u'Brazil', u'Bahamas',
       u'Bhutan', u'Botswana', u'Belarus', u'Belize', u'Canada',
       u'Cocos Islands', u'Democratic Republic of the Congo',
       u'Central African Republic', u'Republic of the Congo',
       u'Switzerland', u'Ivory Coast', u'Cook Islands', u'Chile',
       u'Cameroon', u'China', u'Colombia', u'Costa Rica', u'Cuba',
       u'Cape Verde', u'Curacao', u'Christmas Island', u'Cyprus',
       u'Czech Republic', u'Germany', u'Djibouti', u'Denmark', u'Dominica',
    

In [211]:
cities.groupby('country')['name'].nunique()['Algeria']

247

In [208]:
cities.groupby(['country','subcountry'])['name'].nunique()['Afghanistan']['Herat']

5

In [172]:
r=cities.groupby('country')['subcountry'].nunique()
print r['Azerbaijan']
print r['Austria']

41
8


In [233]:
pd.DataFrame([x[0] for x in [x[0] for x in cities.groupby('name')['name'].unique()]])[0].unique()

array([u"'", u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J',
       u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U',
       u'V', u'W', u'X', u'Y', u'Z', u'e', u'g', u'l', u'm', u'\xc3'], dtype=object)

In [203]:
r

## Conclusion

We've covered a lot of ground in this lecture:
- How to use pd.concat() to vertical and horizontally union Series
- How to use pd.merge() to join DataFrames with left, right, inner, and outer
- How to use groupby() to group items by a given key

Next up we have a lab on these same concepts...