## Merging data frames

Merging can be done in 2 ways : *horizontally*(also called join) and *vertically* (also called concatination).

There are 4 types of joins(merges) that can be performed:
1. Outer Join (union)
2. Inner Join (intersection)
3. Left Join 
4. Right Join

For A & B (two dfs), A being the first df, left join will include all the rows of A, and intersecting rows of B.

In [7]:
import pandas as pd

staffs = pd.DataFrame([
    {'Name':'Kelly', 'Role':'Director of HR'},
    {'Name':'Sally', 'Role':'Course liason'},
    {'Name':'James', 'Role':'Grader'},
])

students = pd.DataFrame([
    {'Name':'James', 'School':'Business'},
    {'Name':'Sally', 'School':'Law'},
    {'Name':'Mike', 'School':'Engineering'},
])

staffs =staffs.set_index('Name')
students = students.set_index('Name')
# Note that these have a common index

print(staffs.head())
print(students.head())


                 Role
Name                 
Kelly  Director of HR
Sally   Course liason
James          Grader
            School
Name              
James     Business
Sally          Law
Mike   Engineering


In [8]:
# Outer join
pd.merge(staffs,students,how='outer',left_index=True,right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Engineering
Sally,Course liason,Law


In [9]:
# Inner join
pd.merge(staffs,students,how='inner',left_index=True,right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liason,Law
James,Grader,Business


In [10]:
# Left join
pd.merge(staffs,students,how='left',left_index=True,right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liason,Law
James,Grader,Business


In [11]:
# Right join
pd.merge(staffs,students,how='right',left_index=True,right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Sally,Course liason,Law
Mike,,Engineering


We don't need **indices** to join data frames we can also use **columns** using the parameter **on**

In [12]:
staffs= staffs.reset_index()
students = students.reset_index()


pd.merge(staffs,students,how='right',on='Name')

Unnamed: 0,Name,Role,School
0,Sally,Course liason,Law
1,James,Grader,Business
2,Mike,,Engineering


In [13]:
# conflicting columns
staffs = pd.DataFrame([
    {'Name':'Kelly', 'Role':'Director of HR', 'Location':'Jane Street'},
    {'Name':'Sally', 'Role':'Course liason', 'Location':'Western Ave'},
    {'Name':'James', 'Role':'Grader','Location':'Wilson Ave'},
])

students = pd.DataFrame([
    {'Name':'James', 'School':'Business','Location':'House #22 Woodward Street'},
    {'Name':'Sally', 'School':'Law','Location':'House #2 Jane Street'},
    {'Name':'Mike', 'School':'Engineering','Location':'House #12 183rd Street'},
])


# Here location in staffs df refers the office location where as the location in students refers house address
# Pandas resolve this conflict by adding _x and _y after conflicting cols.


pd.merge(staffs,students,how='left',on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,Jane Street,,
1,Sally,Course liason,Western Ave,Law,House #2 Jane Street
2,James,Grader,Wilson Ave,Business,House #22 Woodward Street


We can also pass a list inside **on** parameter for example, ```on=['FirstName','LastName']```


#### Merging vertically (concatinating)

we can concatinate dataframes vertically like ```pd.concat(frames)```
where *frames* is the list of data frames to be concatinated. Also we can set keys using ```pd.concat(frames,keys=['2001','2002','2003']```

## Group by in pandas

Sometimes we want to select data based on groups and understand aggregated data on a group level. We have seen that even though Pandas allows us to iterate over every row in a dataFrame, it is generally very slow to do so. 

Pandas has a **groupby()** function to speed up such task. The idea behind the groupby(). The idea behind the groupby() function is that it takes some datFrame, splits it into chunks based on some key values, applies computation on those chunks, then combines the result back together into another dataFrame. In pandas this is referred to as the split-apply-combine pattern. 

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("datasets/census.csv")
df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [3]:
%%timeit -n 3

for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME'] == state).dropna()['CENSUS2010POP'])
    print('Counties in state '+state+' have an avg population of '+str(avg))

Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state District of Columbia have an avg population of 601723.0
Counties in state Florida have an avg population of 280616.5671641791
Counties in state Georgia have an avg population of 60928.63522012578
Counties in state Hawaii have an avg population of 272060.2
Counties in state Idaho have an avg population of 35626.86363636364
Counties in state Illinois have an avg population of 125790.50980392157
Counties in state Indiana have

Counties in state Kansas have an avg population of 27172.55238095238
Counties in state Kentucky have an avg population of 36161.39166666667
Counties in state Louisiana have an avg population of 70833.9375
Counties in state Maine have an avg population of 83022.5625
Counties in state Maryland have an avg population of 240564.66666666666
Counties in state Massachusetts have an avg population of 467687.78571428574
Counties in state Michigan have an avg population of 119080.0
Counties in state Minnesota have an avg population of 60964.65517241379
Counties in state Mississippi have an avg population of 36186.54878048781
Counties in state Missouri have an avg population of 52077.62608695652
Counties in state Montana have an avg population of 17668.125
Counties in state Nebraska have an avg population of 19638.075268817203
Counties in state Nevada have an avg population of 158855.9411764706
Counties in state New Hampshire have an avg population of 131647.0
Counties in state New Jersey have an

Counties in state Ohio have an avg population of 131096.63636363635
Counties in state Oklahoma have an avg population of 48718.844155844155
Counties in state Oregon have an avg population of 106418.72222222222
Counties in state Pennsylvania have an avg population of 189587.74626865672
Counties in state Rhode Island have an avg population of 210513.4
Counties in state South Carolina have an avg population of 100551.39130434782
Counties in state South Dakota have an avg population of 12336.060606060606
Counties in state Tennessee have an avg population of 66801.1052631579
Counties in state Texas have an avg population of 98998.27165354331
Counties in state Utah have an avg population of 95306.37931034483
Counties in state Vermont have an avg population of 44695.78571428572
Counties in state Virginia have an avg population of 60111.29323308271
Counties in state Washington have an avg population of 172424.10256410256
Counties in state West Virginia have an avg population of 33690.8
Countie

Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state District of Columbia have an avg population of 601723.0
Counties in state Florida have an avg population of 280616.5671641791
Counties in state Georgia have an avg population of 60928.63522012578
Counties in state Hawaii have an avg population of 272060.2
Counties in state Idaho have an avg population of 35626.86363636364
Counties in state Illinois have an avg population of 125790.50980392157
Counties in state Indiana have

Counties in state Louisiana have an avg population of 70833.9375
Counties in state Maine have an avg population of 83022.5625
Counties in state Maryland have an avg population of 240564.66666666666
Counties in state Massachusetts have an avg population of 467687.78571428574
Counties in state Michigan have an avg population of 119080.0
Counties in state Minnesota have an avg population of 60964.65517241379
Counties in state Mississippi have an avg population of 36186.54878048781
Counties in state Missouri have an avg population of 52077.62608695652
Counties in state Montana have an avg population of 17668.125
Counties in state Nebraska have an avg population of 19638.075268817203
Counties in state Nevada have an avg population of 158855.9411764706
Counties in state New Hampshire have an avg population of 131647.0
Counties in state New Jersey have an avg population of 418661.61904761905
Counties in state New Mexico have an avg population of 62399.36363636364
Counties in state New York ha

Counties in state Ohio have an avg population of 131096.63636363635
Counties in state Oklahoma have an avg population of 48718.844155844155
Counties in state Oregon have an avg population of 106418.72222222222
Counties in state Pennsylvania have an avg population of 189587.74626865672
Counties in state Rhode Island have an avg population of 210513.4
Counties in state South Carolina have an avg population of 100551.39130434782
Counties in state South Dakota have an avg population of 12336.060606060606
Counties in state Tennessee have an avg population of 66801.1052631579
Counties in state Texas have an avg population of 98998.27165354331
Counties in state Utah have an avg population of 95306.37931034483
Counties in state Vermont have an avg population of 44695.78571428572
Counties in state Virginia have an avg population of 60111.29323308271
Counties in state Washington have an avg population of 172424.10256410256
Counties in state West Virginia have an avg population of 33690.8
Countie

Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state District of Columbia have an avg population of 601723.0
Counties in state Florida have an avg population of 280616.5671641791
Counties in state Georgia have an avg population of 60928.63522012578
Counties in state Hawaii have an avg population of 272060.2
Counties in state Idaho have an avg population of 35626.86363636364
Counties in state Illinois have an avg population of 125790.50980392157
Counties in state Indiana have

Counties in state Kentucky have an avg population of 36161.39166666667
Counties in state Louisiana have an avg population of 70833.9375
Counties in state Maine have an avg population of 83022.5625
Counties in state Maryland have an avg population of 240564.66666666666
Counties in state Massachusetts have an avg population of 467687.78571428574
Counties in state Michigan have an avg population of 119080.0
Counties in state Minnesota have an avg population of 60964.65517241379
Counties in state Mississippi have an avg population of 36186.54878048781
Counties in state Missouri have an avg population of 52077.62608695652
Counties in state Montana have an avg population of 17668.125
Counties in state Nebraska have an avg population of 19638.075268817203
Counties in state Nevada have an avg population of 158855.9411764706
Counties in state New Hampshire have an avg population of 131647.0
Counties in state New Jersey have an avg population of 418661.61904761905
Counties in state New Mexico ha

Counties in state North Carolina have an avg population of 95354.83
Counties in state North Dakota have an avg population of 12690.396226415094
Counties in state Ohio have an avg population of 131096.63636363635
Counties in state Oklahoma have an avg population of 48718.844155844155
Counties in state Oregon have an avg population of 106418.72222222222
Counties in state Pennsylvania have an avg population of 189587.74626865672
Counties in state Rhode Island have an avg population of 210513.4
Counties in state South Carolina have an avg population of 100551.39130434782
Counties in state South Dakota have an avg population of 12336.060606060606
Counties in state Tennessee have an avg population of 66801.1052631579
Counties in state Texas have an avg population of 98998.27165354331
Counties in state Utah have an avg population of 95306.37931034483
Counties in state Vermont have an avg population of 44695.78571428572
Counties in state Virginia have an avg population of 60111.29323308271
Cou

Counties in state Wisconsin have an avg population of 78985.91666666667
Counties in state Wyoming have an avg population of 24505.478260869564
3.19 s ± 87.2 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [7]:
%%timeit -n 3

for group,frames in df.groupby('STNAME'):
    # groupby() returns a tuple, where the first value is the value of the key
    # we were trying to group by, in this case STNAME, and the second valye is 
    # projected dataFrame that wass found.
    frames.head()
    avg = np.average(frames['CENSUS2010POP'])
    print('Counties in state '+group+' have an avg population of '+str(avg))

Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state District of Columbia have an avg population of 601723.0
Counties in state Florida have an avg population of 280616.5671641791
Counties in state Georgia have an avg population of 60928.63522012578
Counties in state Hawaii have an avg population of 272060.2
Counties in state Idaho have an avg population of 35626.86363636364
Counties in state Illinois have an avg population of 125790.50980392157
Counties in state Indiana have

Counties in state Oregon have an avg population of 106418.72222222222
Counties in state Pennsylvania have an avg population of 189587.74626865672
Counties in state Rhode Island have an avg population of 210513.4
Counties in state South Carolina have an avg population of 100551.39130434782
Counties in state South Dakota have an avg population of 12336.060606060606
Counties in state Tennessee have an avg population of 66801.1052631579
Counties in state Texas have an avg population of 98998.27165354331
Counties in state Utah have an avg population of 95306.37931034483
Counties in state Vermont have an avg population of 44695.78571428572
Counties in state Virginia have an avg population of 60111.29323308271
Counties in state Washington have an avg population of 172424.10256410256
Counties in state West Virginia have an avg population of 33690.8
Counties in state Wisconsin have an avg population of 78985.91666666667
Counties in state Wyoming have an avg population of 24505.478260869564
Coun

Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state District of Columbia have an avg population of 601723.0
Counties in state Florida have an avg population of 280616.5671641791
Counties in state Georgia have an avg population of 60928.63522012578
Counties in state Hawaii have an avg population of 272060.2
Counties in state Idaho have an avg population of 35626.86363636364
Counties in state Illinois have an avg population of 125790.50980392157
Counties in state Indiana have

Counties in state New Mexico have an avg population of 62399.36363636364
Counties in state New York have an avg population of 312550.03225806454
Counties in state North Carolina have an avg population of 95354.83
Counties in state North Dakota have an avg population of 12690.396226415094
Counties in state Ohio have an avg population of 131096.63636363635
Counties in state Oklahoma have an avg population of 48718.844155844155
Counties in state Oregon have an avg population of 106418.72222222222
Counties in state Pennsylvania have an avg population of 189587.74626865672
Counties in state Rhode Island have an avg population of 210513.4
Counties in state South Carolina have an avg population of 100551.39130434782
Counties in state South Dakota have an avg population of 12336.060606060606
Counties in state Tennessee have an avg population of 66801.1052631579
Counties in state Texas have an avg population of 98998.27165354331
Counties in state Utah have an avg population of 95306.37931034483

Counties in state Vermont have an avg population of 44695.78571428572
Counties in state Virginia have an avg population of 60111.29323308271
Counties in state Washington have an avg population of 172424.10256410256
Counties in state West Virginia have an avg population of 33690.8
Counties in state Wisconsin have an avg population of 78985.91666666667
Counties in state Wyoming have an avg population of 24505.478260869564
Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state Di

Counties in state Vermont have an avg population of 44695.78571428572
Counties in state Virginia have an avg population of 60111.29323308271
Counties in state Washington have an avg population of 172424.10256410256
Counties in state West Virginia have an avg population of 33690.8
Counties in state Wisconsin have an avg population of 78985.91666666667
Counties in state Wyoming have an avg population of 24505.478260869564
Counties in state Alabama have an avg population of 71339.34328358209
Counties in state Alaska have an avg population of 24490.724137931036
Counties in state Arizona have an avg population of 426134.4666666667
Counties in state Arkansas have an avg population of 38878.90666666667
Counties in state California have an avg population of 642309.5862068966
Counties in state Colorado have an avg population of 78581.1875
Counties in state Connecticut have an avg population of 446762.125
Counties in state Delaware have an avg population of 299311.3333333333
Counties in state Di

In [4]:
df = df.set_index('STNAME')
def set_batch_number(item):
    if item[0]<'M':
        return 0
    if item[0]<'Q':
        return 1
    return 2

for group,frame in df.groupby(set_batch_number):
    print('There are '+str(len(frame))+' records in group ' + str(group) +' for processing')
        

There are 1177 records in group 0 for processing
There are 1134 records in group 1 for processing
There are 831 records in group 2 for processing


In [22]:
# multi-index grouping
df = pd.read_csv('datasets/listings.csv')
df = df.set_index(['cancellation_policy','review_scores_value'])

for group,frame in df.groupby(level=(0,1)):
    print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


In [23]:
def grouping_fun(item):
    if item[1]==10.0:
        return (item[0],"10.0")
    else:
        return (item[0],"not 10.0")

for group,frame in df.groupby(by=grouping_fun):
    print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


## Aggregation

The most straight forward apply step is the aggregation of data, and uses the method agg() on the groupby() object.

In [24]:
df = df.reset_index()
df.groupby('cancellation_policy').agg({'review_scores_value':np.average})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


In [25]:
df.groupby('cancellation_policy').agg({'review_scores_value':np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [26]:
# we can extend this to aggregate multiple functions or multiple cols
df.groupby('cancellation_policy').agg({'review_scores_value':(np.nanmean,np.nanstd), "reviews_per_month":np.nanmean})


Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


## Transformation
Transformation is different from aggregation, where agg() returns a single value per col, so one per group, tranform() returns an object that is same size as the group. Essentially, it broadcasts the function you supply over the grouped dataFrame, returning a new dataFrame. This makes combining the data later much easier.

In [28]:
cols=['cancellation_policy','review_scores_value']
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head()

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421
