<<<[Index](index.ipynb)
# Merging DataFrames with pandas

As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of **text files, spreadsheets, or databases.** You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.


## [Preparing Data](#pd)
### [Reading multiple data files](#rmdf)
   * [Reading DataFrames from multiple files](#rdmf)
   * [Reading DataFrames from multiple files in a loop](#rdfml)
   * [Combining DataFrames from multiple data files](#cdmdf)

### [Reindexing DataFrames](#rd)
   * [Sorting DataFrame with the Index & columns](#sdi)
   * [Reindexing DataFrame from a list](#rdl)
   * [Reindexing using another DataFrame Index](#radi)

### [Arithmetic with Series & DataFrames](#asd)
   * [Adding unaligned DataFrames](#aud)
   * [Broadcasting in arithmetic formulas](#baf)
   * [Computing percentage growth of GDP](#cpg)
   * [Converting currency of stocks](#ccs)
   


## [Concatenating Data](#cd)
* [Appending & concatenating Series](#A&cS)
* [Appending Series with nonunique Indices](#ASwnI)
* [Appending pandas Series](#ApS)
* [Concatenating pandas Series along row axis](#CpSara)
* [Appending & concatenating DataFrames](#A&cD)
* [Appending DataFrames with ignore_index](#ADwi)
* [Concatenating pandas DataFrames along column axis](#CpDaca)
* [Reading multiple files to build a DataFrame](#RmftbaD)
* [Concatenation, keys, & MultiIndexes](#Ck&M)
* [Concatenating vertically to get MultiIndexed rows](#CvtgMr)
* [Slicing MultiIndexed DataFrames](#SMD)
* [Concatenating horizontally to get MultiIndexed columns](#ChtgMc)
* [Concatenating DataFrames from a dict](#CDfad)
* [Outer & inner joins](#O&ij)
* [Concatenating DataFrames with inner join](#CDwij)
* [Resampling & concatenating DataFrames with inner join](#R&cDwij)




<p id ='pd'><p>
## Preparing Data

In [1]:
import pandas as pd

In [5]:
cd jupyternotes/


/Users/satyammishra/Desktop/Datacamp stuff/jupyternotes


<p id ='rmdf'><p>
### Reading multiple data files

<p id ='rdmf'><p>
### Reading DataFrames from multiple files

In [6]:
bronze= pd.read_csv('./data/olympic/Bronze.csv')
gold = pd.read_csv('./data/olympic/Gold.csv')
silver = pd.read_csv('./data/olympic/Silver.csv')
silver.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1195.0
1,URS,Soviet Union,627.0
2,GBR,United Kingdom,591.0
3,FRA,France,461.0
4,GER,Germany,350.0


<p id ='rdfml'><p>
### Reading DataFrames from multiple files in a loop

In [7]:
# Create the list of file names: filenames
filenames = ['Gold.csv', 'Silver.csv', 'Bronze.csv']


In [8]:
pwd

'/Users/satyammishra/Desktop/Datacamp stuff/jupyternotes'

In [9]:
cd data/olympic/

/Users/satyammishra/Desktop/Datacamp stuff/jupyternotes/data/olympic


In [10]:
ls

Bronze.csv
Gold.csv
Silver.csv
Summer Olympic medalists 1896 to 2008 - ALL MEDALISTS.tsv
Summer Olympic medalists 1896 to 2008 - EDITIONS.tsv
Summer Olympic medalists 1896 to 2008 - IOC COUNTRY CODES.csv
bronze_top5.csv
gold_top5.csv
silver_top5.csv


In [14]:
dataframes = [pd.read_csv(i) for i in filenames]

In [15]:
gold =dataframes[0].head()
gold.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


In [16]:
dataframes[2].head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


<p id ='cdmdf'><p>
### Combining DataFrames from multiple data files
In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.



In [17]:
# Make a copy of gold: medals
medals = gold.copy()
medals.head()


Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


In [18]:
# Rename the columns of medals using new_labels
medals.columns = ['NOC', 'Country', 'Gold']
medals.head()

Unnamed: 0,NOC,Country,Gold
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


In [19]:
# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver.Total
medals['Bronze'] = bronze.Total
medals.head()

Unnamed: 0,NOC,Country,Gold,Silver,Bronze
0,USA,United States,2088.0,1195.0,1052.0
1,URS,Soviet Union,838.0,627.0,584.0
2,GBR,United Kingdom,498.0,591.0,505.0
3,FRA,France,378.0,461.0,475.0
4,GER,Germany,407.0,350.0,454.0


<p id ='rd'><p>
## Reindexing DataFrames

<p id ='sdi'><p>
### Sorting DataFrame with the Index & columns

In [23]:
cd ..

/Users/satyammishra/Desktop/Datacamp stuff/jupyternotes/data


In [20]:
mydict = {'Month':['Jan','Feb', 'Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
    'MaxTemperatureF':['68','60','68','84','88','89','91','86','90','84','72','68']
}

In [21]:
weather1 = pd.DataFrame.from_dict(mydict)
weather1 =weather1.set_index('Month')
weather1.head()

Unnamed: 0_level_0,MaxTemperatureF
Month,Unnamed: 1_level_1
Jan,68
Feb,60
Mar,68
Apr,84
May,88


In [22]:
# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()
weather2.head()

Unnamed: 0_level_0,MaxTemperatureF
Month,Unnamed: 1_level_1
Apr,84
Aug,86
Dec,68
Feb,60
Jan,68


In [23]:
weather3 = weather1.sort_index(ascending=False)
weather3.head()

Unnamed: 0_level_0,MaxTemperatureF
Month,Unnamed: 1_level_1
Sep,90
Oct,84
Nov,72
May,88
Mar,68


In [24]:
# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values('MaxTemperatureF', ascending=False)
weather4.head()

Unnamed: 0_level_0,MaxTemperatureF
Month,Unnamed: 1_level_1
Jul,91
Sep,90
Jun,89
May,88
Aug,86


<p id ='rdl'><p>
### Reindexing DataFrame from a list

In [25]:
year = ['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']

In [26]:
df =weather2[:4]
df

Unnamed: 0_level_0,MaxTemperatureF
Month,Unnamed: 1_level_1
Apr,84
Aug,86
Dec,68
Feb,60


In [27]:
df.reindex(year).ffill()

Unnamed: 0_level_0,MaxTemperatureF
Month,Unnamed: 1_level_1
Jan,
Feb,60.0
Mar,60.0
Apr,84.0
May,84.0
Jun,84.0
Jul,84.0
Aug,86.0
Sep,86.0
Oct,86.0


<p id ='radi'><p>
### Reindexing using another DataFrame Index

In [31]:
cd ..

/Users/satyammishra/Desktop/Datacamp stuff/jupyternotes/data


In [32]:
babynames1881 = pd.read_csv('./baby/names1881.csv', header=None, names=['name', 'gender', 'count'], index_col=['name', 'gender'])
babynames1981 = pd.read_csv('./baby/names1981.csv',header= None, names=['name', 'gender', 'count'], index_col=['name', 'gender'])
print(babynames1881.shape)
babynames1881.head()


(1935, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Mary,F,6919
Anna,F,2698
Emma,F,2034
Elizabeth,F,1852
Margaret,F,1658


In [33]:
print(babynames1981.shape)


babynames1981.head()

(19455, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Jennifer,F,57032
Jessica,F,42519
Amanda,F,34370
Sarah,F,28162
Melissa,F,28003


Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.



In [34]:
comman_names = babynames1881.reindex(babynames1981.index)
comman_names.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Jennifer,F,
Jessica,F,7.0
Amanda,F,263.0
Sarah,F,1226.0
Melissa,F,40.0


In [35]:
comman_names =comman_names.dropna()
print('There are ', comman_names.size, 'common names in 1981 and 1988')

There are  1587 common names in 1981 and 1988


<p id ='asd'><p>
## Arithmetic with Series & DataFrames

<p id ='aud'><p>
### Adding unaligned DataFrames

In [36]:
feb = {
'Company':['Acme Corporation','Hooli','Mediacore','Vandelay Inc'],
'Units' :[15, 3, 13, 25]
}

In [37]:
jan ={
'Company':['Acme Corporation', 'Hooli', 'Initech', 'Mediacore', 'Streeplex'],
'Units': [19, 17, 20, 10, 3]
}

In [38]:
january = pd.DataFrame.from_dict(jan)
january =january.set_index('Company')
january

Unnamed: 0_level_0,Units
Company,Unnamed: 1_level_1
Acme Corporation,19
Hooli,17
Initech,20
Mediacore,10
Streeplex,3


In [39]:
february = pd.DataFrame.from_dict(feb)
february = february.set_index('Company')
february

Unnamed: 0_level_0,Units
Company,Unnamed: 1_level_1
Acme Corporation,15
Hooli,3
Mediacore,13
Vandelay Inc,25


In [40]:
total = january+february
total

Unnamed: 0_level_0,Units
Company,Unnamed: 1_level_1
Acme Corporation,34.0
Hooli,20.0
Initech,
Mediacore,23.0
Streeplex,
Vandelay Inc,


<p id ='baf'><p>
### Broadcasting in arithmetic formulas

In [42]:
weather = pd.read_csv('./pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.head()

Unnamed: 0_level_0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,Mean Dew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,32,28,21,30,27,16,100,89,77,30.1,...,10,6,2,10,8,,0.0,8,Snow,277
2013-01-02,25,21,17,14,12,10,77,67,55,30.27,...,10,10,10,14,5,,0.0,4,,272
2013-01-03,32,24,16,19,15,9,77,67,56,30.25,...,10,10,10,17,8,26.0,0.0,3,,229
2013-01-04,30,28,27,21,19,17,75,68,59,30.28,...,10,10,6,23,16,32.0,0.0,4,,250
2013-01-05,34,30,25,23,20,16,75,68,61,30.42,...,10,10,10,16,10,23.0,0.21,5,,221


In [43]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
temps_f.head()

Unnamed: 0_level_0,Min TemperatureF,Mean TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,21,28,32
2013-01-02,17,21,25
2013-01-03,16,24,32
2013-01-04,27,28,30
2013-01-05,25,30,34


In [44]:
# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9
temps_c.columns = temps_c.columns.str.replace('F', 'C')
temps_c.head()

Unnamed: 0_level_0,Min TemperatureC,Mean TemperatureC,Max TemperatureC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,-6.111111,-2.222222,0.0
2013-01-02,-8.333333,-6.111111,-3.888889
2013-01-03,-8.888889,-4.444444,0.0
2013-01-04,-2.777778,-2.222222,-1.111111
2013-01-05,-3.888889,-1.111111,1.111111


<p id ='cpg'><p>
### Computing percentage growth of GDP
Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.



In [45]:
cd ..

/Users/satyammishra/Desktop/Datacamp stuff/jupyternotes


In [46]:
gdp = pd.read_csv('./data/GDP/gdp_usa.csv', parse_dates = True, index_col ='DATE')
gdp.head()

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [47]:
post2008 = gdp.loc['2008':]
post2008.tail(8)

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2014-07-01,17569.4
2014-10-01,17692.2
2015-01-01,17783.6
2015-04-01,17998.3
2015-07-01,18141.9
2015-10-01,18222.8
2016-01-01,18281.6
2016-04-01,18436.5


In [48]:
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
yearly

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2008-12-31,14549.9
2009-12-31,14566.5
2010-12-31,15230.2
2011-12-31,15785.3
2012-12-31,16297.3
2013-12-31,16999.9
2014-12-31,17692.2
2015-12-31,18222.8
2016-12-31,18436.5


In [49]:
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = (yearly.pct_change())*100
yearly

Unnamed: 0_level_0,VALUE,growth
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-12-31,14549.9,
2009-12-31,14566.5,0.11409
2010-12-31,15230.2,4.556345
2011-12-31,15785.3,3.644732
2012-12-31,16297.3,3.243524
2013-12-31,16999.9,4.311144
2014-12-31,17692.2,4.072377
2015-12-31,18222.8,2.999062
2016-12-31,18436.5,1.172707


<p id ='ccs'><p>
### Converting currency of stocks
    
Stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

In [50]:
sp500 = pd.read_csv('./data/sp500.csv')
exchange = pd.read_csv('./data/exchange.csv')


In [51]:
sp500.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-01-02,2058.899902,2072.360107,2046.040039,2058.199951,2708700000,2058.199951
1,2015-01-05,2054.439941,2054.439941,2017.339966,2020.579956,3799120000,2020.579956
2,2015-01-06,2022.150024,2030.25,1992.439941,2002.609985,4460110000,2002.609985
3,2015-01-07,2005.550049,2029.609985,2005.550049,2025.900024,3805480000,2025.900024
4,2015-01-08,2030.609985,2064.080078,2030.609985,2062.139893,3934010000,2062.139893


In [52]:
exchange.head()

Unnamed: 0,Date,GBP/USD
0,2015/01/02,0.65101
1,2015/01/05,0.65644
2,2015/01/06,0.65896
3,2015/01/07,0.66344
4,2015/01/08,0.66151


In [53]:
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]
dollars.head()

Unnamed: 0,Open,Close
0,2058.899902,2058.199951
1,2054.439941,2020.579956
2,2022.150024,2002.609985
3,2005.550049,2025.900024
4,2030.609985,2062.139893


In [54]:
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis = 'rows')
pounds.head()

Unnamed: 0,Open,Close
0,1340.364425,1339.90875
1,1348.616555,1326.389506
2,1332.51598,1319.639876
3,1330.562125,1344.063112
4,1343.268811,1364.126161


## [Concatenating Data](#cd)

<p id ='A&cS'><p>
## Appending & concatenating Series

<p id ='ASwnI'><p>
### Appending Series with nonunique Indices

In [55]:
bronze_top=bronze.sort_values(by= 'Total', ascending=False)[:5]
bronze_top = bronze_top.set_index('Country')
bronze_top.head()

Unnamed: 0_level_0,NOC,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,USA,1052.0
Soviet Union,URS,584.0
United Kingdom,GBR,505.0
France,FRA,475.0
Germany,GER,454.0


In [56]:
silver_top = silver.sort_values(by = 'Total', ascending = False)[:5]
silver_top = silver_top.set_index('Country')
silver_top.head()

Unnamed: 0_level_0,NOC,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,USA,1195.0
Soviet Union,URS,627.0
United Kingdom,GBR,591.0
France,FRA,461.0
Italy,ITA,394.0


In [57]:
combined = bronze_top.append(silver_top)
combined.shape
combined.loc['United States']

Unnamed: 0_level_0,NOC,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,USA,1052.0
United States,USA,1195.0


In [58]:
type(combined)

pandas.core.frame.DataFrame

<p id ='ApS'><p>
### Appending pandas Series

In [59]:
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('./data/Sales-3/sales-jan-2015.csv', index_col='Date', parse_dates=True)

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('./data/Sales-3/sales-feb-2015.csv', index_col='Date', parse_dates=True)

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('./data/Sales-3/sales-mar-2015.csv', index_col='Date', parse_dates=True)

In [60]:
jan.head()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-21 19:13:21,Streeplex,Hardware,11
2015-01-09 05:23:51,Streeplex,Service,8
2015-01-06 17:19:34,Initech,Hardware,17
2015-01-02 09:51:06,Hooli,Hardware,16
2015-01-11 14:51:02,Hooli,Hardware,11


In [61]:
mar.head()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-03-22 14:42:25,Mediacore,Software,6
2015-03-12 18:33:06,Initech,Service,19
2015-03-22 03:58:28,Streeplex,Software,8
2015-03-15 00:53:12,Hooli,Hardware,19
2015-03-17 19:25:37,Hooli,Hardware,10


In [62]:
# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

mar_units.shape

(20,)

In [63]:
# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)
quarter1.shape

(60,)

In [64]:
# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])



Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64


In [65]:
# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])


Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


In [66]:
# Compute & print total sales in quarter1
print(quarter1.sum())

642


<p id ='CpSara'><p>
### Concatenating pandas Series along row axis

In [67]:
quarter1 = pd.concat([month.Units for month in [jan, feb, mar]])
quarter1.shape

(60,)

In [68]:
# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


In [69]:
type(quarter1)

pandas.core.series.Series

<p id ='A&cD'><p>
## Appending & concatenating DataFrames

<p id ='ADwi'><p>
### Appending DataFrames with ignore_index

In [70]:
names_1881 = pd.read_csv('./data/baby/names1881.csv', header=None, names=['name', 'gender', 'count'])
names_1981 = pd.read_csv('./data/baby/names1981.csv',header= None, names=['name', 'gender', 'count'])
print(names_1881.shape)


(1935, 3)


In [71]:
names_1881.head()

Unnamed: 0,name,gender,count
0,Mary,F,6919
1,Anna,F,2698
2,Emma,F,2034
3,Elizabeth,F,1852
4,Margaret,F,1658


In [72]:
names_1981.head()

Unnamed: 0,name,gender,count
0,Jennifer,F,57032
1,Jessica,F,42519
2,Amanda,F,34370
3,Sarah,F,28162
4,Melissa,F,28003


In [73]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981


In [74]:
# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)
# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

(19455, 4)
(1935, 4)
(21390, 4)


In [75]:
# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names['name']=='Morgan'])

         name gender  count  year
1283   Morgan      M     23  1881
2096   Morgan      F   1769  1981
14390  Morgan      M    766  1981


<p id ='CpDaca'><p>
### Concatenating pandas DataFrames along column axis

In [76]:
weather_max = pd.DataFrame.from_dict({
'Month': ['Jan','Apr','Jul','Oct'],
'MaxTempF':['68','89','91','84']
})
weather_max


Unnamed: 0,Month,MaxTempF
0,Jan,68
1,Apr,89
2,Jul,91
3,Oct,84


In [77]:
weather_mean = pd.DataFrame.from_dict({'Month':['Apr','Aug','Dec','Feb','Jan','Jul','Jun','Mar','May','Nov','Oct','Sep'],
'MeanTempF':[53.100000,70.000000,34.935484,28.714286,32.354839,72.870968,70.133333,35.000000,62.612903,39.800000,55.451613,63.766667] 
})
weather_mean

Unnamed: 0,Month,MeanTempF
0,Apr,53.1
1,Aug,70.0
2,Dec,34.935484
3,Feb,28.714286
4,Jan,32.354839
5,Jul,72.870968
6,Jun,70.133333
7,Mar,35.0
8,May,62.612903
9,Nov,39.8


In [78]:
weather = pd.concat([weather_mean, weather_max], axis='columns')
weather

Unnamed: 0,Month,MeanTempF,Month.1,MaxTempF
0,Apr,53.1,Jan,68.0
1,Aug,70.0,Apr,89.0
2,Dec,34.935484,Jul,91.0
3,Feb,28.714286,Oct,84.0
4,Jan,32.354839,,
5,Jul,72.870968,,
6,Jun,70.133333,,
7,Mar,35.0,,
8,May,62.612903,,
9,Nov,39.8,,


<p id ='RmftbaD'><p>
### Reading multiple files to build a DataFrame

In [79]:
medal_types = ['bronze', 'silver', 'gold']
medals = []

In [80]:
for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv('./data/olympic/'+file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

In [81]:
medals[0]

Unnamed: 0_level_0,bronze
Country,Unnamed: 1_level_1
United States,1052.0
Soviet Union,584.0
United Kingdom,505.0
France,475.0
Germany,454.0


In [82]:
medals[1]

Unnamed: 0_level_0,silver
Country,Unnamed: 1_level_1
United States,1195.0
Soviet Union,627.0
United Kingdom,591.0
France,461.0
Italy,394.0


In [83]:
medals1 = pd.concat(medals, axis='columns')


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


In [84]:
medals1

Unnamed: 0,bronze,silver,gold
France,475.0,461.0,
Germany,454.0,,407.0
Italy,,394.0,460.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0
United States,1052.0,1195.0,2088.0




<p id ='Ck&M'><p>
## Concatenation, keys, & MultiIndexes

<p id ='CvtgMr'><p>
### Concatenating vertically to get MultiIndexed rows

In [85]:
medals = []

In [86]:
medals

[]

In [87]:
for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv('./data/olympic/'+file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

In [88]:
medals[0]

Unnamed: 0_level_0,gold
Country,Unnamed: 1_level_1
United States,1052.0
Soviet Union,584.0
United Kingdom,505.0
France,475.0
Germany,454.0


In [89]:
medals = pd.concat(medals, keys = ['bronze', 'silver', 'gold'])

In [90]:
medals

Unnamed: 0_level_0,Unnamed: 1_level_0,gold
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United States,1052.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,France,475.0
bronze,Germany,454.0
silver,United States,1195.0
silver,Soviet Union,627.0
silver,United Kingdom,591.0
silver,France,461.0
silver,Italy,394.0


<p id ='SMD'><p>
### Slicing MultiIndexed DataFrames

In [91]:
medals_sorted = medals.sort_index(level =0)
medals_sorted

Unnamed: 0_level_0,Unnamed: 1_level_0,gold
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,France,475.0
bronze,Germany,454.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,United States,1052.0
gold,Germany,407.0
gold,Italy,460.0
gold,Soviet Union,838.0
gold,United Kingdom,498.0
gold,United States,2088.0


In [92]:
# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

gold    454.0
Name: (bronze, Germany), dtype: float64


In [93]:
# Print data about silver medals
print(medals_sorted.loc['silver'])

                  gold
Country               
France           461.0
Italy            394.0
Soviet Union     627.0
United Kingdom   591.0
United States   1195.0


In [94]:
# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[(slice(None), 'United Kingdom'), :])

                        gold
       Country              
bronze United Kingdom  505.0
gold   United Kingdom  498.0
silver United Kingdom  591.0


In [95]:
pd.IndexSlice[:,'United Kingdom']

(slice(None, None, None), 'United Kingdom')

<p id ='ChtgMc'><p>
### Concatenating horizontally to get MultiIndexed columns

In [96]:
hardware = pd.read_csv('./data/Sales-3/feb-sales-Hardware.csv')
hardware

Unnamed: 0,Date,Company,Product,Units
0,2015-02-04 21:52:45,Acme Coporation,Hardware,14
1,2015-02-07 22:58:10,Acme Coporation,Hardware,1
2,2015-02-19 10:59:33,Mediacore,Hardware,16
3,2015-02-02 20:54:49,Mediacore,Hardware,9
4,2015-02-21 20:41:47,Hooli,Hardware,3


In [97]:
software = pd.read_csv('./data/Sales-3/feb-sales-Software.csv')
software

Unnamed: 0,Date,Company,Product,Units
0,2015-02-16 12:09:19,Hooli,Software,10
1,2015-02-03 14:14:18,Initech,Software,13
2,2015-02-02 08:33:01,Hooli,Software,3
3,2015-02-05 01:53:06,Acme Coporation,Software,19
4,2015-02-11 20:03:08,Initech,Software,7
5,2015-02-09 13:09:55,Mediacore,Software,7
6,2015-02-11 22:50:44,Hooli,Software,4
7,2015-02-04 15:36:29,Streeplex,Software,13
8,2015-02-21 05:01:26,Mediacore,Software,3


In [98]:
services = pd.read_csv('./data/Sales-3/feb-sales-Service.csv')
services

Unnamed: 0,Date,Company,Product,Units
0,2015-02-26 08:57:45,Streeplex,Service,4
1,2015-02-25 00:29:00,Initech,Service,10
2,2015-02-09 08:57:30,Streeplex,Service,19
3,2015-02-26 08:58:51,Streeplex,Service,1
4,2015-02-05 22:05:03,Hooli,Service,10
5,2015-02-19 16:02:58,Mediacore,Service,10


In [99]:
dataframes = [hardware, software, services]

In [101]:
# Concatenate dataframes: february
february = pd.concat(dataframes, keys = ['hardware', 'software', 'services'], axis = 1)
february

Unnamed: 0_level_0,hardware,hardware,hardware,hardware,software,software,software,software,services,services,services,services
Unnamed: 0_level_1,Date,Company,Product,Units,Date,Company,Product,Units,Date,Company,Product,Units
0,2015-02-04 21:52:45,Acme Coporation,Hardware,14.0,2015-02-16 12:09:19,Hooli,Software,10,2015-02-26 08:57:45,Streeplex,Service,4.0
1,2015-02-07 22:58:10,Acme Coporation,Hardware,1.0,2015-02-03 14:14:18,Initech,Software,13,2015-02-25 00:29:00,Initech,Service,10.0
2,2015-02-19 10:59:33,Mediacore,Hardware,16.0,2015-02-02 08:33:01,Hooli,Software,3,2015-02-09 08:57:30,Streeplex,Service,19.0
3,2015-02-02 20:54:49,Mediacore,Hardware,9.0,2015-02-05 01:53:06,Acme Coporation,Software,19,2015-02-26 08:58:51,Streeplex,Service,1.0
4,2015-02-21 20:41:47,Hooli,Hardware,3.0,2015-02-11 20:03:08,Initech,Software,7,2015-02-05 22:05:03,Hooli,Service,10.0
5,,,,,2015-02-09 13:09:55,Mediacore,Software,7,2015-02-19 16:02:58,Mediacore,Service,10.0
6,,,,,2015-02-11 22:50:44,Hooli,Software,4,,,,
7,,,,,2015-02-04 15:36:29,Streeplex,Software,13,,,,
8,,,,,2015-02-21 05:01:26,Mediacore,Software,3,,,,


In [102]:
february.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 12 columns):
(hardware, Date)       5 non-null object
(hardware, Company)    5 non-null object
(hardware, Product)    5 non-null object
(hardware, Units)      5 non-null float64
(software, Date)       9 non-null object
(software, Company)    9 non-null object
(software, Product)    9 non-null object
(software, Units)      9 non-null int64
(services, Date)       6 non-null object
(services, Company)    6 non-null object
(services, Product)    6 non-null object
(services, Units)      6 non-null float64
dtypes: float64(2), int64(1), object(9)
memory usage: 944.0+ bytes


<p id ='CDfad'><p>
### Concatenating DataFrames from a dict
    
Your task is to aggregate the sum of all sales over the 'Company' column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

In [104]:
month_list = [('january', jan), ('february', feb), ('march', mar)]

In [107]:
month_dict = {}

In [111]:
for month_name, month_data in month_list:
    month_dict[month_name] = month_data.groupby('Company').sum()

In [113]:
print(month_dict)

{'january':                  Units
Company               
Acme Coporation     76
Hooli               70
Initech             37
Mediacore           15
Streeplex           50, 'february':                  Units
Company               
Acme Coporation     34
Hooli               30
Initech             30
Mediacore           45
Streeplex           37, 'march':                  Units
Company               
Acme Coporation      5
Hooli               37
Initech             68
Mediacore           68
Streeplex           40}


In [114]:
# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)


In [115]:
sales

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Acme Coporation,34
february,Hooli,30
february,Initech,30
february,Mediacore,45
february,Streeplex,37
january,Acme Coporation,76
january,Hooli,70
january,Initech,37
january,Mediacore,15
january,Streeplex,50


In [116]:
sales.loc[(slice(None), 'Mediacore'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Mediacore,45
january,Mediacore,15
march,Mediacore,68


In [117]:
sales.loc[pd.IndexSlice[:, 'Mediacore'], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Mediacore,45
january,Mediacore,15
march,Mediacore,68


<p id ='O&ij'><p>
## Outer & inner joins

<p id ='CDwij'><p>
### Concatenating DataFrames with inner join

In [145]:
gold_top = pd.read_csv('./data/olympic/gold_top5.csv', index_col='Country')

In [146]:
medal_list = [bronze_top, silver_top, gold_top]

In [132]:
del bronze_top['NOC']


In [133]:
del silver_top['NOC']

In [148]:
silver_top

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
United States,1195.0
Soviet Union,627.0
United Kingdom,591.0
France,461.0
Italy,394.0


In [149]:
bronze_top

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
United States,1052.0
Soviet Union,584.0
United Kingdom,505.0
France,475.0
Germany,454.0


In [150]:
gold_top

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
United States,2088.0
Soviet Union,838.0
United Kingdom,498.0
Italy,460.0
Germany,407.0


In [151]:
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')
medals

Unnamed: 0_level_0,bronze,silver,gold
Unnamed: 0_level_1,Total,Total,Total
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
United States,1052.0,1195.0,2088.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0


<p id ='R&cDwij'><p>
### Resampling & concatenating DataFrames with inner join
    
In this exercise, we'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.



In [201]:
china = pd.read_csv('./data/GDP/gdp_china.csv', parse_dates=True, index_col = 'Year')
us = pd.read_csv('./data/GDP/gdp_usa.csv', parse_dates=True, index_col='DATE' )


In [202]:
more ./data/GDP/gdp_usa.csv

In [210]:
china.head()

Unnamed: 0_level_0,China
Year,Unnamed: 1_level_1
1960-01-01,59.184116
1961-01-01,49.55705
1962-01-01,46.685179
1963-01-01,50.097303
1964-01-01,59.062255


In [207]:
china.columns = ['China']

In [214]:
us.head()

Unnamed: 0_level_0,US
Year,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [213]:
us.index.name = 'Year'

In [208]:
us.columns = ['US']

In [215]:
print(china.shape)
print(us.shape)

(56, 1)
(278, 1)


In [216]:
china_annual = china.resample('A').last().pct_change(10).dropna()
china_annual.shape

(46, 1)

In [217]:
china_annual.head()

Unnamed: 0_level_0,China
Year,Unnamed: 1_level_1
1970-12-31,0.546128
1971-12-31,0.98886
1972-12-31,1.402472
1973-12-31,1.730085
1974-12-31,1.408556


In [218]:
us_annual = us.resample('A').last().pct_change(10).dropna()
us_annual.shape

(60, 1)

In [219]:
us_annual.head()

Unnamed: 0_level_0,US
Year,Unnamed: 1_level_1
1957-12-31,0.827507
1958-12-31,0.782686
1959-12-31,0.953137
1960-12-31,0.689354
1961-12-31,0.630959


In [220]:
gdp = pd.concat([us_annual, china_annual], axis = 1, join = 'inner')
gdp.head()

Unnamed: 0_level_0,US,China
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970-12-31,1.017187,0.546128
1971-12-31,1.05227,0.98886
1972-12-31,1.172566,1.402472
1973-12-31,1.258858,1.730085
1974-12-31,1.295246,1.408556


In [221]:
gdp.resample('10A').last()

Unnamed: 0_level_0,US,China
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970-12-31,1.017187,0.546128
1980-12-31,1.742556,1.072537
1990-12-31,1.012126,0.89282
2000-12-31,0.738632,2.357522
2010-12-31,0.454332,4.011081
2020-12-31,0.36178,3.789936
