# Concatenating and merging data

## Concatenation

<div class="alert alert-info">
<h3> Your turn</h3>
<ol>
    <li>Create a new <TT>Series</TT> with observations <TT>['C1', 'C2']</TT>.</li>
    <li>Using the previously created <TT>Series</TT> <TT>a</TT> and <TT>b</TT>, concatenate all three objects along the row axis and create a new (unique) index.</li>
    <li>Repeat the previous step, but now concatenate along the column axis. Assign the column names <TT>'Column1'</TT>, <TT>'Column2'</TT>, and <TT>'Column3'</TT>.</li>
</ol>
</div>

### Solution

In [2]:
import pandas as pd 

# Recreate Series a, b:
# Create first series of 3 observations
a = pd.Series(['A1', 'A2', 'A3'])
# Create second series with 5 observations
b = pd.Series([f'B{i}' for i in range(5)])

In [3]:
# Create Series c
c = pd.Series(['C1', 'C2'])

In [4]:


# Concatenate Series a, b, c and reset the index
s = pd.concat((a, b, c)).reset_index(drop=True)
s

0    A1
1    A2
2    A3
3    B0
4    B1
5    B2
6    B3
7    B4
8    C1
9    C2
dtype: object

In [5]:
s = pd.concat((a, b, c), axis=1, keys=['Column1', 'Column2', 'Column3'])
s

Unnamed: 0,Column1,Column2,Column3
0,A1,B0,C1
1,A2,B1,C2
2,A3,B2,
3,,B3,
4,,B4,


<div class="alert alert-info">
<h3> Your turn</h3>
Use the data files located in the folder <TT>../data/FRED</TT> to perform the following tasks:
<ol>
    <li>Load the data in <TT>FRED_monthly_1950.csv</TT> and <TT>FRED_monthly_1960.csv</TT> into two different DataFrames.
        The files contain monthly macroeconomic time series for the 1950s and 1960s, respectively.
        <br/>
        <i>Hint:</i> Use <TT>pd.read_csv(..., parse_dates=['DATE'])</TT> to automatically parse strings stored in the <TT>DATE</TT> column as dates.
        </li>
    <li>Concatenate these DataFrames along the row dimension to get a total of 240 observations.</li>
    <li>Set the column <TT>DATE</TT> as index for the newly created DataFrame.</li>
</ol>
</div>

### Solution

#### Part (1)

In [83]:
# Path to data folder
DATA_PATH = '/home/richard/repos/teaching/TECH2-H24/data/FRED'

In [84]:
import pandas as pd

# Load data from the 1950s
df1 = pd.read_csv(f'{DATA_PATH}/FRED_monthly_1950.csv', parse_dates=['DATE'])
df1.head(5)

Unnamed: 0,DATE,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
0,1950-01-01,23.5,6.5,,,58.9
1,1950-02-01,23.6,6.4,,,58.9
2,1950-03-01,23.6,6.3,,,58.8
3,1950-04-01,23.6,5.8,,,59.2
4,1950-05-01,23.8,5.5,,,59.1


In [85]:
# Load data from the 1960s
df2 = pd.read_csv(f'{DATA_PATH}/FRED_monthly_1960.csv', parse_dates=['DATE'])
df2.head(5)

Unnamed: 0,DATE,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
0,1960-01-01,29.4,5.2,4.0,,59.1
1,1960-02-01,29.4,4.8,4.0,,59.1
2,1960-03-01,29.4,5.4,3.8,,58.5
3,1960-04-01,29.5,5.2,3.9,,59.5
4,1960-05-01,29.6,5.1,3.8,,59.5


#### Part (2)

In [86]:
# Concatenate data sets along the first dimension (rows)
df = pd.concat((df1, df2), axis=0)

In [87]:
# First half contains data from the 1950s
df.head(5)

Unnamed: 0,DATE,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
0,1950-01-01,23.5,6.5,,,58.9
1,1950-02-01,23.6,6.4,,,58.9
2,1950-03-01,23.6,6.3,,,58.8
3,1950-04-01,23.6,5.8,,,59.2
4,1950-05-01,23.8,5.5,,,59.1


In [88]:
# Second half contains data from the 1960s
df.tail(5)

Unnamed: 0,DATE,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
115,1969-08-01,36.9,3.5,9.2,,60.3
116,1969-09-01,37.1,3.7,9.2,,60.3
117,1969-10-01,37.3,3.7,9.0,,60.4
118,1969-11-01,37.5,3.5,8.8,,60.2
119,1969-12-01,37.7,3.5,9.0,,60.2


#### Part (3)

Note that the index of the newly created `DataFrame` is not unique:

In [89]:
# Select rows at index 0: returns 2 (!) different rows
df.loc[0]

Unnamed: 0,DATE,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
0,1950-01-01,23.5,6.5,,,58.9
0,1960-01-01,29.4,5.2,4.0,,59.1


In [90]:
# Set Date as new (unique!) index
df = df.set_index('DATE')
df.head(10)

Unnamed: 0_level_0,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1950-01-01,23.5,6.5,,,58.9
1950-02-01,23.6,6.4,,,58.9
1950-03-01,23.6,6.3,,,58.8
1950-04-01,23.6,5.8,,,59.2
1950-05-01,23.8,5.5,,,59.1
1950-06-01,23.9,5.4,,,59.4
1950-07-01,24.1,5.0,,,59.1
1950-08-01,24.2,4.5,,,59.5
1950-09-01,24.3,4.4,,,59.2
1950-10-01,24.5,4.2,,,59.4


***
## Merging and joining data sets

<div class="alert alert-info">
<h3> Your turn</h3>
Use the data files located in the folder <TT>../data/FRED</TT> to perform the following tasks:
<ol>
    <li>Load the data in <TT>CPI.csv</TT> and <TT>GDP.csv</TT> into two different DataFrames.
        The files contain monthly data for the Consumer Price Index (CPI) and quarterly data for GDP, respectively.
        <br/>
        <i>Hint:</i> Use <TT>pd.read_csv(..., parse_dates=['DATE'])</TT> to automatically parse strings stored in the <TT>DATE</TT> column as dates.
        </li>
    <li>Merge the CPI with the GDP time series with 
    <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html"><TT>merge()</TT></a> 
    using a left join (<TT>how='left'</TT>). How many observations does the resulting DataFrame have?</li>
    <li>Merge the CPI with the GDP time series with <TT>merge()</TT> using an inner join (<TT>how='inner'</TT>). How many observations does the resulting DataFrame have,
        and why is this different from the previous case?</li>
</ol>
</div>

### Solution

#### Part (1)

In [91]:
# Path to data folder
DATA_PATH = '/home/richard/repos/teaching/TECH2-H24/data/FRED'

In [92]:
import pandas as pd 

cpi = pd.read_csv(f'{DATA_PATH}/CPI.csv', parse_dates=['DATE'])
cpi.head(5)

Unnamed: 0,DATE,CPI
0,1947-01-01,21.5
1,1947-02-01,21.6
2,1947-03-01,22.0
3,1947-04-01,22.0
4,1947-05-01,22.0


In [93]:
gdp = pd.read_csv(f'{DATA_PATH}/GDP.csv', parse_dates=['DATE'])
gdp.head(5)

Unnamed: 0,DATE,GDP
0,1947-01-01,2182.7
1,1947-04-01,2176.9
2,1947-07-01,2172.4
3,1947-10-01,2206.5
4,1948-01-01,2239.7


#### Part (2)

In [94]:
# Merge so that left DataFrame determines resulting index
df = pd.merge(cpi, gdp, on='DATE', how='left')
df.head(12)

Unnamed: 0,DATE,CPI,GDP
0,1947-01-01,21.5,2182.7
1,1947-02-01,21.6,
2,1947-03-01,22.0,
3,1947-04-01,22.0,2176.9
4,1947-05-01,22.0,
5,1947-06-01,22.1,
6,1947-07-01,22.2,2172.4
7,1947-08-01,22.4,
8,1947-09-01,22.8,
9,1947-10-01,22.9,2206.5


In [95]:
# Number of observations
N = len(df)
print(f'Number of observations with left join: {N:,d}')

Number of observations with left join: 932


#### Part (3)

In [96]:
# Drop columns with missing observations in GDP
df = pd.merge(cpi, gdp, on='DATE', how='inner')
df.head(12)

Unnamed: 0,DATE,CPI,GDP
0,1947-01-01,21.5,2182.7
1,1947-04-01,22.0,2176.9
2,1947-07-01,22.2,2172.4
3,1947-10-01,22.9,2206.5
4,1948-01-01,23.7,2239.7
5,1948-04-01,23.8,2276.7
6,1948-07-01,24.4,2289.8
7,1948-10-01,24.3,2292.4
8,1949-01-01,24.0,2260.8
9,1949-04-01,23.9,2253.1


In [97]:
# Number of observations
N = len(df)
print(f'Number of observations with inner join: {N:,d}')

Number of observations with inner join: 310


The inner join drops all dates from `cpi` which are not present in the `gdp` DataFrame, hence the number of rows in the merged DataFrame is only a third of the original data (since the GDP data is quarterly).

<div class="alert alert-info">
<h3> Your turn</h3>
Use the data files located in the folder <TT>../data/FRED</TT> to perform the following tasks:
<ol>
    <li>Load the data in <TT>CPI.csv</TT> and <TT>GDP.csv</TT> into two different DataFrames.
        The files contain monthly data for the Consumer Price Index (CPI) and quarterly data for GDP, respectively.
        <br/>
        <i>Hint:</i> Use <TT>pd.read_csv(..., parse_dates=['DATE'])</TT> to automatically parse strings stored in the <TT>DATE</TT> column as dates.
        </li>
    <li>Set the <TT>DATE</TT> column as the index for each of the two DataFrames.</li>
    <li>Merge the CPI with the GDP time series with 
    <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html"><TT>join()</TT></a>. 
    Do this with both a left and an inner join.</li>
</ol>
</div>

### Solution

#### Part (1)

In [98]:
# Path to data folder
DATA_PATH = '/home/richard/repos/teaching/TECH2-H24/data/FRED'

In [99]:
import pandas as pd 

cpi = pd.read_csv(f'{DATA_PATH}/CPI.csv', parse_dates=['DATE'])
# Alternatively, we can set the index directly when loading the data
# cpi = pd.read_csv(f'{DATA_PATH}/CPI.csv', parse_dates=['DATE'], index_col='DATE')
cpi.head(5)

Unnamed: 0,DATE,CPI
0,1947-01-01,21.5
1,1947-02-01,21.6
2,1947-03-01,22.0
3,1947-04-01,22.0
4,1947-05-01,22.0


In [100]:
gdp = pd.read_csv(f'{DATA_PATH}/GDP.csv', parse_dates=['DATE'])
# Alternatively, we can set the index directly when loading the data
# gdp = pd.read_csv(f'{DATA_PATH}/GDP.csv', parse_dates=['DATE'], index_col='DATE')
gdp.head(5)

Unnamed: 0,DATE,GDP
0,1947-01-01,2182.7
1,1947-04-01,2176.9
2,1947-07-01,2172.4
3,1947-10-01,2206.5
4,1948-01-01,2239.7


#### Part (2)

If we didn't specify the index columns using `index_col` as an argument to `pd.read_csv()`, we can set the index after loading the data.

In [101]:
# Set DATE column as index
cpi = cpi.set_index('DATE')
gdp = gdp.set_index('DATE')

#### Part (3)

In [102]:
# Perform left join (the default)
df = cpi.join(gdp)
df.head(10)

Unnamed: 0_level_0,CPI,GDP
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1947-01-01,21.5,2182.7
1947-02-01,21.6,
1947-03-01,22.0,
1947-04-01,22.0,2176.9
1947-05-01,22.0,
1947-06-01,22.1,
1947-07-01,22.2,2172.4
1947-08-01,22.4,
1947-09-01,22.8,
1947-10-01,22.9,2206.5


In [103]:
# Perform inner join
df = cpi.join(gdp, how='inner')
df.head(10)

Unnamed: 0_level_0,CPI,GDP
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1947-01-01,21.5,2182.7
1947-04-01,22.0,2176.9
1947-07-01,22.2,2172.4
1947-10-01,22.9,2206.5
1948-01-01,23.7,2239.7
1948-04-01,23.8,2276.7
1948-07-01,24.4,2289.8
1948-10-01,24.3,2292.4
1949-01-01,24.0,2260.8
1949-04-01,23.9,2253.1


***
# Dealing with missing values

<div class="alert alert-info">
<h3> Your turn</h3>
Use the data files located in the folder <TT>../data/FRED</TT> to perform the following tasks:
<ol>
    <li>Load the data in <TT>CPI.csv</TT> and <TT>GDP.csv</TT> into two different DataFrames.
        The files contain monthly data for the Consumer Price Index (CPI) and quarterly data for GDP, respectively.
        <br/>
        <i>Hint:</i> Use <TT>pd.read_csv(..., parse_dates=['DATE'])</TT> to automatically parse strings stored in the <TT>DATE</TT> column as dates.
        </li>
    <li>Merge the CPI with the GDP time series with <TT>merge()</TT> using a left join. This creates missing values in the <TT>GDP</TT>
    column.</li>
    <li>Impute the missing GDP values using <a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html"><TT>interpolate()</TT></a> 
    and replace the missing values in column <TT>GDP</TT>.</li>
</ol>
</div>

### Solution

#### Part (1)

In [2]:
# Path to data folder
DATA_PATH = '/home/richard/repos/teaching/TECH2-H24/data/FRED'

In [3]:
import pandas as pd 

# Load CPI data
cpi = pd.read_csv(f'{DATA_PATH}/CPI.csv', parse_dates=['DATE'])

# Load GDP data
gdp = pd.read_csv(f'{DATA_PATH}/GDP.csv', parse_dates=['DATE'])

#### Part (2)

In [7]:
# Merge CPI and GDP into a single DataFrame, use keys from CPI
df = pd.merge(cpi, gdp, how='left')

# Print first 12 months
df.head(12)

Unnamed: 0,DATE,CPI,GDP
0,1947-01-01,21.5,2182.7
1,1947-02-01,21.6,
2,1947-03-01,22.0,
3,1947-04-01,22.0,2176.9
4,1947-05-01,22.0,
5,1947-06-01,22.1,
6,1947-07-01,22.2,2172.4
7,1947-08-01,22.4,
8,1947-09-01,22.8,
9,1947-10-01,22.9,2206.5


Since GDP data is available on quarterly frequency, only every third month contains non-missing values.

#### Part (3)

In [12]:
# Linearly interpolate missing value
df['GDP'] = df['GDP'].interpolate(method='linear')

# Print first 12 months to confirm that missing values are gone
df.head(12)

Unnamed: 0,DATE,CPI,GDP
0,1947-01-01,21.5,2182.7
1,1947-02-01,21.6,2180.766667
2,1947-03-01,22.0,2178.833333
3,1947-04-01,22.0,2176.9
4,1947-05-01,22.0,2175.4
5,1947-06-01,22.1,2173.9
6,1947-07-01,22.2,2172.4
7,1947-08-01,22.4,2183.766667
8,1947-09-01,22.8,2195.133333
9,1947-10-01,22.9,2206.5
