# Combining DataFrames

In this notebook, we will be working with multiple data sources.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### A. Concatenation

Concatenation is simply appending one dataframe to another, either via rows or via columns. Think copy-pasting in Excel.

In [None]:
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                   columns=['a', 'b', 'c'],
                   index=['one', 'two', 'three'])
df2 = pd.DataFrame(np.arange(6).reshape((3, 2)), 
                   columns=['d','e'],
                   index=['three', 'two','one'])
display(df1)
display(df2)

- Since the two data frames have the same number of rows, it is natural to combine them "horizontally".  
- Note the concatenation takes place on the name of the index and not the order.

In [None]:
pd.concat([df1, df2], axis = 1, sort=False)

- The argument "axis = 1" means expanding along the column indices. Setting "axis = 0" will combine two data frames with same number of columns vertically. 

In [None]:
pd.concat([df1, df2], axis = 0, sort=False)

### Try it!

We have here data from primary and secondary schools. We want to combine these data into one dataframe. Since the headers for both dataframes are the same, we can use concat. 

In [None]:
# Reading dataframe1
primary = pd.read_csv("depend_publicelementary2015.csv", encoding="latin-1")
primary.head()

In [None]:
# Reading dataframe
secondary = pd.read_csv("deped_publicsecondary2015.csv", encoding="latin-1")
secondary.head()

In [None]:
# Combine primary and secondary schools

#df_all_schools = 

### B. Merge
Merging is the most common way to combine multiple data frames and has more flexibility than concat. 

In [None]:
df3 = pd.DataFrame([['a','b','c'],['d','e','f'],['g','h','i']]\
                   ,columns=['col1','col2','col3'])
df4 = pd.DataFrame({'col2':['x','e','b','z'],'col4':[1,2,3,4],'col5':['i','f','e','h']})
display(df3)
display(df4)

- Merging will use the **`on`** column as a key for the merge.  The code below identifies the column ‘col2’ from both data frames. 
- The argument **`how`** set to 'inner' makes the merge only keep rows occuring in both data frames.

In [None]:
pd.merge(df3, df4, how='inner', on ='col2')

- The default value of the parameter `how` is 'inner'. The following code performs the same task as above.

In [None]:
pd.merge(df3, df4, on ='col2')

- To keep every row in df1 then set the parameter `how` = 'left'.

In [None]:
pd.merge(df3, df4, how='left', on ='col2')

- To keep all rows from both df1 and df2, set the parameter `how` = 'outer'.

In [None]:
pd.merge(df3, df4, how='outer', on ='col2')

- If the `on` column does not have the same name in the two data frames, use 'left_on' and 'right_on' to indicate how to perform the merge.  
- Note that columns with the same name, in the two data frames, will be named with an x or y character appended.

In [None]:
pd.merge(df3, df4, left_on='col2', right_on='col5')

### Try it!

Let's try adding variables from `schools.csv` that are not present in the df_all_schools data we prepared. 

In [None]:
# Add data from a different dataset
additional_data = pd.read_csv("schools.csv")
additional_data.head()

Which columns can we add?

In [None]:
data_to_add = additional_data[["ID", "Total_Enro", "Total_Inst", "Rooms_used", "Rooms_unused", "Type_of_Sc"]]

How should we combine them?

In [None]:
# Combine select columns from additional_data to the df_all_schools

#df_all = 

In [None]:
df_all

Let's check the shape of our new dataframe

You may have noticed that there are lots of missing values in our new dataframe. One way to avoid missing values is to obtain only the schools that are present in both dataframes. Let's try doing that.

In [None]:
# Add data but obtain only common primary key

# Check shape


In [None]:
# Check data frame


### Saving to csv

Now that we have our new clean dataframe, let's save it to a new file.

In [None]:
#df_inner.to_csv("schools_combined.csv")