# 4.6 Combining & Exporting Data

In [4]:
import pandas as pd
from os import path

## Methods for Combining Data  

There are two things that make combining data in Python difficult: __1)__ deciding which procedure to use, and __2)__ deciding how to use it.

### Concatenating Data  

Concatenation is a good choice for combining __data sets that have multiple rows and columns of the same length__. Say, for example, that you have two data sets with five columns each that both carry the same information, only with different values. The concatenate function will let you stack these data sets either on top of one another or side by side.  

If you had two data sets with the same columns but different values in them, you’d want to choose to combine them one on top of the other.  

EXAMPLE:

In [1]:
# Define a dictionary containing January 2020 data
data1 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
    'purchased_meat':[0, 13, 3, 4],
    'purchased_alcohol':[1, 2, 10, 0],
    'purchased_snacks':[10, 5, 1, 7]}

# Define a dictionary containing February 2020 data
data2 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Feb-20', 'Feb-20', 'Feb-20', 'Feb-20'],
    'purchased_meat':[0, 10, 5, 3],
    'purchased_alcohol':[2, 4, 14, 0],
    'purchased_snacks':[15, 3, 2, 6]}

In [6]:
# Convert the dictionary into dataframe
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
df_1 = pd.DataFrame(data2,index=[0, 1, 2, 3])

In [7]:
df

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7


In [8]:
df_1

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


You can now concatenate your dataframes.  

First create a list that holds both of them, then concatenate the list with pandas function `pd.concat()`.  
The argument `axis = 0` is the default one, that tells pandas to perform the operation along the axis 0 or vertically.  

##### __Remember:__  
0 stands for vertical (index axis)  
1 stands for horizontal (column axis)

In [15]:
frames = [df, df_1]
df_concat = pd.concat(frames, axis=0)

In [16]:
df_concat

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


##### To recap, the `pd.concat()` function:  

- Is suitable for __rows or columns of the same length__
- Will place dataframes on top of each other by default (`axis = 0`)
- Requires a list as its main argument (this is why the `frames` list was created first in the example above)

### Joining Data  

The `df.join()` function is typically only used in cases where your index column carries some sort of information (rather than simply displaying the number of rows, as in your Instacart data and the demo data in this Exercise). If you want to learn more about this function, check out this [Resources](https://towardsdatascience.com/pandas-join-vs-merge-c365fd4fbf49)

### Merging Data  

The best use cases for the `df.merge()` function are those where the dataframes you want to combine __don’t match in shape__, unlike with the concatenate function.  

You’ll need a key or some kind of _common identifier_ column that brings the two (or more) data sets together.  
While some dataframes may contain common identifier columns, others won’t have any full matches between them at all.  

EXAMPLE:  

Imagine you have multiple data sets with information about social media users, each one containing information about the same person on different social media platforms. If you wanted to combine, say, the Facebook data set with the Instagram data set, you wouldn’t have a full match because Instagram launched after Facebook. It won’t include the same historical data because it simply doesn’t exist.  

__A “full match” refers to having 100 percent of both dataframes in the new combined dataframe.__  

This is why it’s important to know beforehand which information you need to keep and which you can omit.  

This is also where __the importance of the type of _join_ really kicks in.__

##### Inner Join:  

Used to keep only information that’s present in both data sets.  
Only use this method only if you’re absolutely sure you don’t need the excess data.  
  
    
      
##### Left Join:  

Used to keep information from the left dataframe in addition to the information that matches both (Inner Join + all the unmatched of the left dataframe)  



##### Right Join:  

Same as left join just the opposite (Inner join + all the unmatched of the right dataframe)


##### Full Outer Join:  

Used to keep all information from both dataframes, regardless of whether they match.  
When using this method on data sets of different sizes, you’ll end up with a lot of missing data in your final dataframe.




In [18]:
# Let's create another dataframe of different shape to test the merge method
data3 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
    'days_purchased_on':[0, 10, 4, 1]}

df_2 = pd.DataFrame(data3,index=[0, 1, 2, 3])

In [23]:
df

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7


In [24]:
df_2

Unnamed: 0,customer_id,month,days_purchased_on
0,6732,Jan-20,0
1,767,Jan-20,10
2,890,Jan-20,4
3,635,Jan-20,1


Now we have `df` and `df_2` that have different shape.  
Similar to SQL, the `merge()` function takes the `on` argument to specify the common column to use to match the two dataframes.

In [25]:
# the on argument tells pandas that the “customer_id” column is the common column between the two
df_merged = df.merge(df_2, on = 'customer_id')

In [20]:
df_merged

Unnamed: 0,customer_id,month_x,purchased_meat,purchased_alcohol,purchased_snacks,month_y,days_purchased_on
0,6732,Jan-20,0,1,10,Jan-20,0
1,767,Jan-20,13,2,5,Jan-20,10
2,890,Jan-20,3,10,1,Jan-20,4
3,635,Jan-20,4,0,7,Jan-20,1


##### ATTENTION:  
There may be times where the dataframes you’re provided with have a common column, but that column has a different name in each dataframe—for instance, `cust_id` in the first and `customer_id` in the second. In this scenario, you’ll need to rename one of the columns before executing a merge.

Another thing you may notice is two new columns: `month_x` and `month_y`. This is a result of the `month` column existing in both dataframes. Because you didn’t specify it as a key, like you did the `customer_id` column, it’s duplicated in the final dataframe.

In [36]:
# since the month column also matches in both dataframes, we can include it in the on argument using a list
df_merged = df.merge(df_2, on = ['customer_id', 'month'])

In [37]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,10
2,890,Jan-20,3,10,1,4
3,635,Jan-20,4,0,7,1


When conducting the merge procedure, you’d expect there to be a full match between the two sets.  

For instance, in this case, every `customer_id` in `df` existed in `df_2` allowing for a seamless merge.  
If there isn’t a full match, you may have been given the wrong data sets (a fairly common mistake).  

##### Is it a full match?
A quick and easy way to check for a full match is via the `indicator = True` argument.  
  
When included in your merge, __pandas will create an additional column in the final dataframe__ called `_merge` that indicates the source of the data within that row (this column is technically called a _merge flag_). A value of `both` means the key (or keys) you specified exist in both dataframes, while a value of `left_only` or `right_only` indicates that the key only exists in either the left or right dataframe.

In [42]:
df_merged = df.merge(df_2, on = ['customer_id', 'month'], indicator=True)

In [43]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,10,both
2,890,Jan-20,3,10,1,4,both
3,635,Jan-20,4,0,7,1,both


If you run a frequency check on this column, you’ll quickly be able to see how many rows in the new dataframe have a value of `both`, `right_only`, and `left_only`.

In [44]:
df_merged.value_counts('_merge')

_merge
both          4
left_only     0
right_only    0
Name: count, dtype: int64

In [49]:
# drop the _merge column
df_merged.drop('_merge', axis = 1, inplace=True)

In [51]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,10
2,890,Jan-20,3,10,1,4
3,635,Jan-20,4,0,7,1


If you'd like to test a merge without overwriting any dataframe you can do without saving it:

In [54]:
df.merge(df_2, on = ['customer_id', 'month'])

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,10
2,890,Jan-20,3,10,1,4
3,635,Jan-20,4,0,7,1


Another function you can use that is the same, it is just a matter of preference is:

In [56]:
# this function is accessed directly from the pandas module
pd.merge(df, df_2, on = ['customer_id', 'month'])

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,10
2,890,Jan-20,3,10,1,4
3,635,Jan-20,4,0,7,1


There’s one more argument you should be aware of when using the `pd.merge()` function: the `how` argument. This argument specifies, as the name implies, how you want the dataframes to be merged (the type of join to use), and it can take the values `left`, `right`, `inner` (default), or `outer`.  

In this example, it doesn’t matter which type of join you use because there’s a full match between the keys columns in the dataframe. In cases where this isn’t so, however, you should always consult senior colleagues and stakeholders about which parts of the data to keep before deciding which type of join to use.

In [59]:
pd.merge(df, df_2, on = ['customer_id', 'month'], how = 'inner')

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,10
2,890,Jan-20,3,10,1,4
3,635,Jan-20,4,0,7,1
