# Tutorial 3.5: Pandas Data Concatenation
Python for Data Analytics | Module 3  
Professor James Ng

In [None]:
# SETUP: DO NOT CHANGE
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline
plt.style.use('seaborn')

## Introduction

In this tutorial, we will be covering basic ways to combine datasets using the `pd.concat()` function of the *pandas* library. Conceptually, it is related to the `pd.merge()` function that we will cover in the next tutorial in that they are both ways of combining different data sets. 

The mechanics of how they perform the combinations are different.

In [None]:
# For this tutorial, we will only need our college_loan_defaults dataset.
college_loan_defaults = pd.read_csv('data-sets/college-loan-default-rates.csv')

## `pd.concat()`
You can think of the `pd.concat()` function as the equivalent of the NumPy `concatenate()` function for `Series` and `DataFrame` objects. Essentially, the `pd.concat()` function appends new rows or columns to a *Series*/*DataFrame*. 

When using `pd.concat()` with a *DataFrame* most basic question is whether you are adding *additional rows* or *additional columns*. We'll run through the function arguments based on concatenating rows and then come back for a look at how we perform column concatentations.

## Concatenating *DataFrame* Rows
To get started, I'll split the college_loan_defaults into multiple sections of rows using the `.iloc` attribute that we will then stitch back together.

In [None]:
# IMPORTANT: Notice that I've included numeric index 1999 both part_2 and part_3
part_1 = college_loan_defaults.iloc[:1000]
part_2 = college_loan_defaults.iloc[1000:2000]
part_3 = college_loan_defaults.iloc[1999:]

### Basic Usage
Before we start, let's see how many rows are in each of are parts:

In [None]:
print(len(part_1), len(part_2), len(part_3))

In it's simpliest form, all you need to pass to `pd.concat()` is a list of the *pandas* objects that you want to combine:

In [None]:
# Joining all three parts back together with pd.concat()
# Notice that I've put part 3 first here.
concatenated_dataframe = pd.concat([part_3, part_1, part_2])

# And print out the number of rows
# Should be the sum of the individual parts
len(concatenated_dataframe)

So far, so good. Now, take a look at the `head()` of the resulting *DataFrame*:

In [None]:
# Look, part_3 is now at the head() of our DataFrame.
# pd.concat() does not do any sorting of the indices.
concatenated_dataframe.head()

### Handling Duplicate Index Values with `verify_integrity` & `ignore_index` Parameters
I mentioned earlier, that I purposely shared an index value (1999) between the *part_2* and *part_3* objects.
The result of this is that there are actually two rows with index 1999 in our *concatenated_dataframe* object.

In [None]:
concatenated_dataframe.loc[1999]

Gross, now we have to deal with the duplicate index value. 

If you want to keep both entries (often the case if the index value is the same but the rest of the data is different) you can pass the `ignore_index` parameter with a value of `True` to the function and all existing index values will be dumped and a new integer based index will be created for you.

In [None]:
# Dumpy the old index and generate a new one
# Notice how the index values now start from 0, even though we
# used `part_2` as the first object in this concat operation
concatenated_dataframe = pd.concat([part_2, part_3, part_1], ignore_index=True)
concatenated_dataframe.head()

In [None]:
# Now let's see if we still have the data from both entries
# AND ensure they have different index values
mask = concatenated_dataframe['name'] == 'JOHN MARSHALL LAW SCHOOL (THE)'
concatenated_dataframe[mask]

Alright, now we no longer have duplicate index values. **BUT** these two rows are pure duplicates of each other and in situations like this, you almost certainly want to keep just one. 

To check for duplicates use pandas's `duplicated()`.

To drop duplicates use pandas's `drop_duplicates()`.

### Handling Column Mismatches with the `join` Parameter
Sometimes you will have two sets of rows that you want to join together, but the sets don't have all of the same columns.

To demonstrate, I'll create a couple of additional small *DataFrame* objects from our college loan dataset to demonstrate our options here.

In [None]:
# DataFrame 1
# Contains the first 5 rows of the original dataset
# But only the name, city, and state columns
name_city_state_columns_only = college_loan_defaults[['name', 'city', 'state']][:5]
name_city_state_columns_only

In [None]:
# DataFrame 2
# Contains the second 5 rows of the original dataset
# But only the name, state, and zipcode columns
name_state_zipcode_columns_only = college_loan_defaults[['name', 'state', 'zipcode']][5:10]
name_state_zipcode_columns_only

We have have 2 sets of 5 rows that we want to concatenate together, but they have different columns. Let's see what happens if you don't specify anything with the `join` parameter:

In [None]:
pd.concat([name_city_state_columns_only, name_state_zipcode_columns_only])

You can see that *pandas* combined the rows indices together as expected. In addition, it combined all the available column names **and** added `NaN` values for any index/column combination that didn't have a value in the original dataframes.

Alternatively, we can tell *pandas* to drop any column(s) where there is not data for a given column in all of the objects being concatenated. You can do this be specifying a value of `inner` to the `join` parameter of the function.

Let's demonstrate how doing so will result in only the shared columns (name, state) appearing in the final dataframe:

In [None]:
pd.concat([name_city_state_columns_only, name_state_zipcode_columns_only], join='inner')

## Concatenating `DataFrame` Columns
Now let's go back and see how we can use the `pd.concat()` function to combine two sets of columns with the same index (row) values.

As before, we will start with data small *DataFrame* objects. As we begin to work the results will start out a little dirty but we will clean it up with our parameters.

In [None]:
# DataFrame 1
# Contains the first 5 rows of the original dataset
# But only the name, city, and state columns

# Doing a reverse sort on the index...
name_city_state_zipcode_columns = college_loan_defaults[['name', 'city', 'state']][:5]
name_city_state_zipcode_columns.sort_index(ascending=False, inplace=True)
name_city_state_zipcode_columns

In [None]:
# DataFrame 2
# Contains the 7 rows of the original dataset
# But only the name, state, and zipcode columns

# Again, doing a reverse sort on the index.
name_and_default_rates = college_loan_defaults[
    ['year_1_default_rate', 'year_2_default_rate', 'year_3_default_rate']][:7]
name_and_default_rates.sort_index(ascending=False, inplace=True)
name_and_default_rates

Now let's do a simple concatenation. To add columns we have to specify the `axis` parameter with a value of `1` or `columns` to indicate we are adding colums, not rows.

In [None]:
pd.concat(
    [name_city_state_zipcode_columns, name_and_default_rates], axis='columns')

There are a couple of important things to notice here:
* Unlike when concatenating rows, this time *pandas* did do an index sort after performing the combination. Just something to be aware of.
* See how there are a couple of rows with `NaN` values for their first three colums? That's because our `name_and_default_rates` dataframe had two additional rows for which there were no corresponding values in `name_city_state_zipcode_columns`.

Just as we could specify an *inner join* to drop columns with missing data previously, we can use it here to drop rows with missing values: 

In [None]:
pd.concat(
    [name_city_state_zipcode_columns, name_and_default_rates], 
    axis=1, join='inner')

Finally, let's talk about the how the `verify_integrity` and `ignore_index` parameters would work when concatenating columns.

Let's say that we had included the city column in both dataframes:
* The default behavior of `pd.concat()` would have been to create a new dataframe with 2 "city" columns.
* You could make Pandas throw a `ValueError` exception by passing `verify_integrity=True` to the function.
* You could also throw out all the column names and replaced them with an 0-based series of integers by passing `ignore_index=True`.  This would result in the values of "city" being duplicated in two columns, but the columns would have different integer "names".

### Concatenating *Series* Objects
Concatenating *Series* objects is really no different than concatenating a single column *DataFrame*. But, for the sake of completeness, here are some quick examples of `pd.concat()` on a *Series*.

In [None]:
# Grab two sections of the `name` Series
series_index_0_to_5 = college_loan_defaults['name'][:6]
series_index_5_to_10 = college_loan_defaults['name'][5:10]
print(series_index_0_to_5, series_index_5_to_10, sep='\n\n')

In [None]:
# Concatenate them together with default argument values
# Notice the duplicate 5 index
pd.concat([series_index_0_to_5, series_index_5_to_10])

In [None]:
# Ignore the duplicate `41833` index with `ignore_index`
# Remember this generates a new numberic index
pd.concat([series_index_0_to_5, series_index_5_to_10], ignore_index=True)

In [None]:
# Or raise ValueError with `verify_integrity`
pd.concat([series_index_0_to_5, series_index_5_to_10], verify_integrity=True)