# Data Manipulation with Pandas

Pandas is the most widely used library of python for data science. It is incredibly helpful in manipulating the data so that you can derive better insights and build great machine learning models.

In this notebook, we will have a look at some of the intermediate concepts of working with pandas.


## Table of Contents

1. Sorting dataframes
2. Merging dataframes

### Loading dataset

***In this notebook we will use the Big Mart Sales Data. You can download the data from : https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/download/train-file*** 

In [None]:
import pandas as pd
import numpy as np

# read the dataset
data_BM = pd.read_csv('../input/big-mart-sales/train_v9rqX0R.csv')
# drop the null values
data_BM = data_BM.dropna(how="any")
# view the top results
data_BM.head()

### 1. Sorting dataframes

Pandas data frame has two useful functions

- **sort_values()**: to sort pandas data frame by one or more columns
- **sort_index()**: to sort pandas data frame by row index

Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.

Suppose you want to sort the dataframe by "Outlet_Establishment_Year" then you will use **sort_values**

In [None]:
# sort by year
sorted_data = data_BM.sort_values(by='Outlet_Establishment_Year')
# print sorted data
sorted_data[:5]

- Now `sort_values` takes multiple options like:
    - `ascending`: The default sorting order is ascending, when you pass False here then it sorts in descending order.
    - `inplace`: whether to do inplace sorting or not

In [None]:
# sort in place and descending order
data_BM.sort_values(by='Outlet_Establishment_Year', ascending=False, inplace=True)
data_BM[:5]

You might want to sort a data frame based on the values of multiple columns. 
We can specify the columns we want to sort by as a list in the argument for sort_values().


In [None]:
# read the dataset
data_BM = pd.read_csv('../input/big-mart-sales/train_v9rqX0R.csv')
# drop the null values
data_BM = data_BM.dropna(how="any")

# sort by multiple columns
data_BM.sort_values(by=['Outlet_Establishment_Year', 'Item_Outlet_Sales'], ascending=False)[:5]

- Note that when sorting by multiple columns, pandas sort_value() uses the first variable first and second variable next. 
- We can see the difference by switching the order of column names in the list.

In [None]:
# changed the order of columns
data_BM.sort_values(by=['Item_Outlet_Sales', 'Outlet_Establishment_Year'], ascending=False, inplace=True)
data_BM[:5]

- We can use **sort_index()** to sort pandas dataframe to sort by row index or names. 
- In this example, row index are numbers and in the earlier example we sorted data frame by 'Item_Outlet_Sales', 'Outlet_Establishment_Year' and therefore the row index are jumbled up. 
- We can sort by row index (with inplace=True option) and retrieve the original dataframe.

In [None]:
# sort by index
data_BM.sort_index(inplace=True)
data_BM[:5]

### 2. Merging dataframes

- Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks. 
- It is one of the toolkits which every Data Analyst or Data Scientist should master because in almost all the cases data comes from multiple source and files.
- Pandas has two useful functions for merging dataframes:
    - **concat()**
    - **merge()** 
    
#### Creating dummy data

In [None]:
# create dummy data
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
 

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                     'B': ['B4', 'B5', 'B6', 'B7'],
                     'C': ['C4', 'C5', 'C6', 'C7'],
                     'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])
 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                     'B': ['B8', 'B9', 'B10', 'B11'],
                     'C': ['C8', 'C9', 'C10', 'C11'],
                     'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

### a. concat() for combining dataframes
- Suppose you have the following three dataframes: df1, df2 and df3 and you want to combine them **"row-wise"** so that they become a single dataframe like the given image:
![](concat_1_a.png)
- You can use **concat()** here. You will have to pass the names of the DataFrames in a list as the argument to the concat(). 

In [None]:
# combine dataframes
result = pd.concat([df1, df2, df3])
result

- pandas also provides you with an option to label the DataFrames, after the concatenation, with a key so that you may know which data came from which DataFrame.
- You can achieve the same by passing additional argument **keys** specifying the label names of the DataFrames in a list.

In [None]:
# combine dataframes
result = pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
result

- Mentioning the keys also makes it easy to retrieve data corresponding to a particular DataFrame. 
- You can retrieve the data of DataFrame df2 which had the label `y` by using the `loc` method.

In [None]:
# get second dataframe
result.loc['y']

- When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done in the following three ways:

    - Take the union of them all, `join='outer'`. This is the default option as it results in zero information loss.
    - Take the intersection, `join='inner'`.
    - Use a specific index, as passed to the `join_axes` argument.

- Here is an example of each of these methods. First, the default `join='outer'` behavior:
![](concat_2.png)

In [None]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                        'D': ['D2', 'D3', 'D6', 'D7'],
                        'F': ['F2', 'F3', 'F6', 'F7']},
                       index=[2, 3, 6, 7])
    

result = pd.concat([df1, df4], axis=1, sort=False)
result

- Here is the same thing with `join='inner'`:
![](concat_3.png)

In [None]:
result = pd.concat([df1, df4], axis=1, join='inner')
result

- Lastly, suppose we just wanted to `reuse the exact index` from the original DataFrame:
![](concat_4.png)

In [None]:
result = pd.concat([df1, df4], axis=1)
result = result.reindex(df1.index)
result

### b. merge() for combining dataframes using SQL like joins

- Another ubiquitous operation related to DataFrames is the merging operation. 
- Two DataFrames might hold different kinds of information about the same entity and linked by some common feature/column.
- We can use **merge()** to combine such dataframes in pandas.

#### Creating dummy data

In [None]:
# create dummy data
df_a = pd.DataFrame({
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']})

df_b = pd.DataFrame({
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']})

df_c = pd.DataFrame({
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]})

Now these are our dataframes:
![](merge_4.png)

- Let's start with a basic join, we want to combine `df_a` with `df_c` based on the `subject_id` column.

In [None]:
pd.merge(df_a, df_c, on='subject_id')

- Now that we have done a basic join, let's get into **some commmon SQL joins.**

#### Merge with outer join

- “Full outer join produces the set of all records in Table A and Table B, with matching records from both sides where available. If there is no match, the missing side will contain null.”
![](merge_8.png)


In [None]:
pd.merge(df_a, df_b, on='subject_id', how='outer')

#### Merge with inner join

- “Inner join produces only the set of records that match in both Table A and Table B.”
![](merge_5.png)

In [None]:
pd.merge(df_a, df_b, on='subject_id', how='inner')

#### Merge with right join

- “Right outer join produces a complete set of records from Table B, with the matching records (where available) in Table A. If there is no match, the left side will contain null.”

![](merge_7.png)

In [None]:
pd.merge(df_a, df_b, on='subject_id', how='right')

#### Merge with left join

- “Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.”
![](merge_6.png)

In [None]:
pd.merge(df_a, df_b, on='subject_id', how='left')

### Merge OR Concat : Which to use when?

1. After learning both of the functions in detail, chances are that you might be confused which to use when. 
2. One major difference is that `merge()` is used to combine dataframes on the basis of values of **common columns**. While`concat()` is used to **append dataframes** one below the other (or sideways, depending on whether the axis option is set to 0 or 1).
3. Exact usage depends upon the kind of data you have and analysis you want to perform.

# Binning in Pandas

**When dealing with continuous numeric data, it is often helpful to bin the data into multiple buckets for further analysis. There are several different terms for binning including bucketing, discrete binning and discretization.
Pandas supports these approaches using the cut and function. In this section you will learn how to use the pandas functions to convert continuous data to a set of discrete buckets.**

In [None]:
df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})
df_ages

In [None]:
df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])
df_ages


In [None]:
df_ages['age_bins'].unique()


In [None]:
df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])

In [None]:
df_ages