# **Pandas** 🐼

<https://pandas.pydata.org/docs/user_guide/10min.html>

A pandas DataFrame can be easily changed and manipulated. Pandas has helpful functions for handling missing data, performing operations on columns and rows, and transforming data. If that wasn’t enough, a lot of SQL functions have counterparts in pandas, such as join, merge, filter by, and group by. 

In [1]:
import pandas as pd

## **NumPy**

NumPy is an open-source Python library that facilitates efficient numerical operations on large quantities of data. There are a few functions that exist in NumPy that we use on pandas DataFrames. For us, the most important part about NumPy is that pandas is built on top of it. So, NumPy is a dependency of Pandas.

NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.


In [2]:
import numpy as np

In [3]:
list_1 = [1, 2, 3, 4, 5]
list_2 = [6, 7, 8, 9, 10]
list_3 = [11, 12, 13, 14, 15]
test_arr = np.array([list_1, list_2, list_3])

display(test_arr)

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])


## **Series**

Just as the ndarray is the foundation of the NumPy library, the Series is the core object of the pandas library. A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labelling is useful when you are storing pieces of data that have other data associated with them.


In [4]:
test_series = pd.Series(list_2)
test_series = pd.Series(test_arr[1])
test_series

0     6
1     7
2     8
3     9
4    10
dtype: int64

## **DataFrame**

Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).

The default row indices are 0,1,2..., but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame.


### **Creation** 

You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query. 

#### **Dictionary to DataFrame**

You can pass in a dictionary to pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. 

In [5]:
dict_input = {
    "Product ID": [1, 2, 3, 4],
    "Product Name": ["t-shirt", "t-shirt", "skirt", "skirt"],
    "Color": ["blue", "green", "red", "black"],
}

df1 = pd.DataFrame(dict_input)
df1

Unnamed: 0,Product ID,Product Name,Color
0,1,t-shirt,blue
1,2,t-shirt,green
2,3,skirt,red
3,4,skirt,black


#### **Lists to DataFrame**

You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [6]:
list_input = [
    [1, "San Diego", 100],
    [2, "Los Angeles", 120],
    [3, "San Francisco", 90],
    [4, "Sacramento", 115],
]

df2 = pd.DataFrame(list_input, columns=["Store ID", "Location", "Number of Employees"])
df2

Unnamed: 0,Store ID,Location,Number of Employees
0,1,San Diego,100
1,2,Los Angeles,120
2,3,San Francisco,90
3,4,Sacramento,115


#### **External Data**

When you have data in a CSV, you can load it into a DataFrame using `read_csv()`. We can also save data to a CSV, using `to_csv()`.


In [8]:
import os

from src.config import CREDIT_RISK_DATA_DIR

os.chdir(CREDIT_RISK_DATA_DIR)

ModuleNotFoundError: No module named 'src'

In [10]:
df = pd.read_csv("credit_risk_dataset.csv")

#### **Dealing with Multiple Files**

Often, you have the same data separated out into multiple files.

Let’s say that we have a ton of files following the filename structure: `'file1.csv'`, `'file2.csv'`, `'file3.csv'`, and so on. The power of pandas is mainly in being able to manipulate large amounts of structured data. We want to be able to get all of the relevant information into one table so that we can analyze the aggregate data.

We can combine the use of `glob`, a Python library for working with files, with `pandas` to organize this data better. `glob` can open multiple files using shell-style wildcard matching to get the filenames:

```python
import glob
import pandas as pd

files = glob.glob("file*.csv")

df_list = []
for filename in files:
  data = pd.read_csv(filename)
  df_list.append(data)

df = pd.concat(df_list)

print(files)
```

### **Displaying**

In Pandas when we perform data analysis, we need to look at the contents of the dataframe. `display()` and `print()` can be used to show the contents of a DataFrame, but they have different behaviors and purpose. For standard output we use `print` statement and for interactive output we use `display`.

Print is the standard method used in Python to output anything on the console or the terminal. In general we display the output of the dataframe using print method. In Pandas, print basically generates the string version of the dataframe. However the main drawback of using print function is that the output gets truncated when the dataframe size is huge. In simple terms, it means that the middle most rows are eliminated for print so as to fit the whole dataset in the console.

Display is a part of IPython library and another technique that is used to display the dataframes in an interactive and user friendly manner. It is particularly useful in Jupyter notebooks. This method basically renders HTML to display the dataframes in about visually enhanced format. Although for large datasets, it displays the first few rows of the dataset. But we can use set_option to display the rows in a scrollable manner.

In [7]:
from IPython.display import display
import pandas as pd

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"C": [5, 6], "D": [7, 8]})
display(df)  # Shows a nicely formatted HTML table in Colab

Unnamed: 0,A,B
0,1,3
1,2,4


Now in real world scenarios, we need to handle more than one dataframe. So in case of print function we have to write separate print statements for each dataframe. On the other hand, in display function we can pass multiple dataframes as parameters and display at one go.

In [14]:
from IPython.display import display
import pandas as pd

df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"X": [5, 6], "Y": [7, 8]})
df3 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})

display(df1, df2, df3)

print(df1)
print(df2)
print(df3)

Unnamed: 0,A,B
0,1,3
1,2,4


Unnamed: 0,X,Y
0,5,7
1,6,8


Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


   A  B
0  1  3
1  2  4
   X  Y
0  5  7
1  6  8
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9



### **Inspection**

The method `head()` gives the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument `n`. For example, `df.head(10)` would show the first 10 rows. The method `info()` gives some statistics for each column.



In [11]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [17]:
df.columns

Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

In [14]:
df.dtypes

person_age                      int64
person_income                   int64
person_home_ownership          object
person_emp_length             float64
loan_intent                    object
loan_grade                     object
loan_amnt                       int64
loan_int_rate                 float64
loan_status                     int64
loan_percent_income           float64
cb_person_default_on_file      object
cb_person_cred_hist_length      int64
dtype: object

In [16]:
df.shape

(32581, 12)

The `size` attribute in a pandas DataFrame returns the total number of elements in the DataFrame. This is equivalent to the number of rows multiplied by the number of columns.

In [15]:
df.size

390972

### **Manipulation**

#### **Selecting Columns**

There are two possible syntaxes for selecting all values from a column:

1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type `customers['age']` to select the ages.
2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation `customers.age`.

In [18]:
df.person_age

0        22
1        21
2        25
3        23
4        24
         ..
32576    57
32577    54
32578    65
32579    56
32580    66
Name: person_age, Length: 32581, dtype: int64

To select two or more columns from a DataFrame, we use a list of the column names. Make sure that you have a double set of brackets `[[]]`, or this command won’t work!

In [19]:
subset = df[["person_age", "loan_grade"]]
subset.head()

Unnamed: 0,person_age,loan_grade
0,22,D
1,21,B
2,25,C
3,23,C
4,24,C


#### **Selecting Rows**

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. When we select a single row, the result is a Series (just like when we select a single column).

In [20]:
df.iloc[2]

person_age                          25
person_income                     9600
person_home_ownership         MORTGAGE
person_emp_length                  1.0
loan_intent                    MEDICAL
loan_grade                           C
loan_amnt                         5500
loan_int_rate                    12.87
loan_status                          1
loan_percent_income               0.57
cb_person_default_on_file            N
cb_person_cred_hist_length           3
Name: 2, dtype: object

Here are some different ways of selecting multiple rows:

- `orders.iloc[3:7]` would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)
- `orders.iloc[:4]` would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)
- `orders.iloc[-3:]` would select the rows starting at the 3rd to last row and up to and including the final row


In [21]:
df.iloc[2:5]

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


#### **Logical Subsets**

You can select a subset of a DataFrame by using logical statements. In Python, `==` is how we test if a value is exactly equal to another value.


In [22]:
df[df.person_age > 30].head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
81,144,250000,RENT,4.0,VENTURE,C,4800,13.57,0,0.02,N,3
183,144,200000,MORTGAGE,4.0,EDUCATION,B,6000,11.86,0,0.03,N,2
575,123,80004,RENT,2.0,EDUCATION,B,20400,10.25,0,0.25,N,3
747,123,78000,RENT,7.0,VENTURE,B,20000,,0,0.26,N,4
17833,32,1200000,MORTGAGE,1.0,VENTURE,A,12000,7.51,0,0.01,N,8


You can also combine multiple logical statements, as long as each statement is in parentheses. In Python, `|` means “or” and `&` means “and”.

In [23]:
df[(df.person_age > 30) & (df.loan_intent == "VENTURE")].head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
81,144,250000,RENT,4.0,VENTURE,C,4800,13.57,0,0.02,N,3
747,123,78000,RENT,7.0,VENTURE,B,20000,,0,0.26,N,4
17833,32,1200000,MORTGAGE,1.0,VENTURE,A,12000,7.51,0,0.01,N,8
17850,34,120000,RENT,17.0,VENTURE,B,35000,10.59,0,0.29,N,6
17869,33,350000,MORTGAGE,0.0,VENTURE,C,10000,14.65,0,0.03,Y,10


We could use the `isin` command to check that a column is one of a list of values.

In [24]:
df[df.loan_grade.isin(["C", "B"])].head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
6,26,77100,RENT,8.0,EDUCATION,B,35000,12.42,1,0.45,N,3


#### **Setting Indices**

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use `iloc()`. We can fix this using the method `reset_index()`. Note that the old indices have been moved into a new column called 'index'. Unless you need those values for something special, it’s probably better to use the keyword `drop=True`. If we use the keyword `inplace=True` we can just modify our existing DataFrame. You can also change the name of the index by setting a name to `names`.

In [26]:
df.reset_index(names="id")

Unnamed: 0,id,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.10,N,2
2,2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32576,32576,57,53000,MORTGAGE,1.0,PERSONAL,C,5800,13.16,0,0.11,N,30
32577,32577,54,120000,MORTGAGE,4.0,PERSONAL,A,17625,7.49,0,0.15,N,19
32578,32578,65,76000,RENT,3.0,HOMEIMPROVEMENT,B,35000,10.99,1,0.46,N,28
32579,32579,56,150000,MORTGAGE,5.0,PERSONAL,B,15000,11.48,0,0.10,N,26


#### **Adding Columns**

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

In [27]:
df["List Column"] = list(range(1, len(df) + 1))

We can also add a new column that is the same for all rows in the DataFrame. 

In [28]:
df["Constant Column"] = True

Finally, you can add a new column by performing a function on the existing columns.

In [30]:
df["Function Column"] = df.person_age * 0.075

#### **Column Operations**

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition. We can use the `apply` function to apply a function to every value in a particular column.

In [33]:
df["Lowercase"] = df.loan_intent.apply(str.lower)

A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable. We can make our lambdas more complex by using a modified form of an if statement.

In [34]:
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x

Below is a lambda function that does the same thing:

In [35]:
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x

In general, the syntax for an if function in a lambda function is:

```python
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
```

In Pandas, we often use lambda functions to perform complex operations on columns. 

In [36]:
df["Lambda Column"] = df.person_age.apply(myfunction)

#### **Row Operations**

We can also operate on multiple columns at once. If we use `apply` without specifying a single column and add the argument `axis=1`, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax `row.column_name` or `row[‘column_name’]`.

In [37]:
df["Row Operation Column"] = df.apply(
    lambda row: (
        row["person_age"] + row["person_income"]
        if row["person_age"] > 60
        else row["person_income"] * 100
    ),
    axis=1,
)

If you're working with Pandas, both `np.where()` and `df.apply()` can be used to create new columns based on conditions. However, `np.where()` is almost always better in terms of performance.

In [38]:
df["Row Operation Column"] = np.where(
    df["person_age"] > 60, df["person_age"] + df["person_income"], df["person_income"] * 100
)

If you have more than two conditions, use `np.select()` instead of `np.where()`:

In [39]:
conditions = [df["person_age"] > 60, df["person_age"] <= 60, df["person_age"] <= 40]

choices = [
    df["person_age"] + df["person_income"],
    df["person_income"] * 100,
    df["person_income"] * 50,
]

df["Multiple Conditional Column"] = np.select(
    conditions, choices, default=df["person_income"] * 145
)

#### **Renaming Columns**

When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use `df.column_name` (which tab-completes) rather than `df['column_name']` (which takes up extra space).

You can change all of the column names at once by setting the `.columns` property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. 

In [40]:
new_column_names = [col.lower() for col in df.columns]
df.columns = new_column_names
df.columns

Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length',
       'list column', 'constant column', 'function column', 'lowercase',
       'lambda column', 'row operation column', 'multiple conditional column'],
      dtype='object')

You also can rename individual columns by using the `.rename` method. Pass a dictionary like the one below to the `columns` keyword argument:

In [41]:
df.rename(columns={"person_age": "age", "person_income": "income"}, inplace=True)
df.columns

Index(['age', 'income', 'person_home_ownership', 'person_emp_length',
       'loan_intent', 'loan_grade', 'loan_amnt', 'loan_int_rate',
       'loan_status', 'loan_percent_income', 'cb_person_default_on_file',
       'cb_person_cred_hist_length', 'list column', 'constant column',
       'function column', 'lowercase', 'lambda column', 'row operation column',
       'multiple conditional column'],
      dtype='object')

Using `rename` with only the columns keyword will create a new `DataFrame`, leaving your original `DataFrame` unchanged. That’s why we also passed in the keyword argument `inplace=True`. Using `inplace=True` lets us edit the original DataFrame. There are several reasons why `.rename` is preferable to `.columns`:
- You can rename just one column
- You can be specific about which column names are getting changed (with `.column` you can accidentally switch column names if you’re not careful)

### **Aggregation**

#### **Group By**

In general, we use the following syntax to calculate aggregates:

```python
df.groupby('column1').column2.measurement()
```

In [42]:
df.groupby("loan_intent").age.mean()

loan_intent
DEBTCONSOLIDATION    27.606293
EDUCATION            26.588099
HOMEIMPROVEMENT      29.066574
MEDICAL              27.998023
PERSONAL             28.208477
VENTURE              27.568456
Name: age, dtype: float64

As we saw in the previous exercise, the `groupby` function creates a new Series, not a DataFrame. Usually, we’d prefer that those indices were actually a column. In order to get that, we can use `reset_index()`. This will transform our Series into a DataFrame and move the indices into their own column.

In [43]:
df.groupby("loan_intent").age.mean().reset_index()

Unnamed: 0,loan_intent,age
0,DEBTCONSOLIDATION,27.606293
1,EDUCATION,26.588099
2,HOMEIMPROVEMENT,29.066574
3,MEDICAL,27.998023
4,PERSONAL,28.208477
5,VENTURE,27.568456


Sometimes, the operation that you want to perform is more complicated than `mean` or `count`. In those cases, you can use the `apply` method and `lambda` functions, just like we did for individual column operations. Note that the input to our `lambda` function will always be a list of values.

In [45]:
# np.percentile can calculate any percentile over an array of values
df.groupby("loan_intent").income.apply(lambda x: np.percentile(x, 75)).reset_index()

Unnamed: 0,loan_intent,income
0,DEBTCONSOLIDATION,79000.0
1,EDUCATION,77533.0
2,HOMEIMPROVEMENT,90000.0
3,MEDICAL,72000.0
4,PERSONAL,80000.0
5,VENTURE,80000.0


Sometimes, we want to group by more than one column. We can easily do this by passing a list of column names into the `groupby` method.

In [46]:
df.groupby(["loan_intent", "loan_status"]).age.mean().reset_index()

Unnamed: 0,loan_intent,loan_status,age
0,DEBTCONSOLIDATION,0,27.565019
1,DEBTCONSOLIDATION,1,27.709396
2,EDUCATION,0,26.470797
3,EDUCATION,1,27.152115
4,HOMEIMPROVEMENT,0,29.559309
5,HOMEIMPROVEMENT,1,27.671626
6,MEDICAL,0,28.079326
7,MEDICAL,1,27.77483
8,PERSONAL,0,28.41872
9,PERSONAL,1,27.361566


We can also perform multiple aggregations on a single column.

In [48]:
df.groupby("loan_intent").income.agg(["mean", "median", "std"]).reset_index()

Unnamed: 0,loan_intent,mean,median,std
0,DEBTCONSOLIDATION,66470.876247,55000.0,54214.875503
1,EDUCATION,64135.199132,55000.0,46184.516205
2,HOMEIMPROVEMENT,73549.470458,64000.0,50449.301212
3,MEDICAL,61437.227145,50000.0,51180.281187
4,PERSONAL,67864.141279,55000.0,98095.68179
5,VENTURE,66386.574576,55000.0,55361.708449


You can finally perform multiple aggregations on multiple grouping columns.

In [49]:
df.groupby(["loan_intent", "loan_status"]).agg(
    {"age": ["mean", "median"], "income": ["mean", "median"]}
).reset_index()

Unnamed: 0_level_0,loan_intent,loan_status,age,age,income,income
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,median,mean,median
0,DEBTCONSOLIDATION,0,27.565019,26.0,71588.914293,60000.0
1,DEBTCONSOLIDATION,1,27.709396,26.0,53686.085906,44000.0
2,EDUCATION,0,26.470797,24.0,67745.394796,59178.0
3,EDUCATION,1,27.152115,25.0,46776.364536,39912.0
4,HOMEIMPROVEMENT,0,29.559309,28.0,82085.33521,70278.0
5,HOMEIMPROVEMENT,1,27.671626,26.0,49384.174283,40574.0
6,MEDICAL,0,28.079326,26.0,65422.364494,54000.0
7,MEDICAL,1,27.77483,26.0,50497.152375,44000.0
8,PERSONAL,0,28.41872,26.0,73055.192177,60000.0
9,PERSONAL,1,27.361566,26.0,46953.37796,39000.0


#### **Pivot Tables**

When we perform a groupby across multiple columns, we often want to change how our data is stored. Reorganizing a table in this way is called pivoting. The new table is called a pivot table.

```python
df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')
```

Just like with groupby, the output of a pivot command is a new DataFrame, but the indexing tends to be “weird”, so we usually follow up with `.reset_index()`.

In [50]:
age_mean_df = df.groupby(["loan_intent", "loan_status"]).age.mean().reset_index()

age_mean_pivot = age_mean_df.pivot(index="loan_intent", columns="loan_status", values="age")
age_mean_pivot

loan_status,0,1
loan_intent,Unnamed: 1_level_1,Unnamed: 2_level_1
DEBTCONSOLIDATION,27.565019,27.709396
EDUCATION,26.470797,27.152115
HOMEIMPROVEMENT,29.559309,27.671626
MEDICAL,28.079326,27.77483
PERSONAL,28.41872,27.361566
VENTURE,27.695402,26.838253


Alternatively, the grouping can be performed in the pivot table itself.

In [52]:
df.pivot_table(index="loan_intent", columns="loan_status", values="age", aggfunc="max")

loan_status,0,1
loan_intent,Unnamed: 1_level_1,Unnamed: 2_level_1
DEBTCONSOLIDATION,60,70
EDUCATION,144,66
HOMEIMPROVEMENT,64,65
MEDICAL,94,70
PERSONAL,144,61
VENTURE,144,60


### **Multiple Tables**

#### **Merging**

The `.merge()` method looks for columns that are common between two DataFrames and then looks for rows where those column’s values are the same. It then combines the matching rows into a single row in a new table.

In [54]:
pd.merge(
    df, age_mean_df, on=["loan_intent", "loan_status"], how="left", suffixes=("", "_mean")
).head()

Unnamed: 0,age,income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,list column,constant column,function column,lowercase,lambda column,row operation column,multiple conditional column,age_mean
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3,1,True,1.65,personal,22.0,5900000,5900000,27.361566
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2,2,True,1.575,education,21.0,960000,960000,26.470797
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3,3,True,1.875,medical,25.0,960000,960000,27.77483
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2,4,True,1.725,medical,23.0,6550000,6550000,27.77483
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4,5,True,1.8,medical,24.0,5440000,5440000,27.77483


In addition to using `pd.merge()`, each DataFrame has its own `.merge()` method. We generally use this when we are joining more than two DataFrames together because we can “chain” the commands. 

```python
df.merge(age_mean_df).merge(age_mean_df)
```

#### **Specifying Join Columns**

Because the join columns would mean something different in each table, our default merges would be wrong. One way that we could address this problem is to use `.rename()` to rename the columns for our merges.

```python
pd.merge(
    orders,
    customers.rename(columns={'id': 'customer_id'})
    )
```

If we don’t want to do that, we have another option. We could use the keywords `left_on` and `right_on` to specify which columns we want to perform the merge on.

```python
pd.merge(
    orders,
    customers,
    left_on='customer_id',
    right_on='id')
```

If we use this syntax, we’ll end up with two columns called `id`, one from the first table and one from the second. Pandas won’t let you have two columns with the same name, so it will change them to `id_x` and `id_y`.

The new column names `id_x` and `id_y` aren’t very helpful for us when we read the table. We can help make them more useful by using the keyword `suffixes`. We can provide a list of suffixes to use instead of `_x` and `_y`.

```python
pd.merge(
    orders,
    customers,
    left_on='customer_id',
    right_on='id',
    suffixes=['_order', '_customer']
)
```

#### **Concatenation**

Sometimes, a dataset is broken into multiple tables. For instance, data is often split into multiple CSV files so that each download is smaller.

When we need to reconstruct a single DataFrame from multiple smaller DataFrames, we can use the method `pd.concat([df1, df2, df3, ...])`. This method only works if all of the columns are the same in all of the DataFrames.

```python
# Concatenate the two menus to form a new menu
menu = pd.concat([bakery, ice_cream])
```