# Vehicle Loan Prediction Machine Learning Model

# Chapter 3 - Exploratory Data Analysis

## Lesson 1 - Introduction to EDA



### First things first

Remember to load the libraries and import the cleaned data we created last time

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
loan_df = pd.read_csv('../data/vehicle_loans_clean.csv', index_col='UNIQUEID')

Use the [df.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function to remind ourselves what variables we are dealing with

In [None]:
loan_df.info()

- We still have 40 columns
- Our new columns; ‘AGE’, ‘DISBURSAL_MONTH’, ‘AVERAGE_ACCT_AGE_MONTHS’ and ‘CREDIT_HISTORY_LENGTH_MONTHS’ are shown at the bottom of the list
- Start to think about how different columns might be related to both the target variable and each other

### Unique Values 

- A good starting for exploratory analysis is to look at the number of unique values in each column
- Pandas [df.nunique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) provides a quick and easy way to count column-wise unique values

In [None]:
loan_df.nunique()

Do you notice anything interesting?
- MOBILE_AVL_FLAG has only one unique value! 

Let's look in more detail to be sure 

In [None]:
loan_df['MOBILENO_AVL_FLAG'].value_counts()

- Every row contains the value 1 
- It has no predictive value so we can drop it

In [None]:
loan_df = loan_df.drop(['MOBILENO_AVL_FLAG'], axis = 1)

## Lesson 2 -  What's in the IDs?

Since they are near the top of our list of columns, let's take a look at the 6 Id fields.

- BRANCH_ID: Branch where the loan was disbursed
- SUPPLIER_ID: Vehicle Dealer where the loan was disbursed 
- MANUFACTURER_ID: Vehicle manufacturer(Hero, Honda, TVS etc.)
- CURRENT_PINCODE_ID: Current pincode of the customer
- STATE_ID: State of disbursement
- EMPLOYEE_CODE_ID: Employee of the organization who logged the disbursement

In [None]:
loan_df[['SUPPLIER_ID', 'CURRENT_PINCODE_ID', 'EMPLOYEE_CODE_ID', 'BRANCH_ID', 'STATE_ID', 'MANUFACTURER_ID']].sample(10)

These six fields contain numeric data, but really they represent categorical, underordered information. For example, we cannot say things like manufacturer id 1 < 2, or state id 1 = 3 - 2. 

Id fields with large numbers of unique values will introduce complexity into our predictive model. Therefore, we will drop them from the dataset. 

In [None]:
loan_df = loan_df.drop(['SUPPLIER_ID', 'CURRENT_PINCODE_ID', 'EMPLOYEE_CODE_ID', 'BRANCH_ID'], axis=1)

### A Closer Look 

### EXERCISE 

- Pick one of the two remaining Id columns and write some code to investigate its unique values
- HINT: We did this with ‘LOAN_DEFAULT’ in chapter 2


### SOLUTION

- use [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to get manufacturer frequencies 
- include the normalize parameter to look at the percentages 
- plot using [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html)

In [None]:
print(loan_df['MANUFACTURER_ID'].value_counts())
print(loan_df['MANUFACTURER_ID'].value_counts(normalize=True))
sns.countplot(x="MANUFACTURER_ID", data=loan_df)
plt.show()

### Dig a little deeper

- It is important to understand how a particular variable is spread 
- However, we are really interested in its relationship to the target variable!

### Group By

- Pandas provides a very useful [df.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function which can be used to group DataFrame rows according to a single column or group of columns
- Similar to the GROUP BY statement in SQL
- Returns a group by object on which we can perform aggregations such as sum, min and max
- We can select subsets of columns to interrogate the data further

### Group by examples

In [None]:
loan_df.groupby('MANUFACTURER_ID')

- Pure output of groupby not that useful
- Let’s try an aggregation

In [None]:
loan_df.groupby('MANUFACTURER_ID').max()

- Ok, now we can see the max value for each column for every ‘MANUFACTURER_ID’
- We can select subsets of the groups and perform operations on them


In [None]:
loan_df.groupby('MANUFACTURER_ID')['LOAN_DEFAULT'].value_counts()

We can also use [unstack](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html), to give a more readable output

- [unstack](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html) let's us pivot the output of our groupby to give us columns for the unique values of loan default
- The level parameter is used to set the column on which to pivot. In our case want to use the LOAN_DEFAULT column which is the last labelled column in our output, so we set level to -1 to indicate we want to pivot on the last column 

In [None]:
loan_df.groupby('MANUFACTURER_ID')['LOAN_DEFAULT'].value_counts().unstack(level=-1)

### Visualized Groupings

- Now we can start to see how loan defaults are distributed within manufacturer groups
- Remember the normalize parameter from value_counts?

In [None]:
loan_df.groupby('MANUFACTURER_ID')['LOAN_DEFAULT'].value_counts(normalize=True).unstack(level=-1)

- Looks like loans for some manufacturers default at higher rates than others!
- Cars from manufacturer 48 defaulted most frequently. *With the exception of 153 which only had 12 total loans which is not enough data to give us solid insight*
- Seaborn [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html) to visualize the groupings
- We are using catplot rather than countplot as it allows us to group together data with the hue parameter
- x parameter is the main x-axis variable 
- hue is the column we want to create sub-groups on

In [None]:
sns.catplot(data=loan_df,kind='count',x='MANUFACTURER_ID',hue='LOAN_DEFAULT')
plt.show()

## Lesson 3 - Reusable EDA

In the previous lesson, we took a more detailed look at ‘MANUFACTURER_ID’

We still have 4 categorical variables to investigate!
- EMPLOYMENT_TYPE: Employment Type of the customer
- PERFORM_CNS_SCORE_DESCRIPTION: Bureau score description 
- STATE_ID: State of disbursement 
- DISBURSAL_MONTH: The month in which the loan was disbursed

We could copy and paste our steps from the previous lesson

As a general rule of thumb in programming, we want to avoid repeating ourselves

### EXERCISE 

- Write a function to perform the steps from lesson 2 for any column 
- Use this to explore the remaining categorical variables and think about their relationships with the target 


### SOLUTION

- Use print statements to make output readable 

In [None]:
def explore_categorical(col_name):   
    print("{0} Summary".format(col_name))
    print("\n")

    print("{0} Counts".format(col_name))
    print(loan_df[col_name].value_counts())
    print("\n")

    print("{0} Ratio".format(col_name))
    print(loan_df[col_name].value_counts(normalize=True))
    print("\n")

    print("{0} Default Counts".format(col_name))
    print(loan_df.groupby(col_name)['LOAN_DEFAULT'].value_counts().unstack(level=-1))
    print("\n")

    print("{0} Default Ratio".format(col_name))
    print(loan_df.groupby(col_name)['LOAN_DEFAULT'].value_counts(normalize=True).unstack(level=-1))
    print("\n")

    sns.catplot(data=loan_df,kind='count',x=col_name,hue='LOAN_DEFAULT')
    plt.show()

Lets our explore_categorical function to look at DISBURSAL_MONTH

In [None]:
explore_categorical("DISBURSAL_MONTH")

- The vast majority of loans were disbursed in August, September and October 
- Loans disbursed in October had the highest rate of default ~24%

## Lesson 4 - Continuous Variables 

So far in this chapter, we have seen how to investigate categorical data but
we have a number of continuous variables to deal with also!




### Summary Statistics

- The first port of call for exploring continuous variables
- Look at the mean, median, IQR, standard deviation and min/max to get an idea of the range of data and how it is distributed
- Pandas gives us the [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function which generates statistical summaries!

In [None]:
loan_df["AGE"].describe()

Some things to note here

- The mean is 33.9
- The medium is 32 (medium is smaller, could the distribution be skewed a little)
- Max is far bigger than 3rd Q, probably has a right tail
- Min of 17 and Max of 69, these are reasonable so no erroneous outliers 

### Box Plots and Distributions

- As with most things, summary statistics are often easier to interpret when visualized
- Luckily, seaborn makes this easy for us with its [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) and [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) functions

In [None]:
sns.boxplot(x="AGE", data=loan_df)
plt.show()

As suspected, there is a right tail. Now we can use the [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) function from seaborn to look at the distribution



In [None]:
sns.distplot(loan_df['AGE'], hist=False)
plt.show()

### Grouped Summaries

- Just as we did for the categorical variables, we want to explore the relationship between our continuous variables and the LOAN_DEFAULT column
- Remember [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)? We can combine this with the [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function to generate grouped summary statistics!

In [None]:
loan_df.groupby('LOAN_DEFAULT')['AGE'].describe()

Ok, looks like the people who defaulted were generally younger. 

We can use sns boxplot to visualize this

In [None]:
sns.boxplot(x='AGE', y='LOAN_DEFAULT', data=loan_df, orient="h")
plt.show()

The distribution of AGE within the group of people who defaulted was marginally younger! 

### EXERCISE 

- Using the steps we have performed to explore ‘AGE’, write a function that can be used to explore other continuous variables
- Pick a few more continuous variables to explore and use your function to investigate them!
- Keep a note of your findings

### SOLUTION

In [None]:
def explore_continuous(col_name):
    #print statistical summary
    print("{0} Summary".format(col_name))
    print("\n")
    print(loan_df[col_name].describe())
    print("\n")

    #Look at boxplot
    sns.boxplot(x=col_name, data=loan_df)
    plt.show()

    #Look at the distribution
    sns.distplot(loan_df[col_name], hist=False)
    plt.show()

    #Now lets look deeper by grouping with the target variable 
    print("{0} Grouped Summary".format(col_name))
    print("\n")
    print(loan_df.groupby('LOAN_DEFAULT')[col_name].describe())

    #look at grouped boxplot 
    sns.boxplot(x=col_name, y='LOAN_DEFAULT', data=loan_df, orient="h")
    plt.show()

Let's use our new function to look at the DISBURSED_AMOUNT column

In [None]:
explore_continuous('DISBURSED_AMOUNT')

Things to note 

- There are some huge outliers here, we will cover techniques for dealing with them in a later lesson 
- Generally, the disbursed amount for defaulted loans was larger, or at least the distribution ranges over larger values 

## Lesson 5 - Binary Variables & Conclusion

You may have noticed that our data contains several columns with the underscore _FLAG

- MOBILENO_AVL_FLAG: if Mobile no. was shared by the customer then flagged as 1
- AADHAR_FLAG: if aadhar was shared by the customer then flagged as 1
- PAN_FLAG: if pan was shared by the customer then flagged as 1
- VOTERID_FLAG: if voter id was shared by the customer then flagged as 1
- DRIVING_FLAG: if DL was shared by the customer then flagged as 1
- PASSPORT_FLAG: if passport was shared by the customer then flagged as 1

These are binary or boolean fields where a 1 means that some piece of personal information was provided by the customer and 0 means it was not.
We already dropped the MOBILENO_AVL_FLAG because the value was the same for all rows. 

Essentially these columns can be considered as categoricals so we can use our explore_categorical function to look at them!

Let's have a look at 'AADHAR_FLAG'. An AADHAR number is a 12 digit personal id number provided to residents of India by the government

In [None]:
explore_categorical('AADHAR_FLAG')

Looks like people who didn't provide their AADHAR number defaulted more frequently at 25.6%!

## Conclusion

- In this chapter, we have demonstrated some techniques to carry out basic exploratory analysis 
- This is only scratching the surface
- Specific techniques used for exploration may be dependent on both the data and its context
- Spend some time now exploring the data further
- Combine these techniques with your own intuition to formulate some hypothesis as to why a particular person might default on their loan
- As always if you have made changes to the data you wish to carry forward, remember to save it!

In [None]:
loan_df.to_csv('../data/vehicle_loans_eda.csv')