### Introduction

* Solving this assignment will give you an idea about how real business problems are solved using EDA. In this case study, apart from applying the techniques you have learnt in EDA, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.


### Business Understanding

* You work for a consumer finance company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:
    * If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
    * If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

* The data given below contains information about past loan applicants and whether they ‘defaulted’ or not. **The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.**

* In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

* When a person applies for a loan, there are two types of decisions that could be taken by the company:
    * **Loan accepted:** If the company approves the loan, there are 3 possible scenarios described below:
        * **Fully paid:** Applicant has fully paid the loan (the principal and the interest rate)
        * **Current:** Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'.
        * **Charged-off:** Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan 
    * **Loan rejected:** The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

 
### Business Objectives

* This company is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface. 

* Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). Credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who default cause the largest amount of loss to the lenders. In this case, the customers labelled as 'charged-off' are the 'defaulters'. 

* If one is able to identify these risky loan applicants, then such loans can be reduced thereby cutting down the amount of credit loss. **Identification of such applicants using EDA is the aim of this case study.**

* In other words, the company wants to understand **the driving factors (or driver variables) behind loan default**, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment. 

* To develop your understanding of the domain, you are advised to independently research a little about risk analytics (understanding the types of variables and their significance should be enough).

 
### Results Expected
* Write all your code in one well-commented Python file; briefly mention the insights and observations from the analysis 
* Present the overall approach of the analysis in a presentation: 
    * Mention the problem statement and the analysis approach briefly 
    * Explain the results of univariate, bivariate analysis etc. in business terms
    * Include visualisations and summarise the most important results in the presentation

### Importing libraries

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import statsmodels
import sklearn


##### Reading dataset
* Reading the data set by pandas read_csv

In [35]:
# Load the Dataset CSV file
lendingCaseStudyDataFrame = pd.read_csv('loan.csv')

# Load the metadata file
lendingCaseStudyMetadata = pd.read_excel('Data_Dictionary.xlsx', sheet_name='LoanStats')

  lendingCaseStudyDataFrame = pd.read_csv('loan.csv')


##### Fixing Data Quality Issues
* Each data column has mixed data types. Standardizing each column data type.

In [36]:
# Define conversion functions
def toString(column):
    return column.astype(str)

def toFloat(column):
    return column.astype(float)

def toInt(column):
    return column.astype(int)




In [50]:
# Some of the columns have mixed types as described by the error when read by pandas.
# List all column names individually and its type 
# Set display option to limit the number of printed items
pd.reset_option('display.max_seq_items')
print("Columns and column types in the CSV file:")
for columnName in lendingCaseStudyDataFrame.columns:
    columnType = lendingCaseStudyDataFrame[columnName].apply(type).unique()
    uniqueColumnValues = lendingCaseStudyDataFrame[columnName].unique()
    print(f"The data type of {columnName} is: {columnType}")
    print(f"The number of unique values of {columnName} is: {uniqueColumnValues.size}")

    # List all column values if size < 10; To investigate if the column does not have any value of need.
    if (uniqueColumnValues.size < 10):
        print(f"The unique values of {columnName} is: {uniqueColumnValues}")
    print('\n')


Columns and column types in the CSV file:
The data type of id is: [<class 'int'>]
The number of unique values of id is: 39717


The data type of member_id is: [<class 'int'>]
The number of unique values of member_id is: 39717


The data type of loan_amnt is: [<class 'int'>]
The number of unique values of loan_amnt is: 885


The data type of funded_amnt is: [<class 'int'>]
The number of unique values of funded_amnt is: 1041


The data type of funded_amnt_inv is: [<class 'float'>]
The number of unique values of funded_amnt_inv is: 8205


The data type of term is: [<class 'str'>]
The number of unique values of term is: 2
The unique values of term is: [' 36 months' ' 60 months']


The data type of int_rate is: [<class 'str'>]
The number of unique values of int_rate is: 371


The data type of installment is: [<class 'float'>]
The number of unique values of installment is: 15383


The data type of grade is: [<class 'str'>]
The number of unique values of grade is: 7
The unique values of grade

In [56]:
# Remove columns that are not impacting the results i.e. loan descisions.
# it doesn't provide any variability or meaningful information that could contribute to your analysis. 
# Keeping such columns could introduce noise or redundancy, so dropping them makes your dataset cleaner and more efficient to work with.


# Dropping 'mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'tot_coll_amt', 'open_acc_6m',
# 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
# 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 
# 'bc_util', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc',
# 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 
# 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 
# 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'tot_hi_cred_lim',
# 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'tot_cur_bal'
# as all values are 'nan'
columnsListWithAllValuesNan = ['mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'tot_coll_amt', 'open_acc_6m',
'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 
'bc_util', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc',
'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 
'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 
'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'tot_hi_cred_lim',
'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'tot_cur_bal']

# Dropping 'policy_code' column as all values are 1.
# Dropping 'application_type' column as all values are individual
# Dropping 'acc_now_delinq', 'delinq_amnt', column as all values are 0
# Dropping 'pymnt_plan' column as all values are 'n'.
# Dropping 'initial_list_status' as all values are 'f'.
columnsListWithAllSameValues = ['policy_code', 'application_type', 'acc_now_delinq', 'delinq_amnt', 'pymnt_plan', 'initial_list_status']

# Dropping 'collections_12_mths_ex_med', 'chargeoff_within_12_mths', 'tax_liens' as all values are either 'nan or 0'
columnsListWithEitherNanOrOs = ['collections_12_mths_ex_med', 'chargeoff_within_12_mths', 'tax_liens', 'delinq_amnt']

lendingCaseStudyDataFrameCleaned = lendingCaseStudyDataFrame.drop(columns=columnsListWithAllValuesNan + columnsListWithAllSameValues + columnsListWithEitherNanOrOs)

lendingCaseStudyDataFrameCleaned.head()




Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,pub_rec_bankruptcies
0,1077501,1296599,5000,5000,4975.0,36 months,10.65%,162.87,B,B2,...,5000.0,863.16,0.0,0.0,0.0,Jan-15,171.62,,May-16,0.0
1,1077430,1314167,2500,2500,2500.0,60 months,15.27%,59.83,C,C4,...,456.46,435.17,0.0,117.08,1.11,Apr-13,119.66,,Sep-13,0.0
2,1077175,1313524,2400,2400,2400.0,36 months,15.96%,84.33,C,C5,...,2400.0,605.67,0.0,0.0,0.0,Jun-14,649.91,,May-16,0.0
3,1076863,1277178,10000,10000,10000.0,36 months,13.49%,339.31,C,C1,...,10000.0,2214.92,16.97,0.0,0.0,Jan-15,357.48,,Apr-16,0.0
4,1075358,1311748,3000,3000,3000.0,60 months,12.69%,67.79,B,B5,...,2475.94,1037.39,0.0,0.0,0.0,May-16,67.79,Jun-16,May-16,0.0


In [59]:
# Printing the List all column names individually and its type again after cleanup of non-important and empty columns.
print("Info and types of datasets after dropping unncessary columns: ")
print(lendingCaseStudyDataFrameCleaned.info())
print("\n")
print("Columns and column types in the Cleaned CSV file:")
for columnName in lendingCaseStudyDataFrameCleaned.columns:
    columnType = lendingCaseStudyDataFrameCleaned[columnName].apply(type).unique()
    uniqueColumnValues = lendingCaseStudyDataFrameCleaned[columnName].unique()
    print(f"The data type of {columnName} is: {columnType}")
    print(f"The number of unique values of {columnName} is: {uniqueColumnValues.size}")

    # List all column values if size < 10; To investigate if the column does not have any value of need.
    if (uniqueColumnValues.size < 10):
        print(f"The unique values of {columnName} is: {uniqueColumnValues}")
    print('\n')


Info and types of datasets after dropping unncessary columns: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39717 entries, 0 to 39716
Data columns (total 48 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       39717 non-null  int64  
 1   member_id                39717 non-null  int64  
 2   loan_amnt                39717 non-null  int64  
 3   funded_amnt              39717 non-null  int64  
 4   funded_amnt_inv          39717 non-null  float64
 5   term                     39717 non-null  object 
 6   int_rate                 39717 non-null  object 
 7   installment              39717 non-null  float64
 8   grade                    39717 non-null  object 
 9   sub_grade                39717 non-null  object 
 10  emp_title                37258 non-null  object 
 11  emp_length               38642 non-null  object 
 12  home_ownership           39717 non-null  object 
 13  annual_inc   

In [26]:
pd.set_option('display.max_rows', None)
nullCounts = lendingCaseStudyDataFrame.isnull().sum()
print(nullCounts)

id                                    0
member_id                             0
loan_amnt                             0
funded_amnt                           0
funded_amnt_inv                       0
term                                  0
int_rate                              0
installment                           0
grade                                 0
sub_grade                             0
emp_title                          2459
emp_length                         1075
home_ownership                        0
annual_inc                            0
verification_status                   0
issue_d                               0
loan_status                           0
url                                   0
desc                              12942
purpose                               0
title                                11
zip_code                              0
addr_state                            0
dti                                   0
delinq_2yrs                           0


In [28]:
lendingCaseStudyDataFrame.size

4368870

In [29]:
# removing columns with null counts and values

In [33]:
# Specify the columns to check for null values
columnsToCheck = ['
# Drop rows with null values in the specified columns
lendingCaseStudyDataFrame_cleaned = lendingCaseStudyDataFrame.dropna(subset=columnsToCheck)

# Print the cleaned DataFrame to check the result
print(lendingCaseStudyDataFrame.size)

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
       'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens',
       'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit',
       'total_il_high_credit_limit'],
      dtype='object', length=110)
4368870


In [None]:
# Define specific data types
class HomeOwnership(Enum):
    RENT = RENT
    OWN = OWN
    MORTGAGE = MORTGAGE
    NONE = NONE
    OTHER = OTHER

class VerificationStatus(Enum):
    VERIFIED = Verified
    SOURCE_VERIFIED = 'Source Verified'
    NOT_VERIFIED = 'Not Verified'
    
class LoanStatus(Enum):
    CURRENT = Current
    FULLY_PAID = 'Fully Paid'
    CHARGED_OFF = 'Charged_Off'
    

# Define a conversion map (like a case switch)
conversionMap = {
    'id': toString,    # Convert 'id' column to string
    'member_id': toString, # Convert 'member_id' column to string
    'loan_amnt': toFloat, # Convert 'loan_amnt' column to float.
    'funded_amnt': toFloat,  # Convert 'funded_amnt' column to float.
    'funded_amnt_inv' : toFloat, # Convert 'funded_amnt_inv' column to float.
    'term' : toInt, # Convert 'term' column to int to enable calculations.
    'int_rate' : toFloat, # Convert 'int_rate' column to float to enable calculations.
    'installment' : toFloat, # Convert 'installment' column to float to enable calculations though it already is.
    'grade' : toString, # Convert 'grade' column to string though it already is.
    'sub_grade' : toString, # Convert 'sub_grade' column to string though it already is.
    'emp_title' : toString, # Convert 'emp_title' column to string though it already is.
    'home_ownership': toHomeOwnership, # Convert 'home_ownership' column to HomeOwnership enum. 
    'annual_inc' : toFloat, # Convert 'annual_inc' column to float to enable calculations though it already is.
    'verification_status' : toVerificationStatus, # Convert 'verification_status' column to VerificationStatus enum. 
    'issue_d' : toDateTimeStamp, # Convert 'issue_d' column to DateTimeStamp type.
    'loan_status' : toLoanStatus, # Convert 'loan_status' column to LoanStatus enum.
    'desc' : toString, # Convert 'desc' column to string though it already is.
    'purpose' : toString, # Convert 'purpose' column to string though it already is.
    'title' : toString, # Convert 'title' column to string though it already is.
    'zip_code' : toString, # Convert 'zip_code' column to string though it already is.
    'addr_state' : toString, # Convert 'addr_state' column to string though it already is.
    'dti' : toFloat, # Convert 'dti' column to float though it already is.
    'delinq_2yrs' : toInt, # Convert 'delinq_2yrs' column to int though it already is.
    'earliest_cr_line' : toDateTimeStamp, # Convert 'earliest_cr_line' column to DateTimeStamp type.
    'inq_last_6mths' : toInt, # Convert 'inq_last_6mths' column to int though it already is.
    'mths_since_last_delinq' : toInt, # Convert 'mths_since_last_delinq' column to int.
    'mths_since_last_record' : toInt, # Convert 'mths_since_last_record' column to int.
    'open_acc' : toInt, # Convert 'open_acc' column to int.
    'pub_rec' : toInt, # Convert 'pub_rec' column to int.
    'revol_bal' : toFloat, # Convert 'revol_bal' column to float.
    'revol_util' : toFloat, # Convert 'revol_util' column to float.
    'total_acc' : toInt, # Convert 'total_acc' column to int though it already is.
    'out_prncp' : toFloat, # Convert 'out_prncp' column to float though it already is.
    'out_prncp_inv' : toFloat, # Convert 'out_prncp_inv' column to float though it already is.
    'total_pymnt' : toFloat, # Convert 'total_pymnt' column to float though it already is.
    'total_pymnt_inv' : toFloat, # Convert 'total_pymnt_inv' column to float though it already is.
    'total_rec_prncp' : toFloat, # Convert 'total_rec_prncp' column to float though it already is.
    'total_rec_int' : toFloat, # Convert 'total_rec_int' column to float though it already is.
    'total_rec_late_fee' : toFloat, # Convert 'total_rec_late_fee' column to float though it already is.
    'recoveries' : toFloat, # Convert 'recoveries' column to float though it already is.
    'collection_recovery_fee' : toFloat, # Convert 'collection_recovery_fee' column to float though it already is.
    'last_pymnt_d' : toDateTimeStamp, # Convert 'last_pymnt_d' column to DateTimeStamp type.
    'last_pymnt_amnt' : toFloat, # Convert 'last_pymnt_amnt' column to float though it already is.
    'next_pymnt_d' : toDateTimeStamp, # Convert 'next_pymnt_d' column to DateTimeStamp type.
    'last_credit_pull_d' : toDateTimeStamp, # Convert 'last_credit_pull_d' column to DateTimeStamp type.
    'collections_12_mths_ex_med' : toInt, 
}