# Introduction

<img src="https://cdn.corporatefinanceinstitute.com/assets/Loans-1.jpeg" width="700">

The current data set includes details of the 500 people who have opted for loan. Also, the data mentions whether the person has paid back the loan or not and if paid, in how many days they have paid. In this project, we will try to draw few insights on sample Loan data.

Please find the details of dataset below which can help to understand the features in it

1. Loan_id : A unique loan (ID) assigned to each loan customers- system generated
2. Loan_status : Tell us if a loan is paid off, in collection process - customer is yet to payoff, or paid off after the collection efforts
3. Principal : Principal loan amount at the case origination OR Amount of Loan Applied
4. Terms : Schedule(time period to repay)
5. Effective_date : When the loan got originated (started)
6. Due_date : Due date by which loan should be paid off
7. Paidoff_time : Actual time when loan was paid off , null means yet to be paid
8. Past_due_days : How many days a loan has past due date
9. Age : Age of customer
10. Education : Education level of customer applied for loan
11. Gender : Customer Gender (Male/Female)

Loading the initial libraries

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Let us load the data set

In [None]:
loan= pd.read_csv('../input/loandata/Loan payments data.csv')

### **Data Analysis On Loan Data Set**

**Checking first 5 and last 5 records from the datasets**

In [None]:
loan.head(5)

In [None]:
loan.tail(5)

Let's check the duplicate data in data set

In [None]:
loan.duplicated().sum()

In [None]:
loan.shape

In [None]:
loan.info()

In [None]:
loan.isnull().sum()

From Analysis:
1. There are no duplicated values.
2. Loan data set have 500 records in 11 columns/features.
3. There are 100 null values in "paid_off_time" feature and 300 null values in "past_due_days"
4. Also we will need to convert some columns to respective datetime datatype

Let's convert following columns to the Datetime format

In [None]:
loan['effective_date'] = pd.to_datetime(loan['effective_date'])
loan['due_date'] = pd.to_datetime(loan['due_date'])
loan['paid_off_time'] = pd.to_datetime(loan['paid_off_time']).dt.date
loan['paid_off_time'] = pd.to_datetime(loan['paid_off_time'])

In [None]:
loan.info()

**Let's aim to replace NaN values for the columns in accordance with their distribution**

In [None]:
loan.hist(figsize = (15,11), color="#008080")

In [None]:
loan['past_due_days'].fillna(loan['past_due_days'].mean(), inplace = True)
loan['paid_off_time'] = loan['paid_off_time'].fillna(-1)

In [None]:
loan.isnull().sum()

Also, there is one Spelling Correction

In [None]:
loan['education']= loan['education'].replace('Bechalor','Bachelor')

Now, it seems we are good to go ahead

### **Exploratory Data Analysis**

### **Loan Status Analysis**

In [None]:
loan_stat = loan['loan_status'].value_counts()
pd.DataFrame(loan_stat)

In [None]:
plt.figure(figsize = [10,5])
plt.pie(loan['loan_status'].value_counts(),labels=loan['loan_status'].unique(),explode=[0,0.1,0],startangle=145,autopct='%1.f%%', colors=['#1e847f', '#ecc19c', '#000000'])
plt.title('Loan Status Distribution',fontsize = 15)
plt.show()

We can see here,
* Out of 500 peoples 300 people repaid the full amount on time. 
* Collection paid off shows 100 peoples repaid the loan but lately after due date.
* Collection shows 100 people not repaid the loan.

### **Gender v/s Loan Status Analysis**

In [None]:
loan['Gender'].value_counts().sort_index()

**Out of 500 their are 423 males and 77 females present**

In [None]:
loan.groupby(['Gender'])['loan_status'].value_counts().to_frame()

In [None]:
plt.figure(figsize = [10,5])
sns.countplot(loan['Gender'],hue=loan['loan_status'], palette=('#1e847f', '#ecc19c', '#000000'))
plt.legend(loc='upper right')
plt.title('Gender vs Loan Status',fontsize=20)
plt.xlabel('Gender', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

From above analysis:
* Out of 500 their are 423 males and 77 females present
* Around 40% of male population have repaid their loan lately (or yet to pay)
* Around 30% of female population have repaid their loan lately (or yet to pay)
* Irrespective of gender, most of the population tend to pay the loan on time

### **Education v/s Loan Status Analysis**

In [None]:
loan['education'].value_counts().to_frame()

In [None]:
loan.groupby(['education'])['loan_status'].value_counts().to_frame()

In [None]:
plt.figure(figsize = [10,5])
sns.countplot(loan['education'],hue=loan['loan_status'], palette=('#1e847f', '#ecc19c', '#000000'))
plt.legend(loc='upper right')
plt.title('Education vs Loan Status',fontsize=20)
plt.xlabel('Education', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

From above analysis:
* Majority of the loan takers are from High School or College background
* Very few people from Masters or Above background took loan.
* Irrespective of education category, most of them repaid their loan

### **Age v/s Loan Status Analysis**

In [None]:
loan['age'].value_counts().to_frame()

In [None]:
plt.figure(figsize = [18,7])
sns.countplot(loan['age'],hue=loan['loan_status'],palette=('#1e847f', '#ecc19c', '#000000'))
plt.legend(loc='upper left')
plt.title('Age vs Loan Status',fontsize=20)
plt.xlabel('Age', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

From above analysis:
* Majority of the people who took loan have age ranging from 24 years to 38 years
* Majority of people repaid their loan

### **Principal v/s Loan Status Analysis**

In [None]:
loan['Principal'].value_counts().to_frame()

In [None]:
loan.groupby(['Principal'])['loan_status'].value_counts().to_frame()

In [None]:
plt.figure(figsize = [10,5])
sns.countplot(loan['Principal'],hue=loan['loan_status'],palette=('#1e847f', '#ecc19c', '#000000'))
plt.legend(loc='upper left')
plt.title('Principal vs Loan Status',fontsize=20)
plt.xlabel('Principal', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

From above analysis:
* Majority of the people have opted for Principal of 800 and 1000
* And out of those 1800 people, majority of them repaid their loan

### **Term v/s Loan Status Analysis**

In [None]:
loan['terms'].value_counts().to_frame()

In [None]:
loan.groupby(['terms'])['loan_status'].value_counts().to_frame()

In [None]:
plt.figure(figsize = [10,5])
sns.countplot(loan['terms'],hue=loan['loan_status'],palette=('#1e847f', '#ecc19c', '#000000'))
plt.legend(loc='upper left')
plt.title('Terms vs Loan Status',fontsize=20)
plt.xlabel('Terms', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

From above analysis:
* Only few people have opted loan for 7 days term
* Majority of the late payments are from people who have their loan terms as 15 days and 30 days

### **Effective Date v/s Loan Status Analysis**

In [None]:
loan.groupby(['effective_date'])['loan_status'].value_counts().to_frame()

In [None]:
plt.figure(figsize = [10,5])
dates = loan['effective_date'].dt.date
sns.countplot(x=dates, hue=loan['loan_status'],palette=('#1e847f', '#ecc19c', '#000000'))
plt.legend(loc='upper right')
plt.title('Effective Date vs Loan Status',fontsize=20)
plt.xlabel('Effective Date', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()

From above analysis:
* On 11th and 12th September, loan was given to many people
* It looks like maybe as part of a some loan event drive

Let's see correlation between the features

In [None]:
correlation = loan[loan.columns].corr()
plt.figure(figsize=(10, 7))
plot = sns.heatmap(correlation, vmin = -1, vmax = 1,annot=True, annot_kws={"size": 10})
plot.set_xticklabels(plot.get_xticklabels(), rotation=30)

### **Conclusion:**
1. 20% of the people have not repaid the loan 20% of the people have repaid the loan but lately after due date and 60% of the people have repaid the loan on time.
2. Majority of the loan takers are from High School or College background.
3. Majority of the people who took loan have age ranging from 24 years to 38 years.
4. Majority of the people have opted for Principal of  800 and 1000
5. Majority of the late payments are from people who have their loan terms as 15 days and 30 days.
6. Most of the Elder people (35 - 50 years) have paid back loan on time.