
# Analyzing Loan Data

In this notebook I'm going to analyze a dataset about loan data. The data is taken from [Kaggle](https://www.kaggle.com/zhijinzhai/loandata "Kaggle").

My goal is to analyze, explore and visualize data using python libraries pandas, numpy, matplotlib and seaborn.





## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Data exploration or getting familiar with the data

I start by running some common methods in order to explore the data.

In [4]:
ld = pd.read_csv('../input/Loan payments data.csv')

In [5]:
ld.head()

In [6]:
ld.info()

## Data type issue


Looking at the data types there are 3 columns which have been imported as object types but should be date types instead (effective_date, due_date and paid_off_time).

In the next cell I'm correcting the date format for those 3 columns.

In [7]:
ld['effective_date'] = pd.to_datetime(ld['effective_date'])
ld['due_date'] = pd.to_datetime(ld['due_date'])
ld['paid_off_time'] = pd.to_datetime(ld['paid_off_time'])

Verifying that it worked:

In [8]:
ld.info()

## Describe + histogram


Moving one with the analysis: next I'm going to take a closer look into the numeric variables with the describe and histogram methods:

In [9]:
ld.describe()

In [10]:
ld.hist(bins=30, figsize=(20,15))

## Taking a closer look at the loan_status

From a bank perspective, the loan_status column should be the single most interesting column. At the end of the day, it all comes down to whether the loan is paid or not.

In the next few cells I'm going to do some more data exploration with emphasis on the loan_status column.

In [11]:
ld.groupby('loan_status').count()

In [12]:
fig, axs = plt.subplots(2, 2, figsize=(20,15))

sns.countplot(x = "loan_status",hue = "Principal", data=ld,ax=axs[0][0])
axs[0][0].set_title("Distribution of Principal Loan Amount by Loan Status")

sns.countplot(x = "loan_status",hue = "terms", data=ld,ax=axs[0][1])
axs[0][1].set_title("Distribution of Loan Payoff Terms by Loan Status")

sns.countplot(x = "loan_status",hue = "Gender", data=ld,ax=axs[1][0])
axs[1][0].set_title("Distribution of Gender by Loan Status")

sns.countplot(x = "loan_status",hue = "education", data=ld, ax=axs[1][1])
axs[1][1].set_title("Distribution of Education by Loan Status")

In [13]:
plt.figure(figsize=(12,8))
plt.title('Distribution of Past Due Days by Loan Status')
sns.countplot(x='past_due_days',hue='loan_status', data=ld)

In [14]:
plt.figure(figsize=(12,8))
plt.title('Distribution of Age by Loan Status')
sns.countplot(x='age',hue='loan_status', data=ld)

## The dates

Finally, lets's also look at the date columns.

In [15]:
plt.figure(figsize=(20,8))
plt.title('Distribution of Effective Date by Loan Status')
plt.xticks(rotation=45)
sns.countplot(x='effective_date',hue='loan_status', data=ld)

In [16]:
plt.figure(figsize=(20,8))
plt.title('Distribution of Due Date by Loan Status')
plt.xticks(rotation=45)
sns.countplot(x='due_date',hue='loan_status', data=ld)