# BX Data Science - Take-Home Assessment
---
### Ryan Peabody
### 6 January, 2019
---

#### Package imports

In [45]:
import numpy as np
import pandas as pd

from IPython.display import display

###  Part 1: Data Exploration and Evaluation

The dataset of interest details Lending Club loan originations from 2007 to 2015, in a csv accessible through kaggle.com. The full dataset gives 55 features for 887,379 loans, but for our purposes we will be using the following 11 features:

> **1. loan_amnt** (loan amount): The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

> **2. funded_amnt** (funded amount): The total amount committed to that loan at that point in time.

> **3. term**: The number of payments on the loan. Values are in months and can be either 36 or 60.

> **4. int_rate** (interest rate): Interest Rate on the loan.

> **5. grade**: Lending Club assigned loan grade.

> **6. annual_inc** (annual income): The self-reported annual income provided by the borrower during registration.

> **7. issue_d** (issue date): The month and year in which the loan was funded.

> **8. dti** (debt to income ratio): A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

> **9. revol_bal** (revolving balance): Total credit revolving balance.

> **10. total_pymnt** (total payment): Payments received to date for total amount funded.

> **11. loan_status** (loan status): Current status of the loan.

---
Let's begin by loading in the Lending Club loan dataset. Based on the provided definitions, we can attempt to enforce data types for some of the columns.

In [82]:
# Read in DataFrame of loan dataset, enforcing datatypes where possible
columns = {'loan_amnt': float,'funded_amnt': float, 'term': str, 'int_rate': float, 'grade': str,
           'annual_inc': float, 'issue_d': str, 'dti': float, 'revol_bal': float, 'total_pymnt': float,
           'loan_status': str}
df = pd.read_csv("loan.csv", usecols=columns.keys(), dtype=columns)

# Take an initial look at the dimensions, and several rows of the loan dataset
df_display = pd.DataFrame(df.tail())
df_display.loc['Data type', columns.keys()] = df_display.dtypes

print('')
print('Total number of rows: {}'.format(df.shape[0]))
print('Total number of features: {}'.format(df.shape[1]))
display(df_display)


Total number of rows: 887379
Total number of features: 11


Unnamed: 0,loan_amnt,funded_amnt,term,int_rate,grade,annual_inc,issue_d,loan_status,dti,revol_bal,total_pymnt
887374,10000,10000,36 months,11.99,B,31000,Jan-2015,Current,28.69,14037,3971.88
887375,24000,24000,36 months,11.99,B,79000,Jan-2015,Current,3.9,8621,9532.39
887376,13000,13000,60 months,15.99,D,35000,Jan-2015,Current,30.9,11031,3769.74
887377,12000,12000,60 months,19.99,E,64400,Jan-2015,Current,27.19,8254,3787.67
887378,20000,20000,36 months,11.99,B,100000,Jan-2015,Current,10.83,33266,7943.76
Data type,float64,float64,object,float64,object,float64,object,object,float64,float64,float64


This looks reasonable. It is worth noting that our columns describing the loan term and issue dates 2