# Your Name

----

# ACCY576 Final Project

## Overview

In this project, you will finish the first three steps in the data analytics framework CRISP-DM, business understanding, data understanding and data preparation on the LendingClub dataset. You will also start on the fourth step of the CRISP-DM framework, modeling.

We provide a template of the project report, which consists of 20 mini-tasks. The first 18 mini-tasks are worth 4 percentage points each. The last two mini-tasks require more effort, so they are worth more points. The 19th mini-task is worth 10 percentage points, and the twentieth mini-task is worth 18 percentage points. Thus, the total is 100 percentage points. The number of percentage points that you earn will be multiplied by the 280 points for this project (see the syllabus) to determine the number of points that will go towards calculating your final grade. For example, if you earn 90% then you'll receive 252 points (.9*280) for the final project.

All the visualizations should be properly labeled and titled.

You may add extra code cells if needed.

This is an individual project. You may use your notes, but you should not seek for help from anyone currently enrolled in this course, or who has taken this course before.

Please make sure to upload this completed file by the deadline of 11:59 p.m. on Sunday, March 15th.


## Table of Contents


[**Business Understanding**](#Business-Understanding)

[**Data Understanding and Data Preparation**](#Data-Understanding-and-Data-Preparation)

- [**Data File**](#Data-File)
   - [**Task 1: Load the Data Dictionary and the Data File**](#Task-1:-Load-the-Data-Dictionary-and-the-Data-File)
- [**Check Data Quality**](#Check-Data-Quality)
   - [**Task 2: Clean Up the annual_inc Column**](#Task-2:-Clean-Up-the-annual_inc-Column)
   - [**Task 3: Statistics of Numeric Columns**](#Task-3:-Statistics-of-Numeric-Columns)
- [**Exploratory Data Analysis-EDA**](#Exploratory-Data-Analysis---EDA)
   - [**Task 4: Loan Grade**](#Task-4:-Loan-Grade)
   - [**Task 5: Distribution of Interest Rate**](#Task-5:-Distribution-of-Interest-Rate)
   - [**Task 6: Loan Grade and Interest Rate**](#Task-6:-Loan-Grade-and-Interest-Rate)
   - [**Task 7: Loan Term and Interest Rate**](#Task-7:-Loan-Term-and-Interest-Rate)
   - [**Task 8: Loan by State**](#Task-8:-Loan-by-State)
   - [**Task 9: Borrowers Annual Income Distribution**](#Task-9:-Borrowers-Annual-Income-Distribution)
   - [**Task 10: Borrower Annual Income by State**](#Task-10:-Borrower-Annual-Income-by-State)
   - [**Task 11: Annual Income and Interest Rate**](#Task-11:-Annual-Income-and-Interest-Rate)
   - [**Task 12: Convert Date Column**](#Task-12:-Convert-Date-Column)
   - [**Task 13: Loan Issued Over Year**](#Task-13:-Loan-Issued-Over-Year)
   - [**Task 14: Interest Rate Change**](#Task-14:-Interest-Rate-Change)
   - [**Task 15: Loan Status**](#Task-15:-Loan-Status)
   - [**Task 16: More Investigation of Loan Status**](#Task-16:-More-Investigation-of-Loan-Status)
   - [**Task 17: Even More Investigation of Loan Status**](#Task-17:-Even-More-Investigation-of-Loan-Status)
   - [**Task 18: Loan Term and Loan Status**](#Task-18:-Loan-Term-and-Loan-Status)
   - [**Task 19: Loan Return**](#Task-19:-Loan-Return)
- [**Modeling and Evaluation**](#Modeling-and-Evaluation)
   - [**Task 20: Choose the Interest Rate**](#Task-20:-Choose-the-Interest-Rate)


[Back to Top](#Table-of-Contents)

## Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

### Lending Club

LendingClub is an American peer-to-peer lending company, headquartered in San Francisco, California. It is the world's largest peer-to-peer lending platform.

LendingClub enables borrowers to create unsecured personal loans between \\$1,000 and \\$40,000. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.

For more information about the company please check out the wikipedia article about the [LendingClub](https://en.wikipedia.org/wiki/LendingClub).


### Objective

In this project, we will explore the loan and the borrower information, loan payoff rate and loan returns.


[Back to Top](#Table-of-Contents)

## Data Understanding and Data Preparation
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

Data understanding is always done together with data preparation, which cleans up data, deals with missing values and creates new features through feature engineering.

### Data File

The data file **lending_club_2007_2011_6_states.csv** contains the loan and borrower information for loans initiated from 2007 to 2011 in six states, California, New York, Florida, Texas, New Jersy and Illinois.

The data dictionary file **data_dictionary.csv** contains descriptions of all the columns in the data file.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#display all dataframe columns in df.head()
pd.options.display.max_columns = None
#display long string in datafame
pd.options.display.max_colwidth = 300

#filter out warning messages
import warnings
warnings.filterwarnings('ignore')


[Back to Top](#Table-of-Contents)
### Task 1: Load the Data Dictionary and the Data File
In the next code cell, please load the data dictionary file, `data_dictionary.csv`, to a DataFrame **data_dict** and display the **whole** DataFrame.

#### Your Code for the Data Dictionary

Load the `lending_club_2007_2011_6_states.csv` to a DataFrame **loan_df** and display the first 5 rows.

#### Your Code for the Data

### Check Data Quality
In this section of the CRISP-DM framework, you will be asked to perform tasks to check data quality. The most common check is to check missing values. We can also do some basic data cleaning like cleaning up currency field. Sometimes, the currency field needs to be converted from a string to float after removing currency symbols like '$', commas ',', and parentheses when negative value are enclosed in ().


[Back to Top](#Table-of-Contents)
### Basic Dataframe Info

Discuss the basic information of the data briefly. (It doesn't look like we are awarding any points here. Correct?)

You can use `info()` function to print basic DataFrame information. You may also use `df.isnull().sum()` to check out count of null values in each column.

Please write your code in the code cell and your discussion in the markdown cell. The discussion can be very brief with just a few sentences.

You may add extra code cells if needed.

In [None]:
loan_df.info()

In [None]:
loan_df.isnull().sum()

#### Discussion  

Consider the shape of the dataframe and the number of null values. You might want to ask yourself questions like:
- Are there any columns that you think should be excluded because there are too many null values? 
- What could you do to fill in null values?
- Are there any other questions?

[Back to Top](#Table-of-Contents)
### Task 2: Clean Up the annual_inc Column

- Strip "$" and "," from annual_inc
- Convert datatype to float.
- Diplay first 5 rows

After this task, annual_inc column in loan_df should be type float.

#### Your Code

[Back to Top](#Table-of-Contents)
### Task 3: Statistics of Numeric Columns

Print out the descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution.

Hint: use describe() function.

Discuss following columns briefly
- funded_amnt
- int_rate
- annual_inc

#### Your Code

#### Your Discussion


Discussion:

### Exploratory Data Analysis - EDA
EDA is an approach to analyzing data sets to summarize their main characteristics, often with visualizations.

[Back to Top](#Table-of-Contents)
### Task 4: Loan Grade
How many different loan grades there are in the dataset?

- Plot a bar chart for grade. X axis is the loan grade and y axis is the count of loans. The plot should have x and y axis label and a proper title.
- Briefly discuss the results

Hint: You may use seaborn countplot. To sort the loan grade you may set `order=sorted(loan_df.grade.unique())` in the countplot.

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 5: Distribution of Interest Rate

- Plot a histogram of int_rate.

- Discuss the distribution of the interest rate briefly.

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 6: Loan Grade and Interest Rate

Explore the relationship between loan grade and interest rate.

- Get average interest rate for each loan grade.
 - Group by grade column.
 - Apply aggregate function mean on int_rate column.
- Visualize the averate interest rate of each loan grade with a bar chart. Make sure the plot has proper labels and title.
- Discuss the relationship briefly

**Hint**: You can directly plot a bar chart on an aggregated groupby object. ie. `ax = df.groupby(...).agg(...).plot.bar()`

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 7: Loan Term and Interest Rate

Explore the relationship between loan term and interest rate.

- Get average interest rate for each loan term.
 - Group by term column.
 - Apply aggregate function mean on int_rate column.
- Visualize the average interest rate of each term with a bar chart. Make sure the plot has proper labels and title.
- Discuss the relationship briefly

**Hint**: You can directly plot a bar chart on an aggregated groupby object. ie. `ax = df.groupby(...).agg(...).plot.bar()`

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 8: Loan by State

There are loans from six States in the dataset. Please explore the count of loans in each state via visualization.

- Visualize loan counts in each State. Make sure the plot has proper labels and title. (Hint: if you use seaborn countplot you may sort the bars with `order=loan_df.addr_state.value_counts().index`)
- Discuss the result briefly


#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 9: Borrowers Annual Income Distribution

- Plot a histogram of all borrower's annual income.
- Plot another histogram of annual income that is less than $250,000
- Compare the plots and discuss briefly.

**Hint**: You may use hist() function on annual_inc column to plot histogram.

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 10: Borrower Annual Income by State

- Viusalize **median** income of borrowers from each state with a bar chart.
 - Group by addr_state column.
 - Apply aggregate function median on annual_inc column.

- Discuss the result briefly

**Hint**: You can directly plot a bar chart on an aggregated groupby object. ie. `ax = df.groupby(...).agg(...).plot.bar()`

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 11: Annual Income and Interest Rate

- Plot a scatter plot on annual income and interest rate
- Discuss the result briefly, does the scatter plot reveal any correlation between annual income and interest rate?

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 12: Convert Date Column

- Convert the **issue_d** column to a datetime type
- Create a new column, **issue_year**, and set it to the year a loan is issued
- Display the first five rows

#### Your Code

[Back to Top](#Table-of-Contents)
### Task 13: Loan Issued Over Year

Explore the number of loans issued through LendingClub from 2007-2011. 
- Get a count of loans in each year(Hint: groupby issue_year).
- Plot a line chart to see the trend, x axis is year, y axis is count.
- Discuss the result briefly.

**Hint**: You can directly plot a line chart on an aggregated groupby object. ie. `ax = df.groupby(...).agg(...).plot.line()`

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 14: Interest Rate Change

Explore the interest rate change in each state over the years.

- Create pivot table, set
  - index to issue_year
  - columns to addr_state
  - values to int_rate
  - aggfunc to median
- Plot a line chart to compare median interest rate change over years of the 6 states.
- Discuss the result briefly

**Hint**: You can directly plot a line chart on a pivot table. ie. ax = df.pivot_table(...).plot.line()

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 15: Loan Status

The loans in the dataset were issued before 2011 and the longest loan term is 5 years. So all the loans are either paid off or charged off.

Discuss loan status and its relationship with loan/borrower information.

- Create pivot table **pt_year**, set
  - index to issue_year
  - columns to loan_status
  - values to int_rate
  - aggfunc to count
- Create a new column `payoffRate` in the pivot table. Calculate the paid off rate for each year with formula $payoffRate = \frac{FullyPaid}{Fully Paid+Charged Off}$.
- Display the pivot table.
- Discuss the result briefly

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 16: More Investigation of Loan Status

Discuss loan status and its relationship with interest rate.

- Create a pivot table, **pt_intRate**, set
    - index to loan_status
    - values to int_rate
    - aggfunc to median
- Display the pivot table
- Discuss the result briefly

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 17: Even More Investigation of Loan Status

Discuss loan status and its relationship with the borrower's annual income.

- Create a pivot table, **pt_income**, set
    - index to loan_status
    - values to annual_inc
    - aggfunc to median
- Display the pivot table
- Discuss the result briefly

#### Your Code

#### Your Discussion


Discussion:

[Back to Top](#Table-of-Contents)
### Task 18: Loan Term and Loan Status

Explore the payoff rate of three- and five-year loans.

- Create pivot table **pt_term**, set
  - index to term
  - columns to loan_status
  - values to int_rate
  - aggfunc to count
- Calculate the paid off rate for loans of different terms
  - Create a new column `payoffRate` in the **pt_term** pivot table that you created. Calculate the paid off rate for each loan term with formula $payoffRate = \frac{FullyPaid}{Fully Paid+Charged Off}$.
- Display the pivot table.
- Discuss the result briefly

#### Your Code

#### Your Discussion

Discussion:

[Back to Top](#Table-of-Contents)
### Task 19: Loan Return

Calculate the return of the loan portfolios for the 36 month and the 60 month term. In other words, what is the total return for all loans for each term?

Calculation of loan return is very complicated since the loan is paid by monthly installments. In this project, we simplify the calculation by using the total payment and funded amount. For charged off loans, total payment includes post charge off recoveries. So we can use the following formula to calculate the total return:

$TotalReturn = \frac{Total Payment + Recoveries}{Funded Amount} - 1$

The overall return doesn't reflect loan profitbility since loans have different terms. It's more accurate to compare annual returns. There are only two terms in the dataset, 36 months and 60 months. The formula to calcuate annual return is:

$Annualized Return = (1+Total Return)^{(1/years)} - 1$. 

For example, if overall return of the 36 month loan portfolio is 10%, then annualized return = `(1 + 0.1)**(1/3) - 1 = 0.032`.

#### Your Code

#### Your Discussion

Discussion:

### Modeling and Evaluation

[Back to Top](#Table-of-Contents)
### Task 20: Choose the Interest Rate

Assume that you are an investor who is evaluating three different loans. Assuming that you could invest in all three of them, and that they were part of a larger portfolio of loans in which you were investing, what interest rate would you set for the three different loans described below?

1. Loan 1 is for an individual seeking a three year, \\$10,000 loan, C-grade loan. This indivdual has an annual income of \\$85,000.
2. Loan 2 is for an individual seeking a five year, \\$20,000 loan, B-grade loan. This indivdual has an annual income of \\$100,000.
3. Loan 3 is for an individual seeking a three year, \\$2,000 loan, A-grade loan. This indivdual has an annual income of \\$30,000.



#### Your Code

#### Your Discussion

Discussion: