# ACCY 577 Final Project  

## Team Members
Please list your team members in this cell.

----



## Overview 

As an individual investor who is investing on the LendingClub platform, you want to understand how the individual components (i.e. features) of a loan, such as the amount loaned, the term length, loan grade or the interest rate, affect the profitablity of a specific loan. In particular, you are interested in understanding how likely a loan is to be repaid. In this project, you will use the provided loan dataset to build a model to help you pick up a profitable loan portfolio. 


## Criteria

You will work in groups to analyze the data to make recommendations based on the `lending_club_2007_2011_6_states.csv` dataset. 

You will complete three tasks for this group project:
1. A group report in the form of a Jupyter notebook,
2. A video presentation where your group will present your results, and 
3. Peer evaluation of the contributions of each member of your group.

Your final group report will be a single Jupyter notebook that will integrate Markdown, Python code, and the results from your code, such as data visualizations. Markdown cells should be used to explain any decisions you make regarding the data, to discuss any plots or visualizations generated in your notebook, and the results of your analysis. As a general guideline, the content should be written in a way that a fellow classmate (or an arbitrary data scientist/analyst) should be able to read your report and understand the results, implications, and processes that you followed to achieve your result.

All group members should present in the video presentation. You can use presentation software such as MS PowerPoint. The presentations should cover the steps in your analytics process and highlight your results. You don't need to explain Python code in the presentation. Focus on your analysis method and results. The presentation should take between eight to twelve minutes.

### Rubric
  - Notebook report (160 points)
  - Video presentation (100 points)
  - Peer assessment for your groupmates (20 points) - You will get this 20 points as long as you submit peer evaluation before the deadline.
  - Individual grade will be adjusted based on your peer evaluations by your group members.
  
### Submissions
- Each group:
    - Report in .ipynb format
    - Report in .html format
    - Presentation link
- Each individual:
    - Peer evaluation

### General

Your report should 
  - Use proper markdown.  
  - Include all of the code used for your analysis.
  - Plots should be properly labeled (e.g., axis labels and titles).
  - Use a consistent style.
  
Your presentation should
  - Use clear, concise slides.
  - Support your points with facts (e.g., statistics, visualizations).
  - Avoid any coding details.
  - Be creative.
-----

## Get Started

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#display all dataframe columns in df.head()
pd.options.display.max_columns = None
#display long string in datafame
pd.options.display.max_colwidth = 300

#filter out warning messages
import warnings
warnings.filterwarnings('ignore')

### Business Understanding

#### Lending Club

LendingClub is an American peer-to-peer lending company, headquartered in San Francisco, California. It is the world's largest peer-to-peer lending platform.

LendingClub enables borrowers to create unsecured personal loans between \\$1,000 and \\$40,000. Investors can search and browse the loan listings on the LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.

For more information about the company please check out the Wikipedia article about the [LendingClub](https://en.wikipedia.org/wiki/LendingClub).

**Updates**  
LendingClub has closed down their Notes platform at the end of the year 2020 and individual investors will no longer be able to invest in any loans originated by LendingClub.

### Ungraded Discussions
This part is not graded. We'll discuss the following questions in the week 4 live session.  

- How does peer-to-peer lending work?
- How is a loan interest rate determined?
- How to invest on lending club?
    - What are the different ways for investors to select loans to invest?
    - How do investors manage risks?

Since you can't get this information from LendingClub now, you may read this review by a LendingClub invester to understand the details of investing on the platform: [Lending Club Review for New Investors](https://www.lendacademy.com/lending-club-review/).

### Load Data and Data Dictionary

In [2]:
data_dict = pd.read_csv('data_dictionary.csv')
data_dict

Unnamed: 0,ColumnName,Description
0,acc_now_delinq,The number of accounts on which the borrower is now delinquent.
1,addr_state,The state provided by the borrower in the loan application
2,annual_inc,The self-reported annual income provided by the borrower during registration.
3,application_type,Indicates whether the loan is an individual application or a joint application with two co-borrowers
4,chargeoff_within_12_mths,Number of charge-offs within 12 months
5,collection_recovery_fee,post charge off collection fee
6,collections_12_mths_ex_med,Number of collections in 12 months excluding medical collections
7,debt_settlement_flag,"Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company."
8,debt_settlement_flag_date,The most recent date that the Debt_Settlement_Flag has been set
9,delinq_2yrs,The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years


In [3]:
loan_df = pd.read_csv('lending_club_2007_2011_6_states.csv')
loan_df.sample(5)

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens,hardship_flag,disbursement_method,debt_settlement_flag,debt_settlement_flag_date
4393,35000,22950,22925.0,60 months,11.99,510.4,B,B5,Level 3 communications llc,10+ years,MORTGAGE,85000.0,Verified,Sep-2011,Fully Paid,n,debt_consolidation,Debt Consolidation Loan,105xx,NY,6.13,0.0,Jan-1992,1.0,,,10.0,0.0,16649,22.5,35.0,f,0.0,0.0,25113.274765,25085.92,22950.0,2163.27,0.0,0.0,0.0,Jul-2012,20535.07,,Feb-2019,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,
15132,1500,1500,1500.0,36 months,14.59,51.7,D,D1,Alamo Drafthouse Cinema,9 years,RENT,35004.0,Not Verified,May-2010,Fully Paid,n,other,Personal Loan,787xx,TX,21.7,0.0,Feb-2001,2.0,,,3.0,0.0,23806,99.2,7.0,f,0.0,0.0,1861.241862,1861.24,1500.0,361.24,0.0,0.0,0.0,May-2013,54.32,,May-2013,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,
638,18250,18250,18250.0,60 months,23.52,519.95,G,G3,Seminole Hard Rock Hotel & Casino,6 years,OWN,40000.0,Not Verified,Dec-2011,Fully Paid,n,debt_consolidation,no more CC,336xx,FL,19.2,1.0,Feb-1996,0.0,4.0,,8.0,0.0,20240,98.3,16.0,f,0.0,0.0,31196.222337,31196.22,18250.0,12946.22,0.0,0.0,0.0,Dec-2016,519.17,,Jun-2017,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,
6300,18000,18000,18000.0,36 months,10.99,589.22,B,B3,Solano College,< 1 year,RENT,54000.0,Verified,Jul-2011,Fully Paid,n,debt_consolidation,Debt Consolidation Loan,940xx,CA,22.98,0.0,Sep-2001,1.0,,,9.0,0.0,11799,32.1,20.0,f,0.0,0.0,20281.57304,20281.57,18000.0,2281.57,0.0,0.0,0.0,Jan-2014,220.95,,Jan-2014,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,
15181,12000,12000,12000.0,36 months,14.59,413.58,D,D1,Lockheed Martin,4 years,OWN,70000.0,Verified,May-2010,Charged Off,n,credit_card,Credit Card Payoff,951xx,CA,15.93,0.0,Jun-2003,0.0,,,8.0,0.0,30951,96.1,15.0,f,0.0,0.0,10996.42,10996.42,8116.25,2621.9,20.652499,237.62,3.06,Jul-2012,413.58,,Oct-2016,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,


In [4]:
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19908 entries, 0 to 19907
Data columns (total 58 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   loan_amnt                    19908 non-null  int64  
 1   funded_amnt                  19908 non-null  int64  
 2   funded_amnt_inv              19908 non-null  float64
 3   term                         19908 non-null  object 
 4   int_rate                     19908 non-null  float64
 5   installment                  19908 non-null  float64
 6   grade                        19908 non-null  object 
 7   sub_grade                    19908 non-null  object 
 8   emp_title                    18723 non-null  object 
 9   emp_length                   19409 non-null  object 
 10  home_ownership               19908 non-null  object 
 11  annual_inc                   19908 non-null  float64
 12  verification_status          19908 non-null  object 
 13  issue_d         

In [5]:
loan_df.describe()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,next_pymnt_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
count,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,6998.0,1209.0,19908.0,19908.0,19908.0,19889.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,19908.0,0.0,19893.0,0.0,19908.0,19908.0,19893.0,19908.0,19595.0,19898.0
mean,11353.846444,11065.763763,10500.929748,12.089717,330.614254,71073.45,13.008619,0.143962,0.829466,35.841097,69.354839,9.278782,0.046715,13363.994826,49.424966,21.527627,0.0,0.0,12286.852391,11690.155107,9913.51999,2277.120129,1.489973,94.722382,11.861625,2636.253711,,0.0,,1.0,0.0,0.0,0.0,0.037969,0.0
std,7463.700492,7176.276661,7106.22965,3.698287,210.557434,69805.65,6.663658,0.489576,1.044978,21.59917,44.520279,4.414903,0.217827,15943.303849,28.198395,11.269006,0.0,0.0,9098.847567,8984.722396,7126.424699,2583.733214,7.964257,666.434583,141.910324,4412.964304,,0.0,,0.0,0.0,0.0,0.0,0.191393,0.0
min,500.0,500.0,0.0,5.42,15.69,4000.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,1.0,0.0,0.0,0.0,0.0,0.0
25%,5750.0,5600.0,5000.0,9.45,171.2875,42000.0,7.84,0.0,0.0,19.0,0.0,6.0,0.0,3770.5,26.5,13.0,0.0,0.0,5670.592545,5239.8575,4800.0,687.595,0.0,0.0,0.0,222.53,,0.0,,1.0,0.0,0.0,0.0,0.0,0.0
50%,10000.0,10000.0,9000.0,11.86,285.78,60000.0,13.075,0.0,0.0,34.0,90.0,9.0,0.0,8876.5,50.3,20.0,0.0,0.0,10042.735817,9427.845,8000.0,1389.23,0.0,0.0,0.0,544.7,,0.0,,1.0,0.0,0.0,0.0,0.0,0.0
75%,15000.0,15000.0,14500.0,14.61,440.815,85000.0,18.2,0.0,1.0,51.0,104.0,12.0,0.0,16952.25,72.8,28.0,0.0,0.0,16681.68845,15983.44,14000.0,2842.795,0.0,0.0,0.0,3193.3525,,0.0,,1.0,0.0,0.0,0.0,0.0,0.0
max,35000.0,35000.0,35000.0,24.4,1302.69,6000000.0,29.99,11.0,8.0,106.0,129.0,44.0,3.0,148829.0,99.9,90.0,0.0,0.0,58480.139915,58438.37,35000.02,23480.14,180.2,29623.35,6543.04,35596.41,,0.0,,1.0,0.0,0.0,0.0,2.0,0.0


### Paid-Off Rate and Annual Return

The purpose of the project is to identify the most profitable loans. There are two criteria that help you evaluate the loan portfolio profitability: paid off rate and annual return of the portfolio.

#### Paid-off rate
The dataset has loans initiated from 2007 to 2011. All loans have been either fully paid or charged off. So we will create a 'repaid' column by encoding loan status and map Charged Off to 0 and Fully Paid to 1. 

In [6]:
loan_df.loan_status.value_counts()

Fully Paid     16965
Charged Off     2943
Name: loan_status, dtype: int64

In [7]:
mapping_dict = {'Charged Off':0, 'Fully Paid':1}
loan_df['repaid'] = loan_df.loan_status.map(mapping_dict)
loan_df.repaid.value_counts()

1    16965
0     2943
Name: repaid, dtype: int64

The average repaid will be the paid-off rate for a loan portfolio. For example, we get overall paid-off rate of all loans in the dataset in the next code cell. About 85% of all loans are paid off.

In [8]:
loan_df.repaid.mean()

0.8521699819168174

#### Annual return

Calculation of loan return is very complicated since the loan is paid by monthly installments. In this project, we simplify the calculation by using the total payment and funded amount. For charged-off loans, total payment collected includes post charged-off recoveries. So we can use following formula to calculate the total return:

$TotalReturn = \frac{Total Payment + recoveries}{Funded Amount} - 1$

    The overall return doesn't reflect loan profitability since loans have different terms. It's more accurate to compare annual returns. There are only two terms in the dataset, 36 months and 60 months. The formula to calculate annual return is:

$Annualized Return = (1+Total Return)^{(1/years)} - 1$. 

For example, if overall return of a 36 month loan is 10%, then annualized return = `(1 + 0.1)**(1/3) - 1` = `0.032`.

Again, this is not the true annualized return on a loan investment. But the goal of this project is to identify loans to invest, so we just need a benchmark to evaluate loan performance.

In the following code cells, we first create total_return and annual_return for each loan, then we define a function to calculate annual return of a loan portfolio. The function calculates annual return for 36 month and 60 month loans in a portfolio separately. **Note:** You don't have to use this function in your project. You can define your own ways to evaluate profitability of a portfolio.

In [9]:
loan_df.term.value_counts()

36 months    14852
60 months     5056
Name: term, dtype: int64

In [10]:
loan_df['total_return'] = (loan_df.total_pymnt + loan_df.recoveries)/loan_df.funded_amnt-1
loan_df['loan_term_year'] = loan_df.term.apply(lambda x:3 if x=='36 months' else 5)
loan_df['annual_return'] = loan_df.apply(lambda x:(1+x.total_return)**(1/x.loan_term_year)-1, axis=1)

In [11]:
loan_df.sample(5)

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens,hardship_flag,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,repaid,total_return,loan_term_year,annual_return
11677,10000,6450,6450.0,36 months,5.79,195.61,A,A2,Leapfrog Online,1 year,RENT,48000.0,Not Verified,Nov-2010,Fully Paid,n,debt_consolidation,Debt Consolidation,601xx,IL,9.18,0.0,Sep-2002,0.0,,,8.0,0.0,4714,20.5,19.0,f,0.0,0.0,7041.950733,7041.95,6450.0,591.95,0.0,0.0,0.0,Dec-2013,209.42,,Feb-2019,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,,1,0.091775,3,0.029701
7783,5000,5000,5000.0,60 months,18.39,128.04,E,E2,Macy's,< 1 year,RENT,18000.0,Source Verified,May-2011,Charged Off,n,debt_consolidation,Lending Club Loan,939xx,CA,13.87,0.0,Jun-1997,2.0,,,5.0,0.0,2242,77.3,18.0,f,0.0,0.0,4481.4,4481.4,2357.38,2122.7,0.0,1.32,0.0,Apr-2014,128.04,,Feb-2017,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,,0,-0.103456,5,-0.021605
1024,18000,18000,18000.0,60 months,19.42,471.1,E,E3,,3 years,RENT,36000.0,Verified,Dec-2011,Fully Paid,n,other,personal,920xx,CA,3.27,1.0,Apr-2003,0.0,14.0,,4.0,0.0,5976,31.0,5.0,f,0.0,0.0,20512.904913,20512.9,18000.0,2512.9,0.0,0.0,0.0,Sep-2012,16746.62,,Oct-2012,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,,1,0.139606,5,0.026481
17995,25000,25000,14557.01824,36 months,14.11,855.73,D,D1,Executive 1 Financial,3 years,MORTGAGE,116736.0,Verified,Aug-2009,Charged Off,n,other,Bank of America loan,600xx,IL,15.53,0.0,Dec-2001,1.0,,,11.0,0.0,26438,48.7,30.0,f,0.0,0.0,4444.99,2628.46,2284.79,1134.21,0.0,1025.99,10.2,Dec-2009,855.73,,Oct-2016,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,,0,-0.781161,3,-0.397383
145,1400,1400,1400.0,36 months,7.9,43.81,A,A4,Dr.Boers,1 year,RENT,28800.0,Not Verified,Dec-2011,Fully Paid,n,other,auto repair,347xx,FL,10.0,0.0,Jun-2003,1.0,,,5.0,0.0,2016,9.9,7.0,f,0.0,0.0,1478.900477,1478.9,1400.0,78.9,0.0,0.0,0.0,Oct-2012,817.34,,Feb-2019,0.0,,1,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N,,1,0.056357,3,0.018444


In [12]:
loan_df.loan_term_year.value_counts()

3    14852
5     5056
Name: loan_term_year, dtype: int64

In [13]:
def get_portfolio_annual_return(df):
    '''
    Get annual return of 36 and 60 month loans in the portfolio.
    '''
    annual_return_36, annual_return_60 = 0, 0
    df_36 = df[df.loan_term_year==3]
    if(len(df_36)>0):
        return_36 = (df_36.total_pymnt.sum() + df_36.recoveries.sum())/df_36.funded_amnt.sum()-1
        annual_return_36 = (1+return_36)**(1/3)-1
    df_60 = df[df.loan_term_year==5]
    if(len(df_60)>0):
        return_60 = (df_60.total_pymnt.sum() + df_60.recoveries.sum())/df_60.funded_amnt.sum()-1
        annual_return_60 = (1+return_60)**(1/5)-1
    print (f'36 months loan:{len(df_36)}, Annual return:{round(annual_return_36, 4)}')
    print (f'60 months loan:{len(df_60)}, Annual return:{round(annual_return_60, 4)}')
    return annual_return_36, annual_return_60

We get annual returns of all loans in the dataset with the function in the next code cell. The 36-month loans have about 3.03% annual return and the 60-month loans have about 3.13% annual return.

In [14]:
# Overall return of all loans in the dataset
get_portfolio_annual_return(loan_df)

36 months loan:14852, Annual return:0.0303
60 months loan:5056, Annual return:0.0313


(0.03028789309763802, 0.031295585034884166)

### Confusion Matrix Function

The next code cell defines function that plot the confusion matrix. You may use the function in your model evaluation.

In [15]:
# This method produces a colored heatmap that displays the relationship
# between predicted and actual types from a machine learning method.
def confusion(test, predict, labels, title='Confusion Matrix'):
    '''
        test: true label of test data, must be one dimensional
        predict: predicted label of test data, must be one dimensional
        labels: list of label names, ie: ['positive', 'negative']
        title: plot title
    '''

    bins = len(labels)
    # Make a 2D histogram from the test and result arrays
    pts, xe, ye = np.histogram2d(test, predict, bins)

    # For simplicity we create a new DataFrame
    pd_pts = pd.DataFrame(pts.astype(int), index=labels, columns=labels )
    
    # Display heatmap and add decorations
    hm = sns.heatmap(pd_pts, annot=True, fmt="d")    
    hm.axes.set_title(title, fontsize=20)
    hm.axes.set_xlabel('Predicted', fontsize=18)
    hm.axes.set_ylabel('Actual', fontsize=18)

    return None

----

# Notebook Report Tasks (160 points)

There are four tasks you need to finish for this notebook report. Each task accounts for certain points. There is a  total of 160 points for this notebook report.

- Data understanding and feature selection (60 points)
- Modeling (40 points)
- Construct loan portfolio (40 points)
- Conclusion (20 points)

Please include all needed Python code in this notebook. Each task should have a summary. 

**Please run all code cells before this cell before proceeding.**

----
## Task 1: Data understanding and feature selection (60 points)

Use domain knowledge and exploratory data analysis to make the initial feature selection. You need to justify why you discard or select a feature. For example, you may say, the following features are excluded since they are only available after a loan is initiated: feature1, feature2, ....

Please include all Python code used to help you make the selection. You can have as may code cells as needed and you will need to summarize your selections and reasons in a markdown cell.

After Task 1, you should have a new DataFrame that contains the selected features. This DataFrame should also include the three features that help to calculate loan returns, total_pymnt, recoveries and funded_amnt. You will not use the three features to select loans, but you will need them to evaluate your model and portfolio.

You may use (but are not limited to) the following criteria to select features:

- Exclude features that are only available after loan initiation. For example, last_pymnt_amnt is not a feature you can use to help you select loans.
- Exclude features that have too many missing values **and** the missing values are hard to fill.
- Exclude categorical features that have too many categories.
- Exclude categorical features that have only one category.
- Exclude features that have just one value.
- Include features that have impact on paid-off rate and return.

You may use(but not limited to) following techniques:

- Domain knowledge(common sense). For example, annual income is likely to have positive impact on paid-off rate.
- Exploratory data analysis, which includes:
    - Descriptive statistics
    - Groupby
    - Pivot table
    - Visualization

You are **not** limited or required to use the techniques listed above.

### Ungraded Discussions
This part is not graded. We'll discuss the following questions in the week 5 live session.  

Shall we select following features as training data and why?
- total_rec_int
- emp_title
- pymnt_plan
- pub_rec_bankruptcies

----
## Task 2: Modeling (40 points)

Construct a classification model to predict whether a loan will be paid in full. Column 'repaid' will be the label. The purpose of the model is to identify loans that are more likely to be paid in full.

#### Preprocessing

- Separate label and data. Label is 'repaid' column. 
- Encode categorical features
- Manage missing values(drop or fill), please justify your choice.
- Split dataset to train and test. Please set random_state to ensure reproducibility.

#### Modeling, Evaluation and Optimization

- Apply at least two classifiers including LogisticRegression.(The reason is explained in Task 3.)
- If you choose Support Vector Machine, use LinearSVC instead of SVC. The reason is that SVC is very slow with this dataset.
- Apply feature selection with following techniques (you don't have to use all of them):
    - Filter methods
    - Wrapper methods
    - Embedded methods
- Optimize each classifier with cross validation. Select best options for major hyperparameters of the models you choose.
- Evaluate models with classification report and confusion matrix.
- Compare models with ROC plot.

#### Conclusion
Summarize your analysis in Task 2.

### Ungraded Discussions
This part is not graded. We'll discuss the following questions in the week 6 live session.  

- Divide the dataset to two portfolio, one with all loans that are paid off, one with all the loans that are not paid off, what are the annual returns of these two portfolio?
- Create portfolios that contain only one loan grade, i.e. A, B, C.., compare and discuss the repaid rate and the annual return of each portfolio.


----
## Task 3: Construct Loan Portfolio (40 points)

Construct a loan portfolio out of the test set with the help of your classification model. The portfolio can be a small subset of all loans in the dataset, for example, 50% of all loans in the test set.

For this task, finish the following mini tasks.
- Predict on the test set with optimized model.
- Select loans that are predicted to be paid off by the model.
- Calculate paid-off rate of the selected loans. Is the paid off rate better than the overall paid off rate(85.2%)?
- Calculate annual return of selected loans.
- Assess the risk of your portfolio. You may use the variance of annual return as a risk benchmark.
- Adjust class_weight to 'balanced' and other options (ie. class_weight={1:0.2, 0:0.8}), and use the new model to construct the loan portfolio. How is the accuracy rate change? How is the paid-off rate and return of the new portfolio?
- Discuss other possible ways to construct better loan portfolio. For example, what's the impact of only selecting loans from borrowers that don't have bankruptcy record(pub_rec_bankruptcies==0)?
- Summarize your finding and the performance of your portfolio.

**Note:**  
Some machine learning models are not sensitive to class_weight change. You may change some hyperparameter values to make the model more responsive to class_weight change. For example, set max_depth to 10 for a RandomForestClassifier. This is an advanced topic and you are not required to do it. Out of the models we learned, LogisticRegression is most sensitive to class_weight change. That's why you are asked to include LogisticRegression in your modeling choices in Task 2.

#### Conclusion
Summarsize your analysis in Task 3. 
- Is the classification model helpful in choosing portfolio? 
- Does a portfolio with a higher paid-off rate always have a higher return?
- What other methods can you use to help you choose a portfolio with higher return and lower risk?

### Ungraded Discussions
This part is not graded. We'll discuss the following questions in the week 7 live session.  

- We'll use the classification model to help us to select loans to invest. Shall we pursuit high accuracy score, high precision or high recall?


----
## Task 4: Conclusion (20 points)

Summarize:
- Your findings
- The limitations of your analysis
- Possible ways to improve your analysis
- Anything else worth mentioning