# <b>Extreme Data Challenge</b>

##  Today's Mission
- Your objective is to devise the best possible model to predict successful/default loans using Lending Club loan data.

- The training data is 100000 loans labeled either as 1 (successful) or 0 (default). Comes with 33 categorical and numerical features. The testing data is 50000 loans.

- A data dictionary file is included as well. It is a table explaining each what each feature means.

- You will be judged on how much money you model makes. You will use your model on the testing dataset by making predictions on it and testing them. Assume that each loan is 1000 dollars and the interest rate is 10 percent. That means for every loan you issue that is successfully repaid, you will earn 100 dollars and for every loan you issue that defaults, you will lose 1000 dollars.
    
        Profit = 100*(Number of True Positives) - 1000*(Number of False Positives) 

- Use all the tools at your disposal, try all the models we've learned in class. Refer to past class notes for help. Be sure to use modeling evaluating techniques such as ROC curves, confusion matrix, recall/precision, etc.

- To optimize model, find the right combination of features and the right model with the right parameters. Get creative!

- Remember to use your time wisely, it will go by fast. Communicate amongst yourselves often.
   

### Online resources on Lending Club loan data
Kaggle Page: https://www.kaggle.com/wendykan/lending-club-loan-data. Make sure to check out the kernels section.

Y Hat tutorial (It's in R, but its still useful): http://blog.yhat.com/posts/machine-learning-for-predicting-bad-loans.html

Blog tutorial on the data from Kevin Davenport: http://kldavenport.com/lending-club-data-analysis-revisted-with-python/


In [None]:
#Imports and set pandas options
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
pd.set_option("max.columns", 100)
pd.set_option("max.colwidth", 100)

In [None]:
# Load in training data.
# Loan_status column is the target variable. Remember to drop it from df.
df = pd.read_csv("loan_training_data.csv")

In [None]:
#Load in data dictionary
# Loan S
data_dict = pd.read_csv("the_data_dictionary.csv")
data_dict

In [None]:
df.head()

### Ready, Set, Go!!

In [None]:
df['emp_length']=df['emp_length'].apply(lambda x: x.replace('years', '').replace('year','').replace('+',''))

In [None]:
df.head()

In [None]:
df['term']=df['term'].apply(lambda x: x.replace('months',''))

In [None]:
df.head()

In [None]:
df1=pd.get_dummies(df)

In [None]:
df1.head()

In [None]:
df1.groupby('loan_status').mean().round()

In [None]:
df1.corr()

In [None]:
cmap = sb.diverging_palette(220, 10, as_cmap=True)
correlations=df1.corr()
sb.heatmap(correlations, cmap=cmap)

In [None]:
k = 10 #number of variables for heatmap
cols = correlations.nlargest(k, 'loan_status')['loan_status'].index
cm = np.corrcoef(df1[cols].values.T)
sb.set(font_scale=1.5)
hm = sb.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
# Extract feature columns
feature_cols = df1.drop('loan_status',axis=1)
# Extract target column 'passed'
target_col = df1.loan_status
# Show the list of columns
print ("Feature columns:\n{}".format(feature_cols))
print ("\nTarget column: {}".format(target_col))
# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = feature_cols
y_all = target_col
# Show the feature information by printing the first five rows
print ("\nFeature values:")
print (X_all.head())
print (y_all.head())

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

from sklearn.metrics import recall_score as recall
from sklearn import preprocessing

from sklearn.metrics import roc_curve, roc_auc_score

from time import time
from sklearn.metrics import f1_score

from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import make_pipeline

from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics

In [None]:
X_all_scaled = preprocessing.scale(X_all)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_all_scaled, y_all, train_size=0.8, random_state=1)

In [None]:
# Define your classifiers (models)

clf_a = LinearRegression()
clf_b = DecisionTreeRegressor(random_state = 0)
clf_c = RandomForestRegressor(n_estimators = 5, random_state = 0)
clf_d = SVR(kernel = 'rbf')

classifiers = (clf_a, clf_b, clf_c, clf_d)

In [None]:
clf_a = clf_a.fit(X_train, y_train)
pred_probs_clf_a = clf_a.predict(X_test)
fpr, tpr, thres = roc_curve(y_test, pred_probs_clf_a)
roc_auc_score(y_test, pred_probs_clf_a)

In [None]:
clf_b = clf_b.fit(X_train, y_train)
pred_probs_clf_b = clf_b.predict(X_test)
fpr, tpr, thres = roc_curve(y_test, pred_probs_clf_b)
roc_auc_score(y_test, pred_probs_clf_b)

In [None]:
clf_c = clf_c.fit(X_train, y_train)
pred_probs = clf_c.predict(X_test)
fpr, tpr, thres = roc_curve(y_test, pred_probs)
roc_auc_score(y_test, pred_probs)

In [None]:
clf_d = clf_d.fit(X_train, y_train)
pred_probs = clf_d.predict(X_test)
fpr, tpr, thres = roc_curve(y_test, pred_probs)
roc_auc_score(y_test, pred_probs)

#### Can ROC score be improved with GridSearchCV,  regularization and/or Gradient Descent?

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html



##### Try Lasso, Elastic Net or Ridge Regression


http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

___
### Time to test your results

#### Remember: for every loan you issue that is successfully repaid, you will earn 100 dollars and for every loan you issue that defaults, you will lose 1000 dollars.

##### Hint 1: use a confusion matrix to calculate your profit 


In [None]:
from sklearn.metrics import confusion_matrix
#Profit calculator
def profit_calculator(y_true, y_preds):
    cm = confusion_matrix(y_true, y_preds)
    tp = cm[1,1]
    fp = cm[0,1]
    return 100*tp - 1000*fp

##### Hint 2: use your calculator such that:
profit_calculator(y_test, '...' )

'...' being your predictions

---
## Bonus question

### Take your best model and apply PCA
#### How does that affect your ROC score ?
---