For this project we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a [very interesting year in 2016](https://en.wikipedia.org/wiki/Lending_Club#2016), so let's check out some of their data and keep the context in mind. This data is from before they even went public.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from [here](https://www.lendingclub.com/info/download-data.action) or just use the csv already provided. It's recommended you use the csv provided as it has been cleaned of NA values.

Here are what the columns represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

# Import Libraries

**Import the usual libraries for pandas and plotting. You can import sklearn later on.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data

** Use pandas to read loan_data.csv as a dataframe called loans.**

In [2]:
df = pd.read_csv('loan_data.csv')
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [3]:
## Select non numerical categorical column names
mylist = list(df.select_dtypes(include=['object']).columns)
print(mylist)


unique=df.purpose.unique()
print("\n")
print("Unique Categorical features:",unique)

count=df.purpose.unique().size
print("\n")
print("Count of Categorical features:",count)

['purpose']


Unique Categorical features: ['debt_consolidation' 'credit_card' 'all_other' 'home_improvement'
 'small_business' 'major_purchase' 'educational']


Count of Categorical features: 7


In [4]:
## Create dummy variables for non numerical categorical variables
dummies = pd.get_dummies(df[mylist], prefix= mylist)
#print(dummies.head())

df.drop(mylist, axis=1, inplace = True) ## Drop Non numerical categorical columns
#df.head()

df=pd.concat([df,dummies], axis =1 ) ## added encoded categorical columns
df.head()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,1,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0,0,0,1,0,0,0,0
1,1,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,0,1,0,0,0,0,0
2,1,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,0,0,1,0,0,0,0
3,1,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0,0,0,1,0,0,0,0
4,1,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0,0,1,0,0,0,0,0


In [5]:
print(df['not.fully.paid'].value_counts())
print("\n")


df=df.rename(columns={"not.fully.paid": "not_fully_paid"})

0    8045
1    1533
Name: not.fully.paid, dtype: int64




# Upsampling Minority Data

In [6]:
from sklearn.utils import resample
df_majority=df[df.not_fully_paid==0] ## all rows where Not.fully.paid==0
df_minority=df[df.not_fully_paid==1] ## all rows where Not.fully.paid==1

df_minority_upsampled=resample(df_minority,replace=True,n_samples=8045)
df_upsampled=pd.concat([df_minority_upsampled,df_majority])

df_upsampled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16090 entries, 8940 to 9568
Data columns (total 20 columns):
credit.policy                 16090 non-null int64
int.rate                      16090 non-null float64
installment                   16090 non-null float64
log.annual.inc                16090 non-null float64
dti                           16090 non-null float64
fico                          16090 non-null int64
days.with.cr.line             16090 non-null float64
revol.bal                     16090 non-null int64
revol.util                    16090 non-null float64
inq.last.6mths                16090 non-null int64
delinq.2yrs                   16090 non-null int64
pub.rec                       16090 non-null int64
not_fully_paid                16090 non-null int64
purpose_all_other             16090 non-null uint8
purpose_credit_card           16090 non-null uint8
purpose_debt_consolidation    16090 non-null uint8
purpose_educational           16090 non-null uint8
purpose_ho

In [7]:
X_upsampled = df_upsampled.drop('not_fully_paid',1) ## This is the dependent variable
y_upsampled=df_upsampled['not_fully_paid']

X = df.drop('not_fully_paid',1) ## This is the dependent variable
y=df['not_fully_paid']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) 

X_train_upsampled, X_test_upsampled, y_train_upsampled, y_test_upsampled = train_test_split(X_upsampled, y_upsampled, test_size = 0.3, 
                                                                        random_state = 0) 

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
credit.policy                 9578 non-null int64
int.rate                      9578 non-null float64
installment                   9578 non-null float64
log.annual.inc                9578 non-null float64
dti                           9578 non-null float64
fico                          9578 non-null int64
days.with.cr.line             9578 non-null float64
revol.bal                     9578 non-null int64
revol.util                    9578 non-null float64
inq.last.6mths                9578 non-null int64
delinq.2yrs                   9578 non-null int64
pub.rec                       9578 non-null int64
purpose_all_other             9578 non-null uint8
purpose_credit_card           9578 non-null uint8
purpose_debt_consolidation    9578 non-null uint8
purpose_educational           9578 non-null uint8
purpose_home_improvement      9578 non-null uint8
purpose_major_purchase        9

In [9]:
X_upsampled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16090 entries, 8940 to 9568
Data columns (total 19 columns):
credit.policy                 16090 non-null int64
int.rate                      16090 non-null float64
installment                   16090 non-null float64
log.annual.inc                16090 non-null float64
dti                           16090 non-null float64
fico                          16090 non-null int64
days.with.cr.line             16090 non-null float64
revol.bal                     16090 non-null int64
revol.util                    16090 non-null float64
inq.last.6mths                16090 non-null int64
delinq.2yrs                   16090 non-null int64
pub.rec                       16090 non-null int64
purpose_all_other             16090 non-null uint8
purpose_credit_card           16090 non-null uint8
purpose_debt_consolidation    16090 non-null uint8
purpose_educational           16090 non-null uint8
purpose_home_improvement      16090 non-null uint8
purpose_ma

# Decision Tree

In [10]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()

classifier.fit(X_train,y_train)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[1990  416]
 [ 355  113]]
             precision    recall  f1-score   support

          0       0.85      0.83      0.84      2406
          1       0.21      0.24      0.23       468

avg / total       0.75      0.73      0.74      2874



In [11]:
from sklearn.model_selection import cross_val_score 

accuracies_logistic= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_logistic_mean=accuracies_logistic.mean()*100
print("Mean Accuracy:Decision Tree=",accuracies_logistic_mean)

accuracies_logistic_std=accuracies_logistic.std()*100
print("Standard Deviation - Accuracy:Decision Tree=",accuracies_logistic_std)

Mean Accuracy:Decision Tree= 73.88111855630324
Standard Deviation - Accuracy:Decision Tree= 1.2604460990858624


# With upsampled data

In [12]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()

classifier.fit(X_train_upsampled,y_train_upsampled)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_logistic= cross_val_score(estimator = classifier, X = X_train_upsampled, y = y_train_upsampled, cv = 10) 
accuracies_logistic_mean=accuracies_logistic.mean()*100
print("Mean Accuracy:Decision Tree=",accuracies_logistic_mean)

accuracies_logistic_std=accuracies_logistic.std()*100
print("Standard Deviation - Accuracy:Decision Tree=",accuracies_logistic_std)

[[2256  150]
 [   9  459]]
             precision    recall  f1-score   support

          0       1.00      0.94      0.97      2406
          1       0.75      0.98      0.85       468

avg / total       0.96      0.94      0.95      2874

Mean Accuracy:Decision Tree= 88.55540022789562
Standard Deviation - Accuracy:Decision Tree= 0.8574191428794734


# Check Important Features

In [13]:
df1= pd.DataFrame()
#df1['feature'] = df.drop(['not.fully.paid'], axis=1).columns
df1['feature'] = X.columns
df1['Importance Index']= classifier.feature_importances_
print(df1.sort_values(by='Importance Index', ascending=False))

                       feature  Importance Index
2                  installment          0.139602
7                    revol.bal          0.119999
6            days.with.cr.line          0.115736
8                   revol.util          0.111927
1                     int.rate          0.106630
4                          dti          0.102132
3               log.annual.inc          0.100478
5                         fico          0.056259
0                credit.policy          0.042434
9               inq.last.6mths          0.032040
10                 delinq.2yrs          0.018548
14  purpose_debt_consolidation          0.014819
18      purpose_small_business          0.007744
12           purpose_all_other          0.007612
11                     pub.rec          0.006727
13         purpose_credit_card          0.005086
15         purpose_educational          0.004413
16    purpose_home_improvement          0.004312
17      purpose_major_purchase          0.003503


Homework : Remove the features which are not important. Check the results

# Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25) ## Hyperparameter
classifier.fit(X_train,y_train)
predictions = classifier.predict(X_test)


from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_rf= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_rf_mean=accuracies_rf.mean()*100
print("Mean Accuracy:Random Forest=",accuracies_rf_mean)

accuracies_rf_std=accuracies_rf.std()*100
print("Standard Deviation:Random Forest=",accuracies_rf_std)

  from numpy.core.umath_tests import inner1d


[[2375   31]
 [ 450   18]]
             precision    recall  f1-score   support

          0       0.84      0.99      0.91      2406
          1       0.37      0.04      0.07       468

avg / total       0.76      0.83      0.77      2874

Mean Accuracy:Random Forest= 83.71104965173
Standard Deviation:Random Forest= 0.6148740040045731


# With Upsampled Data

In [15]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25) ## Hyperparameter

classifier.fit(X_train_upsampled,y_train_upsampled)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_rf =cross_val_score(estimator = classifier, X = X_train_upsampled, y = y_train_upsampled, cv = 10) 
accuracies_rf_mean=accuracies_rf.mean()*100
print("Mean Accuracy:Random Forest=",accuracies_rf_mean)

accuracies_rf_std=accuracies_rf.std()*100
print("Standard Deviation - Accuracy:Random Forest=",accuracies_rf_std)

[[2357   49]
 [  11  457]]
             precision    recall  f1-score   support

          0       1.00      0.98      0.99      2406
          1       0.90      0.98      0.94       468

avg / total       0.98      0.98      0.98      2874

Mean Accuracy:Random Forest= 94.16677042274164
Standard Deviation - Accuracy:Random Forest= 0.5739277254011015


# Hyperparameter Tuning of Random Forests

In [16]:
print('Parameters currently in use:\n')
print(classifier.get_params())
print("\n")

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)] ## play with start and stop

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 20, num = 5)] ## change 10,20 and 2
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,15]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,10]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

Parameters currently in use:

{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 25, 'n_jobs': 1, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


{'n_estimators': [20, 65, 110, 155, 200], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 12, 15, 17, 20, None], 'min_samples_split': [2, 5, 10, 15], 'min_samples_leaf': [1, 2, 4, 10], 'bootstrap': [True, False]}


In [17]:
# Use the random grid to search for best hyperparameters

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
from sklearn.model_selection import RandomizedSearchCV

rf_random = RandomizedSearchCV(estimator = classifier, param_distributions = random_grid, n_iter = 100, cv = 3, 
                               verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train_upsampled,y_train_upsampled)
print("Best Parameters are:",rf_random.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  8.1min finished


Best Parameters are: {'n_estimators': 155, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}


In [18]:
best_random = rf_random.best_estimator_
best_random.fit(X_train_upsampled,y_train_upsampled)

predictions = best_random.predict(X_test)


from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[2377   29]
 [  13  455]]
             precision    recall  f1-score   support

          0       0.99      0.99      0.99      2406
          1       0.94      0.97      0.96       468

avg / total       0.99      0.99      0.99      2874



# Extra Tree Classifier

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
classifier = ExtraTreesClassifier(n_estimators=700,criterion= 'entropy',min_samples_split= 5,
                            max_depth= 50, min_samples_leaf= 5) 

In [20]:
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_et= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_et_mean=accuracies_et.mean()*100
print("Mean Accuracy:Extra Trees=",accuracies_et)

accuracies_et_std=accuracies_et.std()*100
print("Standard Deviation:Extra Trees=",accuracies_et_std)

[[2406    0]
 [ 468    0]]
             precision    recall  f1-score   support

          0       0.84      1.00      0.91      2406
          1       0.00      0.00      0.00       468

avg / total       0.70      0.84      0.76      2874



  'precision', 'predicted', average, warn_for)


Mean Accuracy:Extra Trees= [0.84053651 0.84053651 0.84053651 0.84053651 0.84053651 0.84179104
 0.84179104 0.84179104 0.84179104 0.84155456]
Standard Deviation:Extra Trees= 0.06073121960374362


In [21]:
classifier.feature_importances_

array([0.1126431 , 0.10841196, 0.06986875, 0.06315166, 0.07229075,
       0.10909734, 0.06006892, 0.04441186, 0.08853576, 0.09075137,
       0.02192613, 0.02341872, 0.02238285, 0.02452572, 0.02629507,
       0.00876445, 0.01472156, 0.01268023, 0.02605378])

In [22]:
df.columns

Index(['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti',
       'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not_fully_paid',
       'purpose_all_other', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_major_purchase',
       'purpose_small_business'],
      dtype='object')

In [23]:
df1= pd.DataFrame()
#df1['feature'] = df.drop(['not.fully.paid'], axis=1).columns
df1['feature'] = X.columns
df1['Importance Index']= classifier.feature_importances_
print(df1.sort_values(by='Importance Index', ascending=False))
#filtered_data=df1['Importance Index'] >0.01
#df1=df1[filtered_data]

                       feature  Importance Index
0                credit.policy          0.112643
5                         fico          0.109097
1                     int.rate          0.108412
9               inq.last.6mths          0.090751
8                   revol.util          0.088536
4                          dti          0.072291
2                  installment          0.069869
3               log.annual.inc          0.063152
6            days.with.cr.line          0.060069
7                    revol.bal          0.044412
14  purpose_debt_consolidation          0.026295
18      purpose_small_business          0.026054
13         purpose_credit_card          0.024526
11                     pub.rec          0.023419
12           purpose_all_other          0.022383
10                 delinq.2yrs          0.021926
16    purpose_home_improvement          0.014722
17      purpose_major_purchase          0.012680
15         purpose_educational          0.008764


# Naive Bayes

In [24]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_logistic= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_logistic_mean=accuracies_logistic.mean()*100
print("Mean Accuracy:Random Forest=",accuracies_logistic_mean)

accuracies_logistic_std=accuracies_logistic.std()*100
print("Standard Deviation:Random Forest=",accuracies_logistic_std)

[[2305  101]
 [ 423   45]]
             precision    recall  f1-score   support

          0       0.84      0.96      0.90      2406
          1       0.31      0.10      0.15       468

avg / total       0.76      0.82      0.78      2874

Mean Accuracy:Random Forest= 82.18979567687217
Standard Deviation:Random Forest= 0.5878106060881642


In [None]:
## Home Work: Use Upsampled data and check the result

# Support Vector Machine

In [25]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train,y_train) 
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))

[[2406    0]
 [ 468    0]]


In [26]:
param_grid = {'C': [0.1,1,5,7,8,9,10, 100, 1000], 'gamma': [1,0.1,0.01,0.002,0.0005,0.001,0.0001], 'kernel': ['rbf']} 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
# May take awhile!
grid.fit(X_train,y_train)

Fitting 3 folds for each of 63 candidates, totalling 189 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.8411633109619687, total=   2.9s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.6s remaining:    0.0s


[CV]  C=0.1, gamma=1, kernel=rbf, score=0.8411633109619687, total=   3.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    9.6s remaining:    0.0s


[CV]  C=0.1, gamma=1, kernel=rbf, score=0.8410922112802148, total=   3.1s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.8411633109619687, total=   2.8s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.8411633109619687, total=   2.8s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.8410922112802148, total=   3.6s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV]  C=0.1, gamma=0.01, kernel=rbf, score=0.8411633109619687, total=   2.7s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV]  C=0.1, gamma=0.01, kernel=rbf, score=0.8411633109619687, total=   2.9s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV]  C=0.1, gamma=0.01, kernel=rbf, score=0.8410922112802148, total=   2.9s
[CV] C=0.1, gamma=0.002, kernel=rbf .....

[CV]  C=5, gamma=0.001, kernel=rbf, score=0.8411633109619687, total=   4.5s
[CV] C=5, gamma=0.001, kernel=rbf ....................................
[CV]  C=5, gamma=0.001, kernel=rbf, score=0.8410922112802148, total=   4.6s
[CV] C=5, gamma=0.0001, kernel=rbf ...................................
[CV]  C=5, gamma=0.0001, kernel=rbf, score=0.8331096196868009, total=   4.6s
[CV] C=5, gamma=0.0001, kernel=rbf ...................................
[CV]  C=5, gamma=0.0001, kernel=rbf, score=0.8326621923937361, total=   4.2s
[CV] C=5, gamma=0.0001, kernel=rbf ...................................
[CV]  C=5, gamma=0.0001, kernel=rbf, score=0.8267681289167412, total=   4.5s
[CV] C=7, gamma=1, kernel=rbf ........................................
[CV]  C=7, gamma=1, kernel=rbf, score=0.8411633109619687, total=   5.5s
[CV] C=7, gamma=1, kernel=rbf ........................................
[CV]  C=7, gamma=1, kernel=rbf, score=0.8411633109619687, total=   4.5s
[CV] C=7, gamma=1, kernel=rbf .................

[CV]  C=9, gamma=0.002, kernel=rbf, score=0.8411633109619687, total=   3.4s
[CV] C=9, gamma=0.002, kernel=rbf ....................................
[CV]  C=9, gamma=0.002, kernel=rbf, score=0.8410922112802148, total=   4.7s
[CV] C=9, gamma=0.0005, kernel=rbf ...................................
[CV]  C=9, gamma=0.0005, kernel=rbf, score=0.8411633109619687, total=   4.5s
[CV] C=9, gamma=0.0005, kernel=rbf ...................................
[CV]  C=9, gamma=0.0005, kernel=rbf, score=0.839821029082774, total=   3.3s
[CV] C=9, gamma=0.0005, kernel=rbf ...................................
[CV]  C=9, gamma=0.0005, kernel=rbf, score=0.8401969561324978, total=   3.0s
[CV] C=9, gamma=0.001, kernel=rbf ....................................
[CV]  C=9, gamma=0.001, kernel=rbf, score=0.8411633109619687, total=   3.1s
[CV] C=9, gamma=0.001, kernel=rbf ....................................
[CV]  C=9, gamma=0.001, kernel=rbf, score=0.8411633109619687, total=   3.1s
[CV] C=9, gamma=0.001, kernel=rbf ......

[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.8411633109619687, total=   7.8s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.8411633109619687, total=   3.3s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.8410922112802148, total=   3.2s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV]  C=1000, gamma=0.01, kernel=rbf, score=0.8411633109619687, total=   3.2s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV]  C=1000, gamma=0.01, kernel=rbf, score=0.8411633109619687, total=   3.2s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV]  C=1000, gamma=0.01, kernel=rbf, score=0.8410922112802148, total=   3.7s
[CV] C=1000, gamma=0.002, kernel=rbf .................................
[CV]  C=1000, gamma=0.002, kernel=rbf, score=0.8411633109619687, total=   3.3s
[CV] C=1000, gamma=0.002, kern

[Parallel(n_jobs=1)]: Done 189 out of 189 | elapsed: 19.7min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 5, 7, 8, 9, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.002, 0.0005, 0.001, 0.0001], 'kernel': ['rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [27]:
print(grid.best_params_)
print(grid.best_estimator_)

#model=SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
 # decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
 # max_iter=-1, probability=False, random_state=None, shrinking=True,
 # tol=0.001, verbose=False)


predictions = grid.predict(X_test)
print(confusion_matrix(y_test,predictions))

{'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
[[2406    0]
 [ 468    0]]


In [28]:
model=SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

model.fit(X_train_upsampled,y_train_upsampled)
predictions = model.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[   0 2406]
 [   0  468]]
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      2406
          1       0.16      1.00      0.28       468

avg / total       0.03      0.16      0.05      2874



  'precision', 'predicted', average, warn_for)
