<a href="https://colab.research.google.com/github/wjung1008/Give-Me-Some-Credit/blob/main/Give_me_some_Credit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Give me some Credit**
### The goal of this assessment is to build a model that borrowers can use to help make the best financial decisions.

## **Import Libraries**

In [None]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

## **Load Dataset**

*   Ultimately, we want to predict if the borrower will experience 90 days past due delinquency or worse (SeriousDlqin2yrs)



In [None]:
data = pd.read_csv('cs-training.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


## **Analyze and validate data (i.e. missing data, outlier)**

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   Unnamed: 0                            150000 non-null  int64  
 1   SeriousDlqin2yrs                      150000 non-null  int64  
 2   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 3   age                                   150000 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 5   DebtRatio                             150000 non-null  float64
 6   MonthlyIncome                         120269 non-null  float64
 7   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 8   NumberOfTimes90DaysLate               150000 non-null  int64  
 9   NumberRealEstateLoansOrLines          150000 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 11  

In [None]:
# Omit first column
data = data.iloc[: , 1:]

In [None]:
# Number of NaN for each column
data.isna().sum()

SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

###**Perform Outlier detection for every column** 

*   Outliers were detected based on different techniques including Interquartile range (IQR) and thresholding.
*   Therefore, rather than simply dropping the outliers, they were either replaced with values based on IQR or min and max threshold was used to stabilize the data.


In [None]:
def IQR(data,column):
  q1 = data[column].quantile(0.25)
  q3 = data[column].quantile(0.75)
  iqr = q3 - q1
  return q1, q3, iqr


for column in data:
  q1, q3, iqr = IQR(data, column)
  outlier = (data[column] < q1 - iqr * 1.5 )| (data[column] > q3 + iqr * 1.5)
  print('Found', len(data[outlier]), 'outlier in', column)

  # data.drop(data[outlier].index, inplace = True)
  # data.plot(kind = "scatter", x = column, y = "SeriousDlqin2yrs")

# Correlation between the variables
data.corr()


Found 10026 outlier in SeriousDlqin2yrs
Found 763 outlier in RevolvingUtilizationOfUnsecuredLines
Found 46 outlier in age
Found 23982 outlier in NumberOfTime30-59DaysPastDueNotWorse
Found 31311 outlier in DebtRatio
Found 4879 outlier in MonthlyIncome
Found 3980 outlier in NumberOfOpenCreditLinesAndLoans
Found 8338 outlier in NumberOfTimes90DaysLate
Found 793 outlier in NumberRealEstateLoansOrLines
Found 7604 outlier in NumberOfTime60-89DaysPastDueNotWorse
Found 13336 outlier in NumberOfDependents


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
SeriousDlqin2yrs,1.0,-0.001802,-0.115386,0.125587,-0.007602,-0.019746,-0.029669,0.117175,-0.007038,0.102261,0.046048
RevolvingUtilizationOfUnsecuredLines,-0.001802,1.0,-0.005898,-0.001314,0.003961,0.007124,-0.011281,-0.001061,0.006235,-0.001048,0.001557
age,-0.115386,-0.005898,1.0,-0.062995,0.024188,0.037717,0.147705,-0.061005,0.03315,-0.057159,-0.213303
NumberOfTime30-59DaysPastDueNotWorse,0.125587,-0.001314,-0.062995,1.0,-0.006542,-0.010217,-0.055312,0.983603,-0.030565,0.987005,-0.00268
DebtRatio,-0.007602,0.003961,0.024188,-0.006542,1.0,-0.028712,0.049565,-0.00832,0.120046,-0.007533,-0.040673
MonthlyIncome,-0.019746,0.007124,0.037717,-0.010217,-0.028712,1.0,0.091455,-0.012743,0.124959,-0.011116,0.062647
NumberOfOpenCreditLinesAndLoans,-0.029669,-0.011281,0.147705,-0.055312,0.049565,0.091455,1.0,-0.079984,0.433959,-0.071077,0.065322
NumberOfTimes90DaysLate,0.117175,-0.001061,-0.061005,0.983603,-0.00832,-0.012743,-0.079984,1.0,-0.045205,0.992796,-0.010176
NumberRealEstateLoansOrLines,-0.007038,0.006235,0.03315,-0.030565,0.120046,0.124959,0.433959,-0.045205,1.0,-0.039722,0.124684
NumberOfTime60-89DaysPastDueNotWorse,0.102261,-0.001048,-0.057159,0.987005,-0.007533,-0.011116,-0.071077,0.992796,-0.039722,1.0,-0.010922


In [None]:
# Rather than removing the outliers, threshold of q3 + 1.5iqr was used as maximum value.
q1, q3, iqr = IQR(data, 'RevolvingUtilizationOfUnsecuredLines')

data['RevolvingUtilizationOfUnsecuredLines'] = np.where(data['RevolvingUtilizationOfUnsecuredLines'] > (q3 + iqr * 1.5), (q3 + iqr * 1.5), data['RevolvingUtilizationOfUnsecuredLines'])


data.RevolvingUtilizationOfUnsecuredLines.describe()

count    150000.000000
mean          0.322261
std           0.356572
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max           1.352814
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

In [None]:
# Upon seeing Age column, min was 0, which doesn't make sense. Hence minimum age was set.
for i in range(19,35):
    print (i, len(data[data.age < i]))


19 1
20 1
21 1
22 184
23 618
24 1259
25 2075
26 3028
27 4221
28 5559
29 7119
30 8821
31 10758
32 12796
33 14846
34 17085


In [None]:
data['age'] = np.where(data['age'] < 22, 22, data['age'])
data.age.describe()

count    150000.000000
mean         52.296573
std          14.768912
min          22.000000
25%          41.000000
50%          52.000000
75%          63.000000
max         109.000000
Name: age, dtype: float64

In [None]:
# NumberOfTime30-59DaysPastDueNotWorse of 96 and 98 is extremely high compare to other numbers, so max threshold was set.
Counter(data['NumberOfTime30-59DaysPastDueNotWorse'])
data['NumberOfTime30-59DaysPastDueNotWorse'] = np.where(data['NumberOfTime30-59DaysPastDueNotWorse'] > 13, 13, data['NumberOfTime30-59DaysPastDueNotWorse'])

In [None]:
data.DebtRatio.describe()

count    150000.000000
mean        353.005076
std        2037.818523
min           0.000000
25%           0.175074
50%           0.366508
75%           0.868254
max      329664.000000
Name: DebtRatio, dtype: float64

In [None]:
# Rather than removing the outliers, threshold of q3 + 1.5iqr was used as maximum value.
q1, q3, iqr = IQR(data, 'DebtRatio')

data['DebtRatio'] = np.where(data['DebtRatio'] > (q3 + iqr * 1.5), (q3 + iqr * 1.5), data['DebtRatio'])

data.DebtRatio.describe()

count    150000.000000
mean          0.663258
std           0.688085
min           0.000000
25%           0.175074
50%           0.366508
75%           0.868254
max           1.908024
Name: DebtRatio, dtype: float64

In [None]:
data.MonthlyIncome.describe()

count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64

In [None]:
# Rather than removing the outliers, threshold of q3 + 1.5iqr was used as maximum value.
q1, q3, iqr = IQR(data, 'MonthlyIncome')

data['MonthlyIncome'] = np.where(data['MonthlyIncome'] > (q3 + iqr * 1.5), (q3 + iqr * 1.5), data['MonthlyIncome'])

data.MonthlyIncome.describe()

count    120269.000000
mean       6157.633251
std        3747.828434
min           0.000000
25%        3400.000000
50%        5400.000000
75%        8249.000000
max       15522.500000
Name: MonthlyIncome, dtype: float64

In [None]:
data.NumberOfOpenCreditLinesAndLoans.describe()

count    150000.000000
mean          8.452760
std           5.145951
min           0.000000
25%           5.000000
50%           8.000000
75%          11.000000
max          58.000000
Name: NumberOfOpenCreditLinesAndLoans, dtype: float64

In [None]:
Counter(data['NumberOfOpenCreditLinesAndLoans'])

Counter({0: 1888,
         1: 4438,
         2: 6666,
         3: 9058,
         4: 11609,
         5: 12931,
         6: 13614,
         7: 13245,
         8: 12562,
         9: 11355,
         10: 9624,
         11: 8321,
         12: 7005,
         13: 5667,
         14: 4546,
         15: 3645,
         16: 3000,
         17: 2370,
         18: 1874,
         19: 1433,
         20: 1169,
         21: 864,
         22: 685,
         23: 533,
         24: 422,
         25: 337,
         26: 239,
         27: 194,
         28: 150,
         29: 114,
         30: 88,
         31: 74,
         32: 52,
         33: 47,
         34: 35,
         35: 27,
         36: 18,
         37: 7,
         38: 13,
         39: 9,
         40: 10,
         41: 4,
         42: 8,
         43: 8,
         44: 2,
         45: 8,
         46: 3,
         47: 2,
         48: 6,
         49: 4,
         50: 2,
         51: 2,
         52: 3,
         53: 1,
         54: 4,
         56: 2,
         57: 2,
  

In [None]:
# There were only few numbers more than 36. Thus, 36 was set as the maximum threshold
data['NumberOfOpenCreditLinesAndLoans'] = np.where(data['NumberOfOpenCreditLinesAndLoans'] > 36, 36, data['NumberOfOpenCreditLinesAndLoans'])


In [None]:
Counter(data['NumberOfTimes90DaysLate'])

Counter({0: 141662,
         1: 5243,
         2: 1555,
         3: 667,
         4: 291,
         5: 131,
         6: 80,
         7: 38,
         8: 21,
         9: 19,
         10: 8,
         11: 5,
         12: 2,
         13: 4,
         14: 2,
         15: 2,
         17: 1,
         96: 5,
         98: 264})

In [None]:
# NumberOfTimes90DaysLate of 96 and 98 is extremely high compare to other numbers, so max threshold was set.
Counter(data['NumberOfTimes90DaysLate'])
data['NumberOfTimes90DaysLate'] = np.where(data['NumberOfTimes90DaysLate'] > 17, 17, data['NumberOfTimes90DaysLate'])

In [None]:
Counter(data['NumberRealEstateLoansOrLines'])

Counter({0: 56188,
         1: 52338,
         2: 31522,
         3: 6300,
         4: 2170,
         5: 689,
         6: 320,
         7: 171,
         8: 93,
         9: 78,
         10: 37,
         11: 23,
         12: 18,
         13: 15,
         14: 7,
         15: 7,
         16: 4,
         17: 4,
         18: 2,
         19: 2,
         20: 2,
         21: 1,
         23: 2,
         25: 3,
         26: 1,
         29: 1,
         32: 1,
         54: 1})

In [None]:
# There were only few numbers more than 13. Thus, 13 was set as the maximum threshold
data['NumberRealEstateLoansOrLines'] = np.where(data['NumberRealEstateLoansOrLines'] > 13, 13, data['NumberRealEstateLoansOrLines'])

In [None]:
Counter(data['NumberOfTime60-89DaysPastDueNotWorse'])

Counter({0: 142396,
         1: 5731,
         2: 1118,
         3: 318,
         4: 105,
         5: 34,
         6: 16,
         7: 9,
         8: 2,
         9: 1,
         11: 1,
         96: 5,
         98: 264})

In [None]:
# NumberOfTime60-89DaysPastDueNotWorse of 96 and 98 is extremely high compare to other numbers, so max threshold was set.
Counter(data['NumberOfTime60-89DaysPastDueNotWorse'])
data['NumberOfTime60-89DaysPastDueNotWorse'] = np.where(data['NumberOfTime60-89DaysPastDueNotWorse'] > 11, 11, data['NumberOfTime60-89DaysPastDueNotWorse'])

In [None]:
# Upon analyzing NumberOfDependents column, max NumberOfDependents was set to 10.
# Additionally, NaN values were replaced to median value in NumberOfDependents.
for i in range(5,21):
    print (i, len(data[data.NumberOfDependents == i]))

5 746
6 158
7 51
8 24
9 5
10 5
11 0
12 0
13 1
14 0
15 0
16 0
17 0
18 0
19 0
20 1


In [None]:
data['NumberOfDependents'] = np.where(data['NumberOfDependents'] > 10, 10, data['NumberOfDependents'])
data['NumberOfDependents'].fillna(data['NumberOfDependents'].median(), inplace=True)

### Random Forest Regressor was used to fill in NaN values in MonthlyIncome column (little computationally expensive but predictions are reasonable.

In [None]:
# Rows with MonthlyIncome becomes the training and rows with NaN will be predicted accordingly.
train = data[data.MonthlyIncome.isnull() == False]
test = data[data.MonthlyIncome.isnull() == True]

X_train = train.drop(['MonthlyIncome', 'SeriousDlqin2yrs'], axis=1)
y_train = train['MonthlyIncome']

In [None]:
regr = RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1,
                              min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True,
                              oob_score=False, n_jobs=1, random_state=None, verbose=1)

regr.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   53.1s finished


RandomForestRegressor(criterion='mse', n_jobs=1, verbose=1)

In [None]:
# Replace NaN with predictions from the model
data['MonthlyIncome'] = np.where(data['MonthlyIncome'].isna(), regr.predict(data.drop(['MonthlyIncome', 'SeriousDlqin2yrs'], axis=1)), data['MonthlyIncome'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    5.6s finished


## **Prediction**

*   Now that the data cleaning is done, the cleaned data can be used to predict "SeriousDlqin2yrs" 




In [None]:
X = data.drop('SeriousDlqin2yrs', axis=1)
y = data['SeriousDlqin2yrs']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
rand_classifier = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2,
                               min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                               max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, 
                               random_state=None, verbose=0)
rand_classifier.fit(X_train, y_train)

RandomForestClassifier(n_jobs=1)

In [None]:
ada_classifier = AdaBoostClassifier(n_estimators=100, random_state=0)
ada_classifier.fit(X_train, y_train)

AdaBoostClassifier(n_estimators=100, random_state=0)

In [None]:
grad_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
grad_classifier.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=0)

In [None]:
y_rand_pred = rand_classifier.predict(X_test)
y_ada_pred = ada_classifier.predict(X_test)
y_grad_pred = grad_classifier.predict(X_test)

In [None]:
from sklearn.metrics import classification_report
print('Random Forest Classifier\n',classification_report(y_test, y_rand_pred))
print('Ada Boost Classifier\n',classification_report(y_test, y_ada_pred))
print('Gradient Boosting Classifier\n',classification_report(y_test, y_grad_pred))

Random Forest Classifier
               precision    recall  f1-score   support

           0       0.94      0.99      0.97     27926
           1       0.55      0.20      0.29      2074

    accuracy                           0.93     30000
   macro avg       0.75      0.59      0.63     30000
weighted avg       0.92      0.93      0.92     30000

Ada Boost Classifier
               precision    recall  f1-score   support

           0       0.94      0.99      0.97     27926
           1       0.57      0.22      0.32      2074

    accuracy                           0.93     30000
   macro avg       0.76      0.61      0.64     30000
weighted avg       0.92      0.93      0.92     30000

Gradient Boosting Classifier
               precision    recall  f1-score   support

           0       0.95      0.96      0.96     27926
           1       0.42      0.35      0.38      2074

    accuracy                           0.92     30000
   macro avg       0.69      0.66      0.67     30

In [None]:
print('Random Forest Classifier:',roc_auc_score(y_test,rand_classifier.predict_proba(X_test)[:, 1] , average='macro', sample_weight=None))
print('Ada Boost Classifier:',roc_auc_score(y_test,ada_classifier.predict_proba(X_test)[:, 1] , average='macro', sample_weight=None))
print('Gradient Boosting Classifier:',roc_auc_score(y_test,grad_classifier.predict_proba(X_test)[:, 1] , average='macro', sample_weight=None))

Random Forest Classifier: 0.8346736702060985
Ada Boost Classifier: 0.8598525059098536
Gradient Boosting Classifier: 0.6574406661329975


## **Test Data Prediction**


*   Since Ada Boost Classifier showed the best performance, it'll be used to predict the test data.

In [None]:
test_data = pd.read_csv('cs-test.csv')
test_data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,,0.885519,43,0,0.177513,5700.0,4,0,0,0,0.0
1,2,,0.463295,57,0,0.527237,9141.0,15,0,4,0,2.0
2,3,,0.043275,59,0,0.687648,5083.0,12,0,1,0,2.0
3,4,,0.280308,38,1,0.925961,3200.0,7,0,2,0,0.0
4,5,,1.0,27,0,0.019917,3865.0,4,0,0,0,1.0


In [None]:
# Omit first column
test_data = test_data.iloc[: , 1:]

In [None]:
# NaN needs to be filled in.
test_data.isna().sum()

SeriousDlqin2yrs                        101503
RevolvingUtilizationOfUnsecuredLines         0
age                                          0
NumberOfTime30-59DaysPastDueNotWorse         0
DebtRatio                                    0
MonthlyIncome                            20103
NumberOfOpenCreditLinesAndLoans              0
NumberOfTimes90DaysLate                      0
NumberRealEstateLoansOrLines                 0
NumberOfTime60-89DaysPastDueNotWorse         0
NumberOfDependents                        2626
dtype: int64

In [None]:
# Test data is cleaned in the same manner
q1, q3, iqr = IQR(test_data, 'RevolvingUtilizationOfUnsecuredLines')

test_data['RevolvingUtilizationOfUnsecuredLines'] = np.where(test_data['RevolvingUtilizationOfUnsecuredLines'] > (q3 + iqr * 1.5), (q3 + iqr * 1.5), test_data['RevolvingUtilizationOfUnsecuredLines'])

test_data['age'] = np.where(test_data['age'] < 22, 22, test_data['age'])

test_data['NumberOfTime30-59DaysPastDueNotWorse'] = np.where(test_data['NumberOfTime30-59DaysPastDueNotWorse'] > 13, 13, test_data['NumberOfTime30-59DaysPastDueNotWorse'])

q1, q3, iqr = IQR(test_data, 'DebtRatio')
test_data['DebtRatio'] = np.where(test_data['DebtRatio'] > (q3 + iqr * 1.5), (q3 + iqr * 1.5), test_data['DebtRatio'])

q1, q3, iqr = IQR(test_data, 'MonthlyIncome')
test_data['MonthlyIncome'] = np.where(test_data['MonthlyIncome'] > (q3 + iqr * 1.5), (q3 + iqr * 1.5), test_data['MonthlyIncome'])

test_data['NumberOfOpenCreditLinesAndLoans'] = np.where(test_data['NumberOfOpenCreditLinesAndLoans'] > 36, 36, test_data['NumberOfOpenCreditLinesAndLoans'])

test_data['NumberOfTimes90DaysLate'] = np.where(test_data['NumberOfTimes90DaysLate'] > 17, 17, test_data['NumberOfTimes90DaysLate'])

test_data['NumberRealEstateLoansOrLines'] = np.where(test_data['NumberRealEstateLoansOrLines'] > 13, 13, test_data['NumberRealEstateLoansOrLines'])

test_data['NumberOfTime60-89DaysPastDueNotWorse'] = np.where(test_data['NumberOfTime60-89DaysPastDueNotWorse'] > 11, 11, test_data['NumberOfTime60-89DaysPastDueNotWorse'])

In [None]:
# Fill in NaN in NumberOfDependents
test_data['NumberOfDependents'] = np.where(test_data['NumberOfDependents'] > 10, 10, test_data['NumberOfDependents'])
test_data['NumberOfDependents'].fillna(test_data['NumberOfDependents'].median(), inplace=True)

In [None]:
# Rows with MonthlyIncome becomes the training and rows with NaN will be predicted accordingly.
train = test_data[test_data.MonthlyIncome.isnull() == False]
test = test_data[test_data.MonthlyIncome.isnull() == True]

X_train = train.drop(['MonthlyIncome', 'SeriousDlqin2yrs'], axis=1)
y_train = train['MonthlyIncome']

# Replace NaN with predictions from the model
test_data['MonthlyIncome'] = np.where(test_data['MonthlyIncome'].isna(), regr.predict(test_data.drop(['MonthlyIncome', 'SeriousDlqin2yrs'], axis=1)), test_data['MonthlyIncome'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.7s finished


In [None]:
# Predict using the trained model
X = test_data.drop('SeriousDlqin2yrs', axis=1)
SeriousDlqin2yrs = ada_classifier.predict(X)

In [None]:
test_pred = pd.DataFrame({'SeriousDlqin2yrs': SeriousDlqin2yrs})
test_pred = test_pred.join(X)

In [None]:
# Test dataset is successfully predicted
test_pred.SeriousDlqin2yrs.describe()

count    101503.000000
mean          0.025585
std           0.157896
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: SeriousDlqin2yrs, dtype: float64

In [None]:
test_pred.to_csv("./cs-predictions.csv", index=False)