# Exercise 03

# Camilo Torres Ovalle
# Wilfredo David Vega Buelvas

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

In [3]:
# drop rows with any missing values
data.dropna().shape

(108648, 12)

In [4]:
# check for missing values
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [5]:
# mean Age
X.age.mean()

51.36130439584714

In [6]:
# mean NumberOfDependents 
X.NumberOfDependents.mean()

0.8565735218319711

In [7]:
# fill missing values for Age with the median age
X.age.fillna(X.age.median(), inplace=True)

In [8]:
# fill missing values for NumberOfDependents with the median NumberOfDependents
X.NumberOfDependents.fillna(X.NumberOfDependents.median(), inplace=True)

In [9]:
# check for missing values
X.isnull().sum()

Unnamed: 0                              0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

In [10]:
from sklearn.linear_model import LogisticRegression

In [18]:
from sklearn.cross_validation import cross_val_score

LogReg = LogisticRegression(C=1e9)

results = cross_val_score(LogReg, X, y, cv=10, scoring='accuracy')

In [19]:
print(results)

[ 0.93260716  0.93269571  0.93234148  0.93269571  0.93269571  0.93243004
  0.93295545  0.93295545  0.93330972  0.93286094]


In [20]:
pd.Series(results).describe()


count    10.000000
mean      0.932755
std       0.000281
min       0.932341
25%       0.932629
50%       0.932696
75%       0.932932
max       0.933310
dtype: float64

In [21]:
# F1Score
# The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

# F1=2(precision x recall)/(precision+recall)



In [22]:
import pandas as pd

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

0.932657904991


In [None]:
# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

In [None]:
# make predictions for testing set
y_pred_class = logreg.predict(X_test)

In [None]:
# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

In [23]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_class)

array([[26302,    13],
       [ 1888,    26]])

In [24]:
from sklearn.metrics import precision_score, recall_score, f1_score
print('precision_score ', precision_score(y_test, y_pred_class))
print('recall_score    ', recall_score(y_test, y_pred_class))

('precision_score ', 0.66666666666666663)
('recall_score    ', 0.013584117032392894)


In [25]:
 print('f1_score    ', f1_score(y_test, y_pred_class))

('f1_score    ', 0.026625704045058884)


# Exercise 3.3

Now which is the best set of features selected by AUC

In [26]:
# predict probability of survival
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

In [29]:
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.655228621331
