## Credit Scoring Classification


This notebook is a quick introduction the Viya Python Client and also Skleanr to fit and compare models.


## Load Python Packages including SAS SWAT



In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric import smoothers_lowess
from pandas import Series, DataFrame
from patsy import dmatrices
from sklearn.ensemble import RandomForestClassifier

import sys
sys.path.insert(0, r'\\newwinsrc\sasgen\dev\mva-vb005\GTKWX6ND\misc\python')
from swat import *
from swat.render import render_html

## Load CSV File to Data Frame


The function that reads CSV files into DataFrames is called ``read_csv``.   We'll use the ``head`` method to display just the first few records.

In [46]:

df = pd.read_csv("C:\demodata\cs_accepts_train.csv")
df.head()

Unnamed: 0,_customerID,target,title,Nchildren,Nhousehold,Age,TimeAddress,TimeJob,tel,NumMyLoan,...,region,regionLarge,cash,product,resid,nat,prof,car,card,saving
0,c000000001,Good,H,2,4,30,96.0,21.0,1,0,...,0,0,600,Furniture,Lease,German,,Car,Yes,No
1,c000000004,Good,H,1,3,49,,192.0,1,0,...,0,0,1000,Electronics,Lease,German,B,Car,Yes,Yes
2,c000000005,Good,H,2,4,37,66.0,54.0,1,0,...,2,1,6000,Furniture,Lease,German,,Car,No,No
3,c000000007,Good,H,0,1,33,84.0,192.0,1,0,...,3,1,600,Leisure,Lease,German,B,Car,Yes,
4,c000000011,Good,H,0,1,24,21.0,84.0,1,0,...,4,1,5000,Electronics,,German,,Car,Yes,


There are many more Pandas data readers that you read about on the [Pandas web site](http://pandas.pydata.org/pandas-docs/stable/io.html).

### Displaying Information about DataFrames

We have displayed the DataFrame above.  We can get more information about the DataFrame using various properties and methods.

The list of column names can be displayed using the ``columns`` property.

In [13]:
df.columns

Index(['target', '_customerID', 'title', 'Nchildren', 'Nhousehold', 'Age',
       'TimeAddress', 'TimeJob', 'tel', 'NumMyLoan', 'NumFinLoan', 'NumLoans',
       'Income', 'EC_Card', 'IncLevel', 'status', 'bureau', 'region',
       'regionLarge', 'cash', 'product', 'resid', 'nat', 'prof', 'car', 'card',
       'saving'],
      dtype='object')

### Fit  a Random Forest Using Continous Features

In [49]:
cols = ['Age', 'bureau', 'Income', 'status'] 
y = ['target']
trainArr = df.as_matrix(cols) #training array
trainRes = df.as_matrix(y) # training results
print(trainArr)
print (trainRes)

rf = RandomForestClassifier(n_estimators=100) # initialize
rf.fit(trainArr, trainRes) # fit the data to the algorithm


[[30 1 0 'V']
 [49 3 0 'V']
 [37 1 3500 'V']
 ..., 
 [31 1 1500 'G']
 [29 1 0 'V']
 [21 3 1900 'U']]
[['Good']
 ['Good']
 ['Good']
 ..., 
 ['Good']
 ['Good']
 ['Good']]


### Fit a Random Forest Using Vectorization to Define Design Matrix

In [89]:
# json format of the features
newTrain = df[['Age', 'bureau', 'Income', 'status']].T.to_dict().values()

# list format of the target
y = ['target']
newResponse = df.as_matrix(y).ravel() # training results

# vectorizing
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
newTrain2 = vec.fit_transform(newTrain).toarray()

rf = RandomForestClassifier(n_estimators=100) # initialize
rf.fit(newTrain2, newResponse)                # fit the data to the algorithm

KeyboardInterrupt: 

### Define and Score Test Data

In [55]:
test = pd.read_csv("C:\demodata\cs_accepts_validation.csv")

# json format of the features
newTest = df[['Age', 'bureau', 'Income', 'status']].T.to_dict().values()

# list format of the target
yt = ['target']
newResponse = df.as_matrix(y) # training results

# vectorizing
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
newTest2 = vec.fit_transform(newTest).toarray()


In [48]:
## Testing!
# put the test data in the same format!
test = pd.read_csv("C:\demodata\cs_accepts_validation.csv")

testArr = test.as_matrix(cols)
results = rf.predict(testArr)
# something I like to do is to add it back to the data frame, so I can compare side-by-side
test['predictions'] = results
test.head()
# note - the column names shifted. Just ignore that.

Unnamed: 0,_customerID,target,title,Nchildren,Nhousehold,Age,TimeAddress,TimeJob,tel,NumMyLoan,...,regionLarge,cash,product,resid,nat,prof,car,card,saving,predictions
0,c000000002,Good,H,1,2,26,288.0,66.0,1,0,...,1,800,Dept. Store,,German,,Without Vehicle,No,,Good
1,c000000003,Good,H,0,2,27,24.0,24.0,1,2,...,1,600,Leisure,Lease,Turkish,,Car,No,No,Good
2,c000000006,Bad,H,0,1,26,96.0,3.0,1,0,...,0,1100,Electronics,Lease,German,F,Car,Yes,Yes,Good
3,c000000008,Good,H,0,1,30,6.0,192.0,1,0,...,1,2500,Furniture,Lease,German,,Car,Yes,Yes,Good
4,c000000009,Good,H,2,4,40,0.0,15.0,1,0,...,1,1400,Electronics,Lease,German,,Car,No,Yes,Good


The data types of the columns can be displayed using the ``dtypes`` property.

In [8]:
df.dtypes

target          object
_customerID     object
title           object
Nchildren        int64
Nhousehold       int64
Age              int64
TimeAddress    float64
TimeJob        float64
tel              int64
NumMyLoan        int64
NumFinLoan       int64
NumLoans         int64
Income           int64
EC_Card          int64
IncLevel         int64
status          object
bureau           int64
region           int64
regionLarge      int64
cash             int64
product         object
resid           object
nat             object
prof            object
car             object
card            object
saving          object
dtype: object

For general information about the DataFrame as a whole, you can use the ``info`` method.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 27 columns):
target         100000 non-null object
_customerID    100000 non-null object
title          100000 non-null object
Nchildren      100000 non-null int64
Nhousehold     100000 non-null int64
Age            100000 non-null int64
TimeAddress    96497 non-null float64
TimeJob        98691 non-null float64
tel            100000 non-null int64
NumMyLoan      100000 non-null int64
NumFinLoan     100000 non-null int64
NumLoans       100000 non-null int64
Income         100000 non-null int64
EC_Card        100000 non-null int64
IncLevel       100000 non-null int64
status         100000 non-null object
bureau         100000 non-null int64
region         100000 non-null int64
regionLarge    100000 non-null int64
cash           100000 non-null int64
product        100000 non-null object
resid          82405 non-null object
nat            100000 non-null object
prof           28809 non-null o

Now that we know more about the columns and their data types, we can move on to subsetting DataFrames into other DataFrames or columns.

For more information on plotting features of DataFrames, see the [Pandas Visualization documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html).

## Conclusion

We've just covered the very basics of the Pandas package here.  You should have enough to get started, but for more information, you should [see the official documentation](http://pandas.pydata.org/pandas-docs/stable/).