# Introduction to Scikit Learn using a Decision Tree

There are many ways to build and apply data mining modules in Scikit learn. You will see lots of examples. Some are more efficient that others. This this how I build a basic data mining project (it may or may not be efficient). Scikit learn has great documentation via http://scikit-learn.org/stable/index.html. Below is a combination of Numpy, Scikit learn, Matplotlib and Pandas scripts. These 4 are the primary tools of the datamining in Python. 

# Import standard packages for Machine Learning

In [1]:
#Add packages
#These are my standard packages I load for almost every project
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#From Scikit Learn
from sklearn import preprocessing
from sklearn.model_selection  import train_test_split, cross_val_score, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
#Notice I did not load my Datamining packages yet (ie decision tree). I will do that as I use them.

# Check current directory

In [2]:
%pwd

'/Users/mpgartland/Documents/Courses/Predictive Models/Pred_Models_git/Week 1/code'

# Change Directory to where my project is located

In [3]:
cd /Users/mpgartland/Documents/Courses/Predictive Models/Pred_Models_git/Week 1/

[Errno 2] No such file or directory: '/Users/mpgartland/Documents/Courses/Predictive Models/Pred_Models_git/Week 1 part 1/'
/Users/mpgartland/Documents/Courses/Predictive Models/Pred_Models_git/Week 1/code


# Read in a CSV file. Print basic information on file

In [4]:
bank = pd.read_csv("data/bank_data.csv", sep=",")
#print type of object for target
print("Data type", bank.savings_acct.dtype)
#Dimensions of dataset
print("Shape of Data", bank.shape)
#Colum names
print("Colums Names", bank.columns)
#See top few rows of dataset
bank.head(10)

FileNotFoundError: File b'data/bank_data.csv' does not exist

# Identify Target Variable and Move to Target to Collumn 0 (optional)

In [None]:
# designate target variable name
targetName = 'savings_acct'
targetSeries = bank[targetName]
#remove target from current location and insert in collum 0
del bank[targetName]
bank.insert(0, targetName, targetSeries)
#reprint dataframe and see target is in position 0
bank.head(10)

# ID collum needs to be removed since I do not believe it has predictive power

In [None]:
#Note: axis=1 denotes that we are referring to a column, not a row
bank=bank.drop('id',axis=1)
bank.head(10)

# EDA on the Target

In [None]:
#Basic bar chart since the target is binominal
groupby = bank.groupby(targetName)
targetEDA=groupby[targetName].aggregate(len)
plt.figure()
targetEDA.plot(kind='bar', grid=False)
plt.axhline(0, color='k')

# Preprocessing of Data

Preprocessing
The below two steps are for preprocessing. The first cell changes the yes/no of the target to numeric. I needed to do this as some models require the target to be numeric. The second cell takes all the category features and creates dummies with them. This is stock code I have used for long time (and I did not write it). It is nice because it will take any dataframe of any size and handle categorial features. I do not have to change a single line in it. It can be used generically on bascially any dataframe. Saves a lot of time of coding each feature.

In [None]:
# This code turns a text target into numeric to some scikit learn alogrythms can process it
from sklearn import preprocessing
le_dep = preprocessing.LabelEncoder()
#to convert into numbers
bank['savings_acct'] = le_dep.fit_transform(bank['savings_acct'])

In [None]:
# perform data transformation. Creates dummies of any categorical feature
for col in bank.columns[1:]:
	attName = col
	dType = bank[col].dtype
	missing = pd.isnull(bank[col]).any()
	uniqueCount = len(bank[attName].value_counts(normalize=False))
	# discretize (create dummies)
	if dType == object:
		bank = pd.concat([bank, pd.get_dummies(bank[col], prefix=col)], axis=1)
		del bank[attName]

# Notice new shape and format of the dataframe. It is now ready to data mine

In [None]:
bank.shape

In [None]:
bank.head(10)

# Randomly split your dataset into Train/Test 

I split the data into a 60/40 train test. The features are stored in "features_train" and "features_test". The targets are in "target_train" and "target_test". I used a biggest test when I have an imbalanced set.

In [1]:
# split dataset into testing and training
features_train, features_test, target_train, target_test = train_test_split(
    bank.iloc[:,1:].values, bank.iloc[:,0].values, test_size=0.40, random_state=0)

NameError: name 'train_test_split' is not defined

## Note the four new train/test files and their shapes. 

In [None]:
print(features_test.shape)
print(features_train.shape)
print(target_test.shape)
print(target_train.shape)

# Run a Decision Tree Model via Scikit Learn using the above created train/test files

In [None]:
#Decision Tree train model. Call up my model and name it clf
from sklearn import tree 
clf_dt = tree.DecisionTreeClassifier()
#Call up the model to see the parameters you can tune (and their default setting)
print(clf_dt)
#Fit clf to the training data
clf_dt = clf_dt.fit(features_train, target_train)
#Predict clf DT model again test data
target_predicted_dt = clf_dt.predict(features_test)


# Obtain Accuracy of Model

In [None]:
print("DT Accuracy Score", accuracy_score(target_test, target_predicted_dt))
print(classification_report(target_test, target_predicted_dt))
print(confusion_matrix(target_test, target_predicted_dt))


# Crossvalidate Tree

I cross validated with 10 repeats. You can see the OOB score for each repeat and the mean. Are the CV results stable? If not, the model might be overfitting. 

In [None]:
#verify DT with Cross Validation
scores = cross_val_score(clf_dt, features_train, target_train, cv=10)
print("Cross Validation Score for each K",scores)
scores.mean()          

# To make a tree, add these to packages

conda install graphviz

pip install pydotplus

In [None]:
from IPython.display import Image
dot_data = tree.export_graphviz(clf_dt, out_file=None, 
                         filled=True, rounded=True,  
                         special_characters=True)
#Add feature names

In [None]:
import pydotplus 
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png()) 

# Add your next model using what you have already processed, below

#Perhaps try a KNN on the same data
#http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
#Look at your DT model for structure and guidance. 


## Start with:
from sklearn.neighbors import KNeighborsClassifier