__Assignment 10__

1. [Import](#Import)
1. [Assignment 10](#Assignment-10)
    1. [Load data](#Load-data)
    1. [Linear regression](#Linear-regression)    
    1. [Logistic regression](#logistic-regression)

# Import

<a id = 'Import'></a>

In [1]:
# standard libary and settings
import warnings; warnings.simplefilter('ignore')
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format

# modeling extensions
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.linear_model as linear_model
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing


# Assignment 10

<a id = 'Assignment-10'></a>

## Load data

<a id = 'load-data'></a>

In [2]:
# load and inspect data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data'
                 ,sep = ','
                 ,names = ['top-left-square', 'top-middle-square', 'top-right-square',
 'middle-left-square', 'middle-middle-square', 'middle-right-square',
 'bottom-left-square', 'bottom-middle-square', 'bottom-right-square','class'])

for column in df:
    df[column] = df[column].astype('category')
    df[column] = df[column].cat.codes

df.info()
display(df[:5])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958 entries, 0 to 957
Data columns (total 10 columns):
top-left-square         958 non-null int8
top-middle-square       958 non-null int8
top-right-square        958 non-null int8
middle-left-square      958 non-null int8
middle-middle-square    958 non-null int8
middle-right-square     958 non-null int8
bottom-left-square      958 non-null int8
bottom-middle-square    958 non-null int8
bottom-right-square     958 non-null int8
class                   958 non-null int8
dtypes: int8(10)
memory usage: 9.4 KB


Unnamed: 0,top-left-square,top-middle-square,top-right-square,middle-left-square,middle-middle-square,middle-right-square,bottom-left-square,bottom-middle-square,bottom-right-square,class
0,2,2,2,2,1,1,2,1,1,1
1,2,2,2,2,1,1,1,2,1,1
2,2,2,2,2,1,1,1,1,2,1
3,2,2,2,2,1,1,1,0,0,1
4,2,2,2,2,1,1,0,1,0,1


In [3]:
# train/test split
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

XTrain, XTest, yTrain, yTest\
      = model_selection.train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)


## Linear regression


<a id = 'Linear-regression'></a>

In [4]:
# create linear regression model and cross validate
linReg = linear_model.LinearRegression()
scores = model_selection.cross_val_score(linReg
                                        ,XTrain
                                        ,yTrain
                                        ,scoring = 'r2'
                                        ,cv = 10
                                       )
print('CV accuracy on training data: {:.3f} +/- {:.3f}'.format(np.mean(scores), np.std(scores)))


CV accuracy on training data: 0.022 +/- 0.045


In [5]:
# test data
linReg = linear_model.LinearRegression()
linReg.fit(XTrain, yTrain)
yPred = linReg.predict(XTest)
print('Test set accuracy: {0}'.format(metrics.r2_score(y_pred = yPred, y_true = yTest)))


Test set accuracy: 0.09324481956312736


> Remarks - As expected, the linear regression model performs quite poorly on this classifcation problem.

## Logistic Regression

<a id = 'logistic-regression'></a>

In [6]:
# create logistic regression model and cross validate
logReg = linear_model.LogisticRegression()
scores = model_selection.cross_val_score(logReg
                                        ,XTrain
                                        ,yTrain
                                        ,scoring = 'accuracy'
                                        ,cv = 10
                                       )
print('CV accuracy on training data: {:.3f} +/- {:.3f}'.format(np.mean(scores), np.std(scores)))


CV accuracy on training data: 0.678 +/- 0.025


> Remarks - This basic logistic regression model performs similarly to the lecture model.

In [7]:
# use GridSearchCV to perform CV over several different combination of parameters
pipe = pipeline.Pipeline(steps=[('scale', preprocessing.StandardScaler())
                       ,('LogisticRegression', linear_model.LogisticRegression(random_state = 1))])

param_grid = {'LogisticRegression__C': np.linspace(1e-6, 10, 100)
               ,'LogisticRegression__penalty': ['l1','l2']
              }
gs = model_selection.GridSearchCV(estimator = pipe
                                    ,param_grid = param_grid
                                    ,scoring = 'accuracy'
                                    ,cv = 10
                                    ,n_jobs = -1
                                 )
gs = gs.fit(XTrain, yTrain)
print(gs.best_score_)
print(gs.best_params_)


0.6814621409921671
{'LogisticRegression__C': 0.4040413636363636, 'LogisticRegression__penalty': 'l2'}


In [1]:
# test data
yPred = gs.predict(XTest)
print('Test set accuracy: {0}'.format(metrics.accuracy_score(y_pred = yPred, y_true = yTest)))


NameError: name 'gs' is not defined

> Remarks - This tuned logistic regression model performs better than the model with default parameters. Steps taken to optimize the model included setting up a pipeline to first perform standard scaling before implementing the logistic regression model. This pipeline is fed into GridSearchCV, where a hyperparameter search for the penalty and C values is performed. The chosen value for C is 0.404 and the penalty type is L2.