### Online Machine Learning Feature Selection Experiments with Test Events

Jay Urbain, PhD

1/29/2016

#### Description

Feature selection can have a significant impact on building effective machine learning models. Irrelevant or partially relevant features can negatively impact model performance.

This script experiments with automatic feature selection techniques available in the Python scikit-learn machine learning package.

#### Feature selection

Feature selection is a process for selecting features in your data that collectively contribute the most to identifying a prediction varaible. Irrelevant features can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data include:

- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise. 
- Improves Accuracy: Less misleading data means modeling accuracy improves. 
- Reduces Training Time: Less data means that algorithms train faster.

#### Load Honeywell data

In [1]:
import numpy as np
import pandas as pd
import glob
import re
import csv
from datetime import datetime
from dateutil.parser import parse
from sklearn import preprocessing

# define columns (not really needed, header included)
columns = [
'EQPID',
'EVENT',
'STATUS',
'START_DATE',
'FINISH_DATE',
'DOWN_DAYS',
'PRIORITY',
'TOTALHOURS',
'TECH_ASSOC',
'AS_FOUND_CONDITION',
'AS_LEFT_CONDITION',
'EQTYPE',
'DOB',
'CURAGE',
'N_TESTS',
'initEvent',
'prevEvent',
'deltaEvent',
'rateEvent',
'countEvent',
'initCollectData',
'prevCollectData',
'deltaCollectData',
'rateCollectData',
'countCollectData',
'initCalibration',
'prevCalibration',
'deltaCalibration',
'rateCalibration',
'countCalibration',
'initRemoveFromService',
'prevRemoveFromService',
'deltaRemoveFromService',
'rateRemoveFromService',
'countRemoveFromService',
'initReturnToService',
'prevReturnToService',
'deltaReturnToService',
'rateReturnToService',
'countReturnToService',
'initSafetyApproval',
'prevSafetyApproval',
'deltaSafetyApproval',
'rateSafetyApproval',
'countSafetyApproval',
'maxDownDays',
'avgDownDays',
'maxTotalHours',
'avgTotalHours',
'maxNTests',
'avgNTests',
# 'removeFromService30d',
# 'removeFromService60d',
# 'returnToService30d',
# 'returnToService60d',
# 'calibration30d',
# 'calibration60d',
'maintenanceRequest30d',
# 'maintenanceRequest60d',
'totalHours30d',
# 'totalHours60d',
'downDays30d',
# 'downDays60d',
'nTests30d'
# 'nTests60d'
]

#print columns

# load data file
url = "/Users/jayurbain/Honeywell/cms_cal_sr_comp_time_norm_test_event_09302016_label.csv"
df = pd.read_csv(url, index_col=None)
print df.shape
# 4116

# print df.avgNTests.describe()

# ensure proper datatypes
df.EQPID.astype('str')
df.EVENT.astype('str')
df.STATUS.astype('str')
# df.START_DATE.astype('datetime64')
# df.START_DATE=np.datetime64(df.START_DATE)
# df.FINISH_DATE.astype('datetime64')
df.DOWN_DAYS.astype('int64')
df.PRIORITY.astype('str')
df.TOTALHOURS.astype('float64')
df.TECH_ASSOC.astype('str')
df.AS_FOUND_CONDITION.astype('str')
df.AS_LEFT_CONDITION.astype('str')
df.EQTYPE.astype('str')
df.DOB.astype('datetime64')
df.CURAGE.astype('float64')
df.N_TESTS.astype('float64')
# df.initEvent.astype('datetime64')
# df.prevEvent.astype('datetime64')
df.deltaEvent.astype('float64')
df.rateEvent.astype('float64')
df.countEvent.astype('float64')
df.initCollectData.astype('datetime64')
df.prevCollectData.astype('datetime64')
df.deltaCollectData.astype('float64')
df.rateCollectData.astype('float64')
df.countCollectData.astype('float64')
df.initCalibration.astype('datetime64')
df.prevCalibration.astype('datetime64')
df.deltaCalibration.astype('float64')
df.rateCalibration.astype('float64')
df.countCalibration.astype('float64')
df.initRemoveFromService.astype('datetime64')
df.prevRemoveFromService.astype('datetime64')
df.deltaRemoveFromService.astype('float64')
df.rateRemoveFromService.astype('float64')
df.countRemoveFromService.astype('float64')
# df.initReturnToService.astype('datetime64')
# df.prevReturnToService.astype('datetime64')
df.deltaReturnToService.astype('float64')
df.rateReturnToService.astype('float64')
df.countReturnToService.astype('float64')
# df.initSafetyApproval.astype('datetime64')
# df.prevSafetyApproval.astype('datetime64')
df.deltaSafetyApproval.astype('float64')
df.rateSafetyApproval.astype('float64')
df.countSafetyApproval.astype('float64')
df.maxDownDays.astype('float64')
df.avgDownDays.astype('float64')
df.maxTotalHours.astype('float64')
df.avgTotalHours.astype('float64')
df.maxNTests.astype('float64')
df.avgNTests.astype('float64')
# df.removeFromService30d.astype('bool_')
# df.removeFromService60d.astype('bool_')
# df.returnToService30d.astype('bool_')
# df.returnToService60d.astype('bool_')
# df.calibration30d.astype('bool_')
# df.calibration60d.astype('bool_')
df.maintenanceRequest30d.astype('bool_')
# df.maintenanceRequest60d.astype('bool_')
df.totalHours30d.astype('bool_')
# # df.totalHours60d.astype('bool_')
df.downDays30d.astype('bool_')
# # df.downDays60d.astype('bool_')
df.nTests30d.astype('bool_')

df[:10]


(34445, 55)


Unnamed: 0,EQPID,EVENT,STATUS,START_DATE,FINISH_DATE,DOWN_DAYS,PRIORITY,TOTALHOURS,TECH_ASSOC,AS_FOUND_CONDITION,...,maxDownDays,avgDownDays,maxTotalHours,avgTotalHours,maxNTests,avgNTests,maintenanceRequest30d,totalHours30d,downDays30d,nTests30d
0,DBIFVPCDYD,Test,Completed,1998-06-01,1998-06-01,1,,0.5,12345,,...,1,1.0,0,0.0,1,1.0,False,False,False,False
1,DBIFVPCDYD,Calibration,Completed,2001-05-24,2001-05-24,1,Normal,3.0,26162,SJJZZIKKBY,...,1,1.0,3,1.5,1,0.5,False,False,False,False
2,DBIFVPCDYD,Calibration,Completed,2001-05-29,2001-06-22,24,Low,6.0,26665,SJJZZIKKBY,...,24,8.666667,6,3.0,1,0.333333,False,False,False,False
3,DBIFVPCDYD,Calibration,Completed,2001-07-09,2001-07-16,7,Low,8.0,26665,SJJZZIKKBY,...,24,8.25,8,4.25,1,0.25,False,False,False,False
4,DBIFVPCDYD,Calibration,Completed,2001-08-10,2001-08-14,4,Normal,6.0,26665,SJJZZIKKBY,...,24,7.4,8,4.6,1,0.2,False,False,False,False
5,DBIFVPCDYD,Calibration,Completed,2001-09-03,2001-12-04,92,Low,6.0,26665,SJJZZIKKBY,...,92,21.5,8,4.833333,1,0.166667,False,False,True,False
6,DBIFVPCDYD,Test,Completed,2001-09-27,2001-09-27,1,,0.5,12345,,...,92,18.571429,8,4.142857,2,0.428571,False,False,False,False
7,DBIFVPCDYD,Calibration,Completed,2002-01-01,2002-01-03,2,Hot,4.0,26665,SJJZZIKKBY,...,92,16.5,8,4.125,2,0.375,False,False,False,False
8,DBIFVPCDYD,Calibration,Completed,2002-01-21,2002-02-08,18,Hot,4.0,26665,SJJZZIKKBY,...,92,16.666667,8,4.111111,2,0.333333,False,False,False,False
9,DBIFVPCDYD,Calibration,Completed,2002-02-25,2002-03-05,8,Hot,4.5,26665,SJJZZIKKBY,...,92,15.8,8,4.1,2,0.3,False,False,True,True


#### Select features for machine learning

In [2]:
column_features = ['EQPID', 'EVENT', 'STATUS', 'DOWN_DAYS', 'PRIORITY', 'TOTALHOURS', 'TECH_ASSOC', 'AS_FOUND_CONDITION', 'AS_LEFT_CONDITION', 'EQTYPE', 'CURAGE', 'N_TESTS', 
                    'initEvent', 'prevEvent', 'deltaEvent', 'rateEvent', 'countEvent',
                    'deltaCollectData', 'rateCollectData', 'countCollectData',
                    'deltaCalibration', 'rateCalibration', 'countCalibration',
                    'deltaRemoveFromService', 'rateRemoveFromService', 'countRemoveFromService',
                    'deltaReturnToService', 'rateReturnToService', 'countReturnToService',
                    'deltaSafetyApproval', 'rateSafetyApproval', 'countSafetyApproval',
                    'maxDownDays', 'avgDownDays', 'maxTotalHours', 'avgTotalHours', 'maxNTests', 'avgNTests',
#                     'removeFromService30d', 
#                     'removeFromService60d',
#                     'returnToService30d', 
#                     'returnToService60d',
#                     'calibration30d', 
#                     'calibration60d',
                    'maintenanceRequest30d', 
#                     'maintenanceRequest60d',
                    'totalHours30d', 
#                     'totalHours60d',
                    'downDays30d',
#                     'downDays60d',
                    'nTests30d'
#                     'nTests60d'
                    ]

df_ = df[:][column_features]
df_[:5]


Unnamed: 0,EQPID,EVENT,STATUS,DOWN_DAYS,PRIORITY,TOTALHOURS,TECH_ASSOC,AS_FOUND_CONDITION,AS_LEFT_CONDITION,EQTYPE,...,maxDownDays,avgDownDays,maxTotalHours,avgTotalHours,maxNTests,avgNTests,maintenanceRequest30d,totalHours30d,downDays30d,nTests30d
0,DBIFVPCDYD,Test,Completed,1,,0.5,12345,,,1,...,1,1.0,0,0.0,1,1.0,False,False,False,False
1,DBIFVPCDYD,Calibration,Completed,1,Normal,3.0,26162,SJJZZIKKBY,SJJZZIKKBY,1,...,1,1.0,3,1.5,1,0.5,False,False,False,False
2,DBIFVPCDYD,Calibration,Completed,24,Low,6.0,26665,SJJZZIKKBY,SJJZZIKKBY,1,...,24,8.666667,6,3.0,1,0.333333,False,False,False,False
3,DBIFVPCDYD,Calibration,Completed,7,Low,8.0,26665,SJJZZIKKBY,SJJZZIKKBY,1,...,24,8.25,8,4.25,1,0.25,False,False,False,False
4,DBIFVPCDYD,Calibration,Completed,4,Normal,6.0,26665,SJJZZIKKBY,SJJZZIKKBY,1,...,24,7.4,8,4.6,1,0.2,False,False,False,False


#### Define X, y

In [3]:

X_features = ['EQPID', 'EVENT', 'STATUS', 'DOWN_DAYS', 'PRIORITY', 'TOTALHOURS', 'TECH_ASSOC', 'AS_FOUND_CONDITION', 'AS_LEFT_CONDITION', 'EQTYPE', 'CURAGE', 'N_TESTS', 
                    'deltaEvent', 'rateEvent', 'countEvent',
                    'deltaCollectData', 'rateCollectData', 'countCollectData',
                    'deltaCalibration', 'rateCalibration', 'countCalibration',
                    'deltaRemoveFromService', 'rateRemoveFromService', 'countRemoveFromService',
                    'deltaReturnToService', 'rateReturnToService', 'countReturnToService',
                    'deltaSafetyApproval', 'rateSafetyApproval', 'countSafetyApproval',
                    'maxDownDays', 'avgDownDays', 'maxTotalHours', 'avgTotalHours', 'maxNTests', 'avgNTests']

# Y_labels = ['removeFromService30d', 'removeFromService60d',
#                     'returnToService30d', 'returnToService60d',
#                     'calibration30d', 'calibration60d',
#                     'maintenanceRequest30d', 'maintenanceRequest60d',
#                     'totalHours30d', 'totalHours60d',
#                     'downDays30d','downDays60d',
#                     'nTests30d','nTests60d']

Y_labels = ['maintenanceRequest30d',
            'totalHours30d',
            'downDays30d',
            'nTests30d']

df[:10]

Unnamed: 0,EQPID,EVENT,STATUS,START_DATE,FINISH_DATE,DOWN_DAYS,PRIORITY,TOTALHOURS,TECH_ASSOC,AS_FOUND_CONDITION,...,maxDownDays,avgDownDays,maxTotalHours,avgTotalHours,maxNTests,avgNTests,maintenanceRequest30d,totalHours30d,downDays30d,nTests30d
0,DBIFVPCDYD,Test,Completed,1998-06-01,1998-06-01,1,,0.5,12345,,...,1,1.0,0,0.0,1,1.0,False,False,False,False
1,DBIFVPCDYD,Calibration,Completed,2001-05-24,2001-05-24,1,Normal,3.0,26162,SJJZZIKKBY,...,1,1.0,3,1.5,1,0.5,False,False,False,False
2,DBIFVPCDYD,Calibration,Completed,2001-05-29,2001-06-22,24,Low,6.0,26665,SJJZZIKKBY,...,24,8.666667,6,3.0,1,0.333333,False,False,False,False
3,DBIFVPCDYD,Calibration,Completed,2001-07-09,2001-07-16,7,Low,8.0,26665,SJJZZIKKBY,...,24,8.25,8,4.25,1,0.25,False,False,False,False
4,DBIFVPCDYD,Calibration,Completed,2001-08-10,2001-08-14,4,Normal,6.0,26665,SJJZZIKKBY,...,24,7.4,8,4.6,1,0.2,False,False,False,False
5,DBIFVPCDYD,Calibration,Completed,2001-09-03,2001-12-04,92,Low,6.0,26665,SJJZZIKKBY,...,92,21.5,8,4.833333,1,0.166667,False,False,True,False
6,DBIFVPCDYD,Test,Completed,2001-09-27,2001-09-27,1,,0.5,12345,,...,92,18.571429,8,4.142857,2,0.428571,False,False,False,False
7,DBIFVPCDYD,Calibration,Completed,2002-01-01,2002-01-03,2,Hot,4.0,26665,SJJZZIKKBY,...,92,16.5,8,4.125,2,0.375,False,False,False,False
8,DBIFVPCDYD,Calibration,Completed,2002-01-21,2002-02-08,18,Hot,4.0,26665,SJJZZIKKBY,...,92,16.666667,8,4.111111,2,0.333333,False,False,False,False
9,DBIFVPCDYD,Calibration,Completed,2002-02-25,2002-03-05,8,Hot,4.5,26665,SJJZZIKKBY,...,92,15.8,8,4.1,2,0.3,False,False,True,True


#### Preporcessing

- Select X features, y classification column 
- Also set categorical values to numeric  
- Normalize values using Z-score normalization 

In [4]:
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif

# select X features and y classification column
X = df_[:][X_features]
y = df_[:]['downDays30d']

# X_features = ['EQPID', 'EVENT', 'STATUS', 'DOWN_DAYS', 'PRIORITY', 'TOTALHOURS', 'TECH_ASSOC', 'AS_FOUND_CONDITION', 'AS_LEFT_CONDITION', 'EQTYPE', 'CURAGE', 'N_TESTS', 
# EQPID
le = preprocessing.LabelEncoder()
le.fit(X.EQPID)
# print (le.classes_)
X.EQPID = le.transform(X.EQPID)

# EVENT
le = preprocessing.LabelEncoder()
le.fit(X.EVENT)
# print (le.classes_)
X.EVENT = le.transform(X.EVENT)

# STATUS
le = preprocessing.LabelEncoder()
le.fit(X.STATUS)
# print (le.classes_)
X.STATUS = le.transform(X.STATUS)

X.DOWN_DAYS = preprocessing.scale(X.DOWN_DAYS)

# PRIORITY
le = preprocessing.LabelEncoder()
le.fit(X.PRIORITY)
# print (le.classes_)
X.PRIORITY = le.transform(X.PRIORITY)

X.TOTALHOURS = preprocessing.scale(X.TOTALHOURS)

# TECH_ASSOC
le = preprocessing.LabelEncoder()
le.fit(X.TECH_ASSOC)
# print (le.classes_)
X.TECH_ASSOC = le.transform(X.TECH_ASSOC)

# AS_FOUND_CONDITION
# le = preprocessing.LabelEncoder()
le.fit(X.AS_FOUND_CONDITION)
# print (le.classes_)
X.AS_FOUND_CONDITION = le.transform(X.AS_FOUND_CONDITION)

# AS_LEFT_CONDITION
le = preprocessing.LabelEncoder()
le.fit(X.AS_LEFT_CONDITION)
# print (le.classes_)
X.AS_LEFT_CONDITION = le.transform(X.AS_LEFT_CONDITION)

# EQPTYPE
le = preprocessing.LabelEncoder()
le.fit(X.EQTYPE)
# print (le.classes_)
X.EQTYPE = le.transform(X.EQTYPE)

X.CURAGE = preprocessing.scale(X.CURAGE)
X.N_TESTS = preprocessing.scale(X.N_TESTS)

'''                    
'deltaEvent', 'rateEvent', 'countEvent',
'deltaCollectData', 'rateCollectData', 'countCollectData',
'deltaCalibration', 'rateCalibration', 'countCalibration',
'deltaRemoveFromService', 'rateRemoveFromService', 'countRemoveFromService',
'deltaReturnToService', 'rateReturnToService', 'countReturnToService',
'deltaSafetyApproval', 'rateSafetyApproval', 'countSafetyApproval',
'maxDownDays', 'avgDownDays', 'maxTotalHours', 'avgTotalHours', 'maxNTests', 'avgNTests'         
''' 

X.deltaEvent = preprocessing.scale(X.deltaEvent)
X.rateEvent = preprocessing.scale(X.rateEvent)
X.countEvent = preprocessing.scale(X.countEvent)
X.deltaCollectData = preprocessing.scale(X.deltaCollectData)
X.rateCollectData = preprocessing.scale(X.rateCollectData)
X.countCollectData = preprocessing.scale(X.countCollectData)
X.deltaCalibration = preprocessing.scale(X.deltaCalibration)
X.rateCalibration = preprocessing.scale(X.rateCalibration)
X.countCalibration = preprocessing.scale(X.countCalibration)
X.deltaRemoveFromService = preprocessing.scale(X.deltaRemoveFromService)
X.rateRemoveFromService = preprocessing.scale(X.rateRemoveFromService)
X.countRemoveFromService = preprocessing.scale(X.countRemoveFromService)
X.deltaReturnToService = preprocessing.scale(X.deltaReturnToService)
X.rateReturnToService = preprocessing.scale(X.rateReturnToService)
X.countReturnToService = preprocessing.scale(X.countReturnToService)
X.deltaSafetyApproval = preprocessing.scale(X.deltaSafetyApproval)
X.rateSafetyApproval = preprocessing.scale(X.rateSafetyApproval)
X.countSafetyApproval = preprocessing.scale(X.countSafetyApproval)
X.maxDownDays = preprocessing.scale(X.maxDownDays)
X.avgDownDays = preprocessing.scale(X.avgDownDays)
X.maxTotalHours = preprocessing.scale(X.maxTotalHours)
X.avgTotalHours = preprocessing.scale(X.avgTotalHours)
X.maxNTests = preprocessing.scale(X.maxNTests)
X.avgNTests = preprocessing.scale(X.avgNTests)    

# y
le = preprocessing.LabelEncoder()
le.fit(y)
# print (le.classes_)
y = le.transform(y)

print X[:10]
print y[:10]

lr = LogisticRegression()
model = lr.fit(X, y)
coefficients = lr.coef_[0]
intercept = lr.intercept_[0]
print 'intercept: ', intercept, 'coefficients: ', coefficients

print 'Logistic Regression model prediction: ', model.predict(X[:10])

print 'Logistic Regression model no feature selection score: ', model.score(X, y)

from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(X, y)

print 'SVM prediction: ', svc.predict(X[:10])
print 'SVM score: ', svc.score(X, y)

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier

print 'X.shape: ' , X.shape
# lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(svc, prefit=True)
X_new = model.transform(X)
print 'X_new.shape: ', X_new.shape

print 'Select from model (SVM) features: ', X.columns[model.get_support()]

svc.fit(X_new, y)
print 'Select from model (SVM) score: ', svc.score(X_new, y)

lr.fit(X_new, y)
print 'Select from model (Logistic Regression) score: ', lr.score(X_new, y)


  flag = np.concatenate(([True], aux[1:] != aux[:-1]))
  return aux[:-1][aux[1:] == aux[:-1]]


   EQPID  EVENT  STATUS  DOWN_DAYS  PRIORITY  TOTALHOURS  TECH_ASSOC  \
0      1      8       0  -0.096973         2   -0.234935           0   
1      1      0       0  -0.096973         3    0.649023           6   
2      1      0       0   1.674749         1    1.709773           8   
3      1      0       0   0.365215         1    2.416939           8   
4      1      0       0   0.134121         3    1.709773           8   
5      1      0       0   6.912882         1    1.709773           8   
6      1      8       0  -0.096973         2   -0.234935           0   
7      1      0       0  -0.019942         0    1.002606           8   
8      1      0       0   1.212560         0    1.002606           8   
9      1      0       0   0.442247         0    1.179398           8   

   AS_FOUND_CONDITION  AS_LEFT_CONDITION  EQTYPE    ...      \
0                   0                  0       0    ...       
1                   7                  5       0    ...       
2                 

#### Feature selection using trees

Fits randomized decision trees

In [5]:
# feature extraction
tree = ExtraTreesClassifier()
tree.fit(X, y)
print 'Data:', X[:5]

d = dict(zip(X.columns, tree.feature_importances_))
print 'tree.feature_importances:'
print d
print
print 'ExtraTreesClassifier model score: ', tree.score(X, y)

# [4115 rows x 36 columns]
X_train = X[:3000]
X_test = X[3000:4115]
y_train = y[:3000]
y_test = y[3000:4115]

X_train = X[:3000]
X_test = X[3000:3500]
y_train = y[:3000]
y_test = y[3000:3500]

tree.fit(X_train, y_train)
print 'ExtraTreesClassifier model score test: ', tree.score(X_test, y_test)

model = SelectFromModel(tree, prefit=True)
X_new = model.transform(X)
lr = LogisticRegression(C=0.9)
lr.fit(X_new, y)
print 'Logistic regression with ExtraTreesClassifier features score: ', lr.score(X_new, y)

model = SelectFromModel(tree, prefit=True)
X_new_train = model.transform(X_train)
X_new_test = model.transform(X_test)
lr = LogisticRegression(C=0.9)
lr.fit(X_new_train, y_train)
print 'X_new.shape: ', X_new.shape

print 'Logistic regression with ExtraTreesClassifier features score test: ', lr.score(X_new_test, y_test)

# Is the beta_1 value associated with balance significant?
# B1 = balance.coef_[0][0]
# B0 = balance.intercept_[0]
# print 'B1: ', B1, ' B0: ', B0 
# print 'e^B1: ', np.exp(B1)

Data:    EQPID  EVENT  STATUS  DOWN_DAYS  PRIORITY  TOTALHOURS  TECH_ASSOC  \
0      1      8       0  -0.096973         2   -0.234935           0   
1      1      0       0  -0.096973         3    0.649023           6   
2      1      0       0   1.674749         1    1.709773           8   
3      1      0       0   0.365215         1    2.416939           8   
4      1      0       0   0.134121         3    1.709773           8   

   AS_FOUND_CONDITION  AS_LEFT_CONDITION  EQTYPE    ...      \
0                   0                  0       0    ...       
1                   7                  5       0    ...       
2                   7                  5       0    ...       
3                   7                  5       0    ...       
4                   7                  5       0    ...       

   countReturnToService  deltaSafetyApproval  rateSafetyApproval  \
0             -0.774489                    0           -0.920281   
1             -0.774489                    0  

#### Evaluate classifications

In [6]:
from sklearn import cross_validation

svc = svm.SVC(kernel='linear')
lr = LogisticRegression(C=0.9)
tree = ExtraTreesClassifier()
# 34445
X_train = X[:27200] # 80% of 34000, omit end 
X_test = X[27200:34000] # 20% of 34000

print 'X.shape: ' , X.shape

for label in Y_labels:
    
    y = df_[:][label]
    y_train = y[:27200]
    y_test = y[27200:34000]
    print '************************'
    print 'label: ', label
    y = df_[:][label]
    svc.fit(X_train, y_train)
#     scores = cross_validation.cross_val_score(clf, X[:3500], y[:3500], cv=5)
#     print("SVM score cv: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    print 'SVM score train: ', svc.score(X_train, y_train)
    print 'SVM score test: ', svc.score(X_test, y_test)
    tree.fit(X_train, y_train)
#     scores = cross_validation.cross_val_score(tree, X[:3500], y[:3500], cv=5)
#     print("Randomized trees score cv: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    print 'Randomized trees score train: ', tree.score(X_train, y_train)
    print 'Randomized trees score test: ', tree.score(X_test, y_test)
    lr.fit(X_train, y_train)
#     scores = cross_validation.cross_val_score(lr, X[:3500], y[:3500], cv=5)
#     print("LR score cv: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    print 'LR score train: ', lr.score(X_train, y_train)
    print 'LR score test: ', lr.score(X_test, y_test)
    print
    
    model = SelectFromModel(svc, prefit=True)
    X_new_train = model.transform(X_train)
    X_new_test = model.transform(X_test)
    lr.fit(X_new_train, y_train)
    print 'SVM model features: ', X.columns[model.get_support()]
#     scores = cross_validation.cross_val_score(lr, X[:3500], y[:3500], cv=5)
#     print("LR score cv: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    print 'Select from model (Logistic Regression) score train: ', lr.score(X_new_train, y_train)
    print 'Select from model (Logistic Regression) score test: ', lr.score(X_new_test, y_test)
    print
    d = dict(zip(X.columns, tree.feature_importances_))
    print 'tree.feature_importances:'
    print d
    print


X.shape:  (34445, 36)
************************
label:  maintenanceRequest30d
SVM score train:  0.595845588235
SVM score test:  0.543088235294
Randomized trees score train:  1.0
Randomized trees score test:  0.564411764706
LR score train:  0.613786764706
LR score test:  0.509852941176

SVM model features:  Index([u'DOWN_DAYS', u'TOTALHOURS', u'AS_FOUND_CONDITION',
       u'AS_LEFT_CONDITION', u'deltaCollectData', u'rateCollectData',
       u'countCollectData', u'maxDownDays'],
      dtype='object')
Select from model (Logistic Regression) score train:  0.593382352941
Select from model (Logistic Regression) score test:  0.544411764706

tree.feature_importances:
{'TOTALHOURS': 0.0078652840658196433, 'rateRemoveFromService': 0.0, 'countCollectData': 0.0087086833554029051, 'maxTotalHours': 0.022999183620285572, 'AS_LEFT_CONDITION': 0.0025200416222368278, 'avgTotalHours': 0.086933987677504335, 'countCalibration': 0.0727760790113893, 'EQTYPE': 0.0076461491543674655, 'AS_FOUND_CONDITION': 0.002