### Classification Tasks

Below are three example data sets for use in classification tasks.  Your goal is as follows:

1. State a clear question that could be answered using classification with the specific data set.
2. Determine whether you care more about precision or recall in this task, and why.
3. Investigate the performance of `LogisticRegression`, `KNearestNeighbors`, `SGDClassifier`, and a `DummyClassifier` based on the majority class.

---

- https://archive.ics.uci.edu/ml/datasets/bank+marketing
- https://www.kaggle.com/c/dato-native
- https://data.cityofnewyork.us/Health/restaurants/ukrt-mvy9/data

---

In [50]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier

from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

from sklearn import datasets
from sklearn.feature_selection import RFE

In [2]:
bank = pd.read_csv('data/bank-additional-full.csv', delimiter=';')

In [3]:
bank.dropna()
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [4]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

In [5]:
bank.education.unique()

array(['basic.4y', 'high.school', 'basic.6y', 'basic.9y',
       'professional.course', 'unknown', 'university.degree',
       'illiterate'], dtype=object)

In [6]:
bank.y.value_counts()

no     36548
yes     4640
Name: y, dtype: int64

In [7]:
tc_flag = {'yes': 1,'no': 0}
 
# traversing through dataframe
# Gender column and writing
# values where key matches
bank.y = [tc_flag[item] for item in bank.y]

In [8]:
bank.y.value_counts()

0    36548
1     4640
Name: y, dtype: int64

In [9]:
bank.y.value_counts()

0    36548
1     4640
Name: y, dtype: int64

In [10]:
bank_orig = bank

In [11]:
bank_orig.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [12]:
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [13]:
bank.groupby('y').mean()

Unnamed: 0_level_0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,39.911185,220.844807,2.633085,984.113878,0.132374,0.248875,93.603757,-40.593097,3.811491,5176.1666
1,40.913147,553.191164,2.051724,792.03556,0.492672,-1.233448,93.354386,-39.789784,2.123135,5095.115991


In [14]:
columns = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
for column in columns:
    bank = bank.join(pd.get_dummies(bank[column], prefix=column))
    bank.drop(column, axis = 1, inplace = True)

In [15]:
#bank.columns

In [16]:
bank.columns.values.tolist()
y=['y']
X=[column for column in bank.columns.values.tolist() if column not in y]

In [17]:
logreg = LogisticRegression()
rfe = RFE(logreg, 18)
rfe = rfe.fit(bank[X], bank[y] )
print(rfe.support_)
print(rfe.ranking_)

  y = column_or_1d(y, warn=True)


[False False False False  True False False False  True False False  True
 False False False  True False  True  True False False False False False
 False False False False False False False False False False  True False
 False False False False False False False False  True  True  True  True
 False False  True  True  True False False False  True False False False
  True  True  True]
[39 36 15 42  1 17 20 21  1 30 12  1 22 41 43  1 35  1  1 31 28 44  6  7
  8 45 14 24  9 13 40 33 25  2  1  4 46 23 19 37 16 34 18  5  1  1  1  1
 26 27  1  1  1 32 11 10  1 38 29  3  1  1  1]


In [18]:
X = bank[['previous', 'euribor3m', 'job_blue-collar', 'job_retired', 'job_services', 'job_student', 'default_no',
      'month_aug', 'month_dec', 'month_jul', 'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri', 'day_of_week_wed', 
      'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success']]

y = bank['y']

In [20]:
import statsmodels.api as sm
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

logit_model = sm.Logit(y, X)
result = logit_model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.287116
         Iterations 7


0,1,2,3
Dep. Variable:,y,No. Observations:,41188.0
Model:,Logit,Df Residuals:,41170.0
Method:,MLE,Df Model:,17.0
Date:,"Tue, 24 Apr 2018",Pseudo R-squ.:,0.1844
Time:,10:02:31,Log-Likelihood:,-11826.0
converged:,True,LL-Null:,-14499.0
,,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
previous,0.2385,0.051,4.642,0.000,0.138,0.339
euribor3m,-0.4981,0.012,-40.386,0.000,-0.522,-0.474
job_blue-collar,-0.3222,0.049,-6.549,0.000,-0.419,-0.226
job_retired,0.3821,0.069,5.552,0.000,0.247,0.517
job_services,-0.2423,0.065,-3.701,0.000,-0.371,-0.114
job_student,0.3540,0.086,4.107,0.000,0.185,0.523
default_no,0.3312,0.056,5.943,0.000,0.222,0.440
month_aug,0.4272,0.055,7.770,0.000,0.319,0.535
month_dec,0.8061,0.163,4.948,0.000,0.487,1.125


In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [29]:
y_pred = logreg.predict(X_test)
print(logreg.score(X_test, y_test))

0.8973860969490977


In [37]:
# KNearestNeighbors
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100)

knn = KNeighborsClassifier(n_neighbors=2)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
pred = knn.predict(X_test)

# evaluate accuracy
print(accuracy_score(y_test, pred))

0.8895755168101228


In [41]:
# SGDClassifier
sgd = SGDClassifier()
sgd.fit(X_train, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [42]:
sgd.score(X_train, y_train)

0.8984236274687444

In [45]:
# DummyClassifier
dumclass = DummyClassifier(strategy='most_frequent')

In [47]:
dumclass.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

In [48]:
dumclass.score(X_train,y_train)

0.8876608081174125

In [58]:
# Predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [59]:
lr_pred = logreg.predict(X_train)
knn_pred = knn.predict(X_train)
sgd_pred = sgd.predict(X_train)
dumclass_pred = dumclass.predict(X_train)

In [61]:
print("Accuracy score for Logistic Regression model:\n{:.2f}".format(accuracy_score(y_train, lr_pred)))
print("Accuracy score for K-Nearest-Neighbors model:\n{:.2f}".format(accuracy_score(y_train, knn_pred)))
print("Accuracy score for SGD model: \n{:.2f}".format(accuracy_score(y_train, sgd_pred)))
print("Accuracy score for Dummy Classifer model: \n{:.2f}".format(accuracy_score(y_train, dumclass_pred)))

Accuracy score for Logistic Regression model:
0.90
Accuracy score for K-Nearest-Neighbors model:
0.90
Accuracy score for SGD model: 
0.90
Accuracy score for Dummy Classifer model: 
0.89


In [62]:
print("Recall score for Logistic Regression model: \n", recall_score(y_train, lr_pred))
print("Recall score for K-Nearest-Neighbors: \n", recall_score(y_train, knn_pred))
print("Recall score for SGD model: \n", recall_score(y_train, sgd_pred))
print("Recall score for Dummy Classifer model: \n", recall_score(y_train, dumclass_pred))

Recall score for Logistic Regression model: 
 0.18942189421894218
Recall score for K-Nearest-Neighbors: 
 0.21924969249692497
Recall score for SGD model: 
 0.1918819188191882
Recall score for Dummy Classifer model: 
 0.0


In [63]:
print("Logistic Regression full report\n", classification_report(y_train,lr_pred))

Logistic Regression full report
              precision    recall  f1-score   support

          0       0.91      0.99      0.95     25579
          1       0.68      0.19      0.30      3252

avg / total       0.88      0.90      0.87     28831



In [64]:
print("K-Nearest-Neighbors full report\n", classification_report(y_train,knn_pred))

K-Nearest-Neighbors full report
              precision    recall  f1-score   support

          0       0.91      0.99      0.95     25579
          1       0.72      0.22      0.34      3252

avg / total       0.89      0.90      0.88     28831



In [65]:
print("SGD full report\n", classification_report(y_train,sgd_pred))

SGD full report
              precision    recall  f1-score   support

          0       0.91      0.99      0.95     25579
          1       0.67      0.19      0.30      3252

avg / total       0.88      0.90      0.87     28831



In [66]:
print("Dummy full report\n", classification_report(y_train,dumclass_pred))

Dummy full report
              precision    recall  f1-score   support

          0       0.89      1.00      0.94     25579
          1       0.00      0.00      0.00      3252

avg / total       0.79      0.89      0.83     28831



  'precision', 'predicted', average, warn_for)
