## Session 12 - Assignment 1

#### I decided to treat this as a classification problem by creating a new binary variable affair (did the woman have at least one affair?) and trying to predict the classification for each woman.
#### Dataset 
#### The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy. Description of Variables The dataset contains 6366 observations of 9 variables:

#### rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
#### age: woman's age
#### yrs_married: number of years married
#### children: number of children
#### religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
#### educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
#### occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
#### occupation_husb: husband's occupation affairs: time spent in extra-marital affairs

In [1]:
# Importing module and assigning alias for them.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn import metrics
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score

In [2]:
# loading a dataset from statsmodels into a pandas dataframe
dta = sm.datasets.fair.load_pandas().data
dta.head()

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb,affairs
0,3.0,32.0,9.0,3.0,3.0,17.0,2.0,5.0,0.111111
1,3.0,27.0,13.0,3.0,1.0,14.0,3.0,4.0,3.230769
2,4.0,22.0,2.5,0.0,1.0,16.0,3.0,5.0,1.4
3,4.0,37.0,16.5,4.0,3.0,16.0,5.0,5.0,0.727273
4,5.0,27.0,9.0,1.0,1.0,14.0,3.0,4.0,4.666666


In [3]:
# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)
dta.head()

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb,affairs,affair
0,3.0,32.0,9.0,3.0,3.0,17.0,2.0,5.0,0.111111,1
1,3.0,27.0,13.0,3.0,1.0,14.0,3.0,4.0,3.230769,1
2,4.0,22.0,2.5,0.0,1.0,16.0,3.0,5.0,1.4,1
3,4.0,37.0,16.5,4.0,3.0,16.0,5.0,5.0,0.727273,1
4,5.0,27.0,9.0,1.0,1.0,14.0,3.0,4.0,4.666666,1


In [4]:
#we will add an intercept column as well as dummy variables for occupation and occupation_husb, 
#since we are treating them as categorial variables. 
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
religious + educ + C(occupation) + C(occupation_husb)',
dta, return_type="dataframe")
X.head()

Unnamed: 0,Intercept,C(occupation)[T.2.0],C(occupation)[T.3.0],C(occupation)[T.4.0],C(occupation)[T.5.0],C(occupation)[T.6.0],C(occupation_husb)[T.2.0],C(occupation_husb)[T.3.0],C(occupation_husb)[T.4.0],C(occupation_husb)[T.5.0],C(occupation_husb)[T.6.0],rate_marriage,age,yrs_married,children,religious,educ
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.0,32.0,9.0,3.0,3.0,17.0
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,27.0,13.0,3.0,1.0,14.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,22.0,2.5,0.0,1.0,16.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,37.0,16.5,4.0,3.0,16.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0,27.0,9.0,1.0,1.0,14.0


In [5]:
y

Unnamed: 0,affair
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0
7,1.0
8,1.0
9,1.0


In [6]:
#Renaming the column name of dummy variable.
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
'C(occupation)[T.3.0]':'occ_3',
'C(occupation)[T.4.0]':'occ_4',
'C(occupation)[T.5.0]':'occ_5',
'C(occupation)[T.6.0]':'occ_6',
'C(occupation_husb)[T.2.0]':'occ_husb_2',
'C(occupation_husb)[T.3.0]':'occ_husb_3',
'C(occupation_husb)[T.4.0]':'occ_husb_4',
'C(occupation_husb)[T.5.0]':'occ_husb_5',
'C(occupation_husb)[T.6.0]':'occ_husb_6'})
X.head()

Unnamed: 0,Intercept,occ_2,occ_3,occ_4,occ_5,occ_6,occ_husb_2,occ_husb_3,occ_husb_4,occ_husb_5,occ_husb_6,rate_marriage,age,yrs_married,children,religious,educ
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.0,32.0,9.0,3.0,3.0,17.0
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,27.0,13.0,3.0,1.0,14.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,22.0,2.5,0.0,1.0,16.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,37.0,16.5,4.0,3.0,16.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0,27.0,9.0,1.0,1.0,14.0


In [7]:
#flatten y into a 1-D array
y = np.ravel(y)
y

array([1., 1., 1., ..., 0., 0., 0.])

In [8]:
# Splitting the dataset into the Training set and Test set
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [9]:
print("X-TRAINING DATA:\n",X_train)
print("X-TEST DATA:\n",X_test)
print("Y-TRAINING DATA:\n",y_train)
print("Y-TEST DATA:\n",y_test)

X-TRAINING DATA:
       Intercept  occ_2  occ_3  occ_4  occ_5  occ_6  occ_husb_2  occ_husb_3  \
4533        1.0    0.0    0.0    0.0    1.0    0.0         0.0         0.0   
2720        1.0    0.0    1.0    0.0    0.0    0.0         1.0         0.0   
3407        1.0    0.0    0.0    0.0    1.0    0.0         0.0         0.0   
3326        1.0    1.0    0.0    0.0    0.0    0.0         0.0         0.0   
5306        1.0    1.0    0.0    0.0    0.0    0.0         0.0         0.0   
5986        1.0    0.0    0.0    1.0    0.0    0.0         0.0         1.0   
3562        1.0    0.0    0.0    1.0    0.0    0.0         0.0         1.0   
1843        1.0    1.0    0.0    0.0    0.0    0.0         0.0         0.0   
5146        1.0    0.0    0.0    0.0    1.0    0.0         0.0         0.0   
1556        1.0    0.0    0.0    0.0    1.0    0.0         0.0         0.0   
1431        1.0    0.0    0.0    0.0    1.0    0.0         0.0         0.0   
5335        1.0    0.0    1.0    0.0    0.0   

In [10]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [11]:
print("X-TRAINING DATA AFTER FEATURE SCALING:\n",X_train)
print("X-TEST DATA AFTER FEATURE SCALING:\n",X_test)

X-TRAINING DATA AFTER FEATURE SCALING:
 [[ 0.         -0.39947743 -0.87793594 ... -0.97943795  0.66739309
   0.81124293]
 [ 0.         -0.39947743  1.13903527 ... -0.97943795 -0.47418148
  -1.01021444]
 [ 0.         -0.39947743 -0.87793594 ... -0.97943795  0.66739309
  -1.01021444]
 ...
 [ 0.         -0.39947743  1.13903527 ... -0.97943795 -0.47418148
  -0.09948576]
 [ 0.         -0.39947743  1.13903527 ... -0.97943795 -0.47418148
  -0.09948576]
 [ 0.         -0.39947743  1.13903527 ... -0.97943795 -1.61575604
  -1.01021444]]
X-TEST DATA AFTER FEATURE SCALING:
 [[ 0.         -0.39947743  1.13903527 ...  0.42051318 -1.61575604
  -0.09948576]
 [ 0.         -0.39947743  1.13903527 ... -0.97943795  1.80896766
   0.81124293]
 [ 0.         -0.39947743 -0.87793594 ... -0.97943795 -0.47418148
   1.26660727]
 ...
 [ 0.         -0.39947743  1.13903527 ...  1.12048875  0.66739309
  -1.01021444]
 [ 0.         -0.39947743 -0.87793594 ...  1.82046432  0.66739309
  -0.09948576]
 [ 0.         -0.39947

In [14]:
# Fitting Logistic Regression to the Training set
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [15]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred

array([1., 0., 0., ..., 0., 1., 1.])

In [16]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [17]:
print("CONFUSIONMATRIX:\n",cm)

CONFUSIONMATRIX:
 [[992 108]
 [315 177]]


In [18]:
print ("MODEL ACCURACY:",metrics.accuracy_score(y_test,y_pred))

MODEL ACCURACY: 0.7342964824120602
