# Logistic Regression with scikit-learn

This is an example of logistic regression in Python with the [scikit-learn module](http://scikit-learn.org/), performed for an [assignment](https://github.com/ajschumacher/gadsdc/tree/master/logistic_assignment) with my [General Assembly Data Science class](https://generalassemb.ly/education/data-science).

## Dataset

The dataset I chose is the [affairs dataset](http://statsmodels.sourceforge.net/stable/datasets/generated/fair.html) that comes with [Statsmodels](http://statsmodels.sourceforge.net/). It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a [1978 paper](http://fairmodel.econ.yale.edu/rayfair/pdf/1978a200.pdf) from the Journal of Political Economy.

## Description of Variables

The dataset contains 6366 observations of 9 variables:

* `rate_marriage`: woman's rating of her marriage (1 = very poor, 5 = very good)
* `age`: woman's age
* `yrs_married`: number of years married
* `children`: number of children
* `religious`: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
* `educ`: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
* `occupation`: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
* `occupation_husb`: husband's occupation (same coding as above)
* `affairs`: time spent in extra-marital affairs

## Problem Statement

I decided to treat this as a classification problem by creating a new binary variable `affair` (did the woman have at least one affair?) and trying to predict the classification for each woman.

Skipper Seabold, one of the primary contributors to Statsmodels, did a similar classification in his [Statsmodels demo](https://github.com/jseabold/pydc) at a [Statistical Programming DC Meetup](http://www.meetup.com/stats-prog-dc/events/173693192/). However, he used Statsmodels for the classification (whereas I'm using scikit-learn), and he treated the occupation variables as continuous (whereas I'm treating them as categorical).

## Import modules

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

## Data Pre-Processing

First, let's load the dataset.

In [7]:
df = pd.read_json('/Users/Felipe/GitHub/StocksGenie/Arquivo de gerar e jogar para S3/json/GOOG.json')
df = df.T

In [8]:
dta = df
dta.columns = ['open', 'high', 'low', 'close', 'volume']
dta

Unnamed: 0,open,high,low,close,volume
2017-10-23 09:30:00,989.5200,989.5200,989.5200,989.5200,10731.0
2017-10-23 09:31:00,988.6700,988.7100,987.0000,987.0000,3223.0
2017-10-23 09:32:00,986.5300,987.3100,986.5300,987.3100,1500.0
2017-10-23 09:33:00,987.0000,987.8414,987.0000,987.7000,2200.0
2017-10-23 09:34:00,987.7000,988.0000,987.0200,987.0200,1954.0
2017-10-23 09:35:00,987.2300,987.2300,986.5000,986.9737,1020.0
2017-10-23 09:36:00,986.5050,987.4600,986.5000,987.3250,1500.0
2017-10-23 09:37:00,987.0200,987.0200,986.7100,986.7100,2600.0
2017-10-23 09:38:00,986.6800,987.1144,986.0000,986.0000,4489.0
2017-10-23 09:39:00,985.9100,986.1750,985.1100,985.1100,2717.0


## Data Exploration

In [9]:
dta['raised'] = 0
dta

Unnamed: 0,open,high,low,close,volume,raised
2017-10-23 09:30:00,989.5200,989.5200,989.5200,989.5200,10731.0,0
2017-10-23 09:31:00,988.6700,988.7100,987.0000,987.0000,3223.0,0
2017-10-23 09:32:00,986.5300,987.3100,986.5300,987.3100,1500.0,0
2017-10-23 09:33:00,987.0000,987.8414,987.0000,987.7000,2200.0,0
2017-10-23 09:34:00,987.7000,988.0000,987.0200,987.0200,1954.0,0
2017-10-23 09:35:00,987.2300,987.2300,986.5000,986.9737,1020.0,0
2017-10-23 09:36:00,986.5050,987.4600,986.5000,987.3250,1500.0,0
2017-10-23 09:37:00,987.0200,987.0200,986.7100,986.7100,2600.0,0
2017-10-23 09:38:00,986.6800,987.1144,986.0000,986.0000,4489.0,0
2017-10-23 09:39:00,985.9100,986.1750,985.1100,985.1100,2717.0,0


In [10]:
dta['date'] = dta.index
dta = dta.reset_index(drop=True)
dta

Unnamed: 0,open,high,low,close,volume,raised,date
0,989.5200,989.5200,989.5200,989.5200,10731.0,0,2017-10-23 09:30:00
1,988.6700,988.7100,987.0000,987.0000,3223.0,0,2017-10-23 09:31:00
2,986.5300,987.3100,986.5300,987.3100,1500.0,0,2017-10-23 09:32:00
3,987.0000,987.8414,987.0000,987.7000,2200.0,0,2017-10-23 09:33:00
4,987.7000,988.0000,987.0200,987.0200,1954.0,0,2017-10-23 09:34:00
5,987.2300,987.2300,986.5000,986.9737,1020.0,0,2017-10-23 09:35:00
6,986.5050,987.4600,986.5000,987.3250,1500.0,0,2017-10-23 09:36:00
7,987.0200,987.0200,986.7100,986.7100,2600.0,0,2017-10-23 09:37:00
8,986.6800,987.1144,986.0000,986.0000,4489.0,0,2017-10-23 09:38:00
9,985.9100,986.1750,985.1100,985.1100,2717.0,0,2017-10-23 09:39:00


In [11]:
for i in range(dta['close'].count()-1):
    if (dta['close'][i+1] > dta['close'][i]):
        dta.loc[i+1, 'raised'] = 1
    else:
        dta.loc[i+1, 'raised'] = 0
    

In [12]:
dta

Unnamed: 0,open,high,low,close,volume,raised,date
0,989.5200,989.5200,989.5200,989.5200,10731.0,0,2017-10-23 09:30:00
1,988.6700,988.7100,987.0000,987.0000,3223.0,0,2017-10-23 09:31:00
2,986.5300,987.3100,986.5300,987.3100,1500.0,1,2017-10-23 09:32:00
3,987.0000,987.8414,987.0000,987.7000,2200.0,1,2017-10-23 09:33:00
4,987.7000,988.0000,987.0200,987.0200,1954.0,0,2017-10-23 09:34:00
5,987.2300,987.2300,986.5000,986.9737,1020.0,0,2017-10-23 09:35:00
6,986.5050,987.4600,986.5000,987.3250,1500.0,1,2017-10-23 09:36:00
7,987.0200,987.0200,986.7100,986.7100,2600.0,0,2017-10-23 09:37:00
8,986.6800,987.1144,986.0000,986.0000,4489.0,0,2017-10-23 09:38:00
9,985.9100,986.1750,985.1100,985.1100,2717.0,0,2017-10-23 09:39:00


In [13]:
dta.groupby('raised').mean()

Unnamed: 0_level_0,open,high,low,close,volume
raised,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1002.914869,1003.04539,1002.458223,1002.603934,2792.245605
1,1003.306873,1003.76422,1003.177364,1003.62976,3018.151885


## Data Visualization

In [14]:
# show plots in the notebook
%matplotlib inline

## Prepare Data for Logistic Regression
The dmatrices function from the [patsy module](http://patsy.readthedocs.org/en/latest/) can do that using formula language.

In [15]:
dta

Unnamed: 0,open,high,low,close,volume,raised,date
0,989.5200,989.5200,989.5200,989.5200,10731.0,0,2017-10-23 09:30:00
1,988.6700,988.7100,987.0000,987.0000,3223.0,0,2017-10-23 09:31:00
2,986.5300,987.3100,986.5300,987.3100,1500.0,1,2017-10-23 09:32:00
3,987.0000,987.8414,987.0000,987.7000,2200.0,1,2017-10-23 09:33:00
4,987.7000,988.0000,987.0200,987.0200,1954.0,0,2017-10-23 09:34:00
5,987.2300,987.2300,986.5000,986.9737,1020.0,0,2017-10-23 09:35:00
6,986.5050,987.4600,986.5000,987.3250,1500.0,1,2017-10-23 09:36:00
7,987.0200,987.0200,986.7100,986.7100,2600.0,0,2017-10-23 09:37:00
8,986.6800,987.1144,986.0000,986.0000,4489.0,0,2017-10-23 09:38:00
9,985.9100,986.1750,985.1100,985.1100,2717.0,0,2017-10-23 09:39:00


In [16]:
from patsy import dmatrices, dmatrix, demo_data
# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('raised ~ volume + open + close + high + low', dta, return_type="dataframe")
print (X.columns)
                 

Index(['Intercept', 'volume', 'open', 'close', 'high', 'low'], dtype='object')


We also need to flatten `y` into a 1-D array, so that scikit-learn will properly understand it as the response variable.

In [17]:
# flatten y into a 1-D array
y = np.ravel(y)
y

array([ 0.,  0.,  1., ...,  0.,  1.,  1.])

## Logistic Regression

Let's go ahead and run logistic regression on the entire data set, and see how accurate it is!

In [18]:
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)#quantos % dos casos acertou

0.81922976159287397

In [19]:
y.mean()

0.49331936075451926

Let's examine the coefficients to see what we learn.

In [20]:
# examine the coefficients
pd.DataFrame = list(zip(X.columns, np.transpose(model.coef_)))
pd.DataFrame

[('Intercept', array([-0.05213724])),
 ('volume', array([ -7.47430192e-06])),
 ('open', array([-6.17613236])),
 ('close', array([ 6.24305909])),
 ('high', array([-0.21977292])),
 ('low', array([ 0.15300226]))]

## Model Evaluation Using a Validation Set

So far, we have trained and tested on the same set. Let's instead split the data into a training set and a testing set.

In [21]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

In [22]:
# predict class labels for the test set
predicted = model2.predict(X_test)
print (predicted)

[ 1.  0.  1. ...,  1.  0.  0.]


In [23]:
# generate class probabilities
probs = model2.predict_proba(X_test)
print (probs)

[[ 0.13808982  0.86191018]
 [ 0.6419753   0.3580247 ]
 [ 0.2133754   0.7866246 ]
 ..., 
 [ 0.17071041  0.82928959]
 [ 0.72474878  0.27525122]
 [ 0.61921292  0.38078708]]


Now let's generate some evaluation metrics.

In [24]:
# generate evaluation metrics
print (metrics.accuracy_score(y_test, predicted))
print (metrics.roc_auc_score(y_test, probs[:, 1]))

0.828097731239
0.917122578287


We can also see the confusion matrix and a classification report with other metrics.

In [25]:
print (metrics.confusion_matrix(y_test, predicted))
print (metrics.classification_report(y_test, predicted))

[[462 104]
 [ 93 487]]
             precision    recall  f1-score   support

        0.0       0.83      0.82      0.82       566
        1.0       0.82      0.84      0.83       580

avg / total       0.83      0.83      0.83      1146



## Model Evaluation Using Cross-Validation

Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

In [26]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print (scores)
print (scores.mean())

[ 0.81462141  0.78590078  0.79895561  0.83246073  0.88188976  0.85826772
  0.77427822  0.81102362  0.78740157  0.81364829]
0.815844772612


Looks good. It's still performing at 73% accuracy.

## Predicting the Probability of an Stock go up

Just for fun, let's predict the probability of an affair for a random woman not present in the dataset. She's a 25-year-old teacher who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.

In [27]:
model.predict_proba(np.array([1, 1000, 1000, 100,1000,100]))#'raised ~ volume + open + close + high + low',

  np.exp(prob, prob)


array([[ 1.,  0.]])

## Next Steps

There are many different steps that could be tried in order to improve the model:

* including interaction terms
* removing features
* regularization techniques
* using a non-linear model