<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating Classification Models on Humor Styles Data

_Authors: Kiefer Katovich (SF)_

---

In this lab you will be practicing evaluating classification models (Logistic Regression in particular) on a "Humor Styles" survey.

This survey is designed to evaluate what "style" of humor subjects have. Your goal will be to classify gender using the responses on the survey.

## Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

    1. I usually don't laugh or joke with other people.
    2. If I feel depressed, I can cheer myself up with humor.
    3. If someone makes a mistake, I will tease them about it.
    4. I let people laugh at me or make fun of me at my expense more than I should.
    5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
    6. Even when I'm alone, I am often amused by the absurdities of life.
    7. People are never offended or hurt by my sense of humor.
    8. I will often get carried away in putting myself down if it makes family or friends laugh.
    9. I rarely make other people laugh by telling funny stories about myself.
    10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
    11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
    12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
    13. I laugh and joke a lot with my closest friends.
    14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
    15. I do not like it when people use humor as a way of criticizing or putting someone down.
    16. I don't often say funny things to put myself down.
    17. I usually don't like to tell jokes or amuse people.
    18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
    19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
    20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
    21. I enjoy making people laugh.
    22. If I am feeling sad or upset, I usually lose my sense of humor.
    23. I never participate in laughing at others even if all my friends are doing it.
    24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
    25. I donít often joke around with my friends.
    26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
    27. If I don't like someone, I often use humor or teasing to put them down.
    28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
    29. I usually can't think of witty things to say when I'm with other people.
    30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
    31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
    32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("likert scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an interger.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

In [67]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

pd.options.display.max_columns = 50

### 1. Load the data and perform any EDA and cleaning you think is necessary.

It is worth reading over the description of the data columns above for this.

In [36]:
hsq = pd.read_csv('./datasets/hsq_data.csv')

In [37]:
# A:
hsq.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
0,2,2,3,1,4,5,4,3,4,3,3,1,5,4,4,4,2,3,3,1,4,4,3,2,1,3,2,4,2,4,2,2,4.0,3.5,3.0,2.3,25,2,100
1,2,3,2,2,4,4,4,3,4,3,4,3,3,4,5,4,2,2,3,2,3,3,4,2,2,5,1,2,4,4,3,1,3.3,3.5,3.3,2.4,44,2,90
2,3,4,3,3,4,4,3,1,2,4,3,2,4,4,3,3,2,4,2,1,4,2,4,3,2,4,3,3,2,5,4,2,3.9,3.9,3.1,2.3,50,1,75
3,3,3,3,4,3,5,4,3,-1,4,2,4,4,5,4,3,3,3,3,3,4,3,2,4,2,4,2,2,4,5,3,3,3.6,4.0,2.9,3.3,30,2,85
4,1,4,2,2,3,5,4,1,4,4,2,2,5,4,4,4,2,3,2,1,5,3,3,1,1,5,2,3,2,5,4,2,4.1,4.1,2.9,2.0,52,1,80


In [38]:
hsq.shape

(1071, 39)

In [50]:
hsq.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
count,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0,1058.0
mean,2.026465,3.337429,3.082231,2.837429,3.604915,4.146503,3.285444,2.536862,2.579395,2.862949,2.741021,2.970699,4.441399,3.267486,3.365784,3.097353,1.912098,2.743856,3.238185,2.099244,4.353497,3.022684,2.761815,2.400756,1.544423,3.507561,2.273157,3.204159,2.320416,3.935728,2.768431,2.844045,4.011248,3.371078,2.957372,2.764745,29.165406,0.549149,87.500945
std,1.075776,1.110941,1.164699,1.159077,1.054277,0.982482,1.098634,1.230741,1.221005,1.202775,1.247529,1.228829,0.881249,1.267191,1.379483,1.215235,1.142134,1.205312,1.255443,1.125342,0.971589,1.23779,1.221836,1.145135,0.88529,1.208782,1.292602,1.319349,1.207975,1.137488,1.307957,1.231588,0.706649,0.660878,0.411734,0.645335,82.373897,0.497814,12.068729
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.3,0.0,0.0,0.0,14.0,0.0,2.0
25%,1.0,3.0,2.0,2.0,3.0,4.0,3.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,4.0,2.0,2.0,2.0,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,3.6,2.9,2.8,2.3,18.0,0.0,80.0
50%,2.0,3.0,3.0,3.0,4.0,4.0,3.0,2.0,2.0,3.0,3.0,3.0,5.0,3.0,3.0,3.0,2.0,3.0,3.0,2.0,5.0,3.0,3.0,2.0,1.0,4.0,2.0,3.0,2.0,4.0,3.0,3.0,4.1,3.4,3.0,2.8,22.0,1.0,90.0
75%,3.0,4.0,4.0,4.0,4.0,5.0,4.0,3.0,3.0,4.0,4.0,4.0,5.0,4.0,5.0,4.0,2.0,4.0,4.0,3.0,5.0,4.0,4.0,3.0,2.0,4.0,3.0,4.0,3.0,5.0,4.0,4.0,4.5,3.8,3.3,3.1,31.0,1.0,95.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,2670.0,1.0,100.0


In [49]:
hsq.loc[hsq[hsq.affiliative == 5.1].index.values, 'affiliative'] = 5.

### 2. Set up a predictor matrix to predict `gender` (only male vs. female)

Choice of predictors is up to you. Justify which variables you include.

In [43]:
gender_3 = hsq[hsq.gender == 3].index.values
gender_0 = hsq[hsq.gender == 0].index.values

In [44]:
# remove gender = 0 and gender = 3
hsq.drop(gender_3, inplace = True, axis = 0)
hsq.drop(gender_0, inplace = True, axis = 0)

In [45]:
hsq.shape

(1058, 39)

In [46]:
hsq.gender = [1 if i == 1 else 0 for i in hsq.gender]

In [47]:
hsq.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
0,2,2,3,1,4,5,4,3,4,3,3,1,5,4,4,4,2,3,3,1,4,4,3,2,1,3,2,4,2,4,2,2,4.0,3.5,3.0,2.3,25,0,100
1,2,3,2,2,4,4,4,3,4,3,4,3,3,4,5,4,2,2,3,2,3,3,4,2,2,5,1,2,4,4,3,1,3.3,3.5,3.3,2.4,44,0,90
2,3,4,3,3,4,4,3,1,2,4,3,2,4,4,3,3,2,4,2,1,4,2,4,3,2,4,3,3,2,5,4,2,3.9,3.9,3.1,2.3,50,1,75
3,3,3,3,4,3,5,4,3,-1,4,2,4,4,5,4,3,3,3,3,3,4,3,2,4,2,4,2,2,4,5,3,3,3.6,4.0,2.9,3.3,30,0,85
4,1,4,2,2,3,5,4,1,4,4,2,2,5,4,4,4,2,3,2,1,5,3,3,1,1,5,2,3,2,5,4,2,4.1,4.1,2.9,2.0,52,1,80


In [53]:
y = hsq.gender
X = hsq[['affiliative', 'selfenhancing', 'agressive', 'selfdefeating']]

In [64]:
hsq[['affiliative', 'selfenhancing', 'agressive', 'selfdefeating','gender']].corr()

Unnamed: 0,affiliative,selfenhancing,agressive,selfdefeating,gender
affiliative,1.0,0.367613,0.058456,0.173478,0.039709
selfenhancing,0.367613,1.0,0.211656,0.252222,-0.022419
agressive,0.058456,0.211656,1.0,0.213565,-0.051389
selfdefeating,0.173478,0.252222,0.213565,1.0,0.082703
gender,0.039709,-0.022419,-0.051389,0.082703,1.0


### 3. Fit a Logistic Regression model and compare your cross-validated accuracy to the baseline.

In [60]:
hsq.gender.value_counts()

1    581
0    477
Name: gender, dtype: int64

In [59]:
# calculating baseline
for i in range(2):
    print i, hsq.gender.value_counts()[i]

0 477
1 581


In [63]:
baseline = float(hsq.gender.value_counts()[1])/hsq.gender.value_counts().sum()
baseline

0.5491493383742911

In [110]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X, y)
scores = cross_val_score(logreg, X, y, cv=5)
predict = logreg.predict(X)
predictprob = pd.DataFrame(logreg.predict_proba(X), columns = ['Female', 'Male'])
print 'cv scores:',scores
print 'cv mean score:',np.mean(scores)

cv scores: [ 0.53051643  0.58490566  0.53080569  0.56872038  0.55450237]
cv mean score: 0.553890105664


### 4. Create a 50-50 train-test split. Fit the model on training and get the predictions and predicted probabilities on the test data.

In [98]:
# A:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=8)

logreg2 = LogisticRegression()

logreg2.fit(X_train,y_train)

logreg2_score = logreg2.score(X_test, y_test)
y_predict = logreg2.predict(X_test)
y_predictprob = pd.DataFrame(logreg2.predict_proba(X_test), columns = ['Female','Male'])


print 'logreg2 score:', logreg2_score
print y_predict[0:10]
print y_predictprob[0:10]

logreg2 score: 0.551984877127
[1 0 1 1 1 1 1 1 1 1]
     Female      Male
0  0.404943  0.595057
1  0.547517  0.452483
2  0.458527  0.541473
3  0.458477  0.541523
4  0.323247  0.676753
5  0.454373  0.545627
6  0.426732  0.573268
7  0.359420  0.640580
8  0.362536  0.637464
9  0.466340  0.533660


### 5. Manually calculate the true positives, false positives, true negatives, and false negatives.

In [105]:
print len(y)
print len(predict)

1058
1058


In [106]:
# A:
tp = np.sum((y == 1) & (predict == 1))
fp = np.sum((y == 0) & (predict == 1))
tn = np.sum((y == 0) & (predict == 0))
fn = np.sum((y == 1) & (predict == 0))
print 'tp:', tp
print 'fp:', fp
print 'tn:', tn
print 'fn:', fn

tp: 482
fp: 362
tn: 115
fn: 99


### 6. Construct the confusion matrix. 

In [107]:
# A:
from sklearn.metrics import confusion_matrix

In [108]:
confusion_matrix(y, predict)

array([[115, 362],
       [ 99, 482]])

### 7. Print out the false positive count as you change your threshold for predicting label 1.

### 8. Plot an ROC curve using your predicted probabilities on the test data.

Calculate the area under the curve.

> *Hint: go back to the lecture to find code for plotting the ROC curve.*

In [10]:
from sklearn.metrics import roc_curve, auc

In [11]:
# A:

### 9. Cross-validate a logistic regression with a Ridge penalty.

Logistic regression can also use the Ridge penalty. Sklearn's `LogisticRegressionCV` class will help you cross-validate an appropriate regularization strength.

**Important `LogisticRegressionCV` arguments:**
- `penalty`: this can be one of `'l1'` or `'l2'`. L1 is the Lasso, and L2 is the Ridge.
- `Cs`: How many different (automatically-selected) regularization strengths should be tested.
- `cv`: How many cross-validation folds should be used to test regularization strength.
- `solver`: When using the lasso penalty, this should be set to `'liblinear'`

> **Note:** The `C` regularization strength is the *inverse* of alpha. That is to say, `C = 1./alpha`

In [12]:
from sklearn.linear_model import LogisticRegressionCV

In [13]:
# A:

**9.B Calculate the predicted labels and predicted probabilities on the test set with the Ridge logisitic regression.**

In [14]:
# A:

**9.C Construct the confusion matrix for the Ridge LR.**

In [15]:
# A:

### 10. Plot the ROC curve for the original and Ridge logistic regressions on the same plot.

Which performs better?

In [16]:
# A:

### 11. Cross-validate a Lasso logistic regression.

**Remember:**
- `penalty` must be set to `'l1'`
- `solver` must be set to `'liblinear'`

> **Note:** The lasso penalty can be considerably slower. You may want to try fewer Cs or use fewer cv folds.

In [17]:
# A:

### 12. Make the confusion matrix for the Lasso model.

In [18]:
# A:

### 13. Plot all three logistic regression models on the same ROC plot.

Which is the best? (if any)

In [19]:
# A:

### 14. Look at the coefficients for the Lasso logistic regression model. Which variables are the most important?

In [20]:
# A: