## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [5]:
# Does chance of personality violent has increase chance bein left-handed 
## Q3, Q15, Q16, Q23, Q27, Q39
# Does chance of personality smart has increase chance left-handed
## use Q1, Q13, Q35, Q11(invert)
# Does chance of personality introvert has increase chance left_handed
## Q6, Q14

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [6]:
# library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, \
accuracy_score, plot_roc_curve, roc_auc_score, recall_score, \
precision_score, f1_score, classification_report

In [7]:
# import data
hand_df = pd.read_csv('data.csv', sep='\t')

---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

In [8]:
# the frequency of left-handers
# is strongly and positively correlated with the rate of homicides.
# source:www.ncbi.nlm.nih.gov/pmc/articles/PMC1634940

In [9]:
# a left-hand preference with increased mathematical 
# skills, which makes left-handed 
# people more likely to be a genius
# source:https://www.harpersbazaar.com/uk/culture/culture-news/news/a42209/
# left-handed-people-more-likely-to-be-genius/

In [10]:
# More specifically Orme (1970) found that left-handers reported
# themselves to be more introvert and shy than right-handers, 
# source: www.ncbi.nlm.nih.gov/pmc/articles/PMC4315011/

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [11]:
# display 5 rows
hand_df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [12]:
# check types
hand_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Q1           4184 non-null   int64 
 1   Q2           4184 non-null   int64 
 2   Q3           4184 non-null   int64 
 3   Q4           4184 non-null   int64 
 4   Q5           4184 non-null   int64 
 5   Q6           4184 non-null   int64 
 6   Q7           4184 non-null   int64 
 7   Q8           4184 non-null   int64 
 8   Q9           4184 non-null   int64 
 9   Q10          4184 non-null   int64 
 10  Q11          4184 non-null   int64 
 11  Q12          4184 non-null   int64 
 12  Q13          4184 non-null   int64 
 13  Q14          4184 non-null   int64 
 14  Q15          4184 non-null   int64 
 15  Q16          4184 non-null   int64 
 16  Q17          4184 non-null   int64 
 17  Q18          4184 non-null   int64 
 18  Q19          4184 non-null   int64 
 19  Q20          4184 non-null 

In [13]:
# check missing values
hand_df.isna().sum().sort_values(ascending=False).head()

Q1     0
Q2     0
Q31    0
Q32    0
Q33    0
dtype: int64

In [14]:
# check values in hand column
hand_df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [15]:
# check other columns that values == 0
hand_df['Q1'].value_counts()

1    2521
3     494
4     461
2     385
5     319
0       4
Name: Q1, dtype: int64

In [16]:
# drop 0 values in every rows 
hand_df=hand_df[hand_df!=0].dropna()

# Why would drop zero values?
# A: Small % of values and we're clear on what 0 means. But we think
# it's likely a missing values stand-in

# what do we do about values of 3? Ambidextrous folks.
# Studeents say drop.
# We're unable to differentiate betweenn dominant hand.
# Also, larger % of 'hand' values but not overwhelming in terms of
# row or information loss

# Options: Exclude them, leave ambi folks in as category of their own,
# or combining with another category (#use of left hand vs not)

In [17]:
# check info after drop 0
hand_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3481 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Q1           3481 non-null   float64
 1   Q2           3481 non-null   float64
 2   Q3           3481 non-null   float64
 3   Q4           3481 non-null   float64
 4   Q5           3481 non-null   float64
 5   Q6           3481 non-null   float64
 6   Q7           3481 non-null   float64
 7   Q8           3481 non-null   float64
 8   Q9           3481 non-null   float64
 9   Q10          3481 non-null   float64
 10  Q11          3481 non-null   float64
 11  Q12          3481 non-null   float64
 12  Q13          3481 non-null   float64
 13  Q14          3481 non-null   float64
 14  Q15          3481 non-null   float64
 15  Q16          3481 non-null   float64
 16  Q17          3481 non-null   float64
 17  Q18          3481 non-null   float64
 18  Q19          3481 non-null   float64
 19  Q20   

In [18]:
# baseline score
hand_df['hand'].value_counts(normalize=True)*100

1.0    85.291583
2.0    10.456765
3.0     4.251652
Name: hand, dtype: float64

In [19]:
# drop both in column 'hand'
hand_df = hand_df[hand_df['hand'] != 3]

### Calculate and interpret the baseline accuracy rate:

In [20]:
# baseline score
hand_df['hand'].value_counts(normalize=True)*100

1.0    89.078908
2.0    10.921092
Name: hand, dtype: float64

In [21]:
# baseline is 89%

### EDA

Violent

In [22]:
# personality violent and left-headed in Q3
hand_df.groupby('hand')['Q3'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,2.846413
2.0,364,2.752747


In [23]:
# personality violent and left-headed in Q15
hand_df.groupby('hand')['Q15'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,3.191984
2.0,364,3.07967


In [24]:
# personality violent and left-headed in Q16
hand_df.groupby('hand')['Q16'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,2.952846
2.0,364,2.909341


In [25]:
# personality violent and left-headed in Q23
hand_df.groupby('hand')['Q23'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,3.917144
2.0,364,3.714286


In [26]:
# personality violent and left-headed in Q27
hand_df.groupby('hand')['Q27'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,2.731223
2.0,364,2.645604


In [27]:
# personality violent and left-headed in Q39
hand_df.groupby('hand')['Q39'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,2.585382
2.0,364,2.472527


Smart

In [28]:
# personality smart and left-headed in Q1
hand_df.groupby('hand')['Q1'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,1.961266
2.0,364,1.925824


In [29]:
# personality smart and left-headed in Q11
hand_df.groupby('hand')['Q11'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,2.744022
2.0,364,2.760989


In [30]:
# personality smart and left-headed in Q13
hand_df.groupby('hand')['Q13'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,2.674638
2.0,364,2.706044


In [31]:
# personality smart and left-headed in Q35
hand_df.groupby('hand')['Q35'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,3.067363
2.0,364,3.035714


Introvert

In [32]:
# personality introvert and left-headed in Q6
hand_df.groupby('hand')['Q6'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,3.707309
2.0,364,3.662088


In [33]:
# personality introvert and left-headed in Q14
hand_df.groupby('hand')['Q14'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
hand,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2969,3.393062
2.0,364,3.214286


### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

In [34]:
# regssion problem about countinuous values
# classification about category values

In [35]:
# Classification - discrete or categories,
# sometimes binary, sometime multiclass

# regression - continuous target values

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

In [36]:
# k  is the number of neighbors that help KNN identify which class label
# a data point woyld belong in

# if k are low it may lead overfitting
# if k are high it may lead underfitting

# smaller k --> may lead to overfitting
# optimal k --> 'the sweet spot'
# larger k --> may lead to underfitting

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

In [37]:
# for make values same scale

# kNN is a distance-based algorithm and scale/magnitude of features
# impacts performance and output

# Standardization (StandardScaler) --> 
# rescale values in the features column to have mean of zero and std of 1

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

In [38]:
# for this problem answer is not
# because all variables are same scale

# Possibly not this time as all our features to be used
# are ranged from 1-5. Already on the same or very similar scale.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

In [39]:
# knn = KNeighborsClassifier(n_neighbors= [set k in this e.g. 1, 5, 10] )

# Options

# 1. use the default
# 2. Guess
# 3. Loop/iterate through a range of k options and select the
# optimal one based on score.

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [40]:
# L2 or Ridge regularization.

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

In [41]:
# lower C would strengthen the lambda regulator

# Argument C - refers to an 'Inverse of Alpha (Regularization parameter)'

# C = 1/alpha
# alpha = 1, what is C? 1
# if alpha = .1 what is C? 10

# What does a higher C mean? Less regularization
# what does a lower c mean? More regularization

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

In [42]:
# regularization will decrease chance of variance in model but model
# also lost predictive power.

# Why? Avoid error due to high variance (overfitting)
# When? High model complexity which is leading to overfitting.

# C is high --> less --> more prone to overfitting
# C is low --> more --> less prone to overfiiting and if
# overdone, we could possibly get to underfitting

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

In [43]:
# Logistic Regression has coefficent that we can interpret
# as x increase 1 unit will result in y increase in the log-odds ration of success:faliure

# Interpretable
# LogReg has coefficients (statistical parameters) which are sometimes
# referred as a betas. kNN is non-parametric and has no coefficients
# or statistical parameters.

# Explainable to a non-technical individual?
# Conceptuall, kNN isn't too hard to explain to a wide audience

# Explaining relationships in a meaningful and actionable way?
# kNN fails and LogReg is better

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables should be 

In [44]:
# create X and y
X = hand_df[['Q1', 'Q3', 'Q6', 'Q11', 'Q13', 'Q14', 'Q15',
             'Q16', 'Q23', 'Q27', 'Q35', 'Q39']]
y = hand_df['hand']
y = y - 1

In [45]:
# create train test split and set stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

In [46]:
# check porpotion of values in target
y_train.value_counts(normalize=True)

0.0    0.890848
1.0    0.109152
Name: hand, dtype: float64

In [47]:
# check porpotion of values in train target
y_test.value_counts(normalize=True)

0.0    0.890555
1.0    0.109445
Name: hand, dtype: float64

#### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

In [48]:
# instanctiate model k = 3 and train model
knn_3 = KNeighborsClassifier(n_neighbors = 3)
knn_3.fit(X_train, y_train)
knn_3.score(X_train, y_train)

0.9066016504126031

In [49]:
# instanctiate model k = 5 and train model
knn_5 = KNeighborsClassifier(n_neighbors = 5)
knn_5.fit(X_train, y_train)
knn_5.score(X_train, y_train)

0.8938484621155289

In [50]:
# instanctiate model k = 15 and train model
knn_15 = KNeighborsClassifier(n_neighbors = 15)
knn_15.fit(X_train, y_train)
knn_15.score(X_train, y_train)

0.890847711927982

In [51]:
# instanctiate model k = 25 and train model
knn_25 = KNeighborsClassifier(n_neighbors = 25)
knn_25.fit(X_train, y_train)
knn_25.score(X_train, y_train)

0.890847711927982

In [68]:
k = [3, 5, 15, 25]
for i in k:
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train, y_train)
    print(f'k = {i} {knn.score(X_train, y_train)}')

k = 3 0.9066016504126031
k = 5 0.8938484621155289
k = 15 0.890847711927982
k = 25 0.890847711927982


In [88]:
k_scores = pd.DataFrame(columns=['k', 'train_score', 'test_score', 'preds_one'])

for n, k in enumerate([3, 5, 15, 25]):
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    y_preds = knn.predict(X_test)
    preds_one = y_preds == 1
    train_score = knn.score(X_train, y_train)
    test_score = knn.score(X_test, y_test)
    k_scores.loc[n] = [k, train_score, test_score, pd.Series(preds_one).sum()]

In [89]:
k_scores

Unnamed: 0,k,train_score,test_score,preds_one
0,3.0,0.906602,0.878561,16.0
1,5.0,0.893848,0.886057,3.0
2,15.0,0.890848,0.890555,0.0
3,25.0,0.890848,0.890555,0.0


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [52]:
# test k =3
knn_3.score(X_test, y_test)

# k = 3 cause Underfitting

0.8785607196401799

In [53]:
# test k = 5
knn_5.score(X_test, y_test)

# k = 5 cause Underfitting

0.8860569715142429

In [54]:
# test k =15
knn_15.score(X_test, y_test)

# k = 15 is do the same as baseline

0.8905547226386806

In [55]:
# test k =25
knn_25.score(X_test, y_test)

# k = 25 is do the same as baseline

0.8905547226386806

In [90]:
# Confusion matrix of knn
confusion_matrix(y_test, knn_3.predict(X_test))
# can't predict any of left-handed

array([[582,  12],
       [ 69,   4]], dtype=int64)

In [91]:
# Underfit, The higher K test scores are the baseline accuracy
# appears good on majority, bad on minority class

# The challenge here is not so much the algorithm or the # of k
# but the imbalanced class split alonng with a tenuous likely
# relationship between X and y

In [None]:
# Options

# Addressing the class label imbalance (switching k at 90/10 split)
# made no significant difference in models predictive accuracy

# Find left-haded people to survey
# Oversampling the minority class
# Undersampling the majority class
# A bit both (SMOTE) --> Synthetic Minority Over-Sampling Technique
# Algorithmically increase the number of your cases in 
# your dataset in a more balanced way

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

In [57]:
# logistic reg, use lasso alpha = 1
logreg_la_1 = LogisticRegression(penalty='l1', solver='liblinear')
logreg_la_1.fit(X_train, y_train)
logreg_la_1.score(X_train, y_train)

0.890847711927982

In [58]:
# logistic reg, use lasso alpha = 10
logreg_la_10 = LogisticRegression(penalty='l1', C=1/10, solver='liblinear')
logreg_la_10.fit(X_train, y_train)
logreg_la_10.score(X_train, y_train)

0.890847711927982

In [59]:
# logistic reg, use ridge alpha = 1
logreg_ri_1 = LogisticRegression(penalty='l2', solver='liblinear')
logreg_ri_1.fit(X_train, y_train)
logreg_ri_1.score(X_train, y_train)

0.890847711927982

In [60]:
# logistic reg, use ridge alpha = 10
logreg_ri_10 = LogisticRegression(penalty='l2', C=1/10, solver='liblinear')
logreg_ri_10.fit(X_train, y_train)
logreg_ri_10.score(X_train, y_train)

0.890847711927982

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [61]:
# test reg l1 alpha = 1
logreg_la_1.score(X_test, y_test)

0.8905547226386806

In [62]:
# test reg l1 alpha = 10
logreg_la_10.score(X_test, y_test)

0.8905547226386806

In [63]:
# test reg l2 alpha = 1
logreg_ri_1.score(X_test, y_test)

0.8905547226386806

In [64]:
# test reg l2 alpha = 10
logreg_ri_10.score(X_test, y_test)

0.8905547226386806

In [65]:
# Confusion matrix
confusion_matrix(y_test, logreg_ri_1.predict(X_test))
# model can't predict TP

array([[594,   0],
       [ 73,   0]], dtype=int64)

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? What are the "best" models?

In [66]:
# Every models have the high accuracy but if we look deep down
# models can not predict the left-handed even logistic regression.
# might from data that imbalance between right and left handed