## Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

- Does personality type correlate with handedness
- Is there a correlation between handedness and gender identity
- How do gender and handedness interact to influence personality traits

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [4]:
# library imports
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import r2_score, root_mean_squared_error

In [5]:
quiz_df = pd.read_csv('data.csv', delimiter='\t')

In [6]:
# review sample rows
quiz_df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [7]:
quiz_df.dtypes

Q1              int64
Q2              int64
Q3              int64
Q4              int64
Q5              int64
Q6              int64
Q7              int64
Q8              int64
Q9              int64
Q10             int64
Q11             int64
Q12             int64
Q13             int64
Q14             int64
Q15             int64
Q16             int64
Q17             int64
Q18             int64
Q19             int64
Q20             int64
Q21             int64
Q22             int64
Q23             int64
Q24             int64
Q25             int64
Q26             int64
Q27             int64
Q28             int64
Q29             int64
Q30             int64
Q31             int64
Q32             int64
Q33             int64
Q34             int64
Q35             int64
Q36             int64
Q37             int64
Q38             int64
Q39             int64
Q40             int64
Q41             int64
Q42             int64
Q43             int64
Q44             int64
introelapse     int64
testelapse

In [8]:
# check for null values
quiz_df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

In [9]:
# review value
quiz_df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

> A comparison of the personality characteristics of 34 left-handed and 148 right-handed students (aged 17–22 yrs) suggested that handedness may be related to extraversion for females: Left-handed females were less extraverted than right-handed females.

[Reference](https://www.researchgate.net/publication/232519132_The_relationship_between_handedness_and_personality_traits_Extraversion_and_neuroticism)

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [13]:
# remove Q1-Q44 with zero value
quiz_df = quiz_df[~(quiz_df.loc[:, 'Q1':'Q44'] == 0).any(axis=1)]
quiz_df.shape

(3803, 56)

In [14]:
# remove hand with zero value
quiz_df = quiz_df[quiz_df['hand'] != 0]
quiz_df.shape

(3794, 56)

In [15]:
quiz_df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,...,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0,3794.0
mean,1.971798,3.838166,2.844228,3.206115,2.885345,3.68213,3.222457,3.196363,2.768318,3.51845,...,446.395888,1.577227,1.243279,24.629678,2.32077,1.654454,1.835793,5.009752,2.38719,1.1903
std,1.365267,1.540441,1.661345,1.464837,1.53997,1.330714,1.481074,1.380508,1.500097,1.226434,...,2574.865462,0.494065,0.441832,12.448414,0.87479,0.640901,1.303092,1.971096,2.182681,0.489974
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,41.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,187.0,1.0,1.0,18.0,2.0,1.0,1.0,5.25,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,243.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,325.0,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,93888.0,2.0,2.0,409.0,4.0,3.0,5.0,7.0,7.0,3.0


### Calculate and interpret the baseline accuracy rate:

In [17]:
quiz_df['hand'].value_counts(normalize=True).mul(100).round(2)

hand
1    85.27
2    10.44
3     4.30
Name: proportion, dtype: float64

### Short answer questions:

In this lab, you'll use K-nearest neighbors and logistic regression to model handedness based on psychological factors. 

Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

###### Regression
 - Predicts a continuous numerical value

###### Classification
 - Predicts a categorical values

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

###### K increase
- Bias increases
- Variance decreases

###### K decrease
- Bias decreases
- Variance increases

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

Standardization ensures that all features contribute equally to the distance calculation by transforming them to a common scale.

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

No scaling required as all explanatory variables share a common range of 1-5

#### How do we settle on $k$ for a $k$-nearest neighbors model?

- Create a grid of possible k values.
- Train the model for each combination of hyperparameters (including k) and evaluate its performance.
- Select the combination with the best performance.

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

- L2 (Ridge)

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

Inverse of regularization strength, smaller values specify stronger regularization.

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

##### Low Regularization Strength
- High Variance, Low Bias
##### High Regularization Strength
- High Bias, Low Variance

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

- Logistic regression models provide coefficients for each feature
- Logistic regression provides a global view of how features contribute to the prediction
- Logistic regression can provide some local explanations for individual predictions

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your features should be:

In [37]:
features = [f'Q{i}' for i in range(1, 45)]
X = quiz_df[features]
y = quiz_df['hand']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.8, test_size=0.2, random_state=42)

#### Create and fit four separate $k$-nearest neighbors models: 
- one with $k = 3$
- one with $k = 5$
- one with $k = 15$
- one with $k = 25$:

In [39]:
# cresate a KNN score test function
def knn_test(n, X_train, y_train, X_test, y_test):
    # innitialize models
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train, y_train)
    y_preds = knn.predict(X_test)

    # compare cross validation scores
    print(f"Cross validation score for KNN k={n} is {cross_val_score(knn, X_train, y_train, cv=5).mean():.8f}")
    print(f"Score of training for KNN k={n} is {knn.score(X_train, y_train):.8f}")
    print(f"Score of testing for KNN k={n} is {knn.score(X_test, y_test):.8f}")
    print()

In [40]:
# K=3
knn_test(3, X_train, y_train, X_test, y_test)

Cross validation score for KNN k=3 is 0.82075783
Score of training for KNN k=3 is 0.86952224
Score of testing for KNN k=3 is 0.82608696



In [41]:
# K=5
knn_test(5, X_train, y_train, X_test, y_test)

Cross validation score for KNN k=5 is 0.83855025
Score of training for KNN k=5 is 0.85304778
Score of testing for KNN k=5 is 0.84453228



In [42]:
# K=15
knn_test(15, X_train, y_train, X_test, y_test)

Cross validation score for KNN k=15 is 0.85271829
Score of training for KNN k=15 is 0.85271829
Score of testing for KNN k=15 is 0.85243742



In [43]:
# K=25
knn_test(25, X_train, y_train, X_test, y_test)

Cross validation score for KNN k=25 is 0.85271829
Score of training for KNN k=25 is 0.85271829
Score of testing for KNN k=25 is 0.85243742



### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

- All models exhibit overfitting, as evidenced by significantly higher training scores than testing scores.

- None of the models outperform the baseline.

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as used above with kNN.

In [48]:
# cresate a LogisticRegression score test function
def lgr_test(algo, alpha, X_train, y_train, X_test, y_test):
    # initialize models
    if algo == 'lasso':
        penalty = 'l1'
    elif algo == 'ridge':
        penalty = 'l2'
    else:
        return
    lgr = LogisticRegression(C=1/alpha, penalty=penalty, solver='liblinear')
    lgr.fit(X_train, y_train)
    y_preds = lgr.predict(X_test)

    # compare cross validation scores
    print(f"Cross validation score for LogisticRegression alpha={alpha}, penalty={algo} is {cross_val_score(lgr, X_train, y_train, cv=5).mean():.8f}")
    print(f"Score of training for LogisticRegression alpha={alpha}, penalty={algo} is {lgr.score(X_train, y_train):.8f}")
    print(f"Score of testing for LogisticRegression alpha={alpha}, penalty={algo} is {lgr.score(X_test, y_test):.8f}")
    print()

In [49]:
# Lasso, alpha 1
lgr_test('lasso', 1, X_train, y_train, X_test, y_test)

Cross validation score for LogisticRegression alpha=1, penalty=lasso is 0.85271829
Score of training for LogisticRegression alpha=1, penalty=lasso is 0.85271829
Score of testing for LogisticRegression alpha=1, penalty=lasso is 0.85243742



In [50]:
# Lasso, alpha 10
lgr_test('lasso', 10, X_train, y_train, X_test, y_test)

Cross validation score for LogisticRegression alpha=10, penalty=lasso is 0.85271829
Score of training for LogisticRegression alpha=10, penalty=lasso is 0.85271829
Score of testing for LogisticRegression alpha=10, penalty=lasso is 0.85243742



In [51]:
# Ridge, alpha 1
lgr_test('ridge', 1, X_train, y_train, X_test, y_test)

Cross validation score for LogisticRegression alpha=1, penalty=ridge is 0.85271829
Score of training for LogisticRegression alpha=1, penalty=ridge is 0.85271829
Score of testing for LogisticRegression alpha=1, penalty=ridge is 0.85243742



In [52]:
# Ridge, alpha 10
lgr_test('ridge', 10, X_train, y_train, X_test, y_test)

Cross validation score for LogisticRegression alpha=10, penalty=ridge is 0.85271829
Score of training for LogisticRegression alpha=10, penalty=ridge is 0.85271829
Score of testing for LogisticRegression alpha=10, penalty=ridge is 0.85243742



### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

- None of them are overfit or underfit, score is very similar.

- All models accuracy are quite same as the baseline.

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? 

What are the "best" models?

- None of the models achieve acceptable accuracy.