## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer:
<br>
1. Which question has the highest correlation to left-handedness?
2. As responses to Q44 become more positive (more respondents say "Yes"), do more respondents tend to be left-handed?
3. Is the response to Q1 influenced by respondents' handedness?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

In [2]:
df = pd.read_table('data.csv')
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:
<br>
1. If certain sensitive categories are not relevant to the problem statement, we may not need to collect data for such categories.
2. If certain sensitive data must be collected, respondents should have an option not to answer ("I do not wish to answer").
3. Data should be collected anonymously throughout the entire process.

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [3]:
# determine data type for each column
df.dtypes

Q1              int64
Q2              int64
Q3              int64
Q4              int64
Q5              int64
Q6              int64
Q7              int64
Q8              int64
Q9              int64
Q10             int64
Q11             int64
Q12             int64
Q13             int64
Q14             int64
Q15             int64
Q16             int64
Q17             int64
Q18             int64
Q19             int64
Q20             int64
Q21             int64
Q22             int64
Q23             int64
Q24             int64
Q25             int64
Q26             int64
Q27             int64
Q28             int64
Q29             int64
Q30             int64
Q31             int64
Q32             int64
Q33             int64
Q34             int64
Q35             int64
Q36             int64
Q37             int64
Q38             int64
Q39             int64
Q40             int64
Q41             int64
Q42             int64
Q43             int64
Q44             int64
introelapse     int64
testelapse

In [4]:
# determine number of null values for each column
df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

In [5]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: Classification. The answer to Q1-Q4 is Yes/No (discrete).

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer:
<br>
Generally, we standardise variables to prevent dominance of larger-scale variables in distance/norm-based calculations. Standardisation ensures algorithms perform better, improves interpretability, and prevents bias from variable scales.

For example, let's consider a dataset with two predictor variables: 'Age' (measured in years) and 'Income' (measured in thousands of dollars). Suppose we want to use these variables in a machine learning algorithm, such as k-Nearest Neighbors (k-NN), to predict a target variable.

Without standardisation:
- The 'Age' variable typically ranges from 18 to 80 (a range of 62).
- The 'Income' variable can range from 20 to 500 (a range of 480).
- If we don't standardise these variables, the 'Income' variable will have a much larger influence on the distance calculations used by the k-NN algorithm. A difference of 10 units in 'Age' (e.g., 30 vs. 40) will be overshadowed by a difference of 10 units in 'Income' (e.g., 50 vs. 60) due to the larger scale of 'Income'.

### 7. Give an example of when we might not standardize our variables.

Answer: If variables are already on the same scale.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: No, as they are already on the same scale (1 to 5).

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

In [6]:
df['hand'].value_counts()

hand
1    3542
2     452
3     179
0      11
Name: count, dtype: int64

Answer:
<br>
To predict whether a person is left-handed or right-handed, it would help to binarise the `hand` column.
- Drop all rows where `hand` = 0
- Map `2` to be `1` (left-handed)
- Map `1` and `3` to be `0` (right-handed)

In [7]:
df['y'] = [1 if i == 2 else 0 for i in df['hand']]

In [8]:
df['y'].value_counts()

y
0    3732
1     452
Name: count, dtype: int64

In [9]:
df

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand,y
0,4,1,5,1,5,1,5,1,4,1,...,2,1,22,3,1,1,3,2,3,0
1,1,5,1,4,2,5,5,4,1,5,...,2,1,14,1,2,2,6,1,1,0
2,1,2,1,1,5,4,3,2,1,4,...,2,2,30,4,1,1,1,1,2,1
3,1,4,1,5,1,4,5,4,3,5,...,2,1,18,2,2,5,3,2,2,1
4,5,1,5,1,5,1,5,1,3,1,...,2,1,22,3,1,1,3,2,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4179,3,5,4,5,2,4,2,2,2,5,...,1,1,18,2,1,1,6,2,1,0
4180,1,5,1,5,1,4,2,4,1,4,...,1,1,18,2,2,1,3,2,1,0
4181,3,2,2,4,5,4,5,2,2,5,...,2,2,22,2,1,1,6,1,1,0
4182,1,3,4,5,1,3,3,1,1,3,...,2,1,16,1,2,5,1,1,1,0


In [10]:
# check that all rows where hand=0 are dropped
df= df[df['hand'] != 0]
df

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand,y
0,4,1,5,1,5,1,5,1,4,1,...,2,1,22,3,1,1,3,2,3,0
1,1,5,1,4,2,5,5,4,1,5,...,2,1,14,1,2,2,6,1,1,0
2,1,2,1,1,5,4,3,2,1,4,...,2,2,30,4,1,1,1,1,2,1
3,1,4,1,5,1,4,5,4,3,5,...,2,1,18,2,2,5,3,2,2,1
4,5,1,5,1,5,1,5,1,3,1,...,2,1,22,3,1,1,3,2,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4179,3,5,4,5,2,4,2,2,2,5,...,1,1,18,2,1,1,6,2,1,0
4180,1,5,1,5,1,4,2,4,1,4,...,1,1,18,2,2,1,3,2,1,0
4181,3,2,2,4,5,4,5,2,2,5,...,2,2,22,2,1,1,6,1,1,0
4182,1,3,4,5,1,3,3,1,1,3,...,2,1,16,1,2,5,1,1,1,0


### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer:
<br>
- When k is an even number, there is a higher chance of encountering tie situations, where an equal number of neighbors belong to different classes.
- In such cases, the algorithm may struggle to determine the class label for the new data point, as there is no clear majority vote among the neighbors.

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [11]:
# see all columns
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand', 'y'],
      dtype='object')

In [12]:
temp_df = df.drop(columns=['introelapse', 'testelapse', 'country','fromgoogle', 'engnat',
                           'age', 'education', 'gender', 'orientation', 'race', 'religion', 'hand', 'y'],
                           axis = 1)

In [13]:
X = temp_df.values

y = df['y']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [15]:
# create four separate models
# instantiate and fit models

k_3 = KNeighborsClassifier(n_neighbors = 3)
k_3.fit(X_train, y_train)

k_5 = KNeighborsClassifier(n_neighbors = 5)
k_5.fit(X_train, y_train)

k_15 = KNeighborsClassifier(n_neighbors = 15)
k_15.fit(X_train, y_train)

k_25 = KNeighborsClassifier(n_neighbors = 25)
k_25.fit(X_train, y_train)

In [16]:
# k_3_cvs = cross_val_score(k_3, X_train, y_train, cv=5).mean()

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: 
<br>
- The default penalty is L2 (Lasso).
- The default C value (inverse of regularisation strength) is 1.0.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer:
- No need to standardise since variables are already on the same scale.
- However, we typically should standardise features before fitting an sklearn logistic regression model.

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [17]:
# lasso model, alpha = 1
lasso_1 = LogisticRegression(penalty = 'l1', C = 1.0, solver = 'liblinear')
lasso_1.fit(X_train, y_train)

# lasso model, alpha = 10
lasso_10 = LogisticRegression(penalty = 'l1', C = 0.1, solver = 'liblinear')
lasso_10.fit(X_train, y_train)

# ridge model, alpha = 1
ridge_1 = LogisticRegression(penalty = 'l2', C = 1.0, solver = 'liblinear')
ridge_1.fit(X_train, y_train)

# ridge model, alpha = 10
ridge_10 = LogisticRegression(penalty = 'l2', C = 0.1, solver = 'liblinear')
ridge_10.fit(X_train, y_train)

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer:
- No, as the questions are mainly based on psychological factors.
- It is not likely that psychological factors will affect whether someone is left or right-handed.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer:

In [18]:
print("kNN train score with k = 3: " + str(k_3.score(X_train, y_train)))
print("kNN test score with k = 3: " + str(k_3.score(X_test, y_test)))

print("kNN train score with k = 5: " + str(k_5.score(X_train, y_train)))
print("kNN test score with k = 5: " + str(k_5.score(X_test, y_test)))

print("kNN train score with k = 15: " + str(k_15.score(X_train, y_train)))
print("kNN test score with k = 15: " + str(k_15.score(X_test, y_test)))

print("kkNN train score with k = 25: " + str(k_25.score(X_train, y_train)))
print("kNN test score with k = 25: " + str(k_25.score(X_test, y_test)))

print("logistic regression train score with lasso penalty, alpha = 1: " + str(lasso_1.score(X_train, y_train)))
print("logistic regression test score with lasso penalty, alpha = 1: " + str(lasso_1.score(X_test, y_test)))

print("logistic regression train score with lasso penalty, alpha = 10: " + str(lasso_10.score(X_train, y_train)))
print("logistic regression test score with lasso penalty, alpha = 10: " + str(lasso_10.score(X_test, y_test)))

print("logistic regression train score with ridge penalty, alpha = 1: " + str(ridge_1.score(X_train, y_train)))
print("logistic regression test score with ridge penalty, alpha = 1: " + str(ridge_1.score(X_test, y_test)))

print("logistic regression train score with ridge penalty, alpha = 10: " + str(ridge_10.score(X_train, y_train)))
print("logistic regression test score with ridge penalty, alpha = 10: " + str(ridge_10.score(X_test, y_test)))

kNN train score with k = 3: 0.9041424169804861
kNN test score with k = 3: 0.8570287539936102
kNN train score with k = 5: 0.8880520369736392
kNN test score with k = 5: 0.8833865814696485
kNN train score with k = 15: 0.8894214310167751
kNN test score with k = 15: 0.8961661341853036
kkNN train score with k = 25: 0.8894214310167751
kNN test score with k = 25: 0.896964856230032
logistic regression train score with lasso penalty, alpha = 1: 0.889763779527559
logistic regression test score with lasso penalty, alpha = 1: 0.896964856230032
logistic regression train score with lasso penalty, alpha = 10: 0.889763779527559
logistic regression test score with lasso penalty, alpha = 10: 0.896964856230032
logistic regression train score with ridge penalty, alpha = 1: 0.889763779527559
logistic regression test score with ridge penalty, alpha = 1: 0.896964856230032
logistic regression train score with ridge penalty, alpha = 10: 0.889763779527559
logistic regression test score with ridge penalty, alpha 

**Answer:**

|        Model        |    Value of $k$   | Penalty |   Value of $\alpha$   | Train Score | Test Score |
|:-------------------:|:--------------:|:-------:|:----------:|:-----------------:|:----------------:|
|         kNN         |  $k = 3$ |    NA   |     NA     |       0.9041      |      0.8570      |
|         kNN         |  $k = 5$ |    NA   |     NA     |       0.8880      |      0.8833      |
|         kNN         | $k = 15$ |    NA   |     NA     |       0.8894      |      0.8961      |
|         kNN         | $k = 25$ |    NA   |     NA     |       0.8894      |      0.8969      |
| Logistic regression |   NA   |  Lasso  |  $\alpha = 1$ |       0.8897      |      0.8969      |
| Logistic regression |   NA   |  Lasso  | $\alpha = 10$ |       0.8897      |      0.8969      |
| Logistic regression |   NA   |  Ridge  |  $\alpha = 1$ |       0.8897      |      0.8969      |
| Logistic regression |   NA   |  Ridge  | $\alpha = 10$ |       0.8897      |      0.8969      |

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer:
- An overfitted model is usually indicated by train scores that exceed test scores.
- This occurs for the kNN model where $k=3$.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer:
- As $k$ increases, the model becomes less of an overfit.
- This means that bias increases and variance decreases as $k$ increases.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:
1. Increase value of $k$
2. Use regularisation techniques
3. Use logistic regression instead

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: There is no evidence of overfitting for the linear regression  models as train scores are already worse than test scores.

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer: As C increases, regularisation decreases. Hence, variance increases and bias decreases.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer:
- As C changes, the train and test scores remain unchanged for the logistic regression models.
- Hence, regularisation has little effect on the model.
- In the context of this problem, it means that the questions have very little influence in determining whether someone is left-handed.

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:
1. Gather more data
2. Use regularisation techniques
3. Remove features from model

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer:
- Logistic regression.
- In this case, logistic regression is the better performing model (no overfit) with more consistent train/test scores.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [19]:
lasso_1.coef_

array([[ 0.00089609, -0.0129303 , -0.05207573, -0.04669918,  0.03549121,
         0.00189489,  0.00573411, -0.19656907,  0.        ,  0.01870479,
         0.00521676, -0.00331328, -0.04987876,  0.06126889, -0.04890026,
         0.04623741,  0.03088057, -0.04046212, -0.00400496, -0.0621593 ,
        -0.05997433, -0.08411421, -0.03384959, -0.05426119,  0.05080234,
         0.08232161,  0.05554712, -0.04689501,  0.05555425,  0.01473696,
         0.00825508,  0.02169194, -0.04057988, -0.02895332,  0.01994143,
        -0.04770449, -0.01638693,  0.09710724, -0.07509253, -0.10466147,
        -0.0292694 , -0.04385765, -0.15124591,  0.0611367 ]])

In [20]:
coeff = lasso_1.coef_[0][0]
print(f"The coefficient for Q1 is {coeff}.")

The coefficient for Q1 is 0.0008960900513012697.


In [21]:
np.exp(lasso_1.coef_[0][0])

1.0008964916599414

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:
- Logistic regression.
- In this case, logistic regression is the better performing model (no overfit) with more consistent train/test scores.

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)