## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

__Answers:__

1. Is a person who day-dreams, more likely to be left-handed? <br>
    _Q5_ 
1. Is a person who puts on fake concerts as a child, more likely to be left-handed? <br>
    _Q16_
1. Is a person who jumps when excited, more likely to be left-handed? <br>
    _Q26_

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data.csv', sep='\t')

In [3]:
df.head(2)

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

**__Answer:__**

1. When collecting data, need to explain to participants why these particular data points are being collected and how would they relate to what the study is trying to achieve
2. How the data would be de-identified, so that no particular data point can be traced back to an individual
3. How long will the data be kept, and timeframe that it would disposed of in a secure manner

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Q1           4184 non-null   int64 
 1   Q2           4184 non-null   int64 
 2   Q3           4184 non-null   int64 
 3   Q4           4184 non-null   int64 
 4   Q5           4184 non-null   int64 
 5   Q6           4184 non-null   int64 
 6   Q7           4184 non-null   int64 
 7   Q8           4184 non-null   int64 
 8   Q9           4184 non-null   int64 
 9   Q10          4184 non-null   int64 
 10  Q11          4184 non-null   int64 
 11  Q12          4184 non-null   int64 
 12  Q13          4184 non-null   int64 
 13  Q14          4184 non-null   int64 
 14  Q15          4184 non-null   int64 
 15  Q16          4184 non-null   int64 
 16  Q17          4184 non-null   int64 
 17  Q18          4184 non-null   int64 
 18  Q19          4184 non-null   int64 
 19  Q20          4184 non-null 

In [5]:
(df.isnull().sum() * 100) / len(df)

Q1             0.0
Q2             0.0
Q3             0.0
Q4             0.0
Q5             0.0
Q6             0.0
Q7             0.0
Q8             0.0
Q9             0.0
Q10            0.0
Q11            0.0
Q12            0.0
Q13            0.0
Q14            0.0
Q15            0.0
Q16            0.0
Q17            0.0
Q18            0.0
Q19            0.0
Q20            0.0
Q21            0.0
Q22            0.0
Q23            0.0
Q24            0.0
Q25            0.0
Q26            0.0
Q27            0.0
Q28            0.0
Q29            0.0
Q30            0.0
Q31            0.0
Q32            0.0
Q33            0.0
Q34            0.0
Q35            0.0
Q36            0.0
Q37            0.0
Q38            0.0
Q39            0.0
Q40            0.0
Q41            0.0
Q42            0.0
Q43            0.0
Q44            0.0
introelapse    0.0
testelapse     0.0
country        0.0
fromgoogle     0.0
engnat         0.0
age            0.0
education      0.0
gender         0.0
orientation 

In [6]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


#### Findings:
* `age` column may have typo error(s) because standard deviation and maximum value is more than 300. 
* Noted that `age` in source documentation is entered as text by participants

In [7]:
df.loc[df['age']>100, ['age']]

Unnamed: 0,age
2075,123
2137,409
2690,23763


_Since there is no way to find the correct values for the age_
_I will have to drop the rows_

In [8]:
df.drop([2075,2137,2690], inplace=True)

In [9]:
# check max and standard deviation
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,...,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0,4181.0
mean,1.962688,3.829706,2.845013,3.187037,2.864865,3.672327,3.21526,3.184406,2.760344,3.523559,...,480.141593,1.575939,1.239895,24.581679,2.317867,1.654867,1.831858,5.012437,2.39249,1.190863
std,1.360602,1.551874,1.664401,1.476915,1.545997,1.341916,1.490502,1.387127,1.511825,1.24306,...,3143.300111,0.494259,0.440852,10.869709,0.873939,0.640905,1.302078,1.971165,2.182517,0.494392
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.0,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,86.0,4.0,3.0,5.0,7.0,7.0,3.0


_The max and standard deviation value of `age` column has changed to `86` and `10.87` respectively_

In [10]:
# one-hot encode `country` column
df1 = pd.get_dummies(data=df, columns=['country'], prefix='ctry')
df1.head(2)

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,ctry_UA,ctry_US,ctry_UY,ctry_UZ,ctry_VE,ctry_VI,ctry_VN,ctry_ZA,ctry_ZM,ctry_ZW
0,4,1,5,1,5,1,5,1,4,1,...,0,1,0,0,0,0,0,0,0,0
1,1,5,1,4,2,5,5,4,1,5,...,0,0,0,0,0,0,0,0,0,0


---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

**Answer:** It would be a classification problem because the target variable only has 2 results, or at max, 3 results (yes for left-handed, no for right-handed, and inconclusive for ambidextrous)

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

**Answer:** when the variance and range of values between variables vary widely, standardisation is required to ensure each variable is weighed equally in the algorithm equation

### 7. Give an example of when we might not standardize our variables.

**Answer:** When the variables are categorical, then standardisation may not be useful

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

**Answer:** Based on my answers above, I would still standardise my variables because some of my variables do not share the same range of values as the rest, e.g demographic variables, and `age` column

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

**Answer:** We will have to exclude the data points that are ambiguous for handedness, meaning values of `3` in the `hand` column 

In [11]:
df1['hand'].value_counts()

1    3541
2     452
3     178
0      10
Name: hand, dtype: int64

In [12]:
df2 = df1[(df1['hand']==1) | (df1['hand']==2)]
df2.shape

(3993, 149)

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

**Answer:** Because the target is to classify between left-handedness or not, `k=4` can result in situations where there is equal representation of both classifications within the `k=4`, and the classification may not work. 

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [13]:
# import sklearn modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

In [14]:
# define variables
'''
dropped `introelapse` and `testelapse` as these variables unlikely linked to handedness, 
also the range of values in these variables might affect the KNN classifier
'''
X = df2.drop(columns=['hand','introelapse', 'testelapse'])
y = df2['hand']

_drop all the demographic columns. Isolate y to be just left handedness

In [15]:
# train test split with stratification of target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1, stratify=y)

In [16]:
# check stratification
y_train.value_counts()

1    2832
2     362
Name: hand, dtype: int64

In [17]:
y_test.value_counts()

1    709
2     90
Name: hand, dtype: int64

_**StandardScaler**_

In [18]:
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

> **KNN_models:** $k = 3, 5, 15$ and $25$

_**Instantiate KNN model**_

In [19]:
knn_3 = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_5 = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_15 = KNeighborsClassifier(n_neighbors=15, weights='distance')
knn_25 = KNeighborsClassifier(n_neighbors=25, weights='distance')

_**What accuracy can be expected from the KNN model for this dataset and target variable?**_
> Using cross validation score

In [20]:
# mean accuracy score
print(f'knn_3 mean accuracy score: {cross_val_score(knn_3, Z_train, y_train, cv=10).mean()}') 
print(f'knn_5 mean accuracy score: {cross_val_score(knn_5, Z_train, y_train, cv=10).mean()}')
print(f'knn_15 mean accuracy score: {cross_val_score(knn_15, Z_train, y_train, cv=10).mean()}')
print(f'knn_25 mean accuracy score: {cross_val_score(knn_25, Z_train, y_train, cv=10).mean()}')

knn_3 mean accuracy score: 0.851916144200627
knn_5 mean accuracy score: 0.8681925940438872
knn_15 mean accuracy score: 0.8863499216300941
knn_25 mean accuracy score: 0.8866634012539185


_**Fit models**_

In [21]:
knn_3 = knn_3.fit(Z_train, y_train)
knn_5 = knn_5.fit(Z_train, y_train)
knn_15 = knn_15.fit(Z_train, y_train)
knn_25 = knn_25.fit(Z_train, y_train)

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

**Answer:** Yes, there is default regularisation in logistic regression in sklearn. It is a `L2` penalty term.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

**Answer:** Yes, because I do not want the variables that have a wider range of values to influence its weight on the logistic regression equation

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [22]:
from sklearn.linear_model import LogisticRegression

_**StandardScaler**_

> LASSO (L1 penalty):  $\alpha = 1$ and $\alpha = 10$

> Ridge (L2 penalty, default): $\alpha = 1$ and $\alpha = 10$

_Selected `solver` = `liblinear` because it is one of the solver types that supports `L1` and `L2` regularisation within `sklearn.linear_model.LogisticRegression` module. Solver = `saga` gave a_ 
> convergence warning: The max_iter was reached which means the `coef_` did not converge

_So `saga` was not used_

_**Instantiate Logistic regression models**_

In [23]:
logr_l1_a1 = LogisticRegression(solver='liblinear', penalty='l1', C=1.0) # LASSO alpha = 1
logr_l1_a10 = LogisticRegression(solver='liblinear', penalty='l1', C=10.0) # LASSO alpha = 10
logr_l2_a1 = LogisticRegression(solver='liblinear', penalty='l2', C=1.0) # Ridge alpha = 1
logr_l2_a10 = LogisticRegression(solver='liblinear', penalty='l2', C=10.0) # Ridge alpha = 10

_**Fit model**_

In [24]:
logr_l1_a1 = logr_l1_a1.fit(Z_train, y_train)
logr_l1_a10 = logr_l1_a10.fit(Z_train, y_train)
logr_l2_a1 = logr_l2_a1.fit(Z_train, y_train)
logr_l2_a10 = logr_l2_a10.fit(Z_train, y_train)

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

**Answer:** I do not think my $X$ variables will do a good job of predicting my $Y$ variable. Because the questions that contribute to $X$ variables values do not have an obvious or known relationship with any particular type of handedness.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

_**Evaluate KNN models**_

In [25]:
# accuracy scores
print(' knn=3 '.center(38, '='))
print(f'knn_3_train_score: {knn_3.score(Z_train, y_train)}') 
print(f'knn_3_test_score: {knn_3.score(Z_test, y_test)}') 
print()
print(' knn=5 '.center(38, '='))
print(f'knn_5_train_score: {knn_5.score(Z_train, y_train)}') 
print(f'knn_5_test_score: {knn_5.score(Z_test, y_test)}')
print()
print(' knn=15 '.center(38, '='))
print(f'knn_15_train_score: {knn_15.score(Z_train, y_train)}') 
print(f'knn_15_test_score: {knn_15.score(Z_test, y_test)}')
print()
print(' knn=25 '.center(38, '='))
print(f'knn_25_train_score: {knn_25.score(Z_train, y_train)}') 
print(f'knn_25_test_score: {knn_25.score(Z_test, y_test)}')
print()
print(' l1_a1 '.center(38, '='))
print(f'logr_l1_a1_trn_score: {logr_l1_a1.score(Z_train, y_train)}') 
print(f'logr_l1_a1_tst_score: {logr_l1_a1.score(Z_test, y_test)}')
print()
print(' l1_a10 '.center(38, '='))
print(f'logr_l1_a10_trn_score: {logr_l1_a10.score(Z_train, y_train)}') 
print(f'logr_l1_a10_tst_score: {logr_l1_a10.score(Z_test, y_test)}')
print()
print(' l2_a1 '.center(38, '='))
print(f'logr_l2_a1_trn_score: {logr_l2_a1.score(Z_train, y_train)}') 
print(f'logr_l2_a1_tst_score: {logr_l2_a1.score(Z_test, y_test)}')
print()
print(' l2_a10 '.center(38, '='))
print(f'logr_l2_a10_trn_score: {logr_l2_a10.score(Z_train, y_train)}') 
print(f'logr_l2_a10_tst_score: {logr_l2_a10.score(Z_test, y_test)}')

knn_3_train_score: 1.0
knn_3_test_score: 0.8635794743429287

knn_5_train_score: 1.0
knn_5_test_score: 0.8823529411764706

knn_15_train_score: 1.0
knn_15_test_score: 0.8886107634543179

knn_25_train_score: 1.0
knn_25_test_score: 0.8873591989987485

logr_l1_a1_trn_score: 0.8872886662492173
logr_l1_a1_tst_score: 0.8873591989987485

logr_l1_a10_trn_score: 0.8869755792110207
logr_l1_a10_tst_score: 0.8873591989987485

logr_l2_a1_trn_score: 0.8869755792110207
logr_l2_a1_tst_score: 0.8873591989987485

logr_l2_a10_trn_score: 0.8869755792110207
logr_l2_a10_tst_score: 0.8873591989987485


> _Shows overfitted KNN models on train data_

> _But logistic regression models are performing almost equally for both training and test data_

In [26]:
knn_3_train_score = knn_3.score(Z_train, y_train) 
knn_3_test_score = knn_3.score(Z_test, y_test) 

knn_5_train_score = knn_5.score(Z_train, y_train) 
knn_5_test_score = knn_5.score(Z_test, y_test)

knn_15_train_score = knn_15.score(Z_train, y_train) 
knn_15_test_score = knn_15.score(Z_test, y_test)

knn_25_train_score = knn_25.score(Z_train, y_train) 
knn_25_test_score = knn_25.score(Z_test, y_test)

logr_l1_a1_trn_score = logr_l1_a1.score(Z_train, y_train) 
logr_l1_a1_tst_score = logr_l1_a1.score(Z_test, y_test)

logr_l1_a10_trn_score = logr_l1_a10.score(Z_train, y_train) 
logr_l1_a10_tst_score = logr_l1_a10.score(Z_test, y_test)

logr_l2_a1_trn_score = logr_l2_a1.score(Z_train, y_train) 
logr_l2_a1_tst_score = logr_l2_a1.score(Z_test, y_test)

logr_l2_a10_trn_score = logr_l2_a10.score(Z_train, y_train) 
logr_l2_a10_tst_score = logr_l2_a10.score(Z_test, y_test)

In [27]:
knn_3_accuracy_diff = abs((knn_3_train_score - knn_3_test_score)/ knn_3_train_score)*100
knn_5_accuracy_diff = abs((knn_5_train_score - knn_5_test_score)/ knn_5_train_score)*100
knn_15_accuracy_diff = abs((knn_15_train_score - knn_15_test_score)/ knn_15_train_score)*100
knn_25_accuracy_diff = abs((knn_25_train_score - knn_25_test_score)/ knn_25_train_score)*100

logr_l1_a1_accuracy_diff = abs((logr_l1_a1_trn_score - logr_l1_a1_tst_score)/ logr_l1_a1_trn_score)*100
logr_l1_a10_accuracy_diff = abs((logr_l1_a10_trn_score - logr_l1_a10_tst_score)/ logr_l1_a10_trn_score)*100
logr_l2_a1_accuracy_diff = abs((logr_l2_a1_trn_score - logr_l2_a1_tst_score)/ logr_l2_a1_trn_score)*100
logr_l2_a10_accuracy_diff = abs((logr_l2_a10_trn_score - logr_l2_a10_tst_score)/ logr_l2_a10_trn_score)*100

In [28]:
print(' knn=3 '.center(38, '='))
print(f'accuracy perc diff: {knn_3_accuracy_diff} %')
print()
print(' knn=5 '.center(38, '='))
print(f'accuracy perc diff: {knn_5_accuracy_diff} %')
print()
print(' knn=15 '.center(38, '='))
print(f'accuracy perc diff: {knn_15_accuracy_diff} %')
print()
print(' knn=25 '.center(38, '='))
print(f'accuracy perc diff: {knn_25_accuracy_diff} %')
print()
print(' l1_a1 '.center(38, '='))
print(f'accuracy perc diff: {logr_l1_a1_accuracy_diff} %')
print()
print(' l1_a10 '.center(38, '='))
print(f'accuracy perc diff: {logr_l1_a10_accuracy_diff} %')
print()
print(' l2_a1 '.center(38, '='))
print(f'accuracy perc diff: {logr_l2_a1_accuracy_diff} %')
print()
print(' l2_a10 '.center(38, '='))
print(f'accuracy perc diff: {logr_l2_a10_accuracy_diff} %')

accuracy perc diff: 13.64205256570713 %

accuracy perc diff: 11.764705882352944 %

accuracy perc diff: 11.138923654568211 %

accuracy perc diff: 11.26408010012515 %

accuracy perc diff: 0.007949244954218821 %

accuracy perc diff: 0.04325032128495186 %

accuracy perc diff: 0.04325032128495186 %

accuracy perc diff: 0.04325032128495186 %


**Answer:**

|Model|Train score|Test score|Accuracy diff %|Comments|
|---|--:|--:|--:|---|
|knn=3|1.0|0.8635794743429287|13.64205256570713 %|Overfitted|
|knn=5|1.0|0.8823529411764706|11.764705882352944 %|Overfitted|
|knn=15|1.0|0.8886107634543179|11.138923654568211 %|Overfitted|
|knn=25|1.0|0.8873591989987485|11.26408010012515 %|Overfitted|
|L1, alpha=1|0.8872886662492173|0.8873591989987485|0.007949244954218821 %|Neither|
|L1, alpha=10|0.8869755792110207|0.8873591989987485|0.04325032128495186 %|Neither|
|L2, alpha=1|0.8869755792110207|0.8873591989987485|0.04325032128495186 %|Neither|
|L2, alpha=10|0.8869755792110207|0.8873591989987485|0.04325032128495186 %|Neither|

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

**Answer:** There is evidence of overfitting for all my KNN models ($k$ = 3, 5, 15, 25). Because they all perform better on the training data than on test data.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

**Answer:** As $k$ increases, bias increases, variance decreases.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

__**Answer:**__

1. Reduce the number of features used in the model
1. Apply regularisation to variables
1. Use a different model
1. Feed more data
1. If using KNN, increase $k$-value

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

**Answer:** None of my logistic regression models show any evidence of overfitting. Because none of them perform better on training set than on test set.

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

**Answer:** As $C$ increases, bias decreases, variance increases. 1/C = alpha. alpha is penalty term.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

_**Instantiate Logistic regression models**_

In [29]:
logr_l1_a0_5 = LogisticRegression(solver='liblinear', penalty='l1', C=0.5) # LASSO alpha = 0.5
logr_l1_a1000 = LogisticRegression(solver='liblinear', penalty='l1', C=1000.0) # LASSO alpha = 1000
logr_l2_a0_1 = LogisticRegression(solver='liblinear', penalty='l2', C=0.1) # Ridge alpha = 0.1
logr_l2_a3000 = LogisticRegression(solver='liblinear', penalty='l2', C=3000.0) # Ridge alpha = 3000

_**Fit model**_

In [30]:
logr_l1_a0_5 = logr_l1_a0_5.fit(Z_train, y_train)
logr_l1_a1000 = logr_l1_a1000.fit(Z_train, y_train)
logr_l2_a0_1 = logr_l2_a0_1.fit(Z_train, y_train)
logr_l2_a3000 = logr_l2_a3000.fit(Z_train, y_train)

_**Models intercepts and coefficients**_

In [31]:
print(' l1_C=0.5 '.center(38, '='))
print(f'logr L1 C=0.5 Intercept: {logr_l1_a0_5.intercept_}')
print(f'logr L1 C=0.5 Coefficients: {logr_l1_a0_5.coef_}')
print()
print(' l1_C=1000 '.center(38, '='))
print(f'logr L1 C=1000 Intercept: {logr_l1_a1000.intercept_}')
print(f'logr L1 C=1000 Coefficients: {logr_l1_a1000.coef_}')
print()
print(' l2_C=0.1 '.center(38, '='))
print(f'logr L2 C=0.1 Intercept: {logr_l2_a0_1.intercept_}')
print(f'logr L2 C=0.1 Coefficients: {logr_l2_a0_1.coef_}')
print()
print(' l2_C=3000 '.center(38, '='))
print(f'logr L2 C=3000 Intercept: {logr_l2_a3000.intercept_}')
print(f'logr L2 C=3000 Coefficients: {logr_l2_a3000.coef_}')

logr L1 C=0.5 Intercept: [-2.28859476]
logr L1 C=0.5 Coefficients: [[-0.03700162 -0.00237809 -0.01237858 -0.02778704  0.11789611 -0.00986011
   0.03396422 -0.18015198 -0.029699    0.07291112  0.03024217 -0.04295736
  -0.05175165  0.          0.          0.1174592   0.0618089  -0.12400584
  -0.10492399 -0.02996719 -0.11454725 -0.11956082 -0.13063521 -0.0258022
   0.07299984  0.09134403  0.          0.0237365   0.08400723  0.02782061
   0.          0.          0.01697291  0.01368972  0.0553251  -0.04046584
  -0.03774368  0.03839922 -0.09508136 -0.09472259 -0.03414744 -0.09223491
  -0.16394666 -0.03414725 -0.00860332 -0.15408153  0.          0.
  -0.14691932  0.11129254 -0.03780446 -0.07493267 -0.01825939 -0.02429835
   0.03724478 -0.04403871  0.00125653 -0.12546313 -0.01442547  0.07844697
  -0.01750258 -0.02048096 -0.15727063  0.04222478  0.09839978 -0.08478701
  -0.06957419  0.04032282 -0.01520401 -0.05732542  0.         -0.04400878
  -0.01489304 -0.03915689 -0.01317405 -0.06327274 -0.1

**Answer:**
* As I increase the $C$-value within `L1` and `L2` regularisation, the coefficients get smaller, the intercept also decreases.
* It means that the regularisation is helping with overfitting of the logistic regression model, by reducing the weight of each variable with the increase in penalty. 

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

__**Answer:**__
1. reduce number of features or increase data
1. regularisation
1. use a different solver

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

**Answer:** I would use logistic regression, because `L1` and `L2` regularisation helps bring some coefficients to zero, eliminating some variables from the equation, which guides me towards the other features that are more important in determining left-handedness.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [32]:
logr_l1_a1.coef_

array([[-0.04477552, -0.00511478, -0.01479419, -0.03212176,  0.12256912,
        -0.01411235,  0.03930677, -0.1926312 , -0.03263088,  0.07884156,
         0.03532943, -0.04916223, -0.0562834 ,  0.00415261,  0.00344368,
         0.12264699,  0.06693344, -0.1275371 , -0.10912277, -0.03363829,
        -0.12656526, -0.12574779, -0.13408379, -0.03065176,  0.07730055,
         0.09799066,  0.00194269,  0.03071261,  0.08883812,  0.03472107,
         0.        ,  0.        ,  0.02053328,  0.01925867,  0.06234167,
        -0.04400752, -0.0422888 ,  0.04537566, -0.10031398, -0.10308679,
        -0.0395093 , -0.09609659, -0.17129444, -0.03744291, -0.01089062,
        -0.15583946,  0.        ,  0.00045057, -0.15179668,  0.11573305,
        -0.047046  , -0.07825271, -0.03121093, -0.03737941,  0.04192997,
        -0.06781636,  0.00715816, -0.13421053, -0.02737186,  0.08058074,
        -0.04039124, -0.03382575, -0.2088591 ,  0.04583077,  0.10035873,
        -0.12220558, -0.0980446 ,  0.04260372, -0.0

**Answer:** LASSO logistic regression has helped to eliminate some variables from the equation by reducing the coefficient to zero. 

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

**Answer:** I would pick the `Logistic Regression` model with `L1` regularisation and $C$-value of `1`. Because it has the least accuracy difference between performance on training and test data sets. 

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

In [33]:
import numpy as np

In [34]:
# discovered it is 2D numpy array which code below cannot accept
logr_l1_a1.coef_.shape

(1, 146)

In [None]:
pd.DataFrame()
sort_values(by='features', )

In [35]:
# convert to 1D numpy array
flat_coef = logr_l1_a1.coef_.ravel()

In [36]:
# check conversion
flat_coef

array([-0.04477552, -0.00511478, -0.01479419, -0.03212176,  0.12256912,
       -0.01411235,  0.03930677, -0.1926312 , -0.03263088,  0.07884156,
        0.03532943, -0.04916223, -0.0562834 ,  0.00415261,  0.00344368,
        0.12264699,  0.06693344, -0.1275371 , -0.10912277, -0.03363829,
       -0.12656526, -0.12574779, -0.13408379, -0.03065176,  0.07730055,
        0.09799066,  0.00194269,  0.03071261,  0.08883812,  0.03472107,
        0.        ,  0.        ,  0.02053328,  0.01925867,  0.06234167,
       -0.04400752, -0.0422888 ,  0.04537566, -0.10031398, -0.10308679,
       -0.0395093 , -0.09609659, -0.17129444, -0.03744291, -0.01089062,
       -0.15583946,  0.        ,  0.00045057, -0.15179668,  0.11573305,
       -0.047046  , -0.07825271, -0.03121093, -0.03737941,  0.04192997,
       -0.06781636,  0.00715816, -0.13421053, -0.02737186,  0.08058074,
       -0.04039124, -0.03382575, -0.2088591 ,  0.04583077,  0.10035873,
       -0.12220558, -0.0980446 ,  0.04260372, -0.027443  , -0.08

In [65]:
# sort the coefficients to find which variables have the highest weights
np.flip(np.sort(flat_coef)) 

array([ 0.26280007,  0.17576027,  0.12323285,  0.12264699,  0.12256912,
        0.11573305,  0.11344575,  0.11057787,  0.10035873,  0.10009212,
        0.09799066,  0.08883812,  0.08508992,  0.08096429,  0.08058074,
        0.07927913,  0.07884156,  0.07730055,  0.07003314,  0.06693344,
        0.06447098,  0.06326155,  0.06234167,  0.06152254,  0.05741222,
        0.05077471,  0.04901168,  0.04583077,  0.04537566,  0.04260372,
        0.04197914,  0.04192997,  0.03930677,  0.03532943,  0.03472107,
        0.03071261,  0.02918085,  0.0215397 ,  0.02053328,  0.01925867,
        0.00715816,  0.00526954,  0.00415261,  0.00344368,  0.00194269,
        0.00045057,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        , -0.00511478, -0.00776869,
       -0.01089062, -0.01366224, -0.01411235, -0.01412217, -0.01479419,
       -0.01481438, -0.01504967, -0.01612593, -0.01839827, -0.01

> Higher the score for a question, higher the agreement to the statement

> For `hand` (target), `1`=`Right`, `2`=`Left`

_Only positive coefficients would signify a direct relationship between an affirmed personality trait and left-handedness because of how the scoring is done (`5` = `agree` and `2` = `left`)_ <br><br>
_After reviewing the top positive coefficients in descending order, and excluding those that relate to demographics (since they are not personality traits), the strongest relationships are found in `Q5`, `Q16`, and `Q26`._ <br><br>
_So the following questions are formulated based on the strongest relationships. The questions above are also revised to match these questions below_.

**Questions:**

> Is a person who day-dreams, more likely to be left-handed? <br>
    _Q5_ 
    
> Is a person who puts on fake concerts as a child, more likely to be left-handed? <br>
    _Q16_
    
> Is a person who jumps when excited, more likely to be left-handed? <br>
    _Q26_

In [71]:
# to confirm the coef array index match which questions
print(np.array(X.columns)[4])
print(np.array(X.columns)[15])
print(np.array(X.columns)[25])

Q5
Q16
Q26


In [72]:
print(' Daydream '.center(50, '='))
print(f' Q5. I have day dreamed about saving someone from a burning building: {flat_coef[4]}')
print()
print(' Fake concerts '.center(50, '='))
print(f' Q16. When I was a child, I put on fake concerts and plays with my friends: {flat_coef[15]}')
print()
print(' Jump in excitement '.center(50, '='))
print(f' Q26. I jump up and down in excitement sometimes: {flat_coef[25]}')

 Q5. I have day dreamed about saving someone from a burning building: 0.1225691222017266

 Q16. When I was a child, I put on fake concerts and plays with my friends: 0.12264698918927933

 Q26. I jump up and down in excitement sometimes: 0.09799066437033165


__**Answer:**__ 
1. A personality trait of daydreaming about saving someone from a burning building is linked to a higher likelihood of being left-handed

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)