# Phase 3 Code Challenge

This assessment is designed to test your understanding of Module 3 material. It covers:

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully_. You will be asked both to write code and to answer short answer questions.

## Code Tests

We have provided some code tests for you to run to check that your work meets the item specifications. Passing these tests does not necessarily mean that you have gotten the item correct - there are additional hidden tests. However, if any of the tests do not pass, this tells you that your code is incorrect and needs changes to meet the specification. To determine what the issue is, read the comments in the code test cells, the error message you receive, and the item instructions.

## Short Answer Questions 

For the short answer questions...

* _Use your own words_. It is OK to refer to outside resources when crafting your response, but _do not copy text from another source_.

* _Communicate clearly_. We are not grading your writing skills, but you can only receive full credit if your teacher is able to fully understand your response. 

* _Be concise_. You should be able to answer most short answer questions in a sentence or two. Writing unnecessarily long answers increases the risk of you being unclear or saying something incorrect.

In [2]:
# Run this cell without changes to import the necessary libraries

from numbers import Number

---
## Part 1: Gradient Descent [Suggested Time: 20 min] 10:35 - 10:50
---
In this part, you will describe how gradient descent works to calculate a parameter estimate. Below is an image of a best fit line from a linear regression model using TV advertising spending to predict product sales.

![best fit line](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/best_fit_line.png)

This best fit line can be described by the equation $y = mx + b$. Below is the RSS cost curve associated with the slope parameter $m$:

![cost curve](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/cost_curve.png)

where RSS is the residual sum of squares: $RSS = \sum_{i=1}^n(y_i - (mx_i + b))^2$ 

### 1.1) Short Answer: Explain how the RSS curve above could be used to find an optimal value for the slope parameter $m$. 

Your answer should provide a one sentence summary, not every step of the process.

In [3]:
# Your answer here

print('''The optimal slope parameter m can be found through the minimum y-value of the RSS curve,
   which is also the point where the partial derivative of the RSS curve equals 0.'''
     )



The optimal slope parameter m can be found through the minimum y-value of the RSS curve,
   which is also the point where the partial derivative of the RSS curve equals 0.


Below is a visualization showing the iterations of a gradient descent algorithm applied the RSS curve. Each yellow marker represents an estimate, and the lines between markers represent the steps taken between estimates in each iteration. Numeric labels identify the iteration numbers.

![gradient descent](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/gd.png)

### 1.2) Short Answer: Explain why the distances between markers get smaller over successive iterations.

In [4]:
# Your answer here

print(
    ''' 
The learning rate is a proportion that adjusts how big the next step size should be on the partial derivative at the current step.
As you approach the minimum slope of the curve, the partial derivative becomes smaller, 
   which leads to smaller distances between step sizes.
   
For example, the partial derivative at 2 is smaller than the partial derivative at 1. 
Therefore, the distance from 2 to 3 is smaller than the distance from 1-2.
''')

 
The learning rate is a proportion that adjusts how big the next step size should be on the partial derivative at the current step.
As you approach the minimum slope of the curve, the partial derivative becomes smaller, 
   which leads to smaller distances between step sizes.
   
For example, the partial derivative at 2 is smaller than the partial derivative at 1. 
Therefore, the distance from 2 to 3 is smaller than the distance from 1-2.



### 1.3) Short Answer: What would be the effect of decreasing the learning rate for this application of gradient descent?

In [5]:
# Your answer here

print(
    '''
Decreasing the learning rate would decrease the step sizes, which increases the number of steps needed to reach the minimum slope 
This means that there are more steps/points for your model to iterate through, which can be more accurate but also more inefficient.
''')



Decreasing the learning rate would decrease the step sizes, which increases the number of steps needed to reach the minimum slope 
This means that there are more steps/points for your model to iterate through, which can be more accurate but also more inefficient.



---
## Part 2: Logistic Regression [Suggested Time: 15 min] 11:05
---
In this part, you will answer general questions about logistic regression.

### 2.1) Short Answer: Provide one reason why logistic regression is better than linear regression for modeling a binary target/outcome.

In [10]:
# Your answer here

print('''
A linear regression fits a straight line to the model, meaning that y must represent continuous numerical values. 
Since a binary outcome can only fall into two categories (for example, 0 and 1), the linear model will 
estimate possibile outcomes (for example, 0.25 or 0.5) that are not actually possible for a binary outcome to be. 

A logistic regression, however, takes into consideration the fact that the outcome can only be at set values, 
by estimating the probability that the y-value is 0 or 1 at a given x-value. 

This makes logistic regression much more interpretable for modeling binary outcomes.
''')



A linear regression fits a straight line to the model, meaning that y must represent continuous numerical values. 
Since a binary outcome can only fall into two categories (for example, 0 and 1), the linear model will 
estimate possibile outcomes (for example, 0.25 or 0.5) that are not actually possible for a binary outcome to be. 

A logistic regression, however, takes into consideration the fact that the outcome can only be at set values, 
by estimating the probability that the y-value is 0 or 1 at a given x-value. 

This makes logistic regression much more interpretable for modeling binary outcomes.



### 2.2) Short Answer: Compare logistic regression to another classification model of your choice (e.g. KNN, Decision Tree, etc.). What is one advantage and one disadvantage logistic regression has when compared with the other model?

In [16]:
# Your answer here

print(
'''
Comparing Logistic Regression to Decision Tree:

Advantage: 

* Logistic Regression is much more interpretable at the variable level, since it calculates coefficients 
for each independent variable that can quantify how much change each individual variable contributes 
to the dependent variable.


Disadvantages:

* Logistic Regression cannot handle independent variables with high multicollinearity, 
but the Decision Tree Model can.

''')



Comparing Logistic Regression to Decision Tree:

Advantage: 

* Logistic Regression is much more interpretable at the variable level, since it calculates coefficients 
for each independent variable that can quantify how much change each individual variable contributes 
to the dependent variable.


Disadvantages:

* Logistic Regression cannot handle independent variables with high multicollinearity, 
but the Decision Tree Model can.




---
## Part 3: Classification Metrics [Suggested Time: 20 min] 11:25
---
In this part, you will make sense of classification metrics produced by various classifiers.

The confusion matrix below represents the predictions generated by a classisification model on a small testing dataset.

![cnf matrix](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/cnf_matrix.png)

### 3.1) Create a numeric variable `precision` containing the precision of the classifier.

In [17]:
# CodeGrade step3.1
# Replace None with appropriate code

tp = 30
tn = 54
fp = 4
fn = 12



precision = tp / (tp + fp)

In [18]:
# This test confirms that you have created a numeric variable named precision

assert isinstance(precision, Number)

### 3.2) Create a numeric variable `f1score` containing the F-1 score of the classifier.

In [19]:
# CodeGrade step3.2
# Replace None with appropriate code

f1score = (2*tp)/(2*tp + fp + fn)

In [20]:
# This test confirms that you have created a numeric variable named f1score

assert isinstance(f1score, Number)

The ROC curves below were calculated for three different models applied to one dataset.

1. Only Age was used as a feature in the model
2. Only Estimated Salary was used as a feature in the model
3. All features were used in the model

![roc](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/many_roc.png)

### 3.3) Short Answer: Identify the best ROC curve in the above graph and explain why it is the best. 

In [21]:
# Your answer here

print(
'''
The best ROC curve maximizes the true positive rate to false positive rate, which is equal to the AUC closest to 1.

Therefore, the best ROC curve is for the model with All Features included.
''')



The best ROC curve maximizes the true positive rate to false positive rate, which is equal to the AUC closest to 1.

Therefore, the best ROC curve is for the model with All Features included.



Run the following cells to load a sample dataset, run a classification model on it, and perform some EDA.

In [22]:
# Run this cell without changes

# Include relevant imports
import pickle, sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


In [23]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

### 3.4) Short Answer: Explain how the distribution of `y` shown above could explain the high accuracy score of the classification model.

In [24]:
# Your answer here
y.value_counts(normalize=True)

print('''
The dataset is imbalanced, such that there are significantly more 0 values compared to 1 values. Since 0-values make up 95.2% of all datapoints, there is already a high
baseline chance of a datapoint being classified as 0 compared to 1. 
Therefore, it makes sense that our current model can perform slightly better than this baseline chance.
''')


The dataset is imbalanced, such that there are significantly more 0 values compared to 1 values. Since 0-values make up 95.2% of all datapoints, there is already a high
baseline chance of a datapoint being classified as 0 compared to 1. 
Therefore, it makes sense that our current model can perform slightly better than this baseline chance.



### 3.5) Short Answer: What is one method you could use to improve your model to address the issue discovered in Question 3.4?

In [25]:
# Your answer here

print('''
As mentioned above, a major problem with the current model is the class imbalance between 0s and 1s. 
The class imbalance can be addressed by using SMOTE, which involves oversampling the minority class (oversampling 1s)

Oversampling the minority class is the better option compared to undersampling the majority class, given the already small total sample size (n = 270).
''')



As mentioned above, a major problem with the current model is the class imbalance between 0s and 1s. 
The class imbalance can be addressed by using SMOTE, which involves oversampling the minority class (oversampling 1s)

Oversampling the minority class is the better option compared to undersampling the majority class, given the already small total sample size (n = 270).



---
## Part 4: Decision Trees [Suggested Time: 20 min] 11:45
---
In this part, you will use decision trees to fit a classification model to a wine dataset. The data contain the results of a chemical analysis of wines grown in one region in Italy using three different cultivars (grape types). There are thirteen features from the measurements taken, and the wines are classified by cultivar in the `target` variable.

In [26]:
# Run this cell without changes

# Relevant imports 
import pandas as pd 
import numpy as np 
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import plot_confusion_matrix

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'

### 4.1) Use `train_test_split()` to evenly split `X` and `y` data between training sets (`X_train` and `y_train`) and test sets (`X_test` and `y_test`), with `random_state=1`.

Do not alter `X` or `y` before performing the split.

In [27]:
# CodeGrade step4.1
# Replace None with appropriate code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [28]:
# These tests confirm that you have created DataFrames named X_train, X_test and Series named y_train, and y_test

assert type(X_train) == pd.DataFrame
assert type(X_test) == pd.DataFrame
assert type(y_train) == pd.Series
assert type(y_test) == pd.Series

# These tests confirm that you have split the data evenly between train and test sets

assert X_train.shape[0] == X_test.shape[0]
assert y_train.shape[0] == y_test.shape[0]

### 4.2) Create an untuned decision tree classifier `wine_dt` and fit it using `X_train` and `y_train`, with `random_state=1`. 

Use parameter defaults for your classifier. You must use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

In [29]:
# CodeGrade step4.2
# Replace None with appropriate code

wine_dt = DecisionTreeClassifier(random_state=1)

wine_dt.fit(X_train, y_train)

DecisionTreeClassifier(random_state=1)

In [30]:
# This test confirms that you have created a DecisionTreeClassifier named wine_dt

assert type(wine_dt) == DecisionTreeClassifier

# This test confirms that you have set random_state to 1

assert wine_dt.get_params()['random_state'] == 1

# This test confirms that wine_dt has been fit

sklearn.utils.validation.check_is_fitted(wine_dt)

### 4.3) Create an array `y_pred` generated by using `wine_dt` to make predictions for the test data.

In [31]:
# CodeGrade step4.3
# Replace None with appropriate code

y_pred = wine_dt.predict(X_test)

In [32]:
# This test confirms that you have created an array-like object named y_pred

assert type(np.asarray(y_pred)) == np.ndarray

### 4.4) Create a numeric variable `wine_dt_acc` containing the accuracy score for your predictions. 

Hint: You can use the `sklearn.metrics` module.

In [34]:
# CodeGrade step4.4
# Replace None with appropriate code

wine_dt_acc = accuracy_score(y_test, y_pred)

In [35]:
# This test confirms that you have created a numeric variable named wine_dt_acc

assert isinstance(wine_dt_acc, Number)

### 4.5) Short Answer: Based on the accuracy score, does the model seem to be performing well or to have substantial performance issues? Explain your answer.

In [38]:
# Your answer here

y_val = (y.value_counts(normalize=True)).values

print(
    f'''
    Based on the accuracy score of {wine_dt_acc: .3f}, the model seems to be performing well! 
    The accuracy score is much higher than the baseline accuracy (based on the percentages of each y-value, 
    the highest of which is ~{y_val[0]:.1f}).
    
    However, to evaluate our model more thoroughly, it is important to look at the values 
    that are being misclassified using the precision & recall scores.
    '''
)



    Based on the accuracy score of  0.876, the model seems to be performing well! 
    The accuracy score is much higher than the baseline accuracy (based on the percentages of each y-value, 
    the highest of which is ~0.4).
    
    However, to evaluate our model more thoroughly, it is important to look at the values 
    that are being misclassified using the precision & recall scores.
    
