# Phase 3 Code Challenge Review

Agenda:
- Gradient Descent & Cost Function
- Logistic Regression
- Evaluation Metrics
- Decision Trees

## Part I. Gradient Descent & Cost Function
- What is a cost function? What are we trying to find?
- How to use gd to find the lowest point? How does the gradient change as we get closer to the bottom?
- What's the role learning rate play? How can learning rate affect your result? 


<p style='text-align:center;font-size:20px'>$ \theta_j := \theta_j - \alpha * \frac{\partial J(\theta)}{\partial\theta_i} $</p>

> A cost function is an algorithm that helps us to determine the best fit line for a regression by minimizing the loss. We are systematically trying to find the line with the lowest amount of errors.

> We can use gradient descent to find the lowest point because it takes into consideration the slope or derivative of a cost curve. We are ultimately trying to get the cost curve to descend to as close to 0 as possible. As the gradient gets closer to the bottom it approaches 0.

> A learning rate is a constant that we establish to help us determine the appropriate step size as we try to find the minimum loss. Having a learning rate that is too large can make us overshoot and miss the minimum, whereas having a learning rate that is too small can stall our efforts to get to the minimum.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error
import seaborn as sns
%matplotlib inline
x = [1, 1, 2, 3, 4, 3, 4, 6, 4]
y = [2, 1, 0.5, 1, 3, 3, 2, 5, 4]
plt.figure(figsize = (8,8))
fig, ax = plt.subplots()

ax.scatter(x,y)
plt.show()

In [None]:
beta_0 = 0
#beta_1 = [.25, .5, .75, .8, 1,]
beta_1 = np.arange(0.25,3,0.1)
fig, ax = plt.subplots(figsize=(10,6))
mses = []
for t in beta_1:
    line = beta_0 + (np.array(x)*t)
    mse = round(mean_squared_error(y, line),3)
    mses.append(mse)
    ax.plot(x, line, label=f'{mse} {t}')
ax.scatter(x,y)
plt.legend()
plt.show()

In [None]:
# Plot the Cost Curve
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(beta_1, mses)
ax.set_title('Cost Curve')
ax.set_xlabel('beta 1')
ax.set_ylabel('MSE')
plt.show()

#### For gradient descent, the questions are going to be mostly intuitive and written answers. You will need to be able to answer questions such as the 3 bullet points above.

## Part II. Logistic Regression 
- How does linear regression differ from logistic regression?
- Why is logistic regression better at modeling a binary outcome?
- What are some advantages and disadvantages of logistic regression?

> Linear Regression differs from Logistic Regression because the target variables are of different kinds. The target variable in a Linear Regression is continuous and the target variable in Logistic Regression is categorical. 

> A Logistic Regression is better at modeling a binary outcome because it utilizes the sigmoid function that creates an S shaped line that is better fit to the data than the line predicted from Linear Regression. 

> With Logistic Regression it can interpret the coefficients for feature importance by magnitude. A downside of this is that it is tough to interpret complex relationships.

## Part III. Evaluation Metrics 
- What are precision and recall?
- How to evaluate a logistic regression model?
- What is roc auc curve?
- What is class imbalance and how do we deal with it?

> Precision is the True Predicted Positives over the Total Positives Predicted. This is a good measure to use when you would concerned with the number of False Positives. A good example of this is crime conviction. You don't want to falsely accuse someone of a crime.

>Recall is the True Predicted Positives over the Total True Positives. This is a good measure to use when you are concerned about the number of False Negatives. A good example of this is cancer detection. You don't want to tell someone they don't have cancer when they do.

>There are other measures that you can use to evaluate Logistic Regression besides Precision and Recall, there is also Accuracy and F1-score. Which one you use is situational depending on the data and what you are trying to accomplish with it.

> A ROC curve shows you all of the scenarios of the thresholds for the classification data. The thresholds determine at what point you consider something negative or positive. Again, what you choose is situational depending on your goal, however in general the best model is often the one closest to the top left corner because it maximizes the True Positives and minimizes the False Positives.

> A class imbalance is when the classes you are trying to predict are uneven. This can bias your machine because of the lack of data on the unbalances side. An easy way to correct for this is Oversampling the minority class. This will create more data for that side and even out the two classes. Often a better way to do this is with SMOTE. This creates artificial data to balance it all out. This is a better option because replicating the data would lead to possible bias.

<img src = 'confusion_matrix.png' width = 300>

In [1]:
### calculate precision here
TP = 63
TP_FP = (63 + 15)
Precision = TP/TP_FP
print(Precision)

0.8076923076923077


In [2]:
### calculate recall here
TP_FN = (63 + 22)
Recall = TP/TP_FN
print(Recall)

0.7411764705882353


In [3]:
### calculate F1 score here
F1 = (2 * Precision * Recall) / (Precision + Recall)
print(F1)

0.7730061349693251


**Explain which model below has the best performance based on ROC-AUC curve? Why?**

<img src='roc_auc.png' width = 400>

For the model below, it depends on the measure that you are seeking to determine what your best model is. In general, for accuracy, the green model is the best. It is the closest to the top left corner maximizing True Positives and minimizing False Positives.

### Class Imbalance
<img src = 'imbalanced.png' wid = 300>

In [4]:
### what problem would it cause? 

The machine would lean towards the majority in an unbalanced dataset. It would have low predicting power for the target=1 result.

In [5]:
### How to remedy it?

There are many solutions. We can resample by taking smaller samples of the majority class, taking larger samples (ie copies) of the minority class, or we can use SMOTE. This creates new samples, which is good because it can help avoid bias.

### Solution 1 - Resampling
<img src = 'resampling.png'>

#### Solution 2 - Smote
<img src = 'smote.png'>

#### Solution 3 - Tomek Link 
<img src = 'tomek.png'>

## Part III. Decision Trees
- Build trees with the sklearn machine learning framework

In [1]:
# import the dataset and set up predictors and target
import seaborn as sns
titanic = sns.load_dataset('titanic')


In [2]:
# define x and y 
y = titanic['survived']
X = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
     'adult_male']]

In [3]:
titanic.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
# fill the age columns missing value with mean 
titanic['age'].fillna(titanic['age'].mean(), inplace = True)

In [5]:
titanic.isna().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [6]:
titanic.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [7]:
categories = titanic.select_dtypes(['object', 'category', 'bool'])

In [8]:
import pandas as pd
titanic = pd.get_dummies(titanic, columns = categories.columns, drop_first=True)

In [9]:
y = titanic['survived']
X = titanic[1:]

# Train test split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2, test_size = .2)

scale = StandardScaler()

X_train_scale = pd.DataFrame(scale.fit_transform(X_train), columns = X_train.columns, index = X_train.index)

X_test_scale = pd.DataFrame(scale.transform(X_test), columns = X_test.columns, index = X_test.index)

ValueError: Found input variables with inconsistent numbers of samples: [890, 891]

In [None]:
# fit the tree
from sklearn.tree import DecisionTreeClassifier


In [None]:
# test the tree 


In [None]:
# generate prediction and output metric (use accuracy)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print(acc)

In [None]:
# how did our tree do? did it perform well?

## hint: think about baseline model
