#### How Good is Your Model?

accuracy (the fraction of correctly classified samples) can be used no measure model performance when working with classification, but accuracy isn't always a useful metric, for example if 99% of emails are real and 1% are spam then you could build a model that allows all emails and it would be 99% accurate but that classifier is actually doing a horrible job at detecting spam because it never actually predicts spam at all which means it's completely failing at its original purpose

**class imbalance** is when one situation is more frequent, this is actually pretty common (like the spam example) and requires a more nuanced metric so that you can assess the performance of the model 

you could make a 2x2 confusion matrix for a binary classifier, with predicted labels across the top, the actual labels down the side, and then fill in the stuff like true positive, false positive, etc., the class of interest is usally the positive class so since you're looking for spam the positive class would probably be that one 

you can calculate metrics from the confusion matrix:
- accuracy is (tp+tn)/(tp+tn+fp+fn), which would be the sum of the diagonal divided by the total sum of the matrix
- precision (PPV, positive predictive value) is tp/(tp+tf) which would be the total number of correctly labeled spam emails divided by the total number of emails classified as spam, high precision means not many real emails were incorrectly predicted to be spam (a low false positive rate)
- recall (sensitivity, hit rate, true positive rate) is tp/(tp+fn), F1 score is 2*((precision*recall)/(precision+recall)) and is the harmonic mean of precision and recall , high recall means that the classifier predicted most positive (spam) emails correctly

by analyzing the confusion matrix and classification report, you can get a much better understanding of your classifier's performance

In [None]:
# confusion matrix in scikit-learn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# instantiate the classifier
knn = KNeighborsClassifier(n_neighbors=8)
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# fit the training data
knn.fit(X_train, y_train)
# predict the labels of the test set 
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))
# for all metrics is sci-kit learn: the first argument is always the true label and the second is always the prediction 

#### Logistic Regression and the ROC Curve

this chapter will be about adding another model to your classification arsennal: logistic regression
even though it seems weird, logistic regression is used in classification problems, not regression problems, log reg works for binary classification (when you have 2 possible labels for the target variable), log reg produces a linear decision boundary (a line that divides the yes (1) from the no (0)
when given one feature, log reg will output a probability (p) with respect to the target variable, if p is greater than 0.5 then the data is labeled 1 and the data will be labeled 0 if the probability is less tha 0.5

by default, the logistic regression threshold is 0.5, it's not exclusive to logistic regression and could also be used for something like KNN

what happens as you vary the threshold? what happens to the true positive and false positive rates as you vary the threshold? 
- when the threshold is p=0 then the model predicts 1 for all the data, this means that the true positive rate is equal to the false positive rate which is 1
- when the threshold is p=0 then the model predicts 0 for all the data, this means that both true and false positive rates are 0
- when the threshold is varied between these two extremes then you'll get a series of different false positive and true positive rates 
- the set of points you get when trying all possible thresholds is called the **ROC curve** (receiver operating characteristic curve) 

classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models
most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class

when looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds

In [None]:
# logistic regression in scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# instantiate the classifier
logreg = LogisticRegression()
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# fit the training data
logreg.fit(X_train, y_train)
# predict on the test set
y_pred = logreg.predict(X_test)

In [None]:
# plot the ROC curve
from sklearn.metrics import roc_curve

# first argument is given by the actual labels, the second by the predicted probabilities
# this returns an array with 2 columns, each with the probabilities for the respective target values 
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# unpack the results into 3 variables: the false positive rate, the true positive rate, and the thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# plot the FPR and TPR using pyplot's plot function
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();
# this used the predicted probabilites of the model, assigning a value of 1 to the observation in question, because to compute
# the ROC we don't only want the predictions on the test set but we want the probability that our log reg model outputs before using a 
# threshold to predict the label, to do that we apply the predict_proba method as seen above 

In [None]:
# exercise example, train a logistic regression model and see if it outperforms knn 
# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
# exercise example, once you have built a logistic regression model you can evaluate its performance by plotting an ROC curve
# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

#### Area Under the ROC Curve

now that you have the ROC curve, you'll want to extract a metric of interest

the larger the area under the ROC curve (AUC), the better the model is! 
if you had a binary classifier that was actually just making random guesses then it would be correct about 50% of the time
the ROC curve would be a diagonal line and the true positive rate and false positive rate would always be equal
the areo under the ROC curve would be 0.5, so you know that if the AUC is greater than 0.5 that the model is better than random guessing

In [None]:
# compute AUC in scikit-learn
from sklearn.metrics import roc_auc_score

# instantiate the classifier
logreg = LogisticRegression()
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# fit the training data
logreg.fit(X_train, y_train)
# predict the labels of the test set, pass the true labels and the predicted probabilities to roc auc score
y_pred_prob = logreg.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_prob)

In [None]:
# another way to compute AUC, compute AUC using cross validation
from sklearn.model_selection import cross_val_score

# pass the estimator, the features, the target
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')
print(cv_scores)

#### Hyperparameter Tuning

now that you have a feel for how well your models are performing, let's supercharge them!!!!!!

when fitting a linear regression, what you're really doing is choosing parameters for the model that fit the data the best
- we also saw that we had to choose a value for alpha in ridge and lasso regression before fitting it 
- before fitting and predicting k-nearnest neighbors we needed to choose n neighbors
- in logistic regression the regularization parameter is C, it controls the inverse of the regularization strength, a large C can lead to an overfit model and a small one can lead to a underfit model
- decision trees have lots of paramaters that can be tuned like max_features, max_depth, and min_samples_leaf (so it's a great use case for RandomizedGridSearch)

**hyperparameters** parameters that need to be specified before fitting a model, these are parameters that can't be explicitly learned by fitting the model  

how to choose the correct hyperparameter, **hyperparameter tuning**:
- try a bunch of different hyperparameter values
- fit all of the separately
- see how well each performs
- choose the best one

when fitting different values of a hyperparameter you have to use cross validation because using the train/test split alone risks overfitting the hyperparameter to the test set, next you'll see that even after tuning the hyperparameters using cross validation you want to have already split off a test set so that we can report how well our model is expected to perform on a dataset that it's never seen before 

**grid search cross-validation**
- choose a grid of possible values that you want to try for the hyperparameter(s), for example if you had two hyperparameters (C and alpha) then you'd have a grid with C values of 0.1-0.5 and alpha values of 0.1 to 0.4
- then perform k-fold cross-validation fer each point in the grid (each choice of hyperparameter or combo of hyperparameters)
- choose the one that performed the best (the highest value) and the corresponding values are the hyperparameters you want

GridSearchCV can be computationally expensive so a solution is to use RandomizedSearchCV which samples a specified number of hyperparameter settings instead of trying out everything,  RandomizedSearchCV will never outperform GridSearchCV but it's value comes from saving computation time 



In [None]:
# grid search cv in scikit-learn
from sklearn.model_selection import GridSearchCV

# specify the parameters as a dictionary with the hyperparameter names as keys (n-neighbors in KNN, or alpha in lasso regression)
param_grid = {'n-neighbors': np.arange(1, 50)}
# the values in the grid dictionary ore lists of the values you're trying to tune to the correct hyperparameter(s) over
# if you specify multiple paramaters then all combinations will be tried

# instantiate the classifier
knn = KNeighborsClassifier()

# pass the model to grid search, the grid you want to tune over, and the number of folds you want
knn_cv = GridSearchCv(knn, param_grid, cv=5)
# this will return a GridSearch object that you can fit to the data and that's what actually performs the grid search inplace

# get your results
knn_cv.best_params_
knn_cv.best_score_
# this will tell you the hyperparameters that perform the best and the mean cv score over that fold 

In [None]:
# exercize example, find the C parameter for logistic regression
# Import necessary modules
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

In [None]:
# exercize example, random grid search with parameters for a decision tree
# Import necessary modules
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

#### Hold-out Set for Final Evaluation

after doing all this stuff, you want to figure out how well your model can perform on never before seen data, you want to use your model to predict on some labeled data then compare the prediction to the actual labels and compute the scoring function

the issue arise when you've used all of your data in cross validation and judging the performance on any of that data might not provide an accurate picture of how it'll perform on unseen data 

solution: 
- split data into training and hold-out sets at the very beginning
- perform grid search cv on the training set to tune the model's hyperparameters 
- choose the best hyperparameters and evaluet on the hold-out set, this will tell you how the model is expected to perform on data it's never seen before 

another penalty in addition to lasso and ridge is the **elastic net** regularization, the penalty term is a linear combination of the L1 and L2 penalties (a*L1+b*L2), in scikit-learn it's represented by the "l1_ratio) paramater, 1 is an L1 penalty and anything lower is a combination of L1 and L2

the next section will be about preprocennitf techniques and how to piece together all the different stages of the machine learning process into a pipeline

In [None]:
# exercise example, hold-out set for classification
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

In [None]:
# exercise example, hold-out set for regression
# Import necessary modules
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))