# ROC/AUC and Hyperparameter Tuning

## The ROC Curve
- By default, the threshold in Logistic Regression is 0.5. Which means, anything that passes the threshold is classified as 1.
- The set of point we get when trying all possible thresholds is called **ROC Curve**
![roc_img](roc.PNG)


In [None]:
## Plotting the ROC Curve ##
from sklearn.metrics import roc_curve

y_pred_prob = logreg.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# fpr: False Positive rate
# tpr: True Positive rate

plt.plot([0,1], [0,1], "k--")
plt.plot(fpr, tpr)

## Area Under the ROC Curve: AUC

- Knowing that the thing we want is getting closer to upper-left corner (Full TPs and no FPs),
     - We can say that the greater is the AUC, the better is the model.
- We can calculate AUC by **importing roc_auc_score from sklearn.metrics**
- **OR** We can calculate it in Cross Validation, by doing **scoring="roc_auc"**

## Hyperparameter Tuning

- In many occasions we need to choose parameters, such as choosing k in KNN or choosing alpha in Ridge/Lasso Regression.
- Sadly, Hyperparameters can't be learned by a model
- **The best way is trying many values iteratively**
- It's essential to use cross validation

### Grid Search Cross Validation
- Assume we have two parameters Alpha and C:
    - Possible values for C = [0.1, 0.2, 0.3, 0.4]
    - Possible values for Alpha = [0.1, 0.2, 0.3]
    - If we make grids for each combination, we're gonna have 12 different combinations to be tried.
- We perform K-fold Cross Validation for each point in the grid.
- After that, we choose the combination with the best score.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {"n_neighbors": np.arange(1, 50)}

knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv=5)

# After getting a GridSearchCV object, fit the data
knn_cv.fit(X, y)
knn_cv.best_params_  # Returns the best combination
knn_cv.best_score_   # Returns the best score

#### Hold-out Set Reasoning
- We need to see how well the model performs on never before seen data.
- So, using ALL data for CV is not ideal.
- Split data into training and hold-out set in the beginning
- Then perform GridSearchCV
- Choose best hyperparameters and evaluate on hold-out set