<a href="https://colab.research.google.com/github/siddharth-iyer1/460J-Labs/blob/main/DSL_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Lab: Lab 5

Submit:
1. A pdf of your notebook with solutions.
2. A link to your colab notebook or also upload your .ipynb if not working on colab.

# Goals of this Lab

1. Random Forests
2. Boosting
3. Playing with Ensembling packages, including XGBoost and CatBoost
4. One more time: Revisiting CIFAR-10 and MNIST
5. Getting ready for Kaggle

We will soon open a Kaggle competition made for this class. In that one, you will be participating on your own. This is an intro to get us started, and also an excuse to work with regularization and regression which we have been discussing. You'll revisit some problems from earlier labs, this time using Random Forests, and Boosting. In particular, you should take this opportunity to become familiar with some very useful packages for boosting. I recommend not only the boosting packages in scikit-learn, but also XGBoost, GBM Light, CatBoost and possibly others. You have to download these and get them running, and then read their documentation to figure out how they work, what the hyperparameters are, etc.

Also, the metric we will use in the Kaggle competition is AUC. We will discuss this. In the meantime, you may want to understand how it works. At least one key thing to remember: to get a good AUC score, you need to submit a soft score (probabilities) and not rounded values (i.e., not 0s and 1s).


## Problem 1: Revisiting Logistic Regression and MNIST

We have played with the handwriting recognition problem (the MNIST data set) using decision trees. We have also considered the same problem using multi-class Logistic Regression in a previous Lab. We revisit this one more time.

**Part 1**: Use Random Forests to try to get the best possible *test accuracy* on MNIST. This involves getting acquainted with how Random Forests work, understanding their parameters, and therefore using Cross Validation to find the best settings. How well can you do? You should use the accuracy metric, since this is what you used in the previous Lab  -- therefore this will allow you to compare your results from Random Forests with your results from L1- and L2- Regularized Logistic Regression.

What are the hyperparameters of your best model?

**Part 2**: Use Boosting to do the same. Take the time to understand how XGBoost works (and/or other boosting packages available -- CatBoost is also another favorite). Try your best to tune your hyper-parameters. As added motivation: typically the winners and near-winners of the Kaggle competition are those that are best able to tune and cross validate XGBoost. What are the hyperparameters of your best model?


In [2]:
# MNIST Data Download
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)

In [5]:
# MNIST Random Forest
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Global Hyperparameters
TEST_SIZE = 0.25
NUM_ESTIMATORS = 100

# Setup train test split
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=TEST_SIZE, random_state=42)

# Baseline Random Forest
clf = RandomForestClassifier(n_estimators=NUM_ESTIMATORS, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.966


In [6]:
# Run Cross Validation to Optimize Hyperparameters
from sklearn.model_selection import cross_val_score
import numpy as np

# Cross Validation
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy")
print(f"Cross Validation Scores: {scores}")
print(f"Mean: {np.mean(scores)}")
print(f"Standard Deviation: {np.std(scores)}")

Cross Validation Scores: [0.96704762 0.966      0.96714286 0.96771429 0.96361905]
Mean: 0.9663047619047619
Standard Deviation: 0.0014523731459587456


In [13]:
# Hyperparameter Optimization
from sklearn.model_selection import GridSearchCV

# set bootstrap to false because in all other cases, when changing other hyperparams, bootstrap was false
param_grid = [
    {'n_estimators': [100, 150, 200], 'max_features': [6, 7, 8], 'bootstrap': [False]}
]

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy", return_train_score=True)

In [14]:
grid_search.fit(X_train, y_train)

KeyboardInterrupt: 

In [12]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

predictions = grid_search.predict(X_test)

best_params = grid_search.best_params_
best_scores = grid_search.best_score_

print("Best parameters found by grid search:", best_params)
print("Best scores found by grid search:", best_scores)

accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

# Calculate precision
# Note: For multiclass classification, you need to set the 'average' parameter.
# Example: average='macro' calculates metrics for each label, and finds their unweighted mean. It does not take label imbalance into account.
# For binary classification, you can omit the 'average' parameter or set it to 'binary' (default).
precision = precision_score(y_test, predictions, average='macro')  # Change 'macro' as needed
print(f"Precision: {precision}")

recall = recall_score(y_test, predictions, average='macro')  # Change 'macro' as needed
print(f"Recall: {recall}")

Best parameters found by grid search: {'bootstrap': False, 'max_features': 8, 'n_estimators': 30}
Best scores found by grid search: 0.9600380952380952
Accuracy: 0.9597142857142857
Precision: 0.9592910573424861
Recall: 0.9591724331616659


## Problem 2: Revisiting Logistic Regression and CIFAR-10

Now that you have your pipeline set up, it should be easy to apply the above procedure to CIFAR-10. If you did something that takes significant computation time, keep in mind that CIFAR-10 is a few times larger.

**Part 1**: What is the best accuracy you can get on the test data, by tuning Random Forests? What are the hyperparameters of your best model?

**Part 2**: What is the best accuracy you can get on the test data, by tuning XGBoost? What are the hyperparameters of your best model?

## Problem 3: Revisiting Kaggle

This is a continuation of Problem 2 from Lab 3. You already did some first steps there, including making a Kaggle account, and trying ridge and lasso linear regression. You also tried stacking.

**Part 1** (Nothing to hand in) Revisit Lab 3 and your answers there.

**Part 2**: Train a gradient boosting regression, e.g., using XGBoost. What score can you get just from a single XGB? (you will need to optimize over its parameters).

**Part 3**: Do your best to get a more accurate model. Try feature engineering and stacking many models. You are allowed to use any public tool in python. No non-python tools allowed.

**Part 4**: (Optional)  Read the Kaggle forums, tutorials and Kernels in this competition. This is an excellent way to learn. Include in your report if you find something in the forums you like, or if you made your own post or code post, especially if other Kagglers liked or used it afterwards.

**Other**: Be sure to read and learn the rules of Kaggle! No sharing of code or data outside the Kaggle forums. Every student should have their own individual Kaggle account and teams can be formed in the Kaggle submissions with your Lab partner. This is more important for live competitions of course.

In the real in-class Kaggle competition (which will be next), you will be graded based on your public score (include that in your report) and also on the creativity of your solution. In your report, due after the competition closes, you will explain what worked and what did not work. Many creative things will not work, but you will get partial credit for developing them. You can start thinking about this now.