# Scikit-learn Tutorial
Author: Lauren Gliane

In this tutorial, we'll go over the following topics to build our own classification model:
1. Introduction
2. Install
3. Datasets
4. Data Preprocessing
5. Choosing a Model
6. Training the Model
7. Save/Load Models
8. Making Predictions and Evaluating the Model

## 1. Get to know SK-learn
### What is Scikit-learn?
Scikit-learn (AKA sk-learn) is written in Python, an open source project, and is **one of the most used ML libraries today**. Sk-learn is built on top of Numpy, SciPy, and Matplotlib, and contains tons of algorithms ready to use to train, evaluate, and save models straight out of the box!

### Why learn Scikit-learn?
With sk-learn, we don’t need to implement complex algorithms built on a backbone of linear algebra and statistics. By using sk-learn’s ML algorithms and neural networks, we can build models faster while getting familiar with industry-standard tools.

## 2. Install
To use sk-learn, we'll need **scipy**, **numpy**, and **sklearn**. To install these, run `pip install scipy numpy scikit-learn` in your terminal.

You can confirm sk-learn was installed correctly by importing something from the package.

In [None]:
from sklearn.tree import DecisionTreeClassifier

## 3. Datasets
When doing machine learning, we need data to train and evaluate our models, because without data, we can't learn patterns, validate performance, or generalize to unseen examples.

Scikit-learn provides tools to do that via built-in datasets, dataset loading utilities, and data preprocessing functions (like [train_test_split](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?utm_source=chatgpt.com)).

### Features and Labels
In machine learning, we use data to train models to make predictions or decisions. This data is typically structured into two main parts:
1. **Features (Input)**
- Features are the input variables (also called independent variables) that the model uses to learn.
- Think of them as the measurable properties or characteristics representing what causes or correlates to your output
- Features are real numbers or are non-numeric and have been transformed into a numerical representation

    **Example:** In a house price prediction dataset, features might include the number of bedrooms, square footage, and location.

2. **Labels (Output)**
- The label is the target variable (also called the dependent variable). Labels are what you want the model to predict.
- Think of it as the correct answer for each example in the dataset.

    **Example:** For house prices, the label is the actual price of the house.

### Step 1: Pick a Dataset
Scikit-learn comes with several built-in toy datasets that are great for learning and experimenting. These datasets are small, well-structured, and easy to load making them perfect for learning.

Look through the following [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and select one that interests you for this tutorial!

**Note**: Don't choose the digits dataset, which is an image dataset that requires additional preprocessing this tutorial does not support

In [None]:
from sklearn.datasets import ______

dataset = ______

##### Viewing your data
Whenever you're working with new data, it's always a good idea to familiarize yourself with the feature names, your labels, and examine a bit of the data. Run the following to understand a bit about the dataset you've chosen.

In [None]:
X = dataset.data 
y = dataset.target 
  
feature_names = dataset.feature_names 
target_names = dataset.target_names 
  
print("Feature names:", feature_names) 
print("Target names:", target_names) 

print("\nType of X is:", type(X)) 

print("\nFirst 5 rows of X:\n", X[:5])


#### Want to use custom data?
If we’re using an external dataset, we can use the pandas library to load and manipulate the datasets with ease. If you haven’t yet, check out our [AI Club Pandas Tutorial](https://github.com/npragin/ai-club-project-management/blob/main/tutorials/pandas-tutorial.ipynb)!

## 4. Data Preprocessing
When working with real-world data, it often requires some preprocessing to ensure it's in the right format for training a model. This can include handling missing values, scaling features, and selecting the relevant features.

The first step is always to inspect your data to understand its structure and identify any potential issues.

### A. Missing Values
Missing values can occur for various reasons, such as data entry errors, sensor malfunctions, or respondents skipping questions in surveys. Handling missing values is crucial because they can hinder the performance of machine learning models. Sometimes, datasets will use NaN (Not a Number), null, or zero to represent missing values. It is important to read about your dataset to understand how missing values are represented.

#### Identify Missing Data
We'll use Pandas to find missing values in our dataset

In [None]:
import pandas as pd

# Convert to DataFrame for easier handling
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y)

print("Checking number of non-null entries per column:")
print(X.isnull().sum())
print()
print("Checking number of non-zero entries per column:")
print((X == 0).sum())
print()
print("Checking number of non-null entries per column using .info():")
X.info()

#### Strategies for Handling Missing Values


**Option 1: Remove Missing Values**
- Drop rows (examples) with missing values, and don't predict unless all features are available
  - Useful when a few rows are missing, and you have a large dataset
  - `df.dropna(inplace=True)`

- Drop columns (features), and don't use that feature for training or prediction
  - Useful when a feature has many missing values and is not critical to the task
  - `df.drop(columns=['column_name'], inplace=True)`

**Option 2: Imputation (Fill Missing Values)**

[SK-Learn imputers](https://scikit-learn.org/stable/api/sklearn.impute.html)

- Fill missing values with a specific value, like the mean, median, or mode of the column
  - `SimpleImputer`
- Fit a function to the non-missing values, then use that function to fill in the missing values
  - `IterativeImputer`
- Use a KNN model to predict and fill in the missing values
  - `KNNImputer`

#### Step 2: Fill Missing Values (if needed)
If you found your dataset has missing values, choose one of the strategies above to handle them. If not, you can skip this step.

In [None]:
from sklearn.impute import ____
imputer = ____
X = imputer.fit_transform(X)

### B. Feature Scaling

For many machine learning models, especially those that rely on distance calculations (like [KNN](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification) or [SVM](https://scikit-learn.org/stable/modules/svm.html#classification)), it's important to scale your features so they are on a similar scale. These models work by calculating distances between data points, and if one feature has a much larger range of values than others, it can dominate the distance calculations and lead to suboptimal performance.

Other models, like [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) and [neural networks](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#neural-networks-supervised), also benefit from feature scaling as it can lead to faster optimization during training.

There are several common methods for feature scaling:

- **Standardization**
  - Transforms each feature to have a mean of 0 and a standard deviation of 1
  - Especially useful for algorithms that assume a Gaussian or Normal distribution of the data
  - Uses the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
- **Normalization**
  - Scales each feature to a range between 0 and 1
  - Useful when you want to ensure all features contribute equally to distance calculations
  - Uses the [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
- **Outlier-Robust Methods**
  - If your data is prone to outliers, consider using techniques like [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) that are less sensitive to outliers

You can find more scalers, imputers, and other preprocessing techniques in scikit-learn's [`preprocessing`](https://scikit-learn.org/stable/api/sklearn.preprocessing.html#module-sklearn.preprocessing) module.

#### Step 3: Scale Features (if needed)
If your data is not already normalized or standardized, choose a scaling strategy to use later in the tutorial. Otherwise, you can skip this step.


In [None]:
from sklearn.preprocessing import ___

scaler = ___
X = scaler.fit_transform(X)

#### C. Split the Data
To efficiently train and evaluate model performance, the dataset will be split into the training set and testing set.

- *Training set:* teaches our model to recognize patterns in the data
- *Testing set:* checks our model’s performance on new, never seen before data

We will use the [train_test_split()](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from sklearn.model_selection module to do this.

##### Deciding on a split
80% training and 20% testing data is the most common split for larger datasets. 
Since most of sk-learn's toy datasets are small, we’ll use 60% for training and 40% for testing to ensure we can be confident about our evaluation results. 

To do this, the parameter responsible for train size or test size (either or both) by taking a look at the docs to see how to pass them in. By adding setting `random_state=1`, we can ensure the split is consistent with each run for reproducibility.

**Result:**
We will have four subsets of the data after splitting.

- **x_train and y_train:** feature and target values for training

- **x_test and y_test:** feature and target values for testing

#### Step 4: Split the Data
Read through the docs on [train_test_split()](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and fill in the parameter to set the test set size to 40%. Then, run the following code to split your data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, ___) # Fill in parameter(s) here

### Other Common Steps in Data Preparation with Scikit-learn tools
For this tutorial, we will only go over loading data, scaling, and splitting. Below shows the common steps for preprocessing and sk-learn tools used to complete them.


| Step                                   | What It Does                                                    | Example Tools in Scikit-learn         |
| -------------------------------------- | --------------------------------------------------------------- | ------------------------------------- |
| **Encode Categorical Variables**    | Convert text labels into numbers.                               | `OneHotEncoder`, `LabelEncoder`       |
| **Feature Selection / Engineering** | Choose the features that maximize performance.             | `SelectKBest`, `SequentialFeatureSelector` |

## 5. Choose a Model

### Where to Find Models
Choosing the right model depends on your data, problem type (classification, regression), and performance needs.

Scikit-learn includes a wide range of built-in models for classification and regression. You can browse all available models [here](https://scikit-learn.org/stable/supervised_learning.html)

Scikit-learn also provides this [helpful flowchart to pick models](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html). It walks you through choosing based on task type, data size, and data type. However, not all models are included in the flowchart, so, if you're curious, be sure to explore the full list of models!

#### Step 5: Pick a Model
Using the list or flowchart above, choose a model that you think will work well with your dataset. You may have to check your dataset's documentation to remember if your task is classification or regression.

In [None]:
from sklearn.___ import ___

model = ___

## 6. Train the Model
The objective of your model is to minimize error when predicting against new, never-before-seen data. First, we must train the model on labeled data.

### Fitting Models on Training Data
Fitting involves training the model on the training data using `.fit(X_train, y_train)`. It learns the mapping from features (X) to target (y). scikit-learn provides a consistent interface across all models, so once you learn how to fit one model, you know how to fit all models.

#### Step 6: Fit the Model
Run the code below to train your model on the training data! Because of scikit-learn's consistent interface, the code below will work regardless of the model you chose in Step 5.

In [None]:
model.fit(X_train, y_train)

### Training vs Validation Performance
Training performance refers to how well your model performs on the data it was trained on, while validation performance refers to how well your model performs on new, unseen data. Monitoring both is crucial to ensure your model generalizes well.

Recall, we should **never** touch the test set until the very end, after we've finalized our model. The test set is only for evaluating the final model's performance. The validation set is used to tune hyperparameters and make decisions about the model.

#### Step 7: Evaluate Initial Performance
Run the cell below to evaluate your model's performance on both the training and validation sets! The score used depends on your model. If you are working on a classification problem, it is likely to be accuracy. If you are working on a regression problem, it is likely to be mean squared error.

In [None]:
# Split the training data into training and validation sets, we can't touch the test set until the very end!
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, random_state=1, test_size=0.2)

# Retrain the model because it was trained on the data including the validation set before
model.fit(X_train_split, y_train_split)

print("Train Score:", model.score(X_train_split, y_train_split))
print("Validation Score:", model.score(X_val, y_val))

### Watch out for Underfitting and Overfitting:
- **Underfitting**
  - Low accuracy on train and test/validation set accuracy
  - Consider using a more complex model (adding layers to a neural network, or adding trees to a random forest).
- **Overfitting**
  - High accuracy on train set, but low accuracy on test/validation set
  - Consider simplifying your model, gathering more data, or using regularization techniques.

### Cross-validation

Cross validation is another technique to evaluate how well your model generalizes to unseen data. Instead of a single train/test split, cross-validation:

1. Splits data into `K` subsets, or "folds", (e.g., 5 or 10)

2. Trains the model on `K-1` folds

3. Uses the remaining fold as validation data to calculate performance

4. Repeats `K` times, so each fold is used as validation data once, and averages the result

**Types of Cross-Validation**
| Type              | When to Use                                |
| ----------------- | ------------------------------------------ |
| [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)           | General-purpose                            |
| [`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) | Ensures each fold has the same class distribution as the whole dataset |
| [`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit) | For time-series data (preserves order)     |
| Many more! | See the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) for a full list and comprehensive guide. |

#### Step 8: Evaluate Initial Performance Using Cross-Validation
Look through the list of [splitters](https://scikit-learn.org/stable/api/sklearn.model_selection.html#splitters) scikit-learn provides and choose one that fits your data and task!

Once you've selected one and run the code below, compare the results to your previous evaluation, where we did not use cross-validation.

Which method told you your model performs better? Which method do you trust more?

In [None]:
from sklearn.model_selection import cross_val_score, ___

cv = ___

cv_scores = cross_val_score(model, X_train, y_train, cv=cv)

print("Cross-Validation Scores:", cv_scores)
print("Mean Cross-Validation Score:", cv_scores.mean())

### Hyperparameters

#### Parameters vs Hyperparameters
Hyperparameters are parameters **set by the engineer** that define the model itself. These are the knobs you will play with to get the best performance you can!

Model parameters are learned from the training data, and are **set by the underlying algorithm**.

#### Why Are Hyperparameters Important?
Hyperparameters are important to ML since they:

- Affect model complexity (ex: tree depth)
- Determine how fast or how well a model learns (ex: learning rate)
- Influence overfitting vs underfitting
- Impact training time and computational cost

#### What Hyperparameters Can I Tune?

The hyperparameters available to you depend on the model you've chosen. The hyperparameters are defined when you initialize your model.

For example, some of the hyperparameters available to the [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) are `n_estimators`, `max_depth`, and `min_samples_split`. Some of these are easy to understand, but some of these require reading the documentation.

#### Hyperparameter Types
1. **Model Complexity Hyperparameters**
- Examples: max_depth, C, n_neighbors
- Controls how complex a model can get
- More complexity → lower bias, higher variance

2. **Training Control Hyperparameters**
- Examples: learning_rate, batch_size, epochs
- Controls how the model is trained
- Lower learning rate → slower, possibly more accurate convergence

3. **Regularization Hyperparameters**
- Examples: alpha, C, l1_ratio
- Penalizes model complexity to prevent overfitting

#### Tuning Hyperparameters
One option when tuning hyperparameters is to manually train a bunch of slightly different models, and compare their performance on validation data, choosing the one with the best results.

However, scikit-learn provides many helpful objects for helping you with your hyperparameter search:

| Method                    | Description                                  | Best For                           |
| ------------------------- | -------------------------------------------- | ---------------------------------- |
| [`GridSearchCV`](https://hscikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)           | Tries all combinations from a param grid     | Small search spaces                |
| [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)         | Samples random combinations from the grid    | Large search spaces                |
| Manual tuning         | Trial and error                              | Simple models or quick prototyping |
| [More options here](https://scikit-learn.org/stable/api/sklearn.model_selection.html#hyper-parameter-optimizers) | scikit-learn has many other options you can find here! | |

#### Step 9: Tune Hyperparameters using `GridSearchCV`
1. Take a look at the documentation for [`GridSearchCV`](https://hscikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and figure out how to define a parameter grid (Hint: Ctrl+F for "Examples")

2. Define a parameter grid for the model you selected. You may have to revisit the documentation on the model you selected. Take a look at the default values for the hyperparameters to understand what a good starting point is.

Don't worry about tuning every hyperparamter, this is just about getting a feel for `GridSearchCV`

**Did performance improve?**

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = ___

grid_search = GridSearchCV(model, parameters)
grid_search.fit(X_train, y_train) # This will automatically re-fit our model using the best parameters found

print("Cross-Validation Score of Best Model Found: ", grid_search.best_score_)
print("Best Parameters Found: ", grid_search.best_params_)

### Common Tuning Mistakes
| Mistake                      | Description                                     | How to Avoid                           |
| ---------------------------- | ----------------------------------------------- | -------------------------------------- |
| Using default values blindly | Defaults may not suit your dataset              | Always perform hyperparameter tuning   |
| Tuning on test data          | Leads to data leakage                           | Use cross-validation or validation set |
| Over-tuning                  | Too many parameters → overfitting on validation | Keep tuning space sensible             |

## 7. Save/Load Models

Once you've trained a model, you might want to save it for later use without needing to retrain it. Scikit-learn provides many different ways to save and load models, but we will use `pickle`, a built-in Python library for serializing and deserializing Python objects, because it comes standard with every Python installation.

If you are working with large models, consider using `ONNX` or `joblib`. You can find documentation on these methods in the [scikit-learn model persistence documentation](https://scikit-learn.org/stable/model_persistence.html).

### Saving a Model Using Pickle

In [None]:
from pickle import dump

with open("saved_model.pkl", "wb") as file:
    dump(model, file)

### Loading a Model Using Pickle
When you use pickle, you can load the model and use it without any knowledge of the model. Pickle encodes everything about the object so you can simply load it and use it without initializing the model.

In [None]:
from pickle import load

with open("saved_model.pkl", "rb") as file:
    loaded_model = load(file)

## 8. Make Predictions and Evaluate the Model
### Predict on test data
The time has finally come to use our test data! Recall, we use validation and training data to tune our model and hyperparameters; the test data is never-before-seen data we will use to evaluate our model's performance.

As we've seen, predicting on test data is easy, but how do we get something meaningful out of it? We can use various evaluation metrics to understand how well our model is performing.

### Accuracy scoring
The simplest way to measure the performance of our model is to calculate its accuracy. This is simple using the `accuracy_score` function from `sklearn.metrics`.

However, accuracy alone can be misleading, especially with imbalanced datasets. For example, if 95% of your data belongs to one class, a model that always predicts that class will have 95% accuracy but is not useful. So, it is always good to compare your accuracy against a baseline model, such as one that makes random predictions or always predicts the majority class.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

random_classifier = DummyClassifier(strategy="uniform", random_state=42) # Model that will guess randomly
random_classifier.fit(X_train, y_train)
random_classifier_test_predictions = random_classifier.predict(X_test)

most_frequent_classifier = DummyClassifier(strategy="most_frequent") # Model that will always guess the most frequent class
most_frequent_classifier.fit(X_train, y_train)
most_frequent_classifier_test_predictions = most_frequent_classifier.predict(X_test)

loaded_model_test_predictions = loaded_model.predict(X_test)

print("Random Classifier Accuracy:", accuracy_score(y_test, random_classifier_test_predictions))
print("Most Frequent Classifier Accuracy:", accuracy_score(y_test, most_frequent_classifier_test_predictions))
print("Our Model Accuracy:", accuracy_score(y_test, loaded_model_test_predictions))

### Confusion matrix
One flaw with using accuracy scoring is it assumes all errors (false positives and false negatives) are equally bad. In many applications, this is not the case. For example, in medical diagnosis, a false negative (failing to identify a disease) can be much more serious than a false positive (incorrectly diagnosing a disease).

To get a better understanding of our model's performance, we can use a **confusion matrix**. A confusion matrix is a table that shows us the number of correct and incorrect predictions made by our model, broken down by each class. It provides a more detailed view of how our model is performing across different classes.

**How many examples did your model misclassify?**

In [None]:
from sklearn.metrics import confusion_matrix

print("Random Classifier Confusion Matrix:\n", confusion_matrix(y_test, random_classifier_test_predictions))
print("Most Frequent Classifier Confusion Matrix:\n", confusion_matrix(y_test, most_frequent_classifier_test_predictions))
print("Our Model Confusion Matrix:\n", confusion_matrix(y_test, loaded_model_test_predictions))

### Classification reporting

A classification report provides a detailed summary of many metrics such as precision (accuracy when predicting a particular class), recall (ability correctly predict a particular class), and F1-score (balance between precision and recall).

Precision is useful when you are more concerned about the accuracy of positive predictions, while recall is important when you want to capture as many positive instances as possible. Consider which metric is most useful for your specific application.

A medical diagnosis model might prioritize recall to ensure that as many cases of a disease are identified as possible, even if it means some false positives. On the other hand, a spam detection system might prioritize precision to avoid incorrectly marking legitimate emails as spam.

In [None]:
from sklearn.metrics import classification_report

print("Random Classifier Classification Report:\n", classification_report(y_test, random_classifier_test_predictions))
print("Most Frequent Classifier Classification Report:\n", classification_report(y_test, most_frequent_classifier_test_predictions))
print("Our Model Classification Report:\n", classification_report(y_test, loaded_model_test_predictions))

### Find Your Own Metrics
Check out the metrics supported by scikit-learn [here]() and some more information about them [here]()! Try using one in the code cell below and see how your model compares to the baselines!

In [None]:
from sklearn.metrics import ___

print("Random Classifier Classification Report:\n", ___)
print("Most Frequent Classifier Classification Report:\n", ___)
print("Our Model Classification Report:\n", ___)