# Nonparametric ML Models - Cumulative Lab

## Introduction

In this cumulative lab, you will apply two nonparametric models you have just learned — k-nearest neighbors and decision trees — to the forest cover dataset.

## Objectives

* Practice identifying and applying appropriate preprocessing steps
* Perform an iterative modeling process, starting from a baseline model
* Explore multiple model algorithms, and tune their hyperparameters
* Practice choosing a final model across multiple model algorithms and evaluating its performance

## Your Task: Complete an End-to-End ML Process with Nonparametric Models on the Forest Cover Dataset

![line of pine trees](https://curriculum-content.s3.amazonaws.com/data-science/images/trees.jpg)

Photo by <a href="https://unsplash.com/@michaelbenz?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Michael Benz</a> on <a href="/s/photos/forest?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

### Business and Data Understanding

To repeat the previous description:

> Here we will be using an adapted version of the forest cover dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/covertype). Each record represents a 30 x 30 meter cell of land within Roosevelt National Forest in northern Colorado, which has been labeled as `Cover_Type` 1 for "Cottonwood/Willow" and `Cover_Type` 0 for "Ponderosa Pine". (The original dataset contained 7 cover types but we have simplified it.)

The task is to predict the `Cover_Type` based on the available cartographic variables:

In [1]:
# Run this cell without changes
import pandas as pd

df = pd.read_csv('data/forest_cover.csv')
df

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Cover_Type
0,2553,235,17,351,95,780,188,253,199,1410,...,0,0,0,0,0,0,0,0,0,0
1,2011,344,17,313,29,404,183,211,164,300,...,0,0,0,0,0,0,0,0,0,0
2,2022,24,13,391,42,509,212,212,134,421,...,0,0,0,0,0,0,0,0,0,0
3,2038,50,17,408,71,474,226,200,102,283,...,0,0,0,0,0,0,0,0,0,0
4,2018,341,27,351,34,390,152,188,168,190,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38496,2396,153,20,85,17,108,240,237,118,837,...,0,0,0,0,0,0,0,0,0,0
38497,2391,152,19,67,12,95,240,237,119,845,...,0,0,0,0,0,0,0,0,0,0
38498,2386,159,17,60,7,90,236,241,130,854,...,0,0,0,0,0,0,0,0,0,0
38499,2384,170,15,60,5,90,230,245,143,864,...,0,0,0,0,0,0,0,0,0,0


> As you can see, we have over 38,000 rows, each with 52 feature columns and 1 target column:

> * `Elevation`: Elevation in meters
> * `Aspect`: Aspect in degrees azimuth
> * `Slope`: Slope in degrees
> * `Horizontal_Distance_To_Hydrology`: Horizontal dist to nearest surface water features in meters
> * `Vertical_Distance_To_Hydrology`: Vertical dist to nearest surface water features in meters
> * `Horizontal_Distance_To_Roadways`: Horizontal dist to nearest roadway in meters
> * `Hillshade_9am`: Hillshade index at 9am, summer solstice
> * `Hillshade_Noon`: Hillshade index at noon, summer solstice
> * `Hillshade_3pm`: Hillshade index at 3pm, summer solstice
> * `Horizontal_Distance_To_Fire_Points`: Horizontal dist to nearest wildfire ignition points, meters
> * `Wilderness_Area_x`: Wilderness area designation (3 columns)
> * `Soil_Type_x`: Soil Type designation (39 columns)
> * `Cover_Type`: 1 for cottonwood/willow, 0 for ponderosa pine

This is also an imbalanced dataset, since cottonwood/willow trees are relatively rare in this forest:

In [2]:
# Run this cell without changes
print("Raw Counts")
print(df["Cover_Type"].value_counts())
print()
print("Percentages")
print(df["Cover_Type"].value_counts(normalize=True))

Raw Counts
Cover_Type
0    35754
1     2747
Name: count, dtype: int64

Percentages
Cover_Type
0    0.928651
1    0.071349
Name: proportion, dtype: float64


Thus, a baseline model that always chose the majority class would have an accuracy of over 92%. Therefore we will want to report additional metrics at the end.

### Previous Best Model

In a previous lab, we used SMOTE to create additional synthetic data, then tuned the hyperparameters of a logistic regression model to get the following final model metrics:

* **Log loss:** 0.13031294393913376
* **Accuracy:** 0.9456679825472678
* **Precision:** 0.6659919028340081
* **Recall:** 0.47889374090247455

In this lab, you will try to beat those scores using more-complex, nonparametric models.

### Modeling

Although you may be aware of some additional model algorithms available from scikit-learn, for this lab you will be focusing on two of them: k-nearest neighbors and decision trees. Here are some reminders about these models:

#### kNN - [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

This algorithm — unlike linear models or tree-based models — does not emphasize learning the relationship between the features and the target. Instead, for a given test record, it finds the most similar records in the training set and returns an average of their target values.

* **Training speed:** Fast. In theory it's just saving the training data for later, although the scikit-learn implementation has some additional logic "under the hood" to make prediction faster.
* **Prediction speed:** Very slow. The model has to look at every record in the training set to find the k closest to the new record.
* **Requires scaling:** Yes. The algorithm to find the nearest records is distance-based, so it matters that distances are all on the same scale.
* **Key hyperparameters:** `n_neighbors` (how many nearest neighbors to find; too few neighbors leads to overfitting, too many leads to underfitting), `p` and `metric` (what kind of distance to use in defining "nearest" neighbors)

#### Decision Trees - [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

Similar to linear models (and unlike kNN), this algorithm emphasizes learning the relationship between the features and the target. However, unlike a linear model that tries to find linear relationships between each of the features and the target, decision trees look for ways to split the data based on features to decrease the entropy of the target in each split.

* **Training speed:** Slow. The model is considering splits based on as many as all of the available features, and it can split on the same feature multiple times. This requires exponential computational time that increases based on the number of columns as well as the number of rows.
* **Prediction speed:** Medium fast. Producing a prediction with a decision tree means applying several conditional statements, which is slower than something like logistic regression but faster than kNN.
* **Requires scaling:** No. This model is not distance-based. You also can use a `LabelEncoder` rather than `OneHotEncoder` for categorical data, since this algorithm doesn't necessarily assume that the distance between `1` and `2` is the same as the distance between `2` and `3`.
* **Key hyperparameters:** Many features relating to "pruning" the tree. By default they are set so the tree can overfit, and by setting them higher or lower (depending on the hyperparameter) you can reduce overfitting, but too much will lead to underfitting. These are: `max_depth`, `min_samples_split`, `min_samples_leaf`, `min_weight_fraction_leaf`, `max_features`, `max_leaf_nodes`, and `min_impurity_decrease`. You can also try changing the `criterion` to "entropy" or the `splitter` to "random" if you want to change the splitting logic.

### Requirements

#### 1. Prepare the Data for Modeling

#### 2. Build a Baseline kNN Model

#### 3. Build Iterative Models to Find the Best kNN Model

#### 4. Build a Baseline Decision Tree Model

#### 5. Build Iterative Models to Find the Best Decision Tree Model

#### 6. Choose and Evaluate an Overall Best Model

## 1. Prepare the Data for Modeling

The target is `Cover_Type`. In the cell below, split `df` into `X` and `y`, then perform a train-test split with `random_state=42` and `stratify=y` to create variables with the standard `X_train`, `X_test`, `y_train`, `y_test` names.

Include the relevant imports as you go.

In [3]:

# Import the relevant function
from sklearn.model_selection import train_test_split

# Split df into X and y
X = df.drop("Cover_Type", axis=1)
y = df["Cover_Type"]

# Perform train-test split with random_state=42 and stratify=y
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

Now, instantiate a `StandardScaler`, fit it on `X_train`, and create new variables `X_train_scaled` and `X_test_scaled` containing values transformed with the scaler.

In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The following code checks that everything is set up correctly:

In [5]:
# Run this cell without changes

# Checking that df was separated into correct X and y
assert type(X) == pd.DataFrame and X.shape == (38501, 52)
assert type(y) == pd.Series and y.shape == (38501,)

# Checking the train-test split
assert type(X_train) == pd.DataFrame and X_train.shape == (28875, 52)
assert type(X_test) == pd.DataFrame and X_test.shape == (9626, 52)
assert type(y_train) == pd.Series and y_train.shape == (28875,)
assert type(y_test) == pd.Series and y_test.shape == (9626,)

# Checking the scaling
assert X_train_scaled.shape == X_train.shape
assert round(X_train_scaled[0][0], 3) == -0.636
assert X_test_scaled.shape == X_test.shape
assert round(X_test_scaled[0][0], 3) == -1.370

## 2. Build a Baseline kNN Model

Build a scikit-learn kNN model with default hyperparameters. Then use `cross_val_score` with `scoring="neg_log_loss"` to find the mean log loss for this model (passing in `X_train_scaled` and `y_train` to `cross_val_score`). You'll need to find the mean of the cross-validated scores, and negate the value (either put a `-` at the beginning or multiply by `-1`) so that your answer is a log loss rather than a negative log loss.

Call the resulting score `knn_baseline_log_loss`.

Your code might take a minute or more to run.

In [6]:
# Replace None with appropriate code

# Relevant imports
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score 

# Creating the model
knn_baseline_model = KNeighborsClassifier()
knn_baseline_model.fit(X_train_scaled, y_train)
y_pred = knn_baseline_model.predict(X_test_scaled)

# Perform cross-validation
knn_baseline_log_loss = -cross_val_score(knn_baseline_model, X_train_scaled, y_train, cv=5, scoring='neg_log_loss').mean()

knn_baseline_log_loss

0.12964546386734577

Our best logistic regression model had a log loss of 0.13031294393913376

Is this model better? Compare it in terms of metrics and speed.

In [7]:
# Replace None with appropriate text
"""
Our log loss is better with the vanilla (un-tuned) kNN model
than it was with the tuned logistic regression model

It was also much slower, taking around a minute to complete
the cross-validation on this machine

It depends on the business case whether this is really a better
model
"""

'\nOur log loss is better with the vanilla (un-tuned) kNN model\nthan it was with the tuned logistic regression model\n\nIt was also much slower, taking around a minute to complete\nthe cross-validation on this machine\n\nIt depends on the business case whether this is really a better\nmodel\n'

## 3. Build Iterative Models to Find the Best kNN Model

Build and evaluate at least two more kNN models to find the best one. Explain why you are changing the hyperparameters you are changing as you go. These models will be *slow* to run, so be thinking about what you might try next as you run them.

In [8]:
knn_baseline_model_3= KNeighborsClassifier(n_neighbors=57,metric= "minkowski")

knn_baseline_loss_3 = -cross_val_score(knn_baseline_model_3, X_train_scaled, y_train, cv=5, scoring='neg_log_loss').mean()
knn_baseline_loss_3

0.08093412009193049

In [9]:
# Create a kNN model with hyperparameters: 5 neighbors, Manhattan distance, and uniform weights
knn_model_1 = KNeighborsClassifier(n_neighbors=5, metric='manhattan', weights='uniform')

# Perform cross-validation and calculate the mean log loss
knn_model_1_log_loss = -cross_val_score(knn_model_1, X_train_scaled, y_train, cv=5, scoring="neg_log_loss").mean()

knn_model_1_log_loss

0.1181008016543303

In [10]:
# Create a kNN model with hyperparameters: 15 neighbors, Euclidean distance, and distance-based weights
knn_model_2 = KNeighborsClassifier(n_neighbors=15, metric='euclidean', weights='distance')

# Perform cross-validation and calculate the mean log loss
knn_model_2_log_loss = -cross_val_score(knn_model_2, X_train_scaled, y_train, cv=5, scoring="neg_log_loss").mean()

knn_model_2_log_loss

0.06158180065403176

In [11]:
# Model 3: kNN with 10 neighbors, Euclidean distance, and uniform weights
knn_model_3 = KNeighborsClassifier(n_neighbors=11, metric='euclidean', weights='uniform')

# Perform cross-validation and calculate the mean log loss
knn_model_3_log_loss = -cross_val_score(knn_model_3, X_train_scaled, y_train, cv=5, scoring="neg_log_loss").mean()

knn_model_3_log_loss

0.07392904401138965

In [12]:
# Model 4: kNN with 20 neighbors, Minkowski distance, and distance-based weights
knn_model_4 = KNeighborsClassifier(n_neighbors=19, metric='minkowski', p=3, weights='distance')

# Perform cross-validation and calculate the mean log loss
knn_model_4_log_loss = -cross_val_score(knn_model_4, X_train_scaled, y_train, cv=5, scoring="neg_log_loss").mean()

# Output the log loss
knn_model_4_log_loss

0.06177786275589958

In [13]:
# output the log loss for comparison
knn_model_1_log_loss, knn_model_2_log_loss, knn_model_3_log_loss, knn_model_4_log_loss

(0.1181008016543303,
 0.06158180065403176,
 0.07392904401138965,
 0.06177786275589958)

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model with scaled features
knn_6 = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_6.fit(X_train_scaled, y_train)
y_pred_6 = knn_6.predict(X_test_scaled)

# Evaluate accuracy
accuracy_6 = accuracy_score(y_test, y_pred_6)
print(f"Accuracy with scaling: {accuracy_6}")


Accuracy with scaling: 0.984209432786204


In [15]:
# Your code here (add more cells as needed)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 10],
    'metric': ['minkowski', 'manhattan'],
    'p': [1, 2]
}

# Grid search with cross-validation
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# Best parameters and accuracy
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validated accuracy: {grid_search.best_score_}")

# Evaluate on the test set
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Test set accuracy of best model: {accuracy_best}")


Best parameters: {'metric': 'minkowski', 'n_neighbors': 3, 'p': 1}
Best cross-validated accuracy: 0.9843116883116882
Test set accuracy of best model: 0.9851444005817578


## 4. Build a Baseline Decision Tree Model

Now that you have chosen your best kNN model, start investigating decision tree models. First, build and evaluate a baseline decision tree model, using default hyperparameters (with the exception of `random_state=42` for reproducibility).

(Use cross-validated log loss, just like with the previous models.)

In [23]:
# Your code here

# Relevant imports
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Create the baseline decision tree model
dt_baseline_model = DecisionTreeClassifier(random_state=42)

# Perform cross-validation with log loss scoring
dt_baseline_log_loss = -cross_val_score(
    dt_baseline_model,
    X_train_scaled,
    y_train,
    scoring="neg_log_loss",
    cv=5
).mean()

dt_baseline_log_loss

0.7364763809378052

Interpret this score. How does this compare to the log loss from our best logistic regression and best kNN models? Any guesses about why?



In [17]:
# Replace None with appropriate text
"""
### Interpretation of Decision Tree Log Loss
- The log loss of the baseline decision tree model is *0.7365*.
- This is *much higher* (worse) than the log loss of the best logistic regression model (*0.1303) and the best kNN model (0.0616*).

---

### Comparison to Other Models
1. *Logistic Regression*:
   - Performs well because it assumes a linear relationship between the features and the target. 
   - While simple, it is effective for datasets with clear patterns, and it generalizes well.

2. *kNN Model*:
   - Achieved the best performance (lowest log loss) because it captures local patterns and is highly flexible.
   - Its success suggests the dataset benefits from proximity-based relationships among instances.

3. *Decision Tree*:
   - The baseline decision tree struggles with performance.
   - Reasons for high log loss:
     - *Overfitting*: Decision trees with default settings often overfit the training data, especially when they grow deep and complex without restrictions (e.g., no maximum depth).
     - *Imbalance Sensitivity*: The dataset's imbalance (fewer cottonwood/willow trees) may cause poor probabilistic predictions.
     - *No Regularization*: Default trees lack hyperparameter constraints (e.g., max depth, min samples per split) that help balance complexity and generalization.

---

### Guesses About Improvement
To improve the decision tree's performance:
1. *Regularization*: Limit tree depth or specify minimum samples for splits/leaves to prevent overfitting.
2. *Class Imbalance Handling*: Weight the classes or resample the dataset to reduce the impact of the imbalance.
3. *Hyperparameter Tuning*: Explore values for max_depth, min_samples_split, and min_samples_leaf to optimize the tree structure.
"""

'\nNone\n'

## 5. Build Iterative Models to Find the Best Decision Tree Model

Build and evaluate at least two more decision tree models to find the best one. Explain why you are changing the hyperparameters you are changing as you go.

In [24]:
# Your code here (add more cells as needed)
# Create a decision tree model with max_depth=10
dt_model_1 = DecisionTreeClassifier(max_depth=10, random_state=42)

# Perform cross-validation and calculate the mean log loss
dt_model_1_log_loss = -cross_val_score(dt_model_1, X_train_scaled, y_train, cv=5, scoring="neg_log_loss").mean()

dt_model_1_log_loss


0.3487610164993546

In [25]:
# Your code here (add more cells as needed)
# Model 2: Adjust minimum samples per split
dt_model_2 = DecisionTreeClassifier(max_depth=10, min_samples_split=20, random_state=42)

# Perform cross-validation
dt_model_2_log_loss = -cross_val_score(
    dt_model_2,
    X_train_scaled,
    y_train,
    scoring="neg_log_loss",
    cv=5
).mean()

dt_model_2_log_loss

0.23751695985660098

In [26]:
# Your code here (add more cells as needed)
# Model 3: Adjust minimum samples per leaf
dt_model_3 = DecisionTreeClassifier(max_depth=10, min_samples_split=20, min_samples_leaf=10, random_state=42)

# Perform cross-validation
dt_model_3_log_loss = -cross_val_score(
    dt_model_3,
    X_train_scaled,
    y_train,
    scoring="neg_log_loss",
    cv=5
).mean()

dt_model_3_log_loss

0.20413494515597375

## 6. Choose and Evaluate an Overall Best Model

Which model had the best performance? What type of model was it?

Instantiate a variable `final_model` using your best model with the best hyperparameters.

In [27]:
# Replace None with appropriate code
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the best model with optimal hyperparameters
final_model = KNeighborsClassifier(
    n_neighbors=15,  # Best hyperparameter from tuning
    weights="distance",  # Ensures closer neighbors contribute more
    metric="euclidean"  # Optimal distance metric
)

# Fit the model on the full training data
# (scaled or unscaled depending on the model)
final_model.fit(X_train_scaled, y_train)


Now, evaluate the log loss, accuracy, precision, and recall. This code is mostly filled in for you, but you need to replace `None` with either `X_test` or `X_test_scaled` depending on the model you chose.

In [29]:
# Replace None with appropriate code
from sklearn.metrics import accuracy_score, precision_score, recall_score, log_loss

# Make predictions and predict probabilities
preds = final_model.predict(X_test_scaled)
probs = final_model.predict_proba(X_test_scaled)

# Evaluate metrics
print("log loss: ", log_loss(y_test, probs))
print("accuracy: ", accuracy_score(y_test, preds))
print("precision:", precision_score(y_test, preds))
print("recall:   ", recall_score(y_test, preds))

log loss:  0.07192440025459416
accuracy:  0.9804695616039892
precision: 0.9179229480737019
recall:    0.7976710334788938


Interpret your model performance. How would it perform on different kinds of tasks? How much better is it than a "dummy" model that always chooses the majority class, or the logistic regression described at the start of the lab?

# Replace None with appropriate text
"""
### *Interpretation of Model Performance*

1. *Log Loss*:  
   The log loss of *0.0719* indicates that the model's predicted probabilities are highly confident and accurate for the correct class. A lower log loss reflects better-calibrated probabilities.

2. *Accuracy*:  
   The accuracy of *98.05%* shows that the model correctly classifies a vast majority of instances in the test set. However, for imbalanced datasets like this, accuracy alone can be misleading as it may favor the majority class.

3. *Precision*:  
   The precision of *91.79%* indicates that when the model predicts the minority class (Cover_Type = 1), it is correct about *91.79%* of the time. This is important for tasks where false positives need to be minimized, such as identifying rare species with high confidence.

4. *Recall*:  
   The recall of *79.77%* shows that the model identifies nearly 80% of the minority class instances. This is crucial for tasks where missing true positives is costly, such as conservation efforts or environmental monitoring.

---

### *Comparison to Baseline Models*

1. *Dummy Model*:
   - A dummy model that always predicts the majority class (e.g., Cover_Type = 0) would have a high accuracy (e.g., ~95%) but a poor recall (0%) for the minority class. In contrast, this model achieves both *high accuracy* and a *balanced precision-recall tradeoff*, making it significantly better.

2. *Logistic Regression*:
   - The baseline logistic regression model had a log loss of *0.1303* compared to this model's *0.0719*, a significant improvement. Logistic regression assumes linear relationships, which likely limits its performance on this nonlinear dataset.

---

### *Task Suitability*
- *Strengths*:  
   This model performs exceptionally well for tasks requiring both accurate classification and well-calibrated probabilities, such as species classification, ecological surveys, or conservation monitoring.
   
- *Limitations*:  
   The slightly lower recall (~80%) could be a drawback for tasks where identifying all instances of the minority class is critical, such as early detection of invasive species.

Would you like to further analyze its performance, optimize recall, or visualize the results?
"""

## Conclusion

In this lab, you practiced the end-to-end machine learning process with multiple model algorithms, including tuning the hyperparameters for those different algorithms. You saw how nonparametric models can be more flexible than linear models, potentially leading to overfitting but also potentially reducing underfitting by being able to learn non-linear relationships between variables. You also likely saw how there can be a tradeoff between speed and performance, with good metrics correlating with slow speeds.