In [None]:
%load_ext nb_black

# KNN Regression

### Warm-up 🥵

* How are weights implemented in K Nearest Neighbors?
  * By default in `sklearn` (aka the `'uniform'` option)?
  * When using the `'distance'` option in `sklearn`?
  
* What type of machine learning problem were we using KNN on up until now?

## Data Import and General EDA 🚗

We'll be looking at the auto MPG dataset from UCI.  Which can be found [here](https://archive.ics.uci.edu/ml/datasets/auto+mpg).  From the description we see:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Our target variable will be `mpg`.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
names = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "year",
    "origin",
    "model",
]

# '\s+' means "more than 1 space" you can download the
# data from the data_url to inspect the data and see why this makes sense
auto = pd.read_csv(data_url, sep="\s+", names=names)

Do some *really* general eda.  i.e just stuff like shape/head/info/describe

Display rows with any NAs in them.

## Data Cleaning and Feature Engineering

### Handling NAs

Since our target variable is `mpg` we probably don't want to do any imputation strategy on it.  We should drop NAs in the target unless we have some domain expertise that tells us otherwise.

The `horsepower` column is responsible for the rest of the NAs.  In practice, we might look up this info somehow, but that would probably take too much time for this demo.

So how do you want to handle these? by dropping? with imputation?  If imputing, what should we impute?

### Handling Categorical Variables

* From the description of the columns above, we can see that `origin` should is a 'discrete' value.
* The `model` column is also a categorical variable
* You can also see that `year` is 'discrete' from the description, but in practice we'll treat year variables as ordinal, so we don't need to make any changes.

Show the value counts for each of our categorical columns (don't include year).

For origin, this is a pretty easy decision, let's one-hot encode and move on.  We're deciding to one-hot encode instead of leaving it as oridinal, because we have no reason to believe that the origin is ordinal.

The model category is a little trickier... Most of the model categories have only 1 value (we have 305 categories for our 406 rows).  This level of variation wouldn't be too useful.  What is a feature we could engineer from this column though?

## Modeling

Perform a train/test split with 20% of the data in the test set.

In [None]:
X = auto.drop(columns=["mpg", "model"])
y = auto["mpg"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

We're going to build... a modeling pipeline for KNN.

In [None]:
cat_cols = ["origin", "make"]
drop_cats = [1, "other"]

# The rest are numeric
num_cols = [c for c in X if c not in cat_cols]

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"train_score: {train_score}")
print(f"test_score: {test_score}")

Compare to linear regression