## Introduction

To get you familiar with the setup for the regression challenge, we have prepared this notebook for you.

### Data
Attached in this homework you will find data equivalent in format to the data from Timo's Regression Challenge.

1) `labeled_data_simplified.csv`: this is your training dataset
2) `unlabeled_data_simplified.csv`: this is your test data, it does not include the column "Quantity_Sold_(kilo)" because you need to create/predict this column with the model you train

**Note:** To pass this homework you can freely choose between this simplified data and the "real" data `labeled_data.csv` and `unlabeled_data.csv` which Timo uploaded. If you want to make yourself familiar with the structure first or don't have much time, use the simplified version. If you want to get started with the full data immediately, that's okay too. We will give feedback either way.

This notebook will create a file "prediction_solution.csv" that you can upload to https://f25-regression-challenge.streamlit.app/ (the challenge leaderboard) to see your MSE (mean squared error) test score.

### Feature Encoding & Modeling

I removed some of the more difficult features, but there is still encoding to be done. The modeling also still needs to be added by you.

Train at least 3 different models/use different features.

Your task for this homework intro is to 
1) change at least one thing about the features (in addition to simple encoding) and 
2) apply at least one regularization technique.

## Imports

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

## Loading Data

In [73]:
# Change this to the full data path `labeled_data.csv` from Timo's challenge to load the full data and see all available columns instead.
full_data_df = pd.read_csv("labeled_data_simplified.csv")
full_data_df.head()


Unnamed: 0,Quantity_Sold_(kilo),Unit Selling Price (RMB/kg),Category Name,Loss Rate (%),Weekday,IsWeekend,Item Name
0,6.841,6.0,Flower/Leaf Vegetables,18.52,Wednesday,False,Amaranth
1,1.909,16.0,Capsicum,15.98,Wednesday,False,7 Colour Pepper (1)
2,5.472,14.0,Flower/Leaf Vegetables,18.51,Wednesday,False,Spinach
3,4.119,10.0,Aquatic Tuberous Vegetables,29.25,Wednesday,False,High Melon (1)
4,10.0,6.85,Flower/Leaf Vegetables,2.48,Wednesday,False,Wawacai


In [50]:
full_data_df.shape
full_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13980 entries, 0 to 13979
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unit Selling Price (RMB/kg)  13980 non-null  float64
 1   Category Name                13980 non-null  object 
 2   Loss Rate (%)                13980 non-null  float64
 3   Weekday                      13980 non-null  object 
 4   IsWeekend                    13980 non-null  bool   
 5   Item Name                    13980 non-null  object 
dtypes: bool(1), float64(2), object(3)
memory usage: 559.9+ KB


## Split into Train/Val(/Test)

In [74]:
X = full_data_df.drop(columns=["Quantity_Sold_(kilo)"])
y = full_data_df["Quantity_Sold_(kilo)"]
# TODO: Split into Train/Val(/Test) sets to avoid overfitting during feature engineering and to evaluate your model before submitting it.
from sklearn.model_selection import train_test_split

# Step 1: Train (70%) and Temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 2: Split Temp (30%) into Validation (15%) and Test (15%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

print("Train size:", X_train.shape)
print("Validation size:", X_val.shape)
print("Test size:", X_test.shape)

# You should have something like this at the end:
# X_train, y_train, X_val, y_val(, X_test, y_test)z

Train size: (22833, 6)
Validation size: (4893, 6)
Test size: (4893, 6)


## Explore the training data

Timo changed this data up a bit from the initial EDA homework in Week 1.

Apply what you have learned about EDA and look into the features (especially if you want to work with the full data )

In [93]:
# TODO
X.isnull().sum()
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32619 entries, 0 to 32618
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unit Selling Price (RMB/kg)  32619 non-null  float64
 1   Category Name                32619 non-null  object 
 2   Loss Rate (%)                32619 non-null  float64
 3   Weekday                      32619 non-null  object 
 4   IsWeekend                    32619 non-null  bool   
 5   Item Name                    32619 non-null  object 
dtypes: bool(1), float64(2), object(3)
memory usage: 1.3+ MB


## Encode Features & Create new Features

In [106]:
## TODO: Encode categorical features & create new features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

num_cols = ['Unit Selling Price (RMB/kg)', 'Loss Rate (%)']

cat_cols = ['Category Name', 'Weekday', 'Item Name']
# ===============================================
# 2. Define transformers for numeric and categorical columns
# ===============================================
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numeric + categorical preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ]
)
    
# Remember to only use the training data to compute any statistics needed for feature engineering (e.g. mean/median for imputation, most frequent category for categorical imputation, etc.)
# Then apply the same transformations to the validation (and test) data.

## Create a model and train it

In [107]:
## TODO: Build a pipeline or use your favorite model after feature engineering.

# ===============================================
# 3. Define  model
# ===============================================
model = RandomForestRegressor(
    n_estimators=200, random_state=42, n_jobs=-1
)

# ===============================================
# 4. Build pipeline
# ===============================================
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

X_train.info()
# Remember to only train on the training set and evaluate on the validation set.
# Rename the model variable to whatever more specific model you want to use.
# model.fit(X_train_encoded, y_train)
pipeline.fit(X_train, y_train)


<class 'pandas.core.frame.DataFrame'>
Index: 22833 entries, 2238 to 23654
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unit Selling Price (RMB/kg)  22833 non-null  float64
 1   Category Name                22833 non-null  int64  
 2   Loss Rate (%)                22833 non-null  float64
 3   Weekday                      22833 non-null  int64  
 4   IsWeekend                    22833 non-null  bool   
 5   Item Name                    22833 non-null  int64  
dtypes: bool(1), float64(2), int64(3)
memory usage: 1.1 MB


## Evaluate the model

In [108]:
# TODO: Adjust if you want to use different metrics or outputs, this is just a suggestion to give you an idea.
y_val_pred = pipeline.predict(X_val)
mse = mean_squared_error(y_val, y_val_pred)
rmse = np.sqrt(mse)
print(f"Validation RMSE: {rmse:.2f}")

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


def report(y_true, y_pred, label):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{label:20s} | MAE {mae:.3f} | MSE {mse:.3f} | R² {r2:.3f}")


# print("training error:")
# report(y_train, model.predict(X_train_encoded), "Description of your model here")
# print("validation error:")
# report(y_val, model.predict(X_val_encoded), "Description of your model here")

Validation RMSE: 12.43


## Repeat at least twice

Train at least 3 different models/use different features.

Your task for this homework intro is to 
1) change at least one thing about the features (in addition to simple encoding) and 
2) apply at least one regularization technique.

So you should have at least 3 different validation errors.

## Be as creative as you want to try other models, play around with features etc.

## Once you're done (for now), create your prediction_solution.csv to upload

If you tried a few different models, you can put your best model to the test (best validation/test score on your previous data split) by getting it evaluated by Timo's website. Here's how:

In [None]:
# If you used the full data, change this to the `unlabeled_data.csv` from Timo's folder.
# Both will work for the submission on Timo's website, but the full data has more features and will likely perform better if you applied good feature engineering.
sarahs_test_data = pd.read_csv("unlabeled_data_simplified.csv")
# This is equivalent to just X_test, Timo's website will evaluate your predictions on the hidden test set (y_test).

# TODO: Apply the same feature engineering to the test data as you did to the training data.
# sarahs_test_data_cleaned = apply_feature_engineering(sarahs_test_data)
# Note: You should not drop any rows from this test data, otherwise your submission will have missing rows and won't work.
# Make sure to not change the order of the rows either, that would mess up the IDs.

# TODO: Create your predictions for the test data.
# your_test_predictions = model.predict(sarahs_test_data_cleaned)

prediction_IDs = pd.read_csv("prediction_example.csv")["ID"]

# TODO: Create your prediction_solution.csv to upload, just uncomment this code part after you created your_test_predictions.
# prediction_solution = pd.DataFrame(
#    {"ID": prediction_IDs, "Quantity_Sold_(kilo)": your_test_predictions}
# )

# prediction_solution.to_csv(
#     "prediction_solution.csv", index=False
# )

Upload the `prediction_solution.csv` if you want :) Ask any questions in the group Slack