# Kaggle Competition

The goal of machine learning is to build models with high predictive accuracy. Thus, it is not surprising that there exist machine learning competitions, where participants compete to build the model with the lowest possible prediction error.

[Kaggle](http://www.kaggle.com/) is a website that hosts machine learning competitions. In this lab, you will participate in a Kaggle competition with other students in this class!  To join the competition, visit [this link](https://www.kaggle.com/t/0de6e5c147c84833a7338e6adcc1ed37). You will need to register an account with Kaggle, but you can use your Google account.

## Question 1

Train many different models to predict IBU. Try different subsets of variables. Try different machine learning algorithms (you are not restricted to just $k$-nearest neighbors). At least one of your models must contain variables derived from the `description` of each beer. Use cross-validation to systematically select good models and submit your predictions to Kaggle. You are allowed 2 submissions per day, so submit early and often!

Note that to submit your predictions to Kaggle, you will need to export your predictions to a CSV file (using `.to_csv()`) in the format expected by Kaggle (see `beer_test_sample_submission.csv` for an example).

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import numpy as np

train_url = 'https://dlsun.github.io/pods/data/beer/beer_train.csv'
data = pd.read_csv(train_url)

data = data[data['srm'] != 'Over 40']
data['srm'] = pd.to_numeric(data['srm'])

data.dropna(inplace = True)


test_url = 'https://dlsun.github.io/pods/data/beer/beer_test.csv'
test_data = pd.read_csv(test_url)


test_data['abv'].fillna(test_data['abv'].mean(), inplace=True)
test_data['name'].fillna('unknown', inplace=True)
test_data['available'].fillna('unknown', inplace=True)
test_data['glass'].fillna('unknown', inplace=True)
test_data['description'].fillna('No description available', inplace=True)




In [None]:
feature_subsets = [
    ["abv"],
    ["abv", "name"],
    ["abv", "name", "available"],
    ["abv", "name", "available", "glass"],
    ["abv", "name", "description"],
    ["abv", "name", "available", "glass", "description"],
    ["abv", "originalGravity"],
    ["abv", "srm"],
    ["abv", "isOrganic"],
    ["abv", "name", "description", "originalGravity", "srm", "isOrganic"]
]

target = "ibu"

models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "ElasticNet": ElasticNet(),
    "DecisionTreeRegressor": DecisionTreeRegressor(),
    "RandomForestRegressor": RandomForestRegressor(),
    "GradientBoostingRegressor": GradientBoostingRegressor(),
}

param_grids = {
    "LinearRegression": {},
    "Ridge": {'model__alpha': [0.1, 1.0, 10.0, 100.0]},
    "Lasso": {'model__alpha': [0.1, 1.0, 10.0, 100.0]},
    "ElasticNet": {'model__alpha': [0.1, 1.0, 10.0], 'model__l1_ratio': [0.1, 0.5, 0.9]},
    "DecisionTreeRegressor": {'model__max_depth': [5, 10, 20]},
    "RandomForestRegressor": {'model__n_estimators': [100, 200], 'model__max_depth': [5, 10, 20]},
    "GradientBoostingRegressor": {'model__n_estimators': [100, 200], 'model__learning_rate': [0.01, 0.1, 0.2]},
}


In [None]:
results = {}

for model_name, model in models.items():
    for feature_set in feature_subsets:
        transformers = []

        if "abv" in feature_set:
            transformers.append(("abv", RobustScaler(), ["abv"]))
        if "name" in feature_set:
            transformers.append(("name_tfidf", TfidfVectorizer(max_features=100), "name"))
        if "available" in feature_set:
            transformers.append(("available_ohe", OneHotEncoder(handle_unknown='ignore'), ["available"]))
        if "glass" in feature_set:
            transformers.append(("glass_ohe", OneHotEncoder(handle_unknown='ignore'), ["glass"]))
        if "description" in feature_set:
            transformers.append(("description_tfidf", TfidfVectorizer(max_features=100), "description"))
        if "originalGravity" in feature_set:
            transformers.append(("originalGravity", RobustScaler(), ["originalGravity"]))
        if "srm" in feature_set:
            transformers.append(("srm", RobustScaler(), ["srm"]))
        if "isOrganic" in feature_set:
            transformers.append(("isOrganic_ohe", OneHotEncoder(handle_unknown='ignore'), ["isOrganic"]))

        preprocessor = ColumnTransformer(transformers, remainder='drop')

        pipeline = Pipeline(steps=[
            ("preprocessor", preprocessor),
            ("model", model)
        ])

        param_grid = param_grids[model_name]

        grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
        X = data[feature_set].copy()
        y = data[target]
        grid_search.fit(X, y)

        best_model = grid_search.best_estimator_
        best_params = grid_search.best_params_
        best_score = -grid_search.best_score_
        best_rmse = np.sqrt(best_score)

        results[(model_name, tuple(feature_set))] = (best_score, best_rmse, best_params)


In [None]:

for (model_name, feature_set), (mse, rmse, best_params) in results.items():
    print(f"Model: {model_name}, Features: {feature_set}, MSE: {mse:.2f}, RMSE: {rmse:.2f}, Best Params: {best_params}")

best_model_features = min(results, key=lambda x: results[x][0])
best_model_name, best_features = best_model_features
best_mse, best_rmse, best_params = results[best_model_features]
print(f"\nBest Model: {best_model_name}, Features: {best_features}, MSE: {best_mse:.2f}, RMSE: {best_rmse:.2f}, Best Params: {best_params}")


Model: LinearRegression, Features: ('abv',), MSE: 507.99, RMSE: 22.54, Best Params: {}
Model: LinearRegression, Features: ('abv', 'name'), MSE: 378.85, RMSE: 19.46, Best Params: {}
Model: LinearRegression, Features: ('abv', 'name', 'available'), MSE: 379.56, RMSE: 19.48, Best Params: {}
Model: LinearRegression, Features: ('abv', 'name', 'available', 'glass'), MSE: 372.12, RMSE: 19.29, Best Params: {}
Model: LinearRegression, Features: ('abv', 'name', 'description'), MSE: 312.10, RMSE: 17.67, Best Params: {}
Model: LinearRegression, Features: ('abv', 'name', 'available', 'glass', 'description'), MSE: 312.09, RMSE: 17.67, Best Params: {}
Model: LinearRegression, Features: ('abv', 'originalGravity'), MSE: 480.93, RMSE: 21.93, Best Params: {}
Model: LinearRegression, Features: ('abv', 'srm'), MSE: 504.38, RMSE: 22.46, Best Params: {}
Model: LinearRegression, Features: ('abv', 'isOrganic'), MSE: 507.71, RMSE: 22.53, Best Params: {}
Model: LinearRegression, Features: ('abv', 'name', 'descrip

In [None]:
best_model = models[best_model_name]
best_params = {k.split('__')[-1]: v for k, v in best_params.items()}
print(best_params)
best_features = list(best_features)

transformers = [
    ("abv", RobustScaler(), ["abv"]),
    ("name_tfidf", TfidfVectorizer(max_features=100), "name"),
    ("available_ohe", OneHotEncoder(handle_unknown='ignore'), ["available"]),
    ("glass_ohe", OneHotEncoder(handle_unknown='ignore'), ["glass"]),
    ("description_tfidf", TfidfVectorizer(max_features=100), "description"),
]


pipeline = Pipeline(steps=[
    ("preprocessor", ColumnTransformer(transformers, remainder='drop')),
    ("model", best_model.set_params(**best_params))
])

print(best_features)
print(target)
X = data[best_features].copy()
y = data[target]

pipeline.fit(X, y)

test_data.dropna(subset=best_features, inplace=True)
X_test = test_data[best_features].copy()

predictions = pipeline.predict(X_test)

test_data['ibu'] = predictions

submission = test_data[['id', 'ibu']]
submission.to_csv("submission.csv", index=False)

print("Predictions saved to submission.csv")


{'learning_rate': 0.1, 'n_estimators': 200}
['abv', 'name', 'available', 'glass', 'description']
ibu
Predictions saved to submission.csv



## Question 2

_This question should be done late in the competition, after you have already made several submissions to Kaggle._

In class, we discussed "ensemble methods", which are methods for combining predictions from different machine learning models. One simple method of ensembling regression models is to take a straight average of the predictions from the models. Work with another team, average the predictions from your best model and their best model, and upload the resulting predictions to Kaggle. Look at your RMSE on the public leaderboard. How does the test RMSE of the ensemble model compare to the test RMSEs of the individual models?

(_Note:_ You are not required to evaluate the ensemble model using cross-validation. Just report the RMSE from Kaggle.)

In [None]:
import pandas as pd
df1 = pd.read_csv('my_submission.csv')
df2 = pd.read_csv('other.csv')

merged_df = pd.merge(df1, df2, on='id', suffixes=('_mine', '_other'))
merged_df['ibu'] = (merged_df['ibu_mine'] + merged_df['ibu_other']) / 2
merged_df.drop(columns=['ibu_mine', 'ibu_other'], inplace=True)
merged_df.to_csv('ensemble_submission.csv', index=False)

After averaging predictions from my best model and another person's best model, I uploaded the results and I got a slightly better model performance/accuracy by 3.84728% from 20.82356 to 20.03754.

## Submission Instructions

- Restart this notebook and run the cells from beginning to end:
  - Go to Runtime > Restart and Run All.
- Download the notebook:
  - Go to File > Download > Download .ipynb.
- Submit your notebook file to the assignment on Canvas.