<font color='darkred'>Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, you'll need to update the *apputil\.py* file and the *app\.py* file.

## Exercise 1

Recall the [simple streamlit app](https://github.com/leontoddjohnson/simple_streamlit) and the [coffee analysis data](https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv) used.

Write a Python script called `train.py` that does the following:

- Loads the [coffee analysis data](https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv) (from the URL).
- Trains a (Scikit-Learn) linear regression model to predict `rating` based on the single feature `100g_USD`.
- Saves the trained model in this repository as a pickle file called `model_1.pickle`.

## Exercise 2

Update the script to train a **Decision Tree Regressor** model that predicts `rating` based on *both* `100g_USD` and `roast`, and saves the trained model as `model_2.pickle`. Notice that the `roast` column is categorical, so you'll need to convert it into a numerical label format:

- Create a dictionary that maps *all* categories to a number (e.g., `roast_cat['Medium-Light'] = 1`).
- Use `.map` or `.apply` (in pandas) to create a numerical column to train your model.
- Save the dictionary along with this process for next exercise.

*Note: **Do not worry about model performance**, but interestingly, tree-based models like this tend to perform more efficiently with category labels instead of than one-hot encoded features.*

## Exercise 3

Update the *apputil\.py* file to include a `predict_rating(df_X)` function that takes in a two-column dataframe, `df_X`, with columns `100g_USD` (numerical) and `roast` (in original text form), and returns an array containing corresponding predicted `rating` values. If a `roast` value is not one of the roast values in the training data, the function should only use the `100g_USD` value to make the prediction (recall `model_1.pickle`). Otherwise, it should use both features.

In [8]:
# Exercise 1: Write a python script called 'train.py' that loads coffee analysis data from a CSV file, trains a scikit-learn linear regression model to predict 'rating' based on the single feature '100g_USD', and saves the trained model in this repository as a pickle file called 'model_1.pickle'.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pickle

# 1) Load the coffee analysis data from a CSV file
data = pd.read_csv("https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv")

# 2) Train a scikit-learn linear regression model to predict 'rating' based on the single feature '100g_USD'
"""Split the data into training and testing sets with 80:20 ratio.
    linear regression model predicts 'rating' based on '100g USD' feature.
"""
df_train, df_test = train_test_split(data, test_size=0.2)
features = ['100g_USD']
X_train = df_train[features]
y_train = df_train['rating']
lm = LinearRegression()
lm.fit(X_train.values, y_train.values)

# 3) Save the trained model in this repository as a pickle file called 'model_1.pickle'
"""Save the trained model in this repository as a pickle file called 'model_1.pickle'."""
with open('model_1.pickle', 'wb') as f:
    pickle.dump(lm, f)

print("Linear Regression model trained and saved as 'model_1.pickle")

Linear Regression model trained and saved as 'model_1.pickle


In [7]:
# Exercise 2: Update the script to train a Decision Tree Regressor model that predicts 'rating' based on both '100g_USD' and 'roast', and saves the trained model as 'model_2.pickle'.
# Notice: The 'roast' column is categorical, so it is necessary to convert it into a numerical label format.
from sklearn.tree import DecisionTreeRegressor
"""
Create a dictionary that maps all categories to a number(e.g., 'roast_cat['Medium-Light'] = 1).
Use '.map' or '.apply' (in pandas) to create a numerical column to train model.
Save the dictionary along with this process for next exercise.
"""
# 1) Encode the categorical 'roast' column into numerical labels
roast_cat = {cat: idx for idx, cat in enumerate(data['roast'].unique())}
data['roast_num'] = data['roast'].map(roast_cat)

# 2) Prepare features and target
features = ['100g_USD', 'roast_num']
X = data[features]
y = data['rating']

# 3) Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

"""Save trained model as 'model_2.pickle'."""
with open('model_2.pickle', 'wb') as f:
    pickle.dump(dt, f)

print("Decision Tree Model trained and saved as 'model_2.pickle")

Decision Tree Model trained and saved as 'model_2.pickle


In [9]:
import pandas as pd
from apputil import predict_rating

df_X = pd.DataFrame([
    [10.00, "Dark"],
    [15.00, "Very Light"]], 
    columns=["100g_USD", "roast"])
y_pred = predict_rating(df_X)
y_pred

ImportError: cannot import name 'predict_rating' from 'apputil' (/Users/woodsprocise/Documents/IU Indy - Fall '25/Code Space Projects /week-10/apputil.py)

## (Optional) Bonus Exercise

Vectorize the `desc_3` column in the coffee analysis data using TF-IDF vectorization. Train a linear regression model to predict `rating` based only on the vectorized text data, and save the trained model as `model_3.pickle`.

Adjust your `predict_rating(X, text=True)` function where the `text` argument indicates that `X` is an array of strings of text (in the style of the reviews in `desc_3`). Update the function so that when `text=True`, it returns predicted ratings based on the text.

Note: you'll need to figure out what to do when the input text contains words that were not in the training data!

In [None]:
X = pd.DataFrame([
    "A delightfull coffee with hints of chocolate and caramel.",
    "A strong coffee with a bold flavor and a smoky finish."], 
    columns=["text"])
y = predict_rating(X, text=True)
y