> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Appled ML SOTA Week 1: End-to-end Supervised Learning Model Development

In this project, you will be building a supervised learning model on a single Kaggle dataset scraped from [Wish](https://www.wish.com/), an e-commerce website. Unlike previous weeks, where cleaned, complete datasets were provided for you, you will be working on designing an ML project end-to-end.

This assignment will walk you through the steps to prepare a raw web scrape into a dataset that can be used to train a supervised learning model. You will also have the opportunity to try new state-of-the-art modeling approaches and apply what you have learned in practice.

### Instructions

1. We provide starter code and data to give your work a common starting point and structure. You must keep function signatures unchanged to support later usage and to ensure your project is graded successfully.
2. Read through the document and starting code before beginning your work. Understand the overall structure and goals of the project to ensure your implementation is efficient.
4. Tasks marked as _extensions_ or _optional_ are intended to provide advanced ML engineering or modeling challenges. You may skip or attempt these tasks as you like.


# Dependencies

Let's start by importing all the libraries that we'll need throughout the project:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from typing import Set, Tuple
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder

# Testing
%pip install -U ipytest
import ipytest
import pytest
ipytest.autoconfig()

Next, we'll fix a random seed so we produce consistent results that can be easily discussed.

In [None]:
# Fix the random seed so that we get consistent results
# We'll use this same seed throughout the notebook
SEED = 0
np.random.seed(SEED)

# Summer clothing sales prediction

The raw Kaggle dataset we will be working with was gathered from Wish.com, an e-commerce website. Each sample in the dataset corresponds to a product that would appear if you type "summer" in the search field of the website. [Here is the Kaggle link.](https://www.kaggle.com/datasets/jmmvutu/summer-products-and-sales-in-ecommerce-wish?datasetId=819786&searchQuery=sales+prediction)

The following screenshot from the website shows some features and how to interpret them:

![](https://drive.google.com/uc?export=view&id=1yH00henYvw0h84lx0UyZGskPhkkKeQXP)

Given the attributes (or features) of a product, can we predict the number of units sold for that product? Let's load the dataset and formulate the problem as an ML task.

## Data Loading

We store the dataset on GDrive. This code downloads the dataset to your colab instance.

In [None]:
!gdown 1NOzxjbZIiVc31V1_GEyXbMo4n4AuvpH4

In [None]:
df = pd.read_csv('summer-products-with-rating-and-performance_2020-08.csv')
print("Dataset size:", len(df))
df.head()

Now, let's split the dataset into train and test sets. Since we are trying to predict the number of units sold, the `units_sold` column will become our $y$ values.

The next cell separates the input ($X$) matrix and output ($y$) vector and splits
the train and test sets. *We'll keep these splits fixed for the rest of the notebook to avoid leakage between train and test sets.*

In [None]:
# Make `units_sold` the y values
X = df.drop('units_sold', axis=1)
y = df['units_sold'].values

# Get train and test set splits - don't modify these!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED, shuffle=True
)

## ML Task Formulation

Recall that we want to design an ML model that can predict the number of units sold for the [Wish](https://www.wish.com/) product listings. At first glance, this naturally sounds like a regression problem.

The small sample of product data above is showing that the values for `units_sold` are rounded numbers. Let's take a closer look at the distribution of `units_sold` to get a better understanding of what we're predicting.


In [None]:
# How many unique values for `units_sold` do we have?
set(y_train)

There are only 14 unique values for `units_sold`! This makes sense, since the Kaggle page describes this column as a "lower bound approximation by steps." In other words, this isn't truly a continuous column, since the number of units sold have been put into bins. It would be better to formulate this as a **classification** problem, where we'll treat the different bins as classes.

We formalize this classification task as:
* Inputs $x$: attributes of the product listing, including `title`, `price`, `rating`, etc. (these may change as we do some feature engineering on the dataset later in this notebook!)
* Output $y$: one of 10 classes indicating the number of units sold

Let's convert the `units_sold` column into a column of 10 classes and build an initial model $f_θ(x) → y$. We'll combine the small values (where $y < 10$) into the same class.

In [None]:
def get_units_sold_class(units_sold: int) -> str:
    """
    Get the class for the given value of units_sold.
    There are 10 distinct classes.
    Args:
      units_sold (int): original units_sold value from
      the Kaggle dataset
    Returns:
      units_sold_class (string): string representation of the
      class the given value of units_sold belongs to
    """
    if units_sold < 10:
        return '1-10'
    elif units_sold < 50:
        return '10-50'
    elif units_sold < 100:
        return '50-100'
    elif units_sold < 1000:
        return '100-1000'
    elif units_sold < 5000:
        return '1000-5000'
    elif units_sold < 10000:
        return '5000-10000'
    elif units_sold < 20000:
        return '10000-20000'
    elif units_sold < 50000:
        return '10000-50000'
    elif units_sold < 100000:
        return '50000-100000'
    else:
        return '>100000'

### Task: Convert outputs into classes

Using the `get_units_sold_class` function defined above, convert `y_train` and `y_test` into vectors of classes, represented by strings (`'1-10'`, `'10-50'`, etc.). Store the new vectors in variables called `y_train_class` and `y_test_class`, respectively.

In [None]:
#############################
# Store your new output vectors with the following variable names:
# y_train_class = ...
# y_test_class = ...
#### YOUR CODE GOES HERE ####

#############################

In [None]:
#@title Test: Convert outputs into classes

%%ipytest

@pytest.mark.parametrize('y, y_class, split', [
    (y_train, y_train_class, 'train'),
    (y_test, y_test_class, 'test'),
])
def test_y_class(y, y_class, split):
    for i, (yi, yi_class) in enumerate(zip(y, y_class)):
        assert get_units_sold_class(yi) == yi_class, \
            'Incorrect class assignment for index {} of y_{}_class'.format(i, split)

## Classification Baseline

Let's predict the majority class and report its test set accuracy. This will serve as our classification baseline.

In [None]:
# Let's plot a bar chart of the frequencies for each class
fig, ax = plt.subplots()
units_sold_classes = pd.DataFrame({'units_sold_class': y_train_class})
units_sold_classes.value_counts().plot(ax=ax, kind='bar')
plt.show()

In [None]:
# Predict majority class
y_train_pred = ['100-1000'] * len(y_train_class)
y_test_pred = ['100-1000'] * len(y_test_class)

print("Test accuracy: %.2f" % accuracy_score(y_test_class, y_test_pred))

# MVP Model
Now that we've loaded the dataset, formulated the appropriate ML task, and performed a basic data check, let's get our first model working. The dataset doesn't come prepared in a format that can be fed into a model right off-the-bat, so we'll need to do some preprocessing.

In [None]:
# Get a fresh copy of the train and test set features
# If we make any changes, we'll make them on this copy
X_train_ft = X_train.copy()
X_test_ft = X_test.copy()

## One-hot Encoding and Missing Value Imputation

The inputs to a model are represented by the real-valued feature matrix $X$. This means that any string columns need to be transformed into numeric values in the feature matrix. We'll make the assumption that we can separate continuous and categorical variables based on their datatype; any numeric columns will be considered as continous variables, and any string columns will be considered as categorical variables. That way, we can one-hot encode string columns in order to transform them in the feature matrix. (This is a naive solution - we'll soon see some better feature engineering approaches this dataset.)

We'll also need to consider any null values, which can't be given as inputs to the model. We'll just fill those with 0 for now.

In [None]:
def get_columns(df: pd.DataFrame) -> Tuple[pd.Index, pd.Index]:
    """
    Get the continuous and categorical columns from the given
    DataFrame df based on the columns' datatype. Numerical columns
    (dtype float64, int64) are considered continuous, and string columns
    (dtype object) are considered categorical.
    Args:
      df (DataFrame): original dataset features
    Returns:
      (continuous_columns, categorical_columns): tuple containing
      continuous and categorical columns from the input df
    """
    continuous_columns = df.select_dtypes(include=['float64', 'int64']).columns
    categorical_columns = df.select_dtypes(include=['object']).columns
    return continuous_columns, categorical_columns

continuous_columns, categorical_columns = get_columns(X_train)

### Task: Simple preprocessing using Pandas

One-hot encode categorical columns and impute missing values in the dataset using Pandas functions (i.e. `get_dummies` and `fillna`).
Assign the new Dataframe to a variable `X`.

In [None]:
# One-hot encode categorical columns and fill null values with zero
# When using get_dummies, the X_train and X_test need to be
# encoded simultaneously in order to have the same number of columns
num_train = len(X_train)
X = pd.concat([X_train, X_test])

#############################
#### YOUR CODE GOES HERE ####

#############################

In [None]:
#@title Test: Simple preprocessing using Pandas

%%ipytest

def test_dtypes():
    assert(np.dtype('object') not in set(X.dtypes)), \
        '''Dataframe contains non-numeric column. Check your one-hot encoding.'''

def test_column_count():
    assert(len(X.columns) == 11969), \
        '''Dataframe has the incorrect number of columns. Check your one-hot encoding.'''

def test_null_entries():
    for col, sum in X.isnull().sum().items():
        assert(sum == 0), '''Column {} contains {} non-null entries'''.format(col, sum)

In [None]:
# Separate out X_train_ft and X_test_ft
X_train_ft = X.iloc[:num_train].copy()
X_test_ft = X.iloc[num_train:].copy()

# Check for null values
print('Null values in train set:', X_train_ft.isnull().sum())
print('Null values in test set:', X_train_ft.isnull().sum())

# Check the new number of features
print('Train set shape:', X_train_ft.shape)

## Linear Model Test
Now that we've done this preprocessing, we can train a scikit-learn model from our dataset. Let's try logisitc regression.

In [None]:
# Fit linear model
lr = linear_model.LogisticRegression(max_iter=500)
lr.fit(X_train_ft, y_train_class)

print("Train accuracy: %.2f" % accuracy_score(y_train_class, lr.predict(X_train_ft)))
print("Test accuracy: %.2f" % accuracy_score(y_test_class, lr.predict(X_test_ft)))

The logistic regression model performs better than our baseline, but accuracy is still quite low on both the training and test sets. We also get a convergence warning, even when setting a large number of iterations.

## Coordinate Standardization

One crucial preprocessesing step we are missing is coordinate standardization for the continuous features. Let's take a look at the mean and standard deviation of these features in our dataset.

In [None]:
# Get stats about continuous features using .describe()
X_train_ft[continuous_columns].describe()

ML models work best when all the features have a mean of 0 and a standard deviation of 1, but this is far from true for our dataset right now. Let's z-score our dataset and see how it improves the linear model's performance.

### Task: Scaling continuous columns
The code block below uses scikit's built-in `StandardScaler` to z-score the continuous columns in the training dataset. Notice that the `fit_transform()` function serves to both "fit" the scaler to the training dataset and perform the transformation simultaneously. Your task is to transform the test set using the fitted `scaler`. Store the transformed test set in a variable called `X_test_ft`.

You can read more about the usage of `StandardScaler` in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
scaler = StandardScaler()

# Separate columns
encoded_columns = list(set(X_train_ft.columns)-set(continuous_columns))
X_train_encoded_cols = X_train_ft[encoded_columns]
X_train_scaled_cols = scaler.fit_transform(X_train_ft[continuous_columns])
X_train_ft = np.concatenate([X_train_scaled_cols, X_train_encoded_cols], axis=1)

#############################
#### YOUR CODE GOES HERE ####

#############################

In [None]:
#@title Test: Scaling continuous columns

%%ipytest

def test_feature_cnt():
    assert(X_train_ft.shape[1] == X_test_ft.shape[1]), \
        '''Train and test sets have a mismatch in the number of features'''

Now, we will fit a few models on the normalized dataset.

In [None]:
# Fit linear model
lr = linear_model.LogisticRegression(max_iter=500)
lr.fit(X_train_ft, y_train_class)

print("Train accuracy: %.2f" % accuracy_score(y_train_class, lr.predict(X_train_ft)))
print("Test accuracy: %.2f" % accuracy_score(y_test_class, lr.predict(X_test_ft)))

In [None]:
# Try nonlinear model
dt = DecisionTreeClassifier(random_state=SEED)
dt.fit(X_train_ft, y_train_class)

print("Train accuracy: %.2f" % accuracy_score(y_train_class, dt.predict(X_train_ft)))
print("Test accuracy: %.2f" % accuracy_score(y_test_class, dt.predict(X_test_ft)))

After z-scoring the dataset, we no longer get a convergence warning while fiting the logistic regression model, and we get a more reasonable test accuracy. However, we can see that the model is overfitting the training set since it achieves perfect train accuracy.

**Question:**
Why do you think a simple linear model like logistic regression is overfitting the training set?

===== (Write your answer here) =====

In [None]:
X_train_ft.shape

Our transformed feature matrix for the training set has 11,969 features for just 1,258 training samples! As a general rule of thumb, the ratio of samples to features should be about 10:1. In our case, our feature matrix is far from ideal -- there are way too many features for a small number of training examples.

Before trying any more complex models to improve test performance, let's take a closer look at our dataset in order to engineer better features for our model.

# Feature Engineering

In order to perform principled feature engineering, we need to get a better understanding of what our dataset actually contains. We'll see that our naive choices for continuous and categorical columns were not ideal, and some columns will require hands-on feature derivations.

In [None]:
# Get a fresh copy of the train and test set features
# If we make any changes, we'll make them on this copy
X_train_ft = X_train.copy()
X_test_ft = X_test.copy()

## Dataset Understanding

Let's take a closer look at all of the columns in our original training dataset, before we performed any transformations. Pandas's `.info()` function provides the row counts, column names, non-null counts for each column, and dtype for each column.

In [None]:
# Get info for each column
X_train_ft.info()

In [None]:
# Observe first example in dataset
X_train_ft.iloc[0]

Under the CSV preview, the Kaggle link also includes column descriptions that can aid us with feature engineering:

![](https://drive.google.com/uc?export=view&id=10lW-pwE91IOZJuhHK5u-Z5ouYgvo_JOu)

Here are the column descriptions:
1. Title: localized for european countries. May be the same as title_orig if the seller did not offer a translation
2. Title_orig: original english title of the product
3. price: price for buyer
4. retail_price: retail price, or reference price in other stores/places. Used by the seller to indicate a regular value or the price before discount.
5. currency_buyer
6. units_sold: Number of units sold. Lower bound approximation by steps
7. uses_ad_boosts: Whether the seller paid to boost his product within the platform (highlighting, better placement or whatever)
8. rating: Mean product rating
9. rating_count: Total number of ratings of the product
10. rating_five_count: Number of 5-star ratings
11. rating_four_count: Number of 4-star ratings
12. rating_three_count: Number of 3-star ratings
13. rating_two_count: Number of 2-star ratings
14. rating_one_count: Number of 1-star ratings
15. badges_count: number of badges the product or seller have
16. badges_local_product: A badge that denotes the product is a local product. Conditions may vary (being produced locally, or something else). Some people may prefer buying local products rather than. 1 means Yes, has the badge
17. badge_product_quality: Badge awarded when many buyers consistently gave good evaluations 1 means Yes, has the badge
18. badge_fast_shipping: Badge awarded when this product's order is consistently shipped rapidly
19. tags: tags set by the seller
20. product_color: Product's main color
21. product_variation_size_id: One of the available size variation for this product
22. product_variation_inventory: Inventory the seller has. Max allowed quantity is 50
23. shipping_option_name
24. shipping_option_price: shipping price
25. shipping_is_express: whether the shipping is express or not. 1 for True
26. countries_shipped_to: Number of countries this product is shipped to. Sellers may choose to limit where they ship a product to
27. inventory_total: Total inventory for all the product's variations (size/color variations for instance)
28. has_urgency_banner: whether there was an urgency banner with an urgency
29. urgency_text: A text banner that appear over some products in the search results.
30. origin_country
31. merchant_title: Merchant's displayed name (show in the UI as the seller's shop name)
32. merchant_name: Merchant's canonical name. A name not shown publicly. Used by the website under the hood as a canonical name. Easier to process since all lowercase without white space
33. merchant_info_subtitle: The subtitle text as shown on a seller's info section to the user. (raw, not preprocessed). The website shows this to the user to give an overview of the seller's stats to the user. Mostly consists of "% ( reviews)" written in french
34. merchant_rating_count: Number of ratings of this seller
35. merchant_rating: merchant's rating
36. merchant_id: merchant unique id
37. merchant_has_profile_picture: Convenience boolean that says whether there is a "merchant_profile_picture" url
38. merchant_profile_picture: Custom profile picture of the seller (if the seller has one). Empty otherwise.
39. product_url: url to the product page. You may need to login to access it
40. product_picture
41. product_id: product identifier. You can use this key to remove duplicate entries if you're not interested in studying them.
42. theme: the search term used in the search bar of the website to get these search results.
43. crawl_month: meta for info only.

**Question:**

Based on what we know so far about the dataset, what columns do you think are important for predicting the number of units sold? What columns do you think are irrelevant? Feel free to consult the additional metadata about the dataset on the Kaggle page!

===== (Write your answer here) =====

## Re-evaluating Column Encodings

Observing the info table and the example, we can see that some of the string columns we treated as categorical features are actually free-form text, such as `title`, `title_orig`, `merchant_info_subtitle`, etc. These are likely to be unique to each example and shouldn't be treated as true classes.

Let's see if this is the case by checking the number of unique values for each column.

In [None]:
X_train_ft.nunique()

Many of the string columns have a large number of unique values. Out of the 1,415 samples in the training set, there are 1,100 unique values for `title`. One-hot encoding columns such as the `title` blows up the number of features in our dataset, and since it is unlikely that the test set will exactly match an existing `title` in the training set, we won't glean any meaningful generalization from those features. This is true, not only for free-form text columns, but also for columns containing `ids` or `urls`. There are separate NLP methods to handle free-form text, but for now, let's omit all these columns.

In [None]:
# Drop long-form text-based columns
text_cols = ['title', 'title_orig', 'merchant_title', 'merchant_name', 'merchant_info_subtitle']

# Drop IDs, usernames, URLs, etc.
id_cols = ['merchant_id', 'merchant_profile_picture', 'product_picture', 'product_url', 'product_id']

categorical_cols_to_drop = text_cols + id_cols

There are also several columns that only have a single unique value. A model won't learn any meaningful variation between samples from such features, since all samples have the same value. We can omit these columns.

In [None]:
# Columns with only one unique value
single_val_cols = ['currency_buyer', 'theme', 'crawl_month']
categorical_cols_to_drop.extend(single_val_cols)

You may have noticed that we did not choose to drop `has_urgency_banner` and `urgency_text`. Those columns actually have two unique values - a null value and a non-null value. A null value indicates that no urgency banner was present on the listing. However, these two columns essentially encode the same attribute and will be completely correlated, so let's drop one of them.

In [None]:
# Drop urgency text since it's redundant
categorical_cols_to_drop.append('urgency_text')

# Now, let's drop all these columns
X_train_ft.drop(columns=categorical_cols_to_drop, inplace=True)
X_test_ft.drop(columns=categorical_cols_to_drop, inplace=True)

## Derived Features

Some columns can be transformed into useful derived features. In the dataset, the `tags` column is a list keywords set by the seller. We can extract the most frequent tags in the training set and use that to encode a new set of boolean features.

In [None]:
# View some tags in the training set
X_train_ft['tags'].head()

### Task: Find the most frequent tags in the training set

Extract the most frequent tags across the training set. Tags for a given item are comma-separated. For example, in the samples above, `Mini`, `womens dresses`, and `Summer` are different tags. Some points to consider:
- Tags in the dataset are comma separated.
- You should do some case normalization -- `shirt` and `Shirt` should be considered the same tag.
- Use python's `Counter` to keep counts of the tags, and store the counter in a variable `c`.

In [None]:
c = Counter()

#############################
#### YOUR CODE GOES HERE ####

#############################

# View top tags
c.most_common(20)

In [None]:
#@title Test: Find the most frequent tags

%%ipytest

top_tags = [
  "women's fashion",
  'summer',
  'fashion',
  'women',
  'casual',
  'plus size',
  'sleeveless',
  'shorts',
  'dress',
  'tops',
  'sexy',
  'beach',
  'print',
  'short sleeves',
  'sleeve',
  'shirt',
  'tank',
  'necks',
  'printed',
  't shirts',
]

@pytest.mark.parametrize('tag', list(zip(*c.most_common(20)))[0])
def test_top_tags(tag):
    assert(tag in top_tags)

### Task: Create tag-based features

Now that you've created a counter with the most frequent tags, it's time to derive a new set of columns from the most frequent tags. Engineer a new set of features based on the 20 most common tags. The new feature columns should be named after the tag and have a prefix `tag_`.

For example, if the tag is "summer", the new feature column should be called `tag_summer`. For a given sample (i.e. row in the dataset), if "summer" is one of the tags in the `tags` column, then the value of the `tag_summer` column should be 1. Otherwise, it should be 0. You can think of this as an *indicator* feature.

There should be a total of 20 additional feature columns for the 20 tags. Make sure to add the columns to both the training and the test set features.

In [None]:
# Get 20 most common tags
common_tags = [tag for tag, _ in c.most_common(20)]

#############################
#### YOUR CODE GOES HERE ####

#############################

# Drop old tags column
X_train_ft.drop('tags', axis=1, inplace=True)
X_test_ft.drop('tags', axis=1, inplace=True)

In [None]:
# Sanity check a few tags' features
print("'summer' count:", len(X_train_ft[X_train_ft['tag_summer'] == 1])) # Should be 1055
print("'sleeve' count:", len(X_train_ft[X_train_ft['tag_sleeve'] == 1])) # Should be 227
print("'tank' count:", len(X_train_ft[X_train_ft['tag_tank'] == 1])) # Should be 211

In [None]:
#@title Test: Tag feature counts

%%ipytest

@pytest.mark.parametrize('tag', top_tags)
def test_top_tags(tag):
    assert(len(X_train_ft[X_train_ft['tag_'+tag] == 1]) == c[tag])

Another column that can be encoded into a better set of derived features is the `product_color` column. When we one-hot encode this column, variants of the same color are completely orthogonal to each other, i.e. there is no indication that `green` and `armygreen` are similar. Since there are so many color variants in the dataset, let's consolidate the different color categoreis and encode a smaller set of features based on the most common colors.

In [None]:
color_counts = X_train_ft['product_color'].value_counts()
color_counts[color_counts > 2]

Some of the common color variants are represented by entirely different words. For instance, `beige` is a variant of `brown`. We'll have to consolidate those manually.

In [None]:
# Manually consolidate color variants
X_train_ft['product_color'].replace('beige', 'brown', inplace=True)
X_test_ft['product_color'].replace('beige', 'brown', inplace=True)
X_train_ft['product_color'].replace('coffee', 'brown', inplace=True)
X_test_ft['product_color'].replace('coffee', 'brown', inplace=True)
X_train_ft['product_color'].replace('rose', 'pink', inplace=True)
X_test_ft['product_color'].replace('rose', 'pink', inplace=True)

# Replace occurrences of 'gray' with its common spelling 'grey'
X_train_ft['product_color'].replace('gray', 'grey', inplace=True)
X_test_ft['product_color'].replace('gray', 'grey', inplace=True)

In [None]:
common_colors = [
    'black', 'white', 'yellow', 'blue', 'pink', 'brown',
    'red', 'green', 'grey', 'purple', 'orange'
]

# Let's engineer a new set of features based on this list of colors
for color in common_colors:
  X_train_ft['color_'+color] = X_train_ft.apply(
      lambda x: int(color in str(x['product_color']).lower()),
      axis = 1
  )
  X_test_ft['color_'+color] = X_test_ft.apply(
      lambda x: int(color in str(x['product_color']).lower()),
      axis = 1
  )

In [None]:
# Sanity check colors
X_train_ft[['product_color', 'color_white', 'color_black', 'color_red', 'color_green', 'color_grey', 'color_pink']]

In [None]:
# Drop old product_color column
X_train_ft.drop('product_color', axis=1, inplace=True)
X_test_ft.drop('product_color', axis=1, inplace=True)

## Remove Correlated Features

Next, let's check on whether the continuous columns in the dataset have any correlations with each other. A group of highly correlated features does not contribute new information for the model to learn, and can even cause [linear models to produce wildly varying solutions](https://en.wikipedia.org/wiki/Multicollinearity#Consequences_of_multicollinearity) from small changes to the dataset.

### Task: Visualize Correlated Feature Matrix

The pandas DataFrame function `.corr()` produces a correlation matrix that we can plot to easily visualize the correlations. Plot a heatmap of the correlation matrix to visualize correlated features.

In [None]:
#############################
#### YOUR CODE GOES HERE ####

#############################

The heatmap shows that there are several features related to the product rating that are correlated with each other. The columns `rating_five_count`,	`rating_four_count`, etc. are all correlated, and `rating` and `rating_count` can be derived directly from those columns as well. Let's remove the redundant columns.

In [None]:
# Remove redundant correlated features
drop_cols = ['rating_five_count',	'rating_four_count', 'rating_three_count', 'rating_two_count', 'rating_one_count']
X_train_ft.drop(columns=drop_cols, inplace=True)
X_test_ft.drop(columns=drop_cols, inplace=True)

**Question:**

Are there any other features you would remove or transform in the dataset?

===== (Write your answer here) =====

## Built-in Preprocessing Steps

Now that we've done some principled feature engineering, we will finish by completing the required steps of one-hot encoding, scaling, and missing value imputation as we had done previously.

You might have noticed that performing these preprocessing steps can get quite repetitive for any new data that we would like to pass into our model. Fortunately, scikit-learn provides several built-in functions that can do these preprocessing steps for us -- in particular, `Pipeline` is very useful for cleaning up code and collapsing all preprocessing and modeling steps into to a single line of code.

We already used `StandardScaler` to z-score continuous columns, but scikit also provides built-ins like `OneHotEncoder` and `SimpleImputer` to perform one-hot encoding and missing value imputation that we previously computed using Pandas functions.


Let's try these out rather than performing the preprocessing steps manually.

In [None]:
# Get new list of continuous and categorical columns
continuous_columns, categorical_columns = get_columns(X_train_ft)

### Task: Scikit-learn preprocessing with Pipelines
Use scikit-learn's built-in `StandardScaler`, `OneHotEncoder`, and `SimpleImputer` to preprocess the dataset. Combine multiple preprocessing steps using the `Pipeline` class. `Pipeline` takes a list of tuples of transformers for its `steps` argument, where each tuple has the pattern `('name_of_transformer', transformer)`. Each step will be chained and applied to the passed DataFrame in the given order.

We have already written a pipeline that transforms continuous features as an example. Write another pipeline that imputes and one-hot encodes categorical features. You can refer to the [documention on `Pipelines`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for more examples, as well as [this tutorial](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d).

In [None]:
# Instantiate scaler, encoder, & imputer
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
imputer = SimpleImputer()

# Example of using Pipeline for numeric features
numeric_pipeline = Pipeline(
    steps=[("impute", SimpleImputer(strategy="mean")),
           ("scale", StandardScaler())]
)

#############################
#### YOUR CODE GOES HERE ####

#############################

We'll now combine both pipelines using `ColumnTransformer`, which is similar to `Pipeline`, but it allows us to specify which columns to apply a transformation to. This creates one processor that can fit and transform the dataset in a single line of code.

(Note: you can also add model instantiation as a part of a scikit-learn `Pipeline`. This means you can preprocess the dataset and train the model all in one line. This is shown in the tutorial linked above.)

In [None]:
full_processor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, continuous_columns),
        ("categorical", categorical_pipeline, categorical_columns),
    ]
)

X_train_ft = full_processor.fit_transform(X_train_ft)
X_test_ft = full_processor.transform(X_test_ft)

In [None]:
# Fit linear model
lr = linear_model.LogisticRegression(max_iter=500)
lr.fit(X_train_ft, y_train_class)

print("Train accuracy: %.2f" % accuracy_score(y_train_class, lr.predict(X_train_ft)))
print("Test accuracy: %.2f" % accuracy_score(y_test_class, lr.predict(X_test_ft)))

In [None]:
# Number of features
X_train_ft.shape

With the new feature set, the linear model no longer fits the training set perfectly.
The test set accuracy has also reduced, but we were able to train a model with much fewer features (159), resulting in a much faster convergence time and less overfitting to the training set.

In [None]:
# Try nonlinear model
dt = DecisionTreeClassifier(random_state=SEED)
dt.fit(X_train_ft, y_train_class)

print("Train accuracy: %.2f" % accuracy_score(y_train_class, dt.predict(X_train_ft)))
print("Test accuracy: %.2f" % accuracy_score(y_test_class, dt.predict(X_test_ft)))

The non-linear decision tree model still overfits the training set, but we see a slight improvement in test set performance.

In [None]:
#@title Test: logistic regression and decision tree metrics

%%ipytest

def test_lr_test_accuracy():
    assert round(accuracy_score(y_test_class, lr.predict(X_test_ft))) > 0.55, \
       '''Logistic regression's test accuracy is low. Double check your encoding and model training.'''

def test_dt_test_accuracy():
    assert round(accuracy_score(y_test_class, dt.predict(X_test_ft))) > 0.70, \
       '''Decision tree's test accuracy is low. Double check your encoding and model training.'''

# Improving Performance with SOTA Models

Now that we've cleaned the dataset with a reduced set of derived features, let's try more SOTA classification models to improve test set performance. This week's [reference notebook](https://colab.research.google.com/drive/1uQF5cF1HnmhWIaze4gDMbN2XE4Ovh_X_?usp=sharing) lists several competition-winning models that often apply easily in practice.

This part of the project is open-ended--it is up to you to choose whether you want to pursue advanced or optional techniques to improve performance. Our suggestion is that you spend a few hours to achieve the best model you can.

**Keep track of your work**

As you try different techniques, visualize data/results, and try side experiments, keep track of your code and experiments! It's okay to let your work contain models that helped you learn but were replaced in later experiments. Keeping a _research journal_ as you work will help you refer back to what you've tried, what works, and where you can improve further later. Keep track of your work here in case you talk through it with peers or teaching staff.

**Note about using XGBoost**

XGBoost is a gradient boosting decision tree library that has a nice interface that integrates with scikit-learn. To use XGBoost, you will need to convert the output classes from strings to integers. You can do this easily with `LabelEncoder`. (XGBoost used to use `LabelEncoder` under the hood for you, but this functionality has been removed in a recent update.)


In [None]:
# SOTA models to try
from sklearn.ensemble import \
  RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

# Highly optimized gradient boosting library
!pip install xgboost

In [None]:
import xgboost as xgb

# Parameter searching functions
from sklearn.model_selection import RandomizedSearchCV, KFold

In [None]:
from datetime import datetime

def timer(start_time=None):
    '''
    Helper function to keep track of training time.
    Example usage:
      start_time = timer(None) # timing starts from this point for "start_time" variable
      ...
      timer(start_time) # timing ends here for "start_time" variable
    '''
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\nTime taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
#############################
#### YOUR CODE GOES HERE ####

#############################

## (Optional) Extension: Text and Image Features

So far, we have only been using tabular features and have not been considering the text and images provided in the dataset. Each example has a title and an image that can provide more signals about the item's price. Let's see if including these additional features improves model performance.

The next cells load pre-generated sentence features for the `title_orig` attribute and pre-generated image features for the `product_picture` attribute in the dataset using a *foundation model* called CLIP. Next week's lecture will be all about these models, but you can get a sneak peak into their capabilities in this extension. (We used a model client called `clip-as-service` to generate these features. You can look at [this reference notebook](https://colab.research.google.com/drive/1o0YSoZnWUgmy0SkLZbaj2DgBuT9n6sSr?usp=sharing) to see the exact code that generated the features.)

You can use these precomputed features and see whether they improve test set performance. Alternatively, you can also try some of the traditional text and image preprocessing methods shown in the reference notebook this week using scikit's `CountVectorizer`/`TfidfVectorizer` for text and HOG features for images.

In [None]:
# Load clip text features with gdown
!gdown 1wRyNDxXGNT5KdNeIfo-p8xCJekZC7mlJ

In [None]:
loaded = np.load('clip_features.npz')

# Get pre-computed text features
X_train_text = loaded['X_train_text']
X_test_text = loaded['X_test_text']

# Get pre-computed image features
X_train_image = loaded['X_train_image']
X_test_image = loaded['X_test_image']

In [None]:
#############################
#### YOUR CODE GOES HERE ####

#############################

In [None]:
print("Train accuracy: %.2f" % accuracy_score(y_train_enc, random_search.predict(X_train_fts)))
print("Test accuracy: %.2f" % accuracy_score(y_test_enc, random_search.predict(X_test_fts)))

# Takeaways
We don't do much in the way of formal grading, but you should prepare some experimental results, explanations of your experiments, and conclusions of your modeling work. Reporting what you tried and the outcomes you observed is a central part of quality ML engineering -- and it's critical for building successful ML systems when collaboration is involved. Here's some results and answers you should have ready when discussing your project:
* What are some baseline methods and their performance on this task?
* What modeling improvements did you try? How did each modeling improvement affect results (show a full results table if you can)
* What is your best result? What combination of modeling/data tricks produced this result?
* Did you perform any ablation or sensitivity experiments to understand which aspects of your best system are most important?
* Error analysis: Have you visualized where your model makes mistakes? (Either in aggregate or with individual mistaken examples)
* What is your current diagnosis of the ML System? Is it high variance/bias? What are your thoughts on current dataset size relative to model capacity / fit?
* What might you try next to improve on this task? Could you improve with more data? More time spent building larger models? Data augmentation or similar techniques?
* Can you identify cases or types of inputs where the model is likely to make mistakes? Are there gaps in the training set and/or model assumptions which would lead the model to make mistakes or not have sufficient data in certain situations?