## ML for Prediction

### Just For Fun

First, let's try a zero-shot learning approach. We will use the pre-trained model, GPT, to predict the class based on the 'brand' and 'description' columns.

In [2]:

import pandas as pd
from sklearn.model_selection import train_test_split


def get_train_val_test():
    food_train = pd.read_csv("data/food_train.csv")
    
    features_df = food_train.drop("category", axis=1)
    labels_df = food_train["category"]
    
    image_scores_df = (
        pd.read_csv(f"data/resnet18_food_train_features_fine_tuned.csv", index_col=0)
        .set_index(food_train.index)
        .drop(columns=["y"])
        .add_prefix("image_scores_")
    )
    
    features_df = pd.concat([features_df, image_scores_df], axis=1)
    
    
    X_train, X_val_test, y_train, y_val_test = train_test_split(
        features_df, labels_df, test_size=0.2, random_state=42
    )
    
    
    X_val, X_test, y_val, y_test = train_test_split(
        X_val_test, y_val_test, test_size=0.25, random_state=42
    )
    
    return X_train, X_val, X_test, y_train, y_val, y_test

In [3]:
from torchvision import models
from resnet import get_datasets
import torch
from tqdm import tqdm

model_ft = models.resnet18(weights="IMAGENET1K_V1")
model_ft.eval()

datasets = get_datasets(X_train, y_train, X_val, y_val)
train_ds = datasets["train"]


with torch.no_grad():
    for phase, ds in datasets.items():
        features, labels = [], []
        loop = tqdm(range(len(ds)))
        for i in loop:
            _, image, label = ds[i]
            resnet18_features = model_ft(image.unsqueeze(0)).squeeze(0).numpy()

            features.append(resnet18_features)
            labels.append(label)

        rs18_df = pd.DataFrame(features)
        rs18_df["y"] = labels
        rs18_df.to_csv(f"data/resnet18_{phase}_features.csv")

100%|██████████| 25400/25400 [28:06<00:00, 15.06it/s]  
100%|██████████| 4763/4763 [03:23<00:00, 23.44it/s]


Not too bad, but recall that we got same results with a much simpler model (Naive Bayes). Thus, we should not be too impressed by this result. I also tried to take advantage of gpt embeddings to measure cosine similarity between `description` / `brand` embedded features and the target category embeddings. This approach didn't work well, as we achieved 0.66 accuracy for embedding `brand` and `description` together, classifying by argmax of cosine similarity.

I was following OpenAI's guide: https://github.com/openai/openai-cookbook/blob/main/examples/Zero-shot_classification_with_embeddings.ipynb

All code is available at helpers/chatgpt.py.

## Baseline Model

First, we need to find out how good can we predict the target variable using the features we have, without hyperparameter tuning. We will replace each textual models with the Naive Bayes model scores for each category. We'll use the same features as in the previous part.


### Building the Pipeline

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

train_sentences = X_train["description"]
val_sentences = X_val["description"]

vectorizer = CountVectorizer(
    stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=60
)
train_matrix = vectorizer.fit_transform(train_sentences).todense()
val_matrix = vectorizer.transform(val_sentences).todense()

train_matrix.shape

In [7]:

from helpers.preprocess import (
    FillNA,
    MergeWithFoodNutrients,
    CleanAndListifyIngredients,
    NaiveBayesScores,
    LogTransformation,
    DropColumns,
    AddImportantTokens,
    StandardScale,
    StemDescription,
)
from sklearn.preprocessing import PolynomialFeatures
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

steps = [
    FillNA(),
    MergeWithFoodNutrients(nutrient_min_freq=2),
    CleanAndListifyIngredients(),
    StemDescription(),
    NaiveBayesScores(colname="brand", preprocess_func=lambda x: x.replace(" ", ""), mode="count", use_tfidf=False, vectorizer_kwgs=dict(
            stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=20
        )),
    NaiveBayesScores(
        colname="description",
        vectorizer_kwgs=dict(
            stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=50
        ),
        mode="count",
        use_tfidf=False,
    ),
    NaiveBayesScores(
        colname="ingredients",
        vectorizer_kwgs=dict(
            stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=50, max_df=0.6
        ),
        mode="count",
        use_tfidf=False,
    ),
    NaiveBayesScores(
        colname="household_serving_fulltext",
        vectorizer_kwgs=dict(stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=50),
        mode="count",
        use_tfidf=False,
    ),
    LogTransformation(columns=["serving_size"]),
    DropColumns(columns=["serving_size_unit"]),
]


pipe = Pipeline([(f"{i}", step) for i, step in enumerate(steps)])

X_train, X_val, X_test, y_train, y_val, y_test = get_train_val_test()
X = pd.concat([X_val, X_train], axis=0)
y = pd.concat([y_val, y_train], axis=0)
X = pipe.fit_transform(X, y)
X_test = pipe.transform(X_test)
# X_train, y_train = SMOTE(sampling_strategy={'chocolate': y_train.value_counts().loc['chocolate'] + 1000, 'candy': y_train.value_counts().loc['candy'] + 1000}).fit_resample(X_train, y_train)

# X_val = pipe.transform(X_val)
dmap = {
    "cakes_cupcakes_snack_cakes": 0,
    "candy": 1,
    "chips_pretzels_snacks": 2,
    "chocolate": 3,
    "cookies_biscuits": 4,
    "popcorn_peanuts_seeds_related_snacks": 5,
}
y = y.apply(lambda x: dmap[x])
y_test = y_test.apply(lambda x: dmap[x])

print(X.shape)

(30163, 2123)


### Train and evaluate base model

In [3]:
from sklearn.ensemble import (
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier
)
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from joblib import dump

best_xgb = XGBClassifier()
best_xgb.load_model('checkpoints/xgb/xgb.model')
ensemble = BaggingClassifier(best_xgb, n_estimators=20, verbose=2, n_jobs=2, random_state=42)

ensemble.fit(X, y)
y_pred = ensemble.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")

dump(ensemble, 'checkpoints/xgb_ensemble/ensemble.joblib')

NameError: name 'X' is not defined

In [8]:
from joblib import dump, load

# dump(ensemble, 'checkpoints/xgb_ensemble/ensemble.joblib')
m = load('checkpoints/xgb_ensemble/ensemble.joblib')
y_pred = m.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


0.9439546599496221


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.5s finished


In [18]:
(y_val==y_pred).mean()

0.9357547764014277

In [9]:
def get_test_set():
    X = pd.read_csv("data/food_test.csv")
    
    image_scores_df = (
        pd.read_csv(f"data/resnet18_food_test_features_fine_tuned.csv", index_col=0)
        .set_index(X.index)
        .add_prefix("image_scores_")
    )
    
    X = pd.concat([X, image_scores_df], axis=1)
    
    return X

def save_predictions(model, path):
    labels = [
      "cakes_cupcakes_snack_cakes",
      "candy",
      "chips_pretzels_snacks",
      "chocolate",
      "cookies_biscuits",
      "popcorn_peanuts_seeds_related_snacks"
    ]
    
    X = get_test_set()
    X = pipe.transform(X)
    
    X['pred_cat'] = model.predict(X)
    X['pred_cat'] = X['pred_cat'].apply(lambda x: x if isinstance(x, str) else labels[x])
    
    X[['idx', 'pred_cat']].to_csv(path, index=False)
    X.drop(columns=['pred_cat'], inplace=True)

from sklearn.ensemble import BaggingClassifier 
# SEGFAULT at rstudio :/ saved the prediction into csv with jupyter

ensemble = load('checkpoints/xgb_ensemble/ensemble.joblib')

y_pred = ensemble.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

save_predictions(ensemble, 'predictions/model03.csv')

NameError: name 'ensemple' is not defined

In [6]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.load_model("checkpoints/xgb/xgb.model")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.0


In [73]:
(y_test == y_pred).mean()

0.9452141057934509

In [76]:
pd.DataFrame({'a': y_test == y_pred, 'b': y_test}).replace(True, None).dropna()['b'].value_counts()

3    34
1    15
4    13
5    12
0    10
2     3
Name: b, dtype: int64

Not too bad. We aim to improve this score in the next part. We'll start with utilizing the images in the dataset. For this part, a ResNet18 model is used. All model's weights are frozen except the last layer. The last layer is replaced with a new layer with 6 outputs. The model is trained for 25 epochs with a learning rate of 0.001, decay by a factor of 0.1 every 7 epochs.

All code is available in `helpers/resnet.py`.

Source: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#convnet-as-fixed-feature-extractor

In [70]:
dmap = {
    "cakes_cupcakes_snack_cakes": 0,
    "candy": 1,
    "chips_pretzels_snacks": 2,
    "chocolate": 3,
    "cookies_biscuits": 4,
    "popcorn_peanuts_seeds_related_snacks": 5,
}
y = y.apply(lambda x: dmap[x])
y_test = y_test.apply(lambda x: dmap[x])

In [22]:
X_test = pipe.transform(X_test)

In [23]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

# param_grid = {
#     'n_estimators': [100, 200, 300, 400, 500],
#     'learning_rate': [0.01, 0.1, 0.3],
#     'max_depth': [3, 5, 6, 7, 8],
#     'min_child_weight': [1, 2, 4],
#     'subsample': [0.8, 0.9, 1.0],
#     'colsample_bytree': [0.8, 0.9, 1.0],
#     'gamma': [0, 0.1, 0.3],
#     'reg_alpha': [0, 0.1, 0.3],
#     'reg_lambda': [0, 0.1, 0.3],
# }

# xgb = XGBClassifier(objective='multi:softmax', num_class=6, random_state=42)

# random_search = RandomizedSearchCV(
#     estimator=xgb,
#     param_distributions=param_grid,
#     n_iter=25,  # Number of iterations
#     scoring='accuracy',  # Use an appropriate scoring metric
#     cv=4,  # Number of cross-validation folds
#     verbose=3,
#     n_jobs=2,  # Use all available CPU cores
#     random_state=42
# )

# random_search.fit(X, y)

# best_params = random_search.best_params_
# best_model = random_search.best_estimator_

val_accuracy = best_model.score(X_test, y_test)
print("Best Parameters:", best_params)
print("Validation Accuracy:", val_accuracy)


with open('checkpoints/xgb/xgb_random_search.txt', 'w') as f:
    f.write("Best Parameters:\n")
    f.write(str(best_params))
    f.write("\n\nValidation Accuracy:\n")
    f.write(str(val_accuracy))


Best Parameters: {'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0.1, 'n_estimators': 400, 'min_child_weight': 1, 'max_depth': 8, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.9}
Validation Accuracy: 0.9452141057934509


In [5]:
from

best_xgb = XGBClassifier()
best_xgb.load_model('/workspaces/Final_Project-aladdeans-sage/checkpoints/xgb/xgb.model')

NameError: name 'XGBClassifier' is not defined

In [52]:
test_accuracy = m.score(X_test, y_test)
print("Best Parameters:", best_params)
print("Validation Accuracy:", test_accuracy)


Best Parameters: {'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0.1, 'n_estimators': 400, 'min_child_weight': 1, 'max_depth': 8, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.9}
Validation Accuracy: 0.9452141057934509


In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from joblib import dump, load

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],  # Number of trees in the forest
    'criterion': ['gini', 'entropy'],           # Split criterion
    'max_depth': [None, 10, 20, 30, 40],        # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],            # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],              # Minimum number of samples required to be at a leaf node
    'bootstrap': [True, False],                 # Whether bootstrap samples are used
    'random_state': [42]                        # Random seed for reproducibility
}

# Create a RandomForestClassifier
rf_classifier = RandomForestClassifier()

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_classifier,
    param_distributions=param_grid,
    n_iter=50,            # Number of parameter settings that are sampled
    scoring='accuracy',   # Scoring metric for evaluation
    cv=5,                 # Cross-validation folds
    verbose=2,
)

# Fit the RandomizedSearchCV to your data
random_search.fit(X, y)

print("Best Score:", random_search.best_score_)
print("Best Parameters:", random_search.best_params_)

best_model = random_search.best_estimator_
dump(best_model, 'checkpoints/rf/best_model.joblib')

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END bootstrap=True, criterion=gini, max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=500, random_state=42; total time= 1.3min
[CV] END bootstrap=True, criterion=gini, max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=500, random_state=42; total time= 1.4min
[CV] END bootstrap=True, criterion=gini, max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=500, random_state=42; total time= 1.4min
[CV] END bootstrap=True, criterion=gini, max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=500, random_state=42; total time= 1.3min
[CV] END bootstrap=True, criterion=gini, max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=500, random_state=42; total time= 1.3min
[CV] END bootstrap=False, criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=300, random_state=42; total time=  53.9s
[CV] END bootstrap=Fa

['checkpoints/rf/best_model.joblib']

In [None]:
df2 = X_val.copy()
df2["y_pred"] = y_pred == y_val
df2["y"] = y_val

df1 = X_train.copy()
df1["y_pred"] = model.predict(X_train) == y_train
df1["y"] = y_train.values

In [None]:
df1['y_pred'].mean()

In [None]:
df2['y_pred'].mean()

In [None]:
print("train mistakes:")
df1[df1["y_pred"] == False]["y"].value_counts()