# Week 2: Leveraging Foundation Models for Robust Supervised Learning

In last week's project, you saw how important choosing the right features is to creating a successful ML model. For more complex modalities like language, speech, and vision, we didn't use the best features we could have! In the week 1 reference notebook, we used standard baseline feature encodings for these modalities. In this project we will use *foundation model features*. These models can be trained on large, related datasets and from them we can extract general features for prediction tasks.

In this project, we will introduce several Kaggle challenge datasets for further improvement via foundation models! The project notebook will introduce a few different foundation models to use as features, and establish a basic baseline system with each. You can leverage these models along with anything else you explored in Week 1 with the continued goal of using build-measure-learn iterations to achieve the best system you can.

Foundation models are often complex deep learning models with large datasets and many parameters. We provide pre-trained foundation models for use in this project. If you want to learn more about designing and training foundation models, check out the Uplimit introduction to deep learning course!

Note: foundation models apply to modalities like language, speech, and vision, but they generally aren't used for featurizing tabular data. If you would like to continue working on tabular data, you are welcome to work on Dataset 1 (the transaction fraud dataset). Otherwise, you can try foundation models on one of the other datasets.

### Instructions

1. We provide starter code below as a scaffold. You will be using many of the skills you learned from previous weeks to complete this notebook.
2. Ensure you read through the document and starting code before beginning your work. Understand the overall structure and goals of the project to make your iteration smoother.
3. As in Week 1, keep track of what you try and iterate towards building the best ML system you can! You are also welcome to try some targeted evaluations of your model to see if e.g. foundation features make the model more robust to noisy or transformed inputs.

# Dependencies

We first setup the libraries required for the project. Many of these may already be installed by default in Colab.

In [None]:
!pip install numpy
!pip install scikit-learn
!pip install librosa
!pip install xgboost
!pip install --upgrade --no-cache-dir gdown

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import librosa
import librosa.display
from tqdm import tqdm
from collections import Counter

# importing a potpouri of models you can use
# feel free to add more!
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import accuracy_score
import xgboost as xgb

In [None]:
# shared setup code for datasets
from sklearn.model_selection import train_test_split


class BaseDataset:
  """
  We will use this base class for all datasets.
  You do not need to change this class.
  """
  def __init__(self):
    self._data = self.make_data()

  def _load(self):
    raise NotImplementedError

  def make_data(self):
    print('loading data...')
    X_train, y_train = self._load()
    X_train, X_test, y_train, y_test = train_test_split(
        X_train, y_train, test_size=0.2, random_state=42, shuffle=True)
    print('done.')
    return dict(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test)

  def get_train_data(self):
    return self._data['X_train'], self._data['y_train']

  def get_test_data(self):
    return self._data['X_test'], self._data['y_test']

  @property
  def num_train(self):
    return len(self._data['X_train'])

  @property
  def num_test(self):
    return len(self._data['X_test'])


# Kaggle challenge datasets

We have chosen a few datasets from Kaggle as possible tasks for you. Each of these datasets was chosen so that we can leverage foundation features for different input modalities. Your project this week is to choose a dataset, and build the best performing model you can by leveraging modeling iterations and foundation features.


## **Task: Choose ONE dataset. Use foundation features to achieve the best performance you can**
Choose a dataset and leverage foundation features along with your modeling best practices to achieve good performance.

We have pre-computed foundation features for you using different models applicable to each task domain. Before starting work you might want to briefly review each of the datasets and corresponding foundation models.

Report your best performance. For practice, ensure you can summarize what your final model is, and what you tried along the way. We provide a _research notebook_ starting point at the bottom for you to track your work.

_You only need to work on one of the datasets below, but try more than one if you'd like!_

In [None]:
# useful general functions
from sklearn.metrics import accuracy_score

def train_svm(X_train, y_train):
  model = LinearSVC()
  model.fit(X_train, y_train)
  return model

def predict_svm(model, X):
  return model.predict(X)

# Dataset 1: Transaction Fraud Detection

[Kaggle link](https://www.kaggle.com/c/ieee-fraud-detection/overview)

This dataset contains Vesta's real world e-commerce transactions with features from device type to product types. The challenge is to design a model to classify fraudulent transactions, helping businesses reduce loss.

**Transaction Features:**

- `TransactionDT`: timedelta from a given reference datetime (not an actual timestamp)
- `TransactionAMT`: transaction payment amount in USD
- `ProductCD`: product code, the product for each transaction
- `card1` - `card6`: payment card information, such as card type, card category, issue bank, country, etc.
- `addr`: address
- `dist`: distance
- `P_` and (`R__`) `emaildomain`: purchaser and recipient email domain
- `C1`-`C14`: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- `D1`-`D15`: timedelta, such as days between previous transaction, etc.
- `M1`-`M9`: match, such as names on card and address, etc.
- `Vxxx`: Vesta engineered rich features, including ranking, counting, and other entity relations.

**Identity Features:**

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)


The following are categorical features:
`ProductCD`, `card1` - `card6`, `addr1`, `addr2`, `P_emaildomain`, `R_emaildomain`, `M1` - `M9`, `DeviceType`, `DeviceInfo`, `id_12` - `id_38`. We recommend you handle categorical features by converting them to [one-hot representations](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/).

Further, this dataset may have missing entries, as is common in tabular data. You have many options here: you can drop rows with missing data, or replace with a filler value, or try to impute it with similar values. It is up to you!

In [None]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ" -O train_transaction.csv && rm -rf /tmp/cookies.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1c1u1zKKVz6FnbcMUM6yUzrigqfK6bQn2' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1c1u1zKKVz6FnbcMUM6yUzrigqfK6bQn2" -O train_identity.csv && rm -rf /tmp/cookies.txt

In [None]:
import numpy as np
import pandas as pd

class FraudDataset(BaseDataset):

  def _load(self):
    rs = np.random.RandomState(42)

    train_tx = pd.read_csv('./train_transaction.csv')
    train_id = pd.read_csv('./train_identity.csv')
    train_data = train_tx.merge(train_id, on='TransactionID', how='left')
    train_data.reset_index(inplace=True)
    del train_data['TransactionID']
    train_label = train_data['isFraud']
    del train_data['isFraud']

    # subsample 10k positive and negative!
    indices0 = rs.choice(np.where(train_label == 0)[0], 10000, replace=False)
    indices1 = rs.choice(np.where(train_label == 1)[0], 10000, replace=False)
    indices = np.concatenate([indices0, indices1])
    train_data = train_data.iloc[indices]
    train_label = train_label.iloc[indices]

    return train_data, train_label


dataset = FraudDataset()

In [None]:
X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()
print('Raw Input:')
print(X_train.head())
print('Targets:')
print(y_train.head())

## Iterating on Tabular Datasets

Foundation models apply to modalities like language, speech, and vision, but they generally aren't used for featurizing tabular data. If you would like to continue working with tabular data and choose to iterate on this fraud dataset, there are more steps you can take to improve model performance:

1.   Try scikit-learn's [built-in preprocessing methods](https://scikit-learn.org/stable/modules/preprocessing.html) for standardizing continuous features and encoding categorical features.
2.   Try different methods for imputing missing values. Scikit-learn provides many classes like [`SimpleImputer`, `IterativeImputer`, and `KNNImputer`](https://scikit-learn.org/stable/modules/impute.html) that you can try out.
3. Try some of the [Tier-2 advanced models](https://colab.research.google.com/drive/1nx7V_XHrWc5RNJ187aGT18gHwp5JTC1h#scrollTo=PzaSS4S7veeZ) suggested in last week's SOTA reference notebook.



*Previous learners have been able to achieve >80% test accuracy on the transaction fraud dataset, depending on their choices of model + features.*

# Dataset 2: Disaster Prediction from Tweets

[(kaggle link)](https://www.kaggle.com/c/nlp-getting-started/overview)

Tweets are an important communication channel in times of emergency. Ideally, our protection agencies can programmatically monitor Twitter to detect disasters and provide relief. However, Tweets that may sound that it is reporting a disaster may be referring to something else entirely. This dataset contains a collection of tweet texts annotated with binary labels that indicate whether the tweet describes a real disaster or not. Additional features, such as location and keyword may be provided.

In [None]:
!gdown --id 1NfuR0tuBF0t5HJW2Q12l0c0hnAH1VqTj

In [None]:
import pandas as pd

class TweetDataset(BaseDataset):

  def _load(self):
    train_data = pd.read_csv('./train.csv')
    train_label = train_data['target']
    del train_data['id'], train_data['target']
    return train_data, train_label

# --
dataset = TweetDataset()

In [None]:
X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()
print(f'Train shape: {X_train.shape}')
print(f'Test shape: {X_test.shape}')

## Word2Vec

[(blog)](https://jalammar.github.io/illustrated-word2vec/) [(paper)](https://arxiv.org/pdf/1310.4546.pdf)

Word2Vec is a very popular algorithm to map individual words to high dimensional vector representations. It does so in a way that synonymous words will be close to each other in vector space. In fact, a famous example for Word2Vec is that embedding of `king` - embedding of `man` + embedding of `woman` returns the embedding of `queen`. This example illustrates that these representations hold semantic meaning.

Word2Vec is trained on a large text corpus, from Twitter to a collection of books. The training distribution has a large effect on the representations learned! The advantage of Word2Vec is simplicity: the model is stored as a large dictionary from words to embeddings. Additionally, it captures more complex behavior than TF-IDF (which we saw from week 1). The downside is still its assumption on word independence. We know as language speakers that words are rarely used in isolation. As such, the perfect representation should capture the context a word is used in; Word2Vec sacrifices this. In practice, Word2Vec returns an embedding for every word in the sentence but we average across all words in a sentence to achieve a single embedding per tweet.

In [None]:
# precomputed word2vec embeddings on the TweetDataset
!gdown --id 11AY0dD_FZ4ghoIr-Jt8TyledvXvTTdTF
!gdown --id 1MFByPWEuqFjKXje-TjX90JPlbmoAuHJq

In [None]:
# these will be in the same order and size as X_train/y_train from above
X_word2vec_train = np.load('./tweet_word2vec_train.npy', allow_pickle=True)
X_word2vec_test = np.load('./tweet_word2vec_test.npy', allow_pickle=True)
print(f'Train shape: {X_word2vec_train.shape}')
print(f'Test shape: {X_word2vec_test.shape}')

The below establishes a baseline model training and evaluation experiment using word2vec features. You may use this as a comparison point for performance, and a start improving from here.

In [None]:
model = train_svm(X_word2vec_train, y_train)
y_hat_train = predict_svm(model, X_word2vec_train)
y_hat_test = predict_svm(model, X_word2vec_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

## BERT

[(blog)](https://jalammar.github.io/illustrated-bert/) [(paper)](https://arxiv.org/abs/1810.04805)

In 2018, BERT shocked the NLP research world as it crushed its competition on a variety of benchmarks. BERT learned **contextual** word embeddings by mixing features from individual tokens using the popular Transformer network. Unlike Word2Vec, BERT embeddings captured the greater context of the sentence and document that a word was being used in. We linked an amazing blog above that we highly recommend the curious reader to explore.

Below, we precomputed embeddings for two different kinds of BERT, one trained on a large corpus of internet articles and books, and the other trained on a large corpus of Tweets. You are free to experiment with both, or combine them!


In [None]:
# download BERT
!gdown --id 18ryiowk_A73UyTB8Mw3zhqbclbPPaGDc
!gdown --id 1u0XENXcs8A_D96EHIN2iENMm6qR7IyAJ

In [None]:
X_bert_train = np.load('./tweet_bert_train.npy')
X_bert_test = np.load('./tweet_bert_test.npy')
print(f'Train shape: {X_bert_train.shape}')
print(f'Test shape: {X_bert_test.shape}')

Here is a baseline setup to use BERT features with an SVM classifier.

In [None]:
model = train_svm(X_bert_train, y_train)
y_hat_train = predict_svm(model, X_bert_train)
y_hat_test = predict_svm(model, X_bert_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

In [None]:
# download BERTweet
!gdown --id 1-Ef8QkuhYVClgIra-AWFmMV0L0K7zWJ7
!gdown --id 11gURmayDn1TsMGYY7guelJ9Khs_bnFdC

In [None]:
X_bertweet_train = np.load('./tweet_bertweet_train.npy')
X_bertweet_test = np.load('./tweet_bertweet_test.npy')
print(f'Train shape: {X_bertweet_train.shape}')
print(f'Test shape: {X_bertweet_test.shape}')

The below establishes a baseline model using the tweet-specific BERT features. You may use this as a baseline, and combine the features in whatever way you choose to achieve the best final system performance you can below.

In [None]:
model = train_svm(X_bertweet_train, y_train)
y_hat_train = predict_svm(model, X_bertweet_train)
y_hat_test = predict_svm(model, X_bertweet_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

*Previous learners have been able to achieve ~83% test accuracy on the Tweet dataset. Try comparing classical NLP methods like TF-IDF with foundation model features like Word2Vec and BERT.*

# Dataset 3: Classifying Cats and Dogs

[(kaggle link)](https://www.kaggle.com/c/dogs-vs-cats)

Is this an image of a cat or a dog? This training dataset contains 25,000 images of both animals. These are real world images of pets with different camera angles, backgrounds, and quality. In other words, this is a difficult task! The top performing model scores 98.9% but use more sophisticated methods than shown in this notebook. Still, see how well you can do!

In [None]:
!gdown --id 1ya_pBnNQ72Rw9AG0-6sZNRnt2ds_mBfP
!unzip -q train.zip

In [None]:
from glob import glob

class CatDogDataset(BaseDataset):

  def _load(self):
    cat_files = glob('train/cat.*.jpg')
    dog_files = glob('train/dog.*.jpg')
    img_files = cat_files + dog_files
    labels = [0] * len(cat_files) + [1] * len(dog_files)
    data = np.array(img_files)
    labels = np.array(labels)

    return data, labels

# --
dataset = CatDogDataset()

In [None]:
X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()
print(f'Train shape: {X_train.shape}')
print(f'Test shape: {X_test.shape}')

## ResNet50

[(blog)](https://towardsdatascience.com/introduction-to-resnets-c0a830a288a4) [(paper)](https://arxiv.org/abs/1512.03385)

Residual Networks were one of the first neural networks to have a truly deep architecture e.g. 50 layers. They accomplished this through "residual connections" where the output of an early layer is directly fed as input to a layer further down the network. ResNets, being trained on ImageNet -- the benchmark for a large visual dataset -- are widely used to generate features for general images.

In [None]:
!gdown --id 1zcgfCH_bJiFn09ulR_ILbYF1OtM3UWt7
!gdown --id 1wjzFlhwzYXeRONBVxs6r8sIS8zslvRsC

In [None]:
X_resnet50_train = np.load('./catdog_resnet50_train.npy')
X_resnet50_test = np.load('./catdog_resnet50_test.npy')
print(f'Train shape: {X_resnet50_train.shape}')
print(f'Test shape: {X_resnet50_test.shape}')

The below establishes a baseline model training and evaluation experiment using the ResNet features. You may use this as a comparison point for performance, and a starting point for integrating these features into ideas or other approaches you explored in Week 1 for this dataset.

In [None]:
model = train_svm(X_resnet50_train, y_train)
y_hat_train = predict_svm(model, X_resnet50_train)
y_hat_test = predict_svm(model, X_resnet50_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

## CLIP

[(blog)](https://openai.com/blog/clip/) [(paper)](https://arxiv.org/abs/2103.00020)

CLIP (or Visual Transformers)  is an unsupervised neural network released by OpenAI in 2021. It is trained to jointly learn image and text embeddings through a large corpus build from scraping the web. At the time of publication, CLIP is state of the art, and is becoming increasingly popular as a model for image (and text) features.

In [None]:
!gdown --id 1RinM0zUNUllTUXL6yUjcvKQFExGyUGJP
!gdown --id 1YgypozUZqzOjUJOqRPltgcW5KCAI0UeD

In [None]:
X_clip_train = np.load('./catdog_clip_train.npy')
X_clip_test = np.load('./catdog_clip_test.npy')
print(f'Train shape: {X_clip_train.shape}')
print(f'Test shape: {X_clip_test.shape}')

The below establishes a baseline model training and evaluation experiment using CLIP features. You may use this as a comparison point for performance, and a starting point for further work with these features.

In [None]:
model = train_svm(X_clip_train, y_train)
y_hat_train = predict_svm(model, X_clip_train)
y_hat_test = predict_svm(model, X_clip_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

*CLIP features achieve close to 100% accuracy. How does this compare with a classical image featurization method like HOG?*

# Dataset 4: Google Home Command Classification





[(kaggle link)](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview)


Google Home, and similar smart devices, rely on speech models to detect when the user utters commands, like "Hey Google". This dataset contains 65,000 one-second long utterances of 30 different short words, each uttered by thousands of people. The labels you will need to predict are `yes`, `no`, `up`, `down`, `left`, `right`, `on`, `off`, `stop`, `go`. You should ignore all other classes.

In [None]:
!gdown --id 1sfkLsKT8JHPMM1pifQJqefL5elopjFX7
!7z x train.7z -y

In [None]:
import os
from glob import glob


class CommandDataset(BaseDataset):
  _commands = ['yes', 'no', 'up', 'down', 'left', 'right',
               'on', 'off', 'stop', 'go']
  _sample_rate = 16000

  def _load(self):
    data, labels = [], []
    for c, command in enumerate(self._commands):
      files = glob(os.path.join(f'./train/audio/{command}/*.wav'))
      labels_c = [c] * len(files)
      data += files
      labels += labels_c
    data = np.array(data)
    labels = np.array(labels)

    return data, labels

# --
dataset = CommandDataset()

In [None]:
X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()
print(f'Train shape: {X_train.shape}')
print(f'Test shape: {X_test.shape}')

Let's revisit the two foundation models we used above, and see how they compare on this new dataset.

### wav2vec 2.0

[(blog)](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) [(paper)](https://arxiv.org/abs/2006.11477)

In [None]:
!gdown --id 1_QKzTYqR28K3LCaFBWGIRqKx1ieFPSXj
!gdown --id 17Tv7Tk_uOtbwmIRb_h9sPXcyMCpWkKTa

In [None]:
X_wav2vec_train = np.load('./google_wav2vec_train.npy')
X_wav2vec_test = np.load('./google_wav2vec_test.npy')
print(f'Train shape: {X_wav2vec_train.shape}')
print(f'Test shape: {X_wav2vec_test.shape}')

Similar to the foundation feature reference notebook, this establishes a baseline classifier using the

In [None]:
model = train_svm(X_wav2vec_train, y_train)
y_hat_train = predict_svm(model, X_wav2vec_train)
y_hat_test = predict_svm(model, X_wav2vec_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

### HUBERT

[(blog)](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression/) [(paper)](https://arxiv.org/abs/2106.07447)

In [None]:
!gdown --id 1r6RQlOjs3RjEPd7xVa-mC1ds8FTomh9s
!gdown --id 1eMnKrfGi3OB3dKcEMHLKJpbsztUvSDA7

In [None]:
X_hubert_train = np.load('./google_hubert_train.npy')
X_hubert_test = np.load('./google_hubert_test.npy')
print(f'Train shape: {X_hubert_train.shape}')
print(f'Test shape: {X_hubert_test.shape}')

The code below establishes a baseline model training and evaluation experiment using wav2vec features. You may use this as a comparison point for performance, and a starting point for integrating these features into ideas or other approaches you explored in Week 1 for this dataset.

In [None]:
model = train_svm(X_hubert_train, y_train)
y_hat_train = predict_svm(model, X_hubert_train)
y_hat_test = predict_svm(model, X_hubert_test)
train_acc = accuracy_score(y_train, y_hat_train)
test_acc = accuracy_score(y_test, y_hat_test)
print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')

*Try to see if you can achieve near ~100% test accuracy using more complex models.*

#Research Notebook

The sections above outline loading foundation model features, and establishing baseline performance using these features on each task. Your main task for this project is to continue improving performance on one of the challenge datasets using what you developed in Week 1, along with your new modeling tool -- foundation features!

As usual, develop in build-measure-learn loops guided by a hypothesis of how you're improving the model or testing an idea at each iterative step.

In [None]:
#############################
#### YOUR CODE GOES HERE ####


#############################