> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Week 3: Building State-of-the-art Supervised Learning Models

In this project, we will be building more advanced supervised learning models to solve Kaggle challenges, where you can compare your models to a community leader board. Unlike previous weeks, this assignment will be more open-ended and provide you more opportunity to try new and creative approaches to building better performing models. This is representative of the research cycle, where you can put some of the things you have learned so far to practice. Use discussion with your peers and the teaching staff to help guide your thinking. Remember to always frame your work in terms of build-measure-learn thinking so you can be clear on what you're trying and why. 


### Instructions

1. We provide starter code and a selection of Kaggle datasets below as a scaffold. Although a lot of the assignment is open-ended, you should constrain yourself to one of the dataset options below and models within scikit-learn (at first). 
1. Ensure you read through the document and starting code before beginning your work. Understand the overall structure and goals of the project to make your iteration smoother.
1. This project is open-ended due to the many possible ways to improve performance. We leave it to you to choose when your project practice is "enough", and whether you want to pursue advanced, optional techniques to improve performance. We encourage you to share progress on Slack to calibrate your solutions with others as you work. 
1. As you work, try to practice hypothesis-driven _build-measure-learn_ development loops. Open ended ML modeling projects are especially helped by thinking clearly about the state of your ML experiments so far and what to try next.

# Dependencies

We first setup the libraries requires for the project. Many of these may already be installed by default in Colab.

In [1]:
!pip install numpy
!pip install scikit-learn
!pip install xgboost
!pip install librosa

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image  # image loading library
import librosa  # speech library for loading
import xgboost as xgb  # gradient boosting library
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Choose a dataset and build a SOTA ML model

Your assignment is to pick your favorite dataset, and use a combination of the modeling and featurization techniques introduced here to build the best model you can. In each dataset, we have fixed the training and test set. You may only use the training set for fitting models and hyperparameter tuning. Using data augmentation techniques based on the data we've provided is okay, but for "official" results don't add more training data of your own. Please avoid fitting your hyperparameters too much using the test set. Instead, create a small dev set of your own from the training set and use the test set infrequently.

Refer to course pages and previous projects for strategies to explore in improving your results. Remember to always work in build-measure-learn hypothesis-led iterations. 

At the end, you should provide your best model(s) and results, and prepare to review/discuss what you tried with classmates.

## Datasets

We present 4 different options of Kaggle datasets, one for each of the modalities above. **You should pick your favorite one and build a model for that dataset.** Because this project emphasizes achieving SOTA performance, see how far you can go in improving performance on just a single task. (Of course if you achieve good performance and run out of ideas for how to improve, it's okay to work on multiple datasets in this project).

For each dataset, we have downloaded and formatted the data for you. We provide a brief description below and the link to the original Kaggle competition where you can find forum discussion and example notebooks for inspiration. These datasets can get quite large, and Colab has limited RAM and storage. We recommend you read through the descriptions below and the Kaggle pages but only pick one to download.

Some additional reminders:
- You may have to handle missing data. 
- You may want to create dev sets for hyperparameter tuning. 
- You are free to remove and add additional features. 
- The data will be given to you as Pandas Dataframes. You may need to convert these to NumPy arrays for your model training.
- We don't have access to the true test labels for many Kaggle labels and so we partition the training set into our own test set.

**NOTE: You should just pick one problem out of the 4 for your project**

If you're in doubt start with the Transaction Fraud or Disaster prediction problems. 

In [3]:
# shared setup code for datasets
from sklearn.model_selection import train_test_split


class BaseDataset:
  """
  We will use this base class for all datasets.
  You do not need to change this class.
  """
  def __init__(self):
    self._data = self.make_data()

  def _load(self):
    raise NotImplementedError

  def make_data(self):
    print('loading data...')
    X_train, y_train = self._load()
    X_train, X_test, y_train, y_test = train_test_split(
        X_train, y_train, test_size=0.2, random_state=42, shuffle=True)
    print('done.')
    return dict(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test)

  def get_train_data(self):
    return self._data['X_train'], self._data['y_train']

  def get_test_data(self):
    return self._data['X_test'], self._data['y_test']

  @property
  def num_train(self):
    return len(self._data['X_train'])

  @property
  def num_test(self):
    return len(self._data['X_test'])


### Transaction Fraud Detection

[Kaggle link](https://www.kaggle.com/c/ieee-fraud-detection/overview) 

This dataset contains Vesta's real world e-commerce transactions with features from device type to product types. The challenge is to design a model to classify fraudulent transactions, helping businesses reduce loss.

**Transaction Features:**

- `TransactionDT`: timedelta from a given reference datetime (not an actual timestamp)
- `TransactionAMT`: transaction payment amount in USD
- `ProductCD`: product code, the product for each transaction
- `card1` - `card6`: payment card information, such as card type, card category, issue bank, country, etc.
- `addr`: address
- `dist`: distance
- `P_` and (`R__`) `emaildomain`: purchaser and recipient email domain
- `C1`-`C14`: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- `D1`-`D15`: timedelta, such as days between previous transaction, etc.
- `M1`-`M9`: match, such as names on card and address, etc.
- `Vxxx`: Vesta engineered rich features, including ranking, counting, and other entity relations.

**Identity Features:**

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)


The following are categorical features:
`ProductCD`, `card1` - `card6`, `addr1`, `addr2`, `P_emaildomain`, `R_emaildomain`, `M1` - `M9`, `DeviceType`, `DeviceInfo`, `id_12` - `id_38`. We recommend you handle categorical features by converting them to [one-hot representations](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/). 

Further, this dataset may have missing entries, as is common in tabular data. You have many options here: you can drop rows with missing data, or replace with a filler value, or try to impute it with similar values. It is up to you!

In [4]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ" -O train_transaction.csv && rm -rf /tmp/cookies.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1c1u1zKKVz6FnbcMUM6yUzrigqfK6bQn2' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1c1u1zKKVz6FnbcMUM6yUzrigqfK6bQn2" -O train_identity.csv && rm -rf /tmp/cookies.txt

--2022-09-26 06:18:03--  https://docs.google.com/uc?export=download&confirm=t&id=11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ
Resolving docs.google.com (docs.google.com)... 74.125.142.113, 74.125.142.138, 74.125.142.100, ...
Connecting to docs.google.com (docs.google.com)|74.125.142.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0k-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/5m5369meqemeslbvlveej8bt8abgchnv/1664173050000/17643477956629335341/*/11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ?e=download&uuid=383e1808-3ea7-425d-a5c9-92d9c0b67f95 [following]
--2022-09-26 06:18:03--  https://doc-0k-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/5m5369meqemeslbvlveej8bt8abgchnv/1664173050000/17643477956629335341/*/11_y7TCGE3YRL_qW33XVVWUILlrrkzcSZ?e=download&uuid=383e1808-3ea7-425d-a5c9-92d9c0b67f95
Resolving doc-0k-44-docs.googleusercontent.com (doc-0k-44-docs.googleusercontent.com)... 74.125.197.132, 

In [5]:
import numpy as np
import pandas as pd

class FraudDataset(BaseDataset):
  
  def _load(self):
    rs = np.random.RandomState(42)

    train_tx = pd.read_csv('./train_transaction.csv')
    train_id = pd.read_csv('./train_identity.csv')
    train_data = train_tx.merge(train_id, on='TransactionID', how='left')
    train_data.reset_index(inplace=True)
    del train_data['TransactionID']
    train_label = train_data['isFraud']
    del train_data['isFraud']
    
    # subsample 10k positive and negative!
    indices0 = rs.choice(np.where(train_label == 0)[0], 10000, replace=False)
    indices1 = rs.choice(np.where(train_label == 1)[0], 10000, replace=False)
    indices = np.concatenate([indices0, indices1])
    train_data = train_data.iloc[indices]
    train_label = train_label.iloc[indices]
    
    return train_data, train_label


dataset = FraudDataset()

loading data...
done.


In [6]:
X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()
print('Raw Input:')
print(X_train.head())
print('Targets:')
print(y_train.head())

Raw Input:
         index  TransactionDT  TransactionAmt ProductCD  card1  card2  card3  \
100677  100677        2040188           25.00         H  16485  174.0  150.0   
48739    48739        1171969           21.00         W   5033  269.0  150.0   
317134  317134        7907020          107.95         W   9485  111.0  150.0   
335997  335997        8275057          141.00         W   6530  206.0  150.0   
196248  196248        4410075          117.00         W   3574  232.0  150.0   

             card4  card5  card6  ...        id_31  id_32  id_33  \
100677        visa  226.0  debit  ...  chrome 63.0   24.0    NaN   
48739   mastercard  224.0  debit  ...          NaN    NaN    NaN   
317134        visa  226.0  debit  ...          NaN    NaN    NaN   
335997  mastercard  126.0  debit  ...          NaN    NaN    NaN   
196248        visa  166.0  debit  ...          NaN    NaN    NaN   

                 id_34 id_35 id_36  id_37  id_38  DeviceType  DeviceInfo  
100677  match_status:2  

In [7]:
#############################
#### YOUR CODE GOES HERE ####
from tqdm import tqdm
from sklearn import preprocessing
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

if 'index' in X_train: del X_train['index']
if 'index' in X_test: del X_test['index']
categorical_columns = [
    'card4', 'card6', 'P_emaildomain', 'R_emaildomain',
    'id_12', 'id_15', 'id_16', 'id_23',
    'id_27', 'id_28', 'id_29', 'id_30', 'id_31',
    'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
    'DeviceType', 'DeviceInfo', 'ProductCD',
    'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9'
]
num_train = len(X_train)
X = pd.concat([X_train, X_test])
X = pd.get_dummies(X, columns = categorical_columns, prefix=categorical_columns)

X.fillna(0, inplace=True)
X.replace([np.inf, -np.inf], 0, inplace=True)

X_train = X.iloc[:num_train]
X_test = X.iloc[num_train:]

model = LinearSVC()
model.fit(X_train, y_train)
yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy: {train_acc}')
print(f'Test accuracy: {test_acc}')
#############################

Train accuracy: 0.5634375
Test accuracy: 0.55275




### Disaster Prediction from Tweets

[Kaggle link](https://www.kaggle.com/c/nlp-getting-started/overview)

Tweets are an important communication channel in times of emergency. Ideally, our protection agencies can programmatically monitor Twitter to detect disasters and provide relief. However, Tweets that may sound that it is reporting a disaster may be referring to something else entirely.

This dataset contains a collection of tweet texts annotated with binary labels that indicate whether the tweet describes a real disaster or not. Additional features, such as location and keyword may be provided.

Data Columns:
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

To get started, we recommend ignoring the keyword and location data initially, and focus on featurizing the tweet itself. Once you have a baseline model, you can try adding the keyword and location as additional features to see if they improve performance.
Further, this dataset may have missing entries. You have many options here: you can drop rows with missing data, drop features with missing labels, or replace with a filler value, or try to impute it with similar values. It is up to you!

Raw text is not easily fed into a model. We recommend exploring different methods to featurize the dataset. Try it out!

In [8]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1NfuR0tuBF0t5HJW2Q12l0c0hnAH1VqTj' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1NfuR0tuBF0t5HJW2Q12l0c0hnAH1VqTj" -O train.csv && rm -rf /tmp/cookies.txt

--2022-09-26 06:20:41--  https://docs.google.com/uc?export=download&confirm=&id=1NfuR0tuBF0t5HJW2Q12l0c0hnAH1VqTj
Resolving docs.google.com (docs.google.com)... 74.125.195.102, 74.125.195.100, 74.125.195.113, ...
Connecting to docs.google.com (docs.google.com)|74.125.195.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1j08uae47fleppmr0cfj41qbdqhedg3r/1664173200000/17643477956629335341/*/1NfuR0tuBF0t5HJW2Q12l0c0hnAH1VqTj?e=download&uuid=57201274-0f4f-4f70-97e8-1a4d9b8af6c6 [following]
--2022-09-26 06:20:43--  https://doc-0c-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1j08uae47fleppmr0cfj41qbdqhedg3r/1664173200000/17643477956629335341/*/1NfuR0tuBF0t5HJW2Q12l0c0hnAH1VqTj?e=download&uuid=57201274-0f4f-4f70-97e8-1a4d9b8af6c6
Resolving doc-0c-44-docs.googleusercontent.com (doc-0c-44-docs.googleusercontent.com)... 74.125.197.132, 2

In [9]:
import pandas as pd

class TweetDataset(BaseDataset):
  
  def _load(self):
    train_data = pd.read_csv('./train.csv')
    train_label = train_data['target']
    del train_data['id'], train_data['target']
    return train_data, train_label


dataset = TweetDataset()

loading data...
done.


In [10]:
X_train, y_train = dataset.get_train_data()
print('Raw Input:')
print(X_train.head())
print('Targets:')
print(y_train.head())

Raw Input:
       keyword            location  \
4996  military               Texas   
3263  engulfed                 NaN   
4907  massacre  Cottonwood Arizona   
2855   drought         Spokane, WA   
4716      lava     Medan,Indonesia   

                                                   text  
4996  Courageous and honest analysis of need to use ...  
3263  @ZachZaidman @670TheScore wld b a shame if tha...  
4907  Tell @BarackObama to rescind medals of 'honor'...  
2855  Worried about how the CA drought might affect ...  
4716  @YoungHeroesID Lava Blast &amp; Power Red #Pan...  
Targets:
4996    1
3263    0
4907    1
2855    1
4716    0
Name: target, dtype: int64


In [11]:
#############################
#### YOUR CODE GOES HERE ####
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()

corpus_train = np.asarray(X_train.text)
corpus_test = np.asarray(X_test.text)
tfidf = TfidfVectorizer()
tfidf.fit(corpus_train)

XF_train = tfidf.transform(corpus_train)
XF_test = tfidf.transform(corpus_test)

model = LinearSVC()
model.fit(XF_train, y_train)
yhat_train = model.predict(XF_train)
yhat_test = model.predict(XF_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy : {train_acc}')
print(f'Test accuracy: {test_acc}')
#############################

Train accuracy : 0.9862068965517241
Test accuracy: 0.8003939592908733


###Model iteration

In [12]:
from sklearn.svm import SVC

model = SVC(kernel='rbf')
model.fit(XF_train, y_train)
yhat_train = model.predict(XF_train)
yhat_test = model.predict(XF_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy: {train_acc}.')
print(f'Test accuracy: {test_acc}.')

Train accuracy: 0.9719211822660099.
Test accuracy: 0.8056467498358503.


In [13]:
from datetime import datetime

def timer(start_time=None):
  if not start_time:
    start_time = datetime.now()
    return start_time
  elif start_time:
    thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
    tmin, tsec = divmod(temp_sec, 60)
    print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 1)))

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', SVC(kernel='rbf')),
])
parameters = {
    'tfidf__max_df': (0.5, 0.6, 0.7),
    'tfidf__max_features': (100, 1000, 10000, 100000),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)], 
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=3, verbose=3)

start_time = timer(None) #timing starts from this point for "start_time" variable
grid_search_tune.fit(corpus_train, y_train)
timer(start_time) #timing ends here for "start_time" variable

print("Best parameters set:")
print(grid_search_tune.best_params_)

yhat_train = grid_search_tune.predict(corpus_train)
yhat_test = grid_search_tune.predict(corpus_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy: {train_acc}.')
print(f'Test accuracy: {test_acc}.')

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV 1/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 1);, score=0.717 total time=   2.2s
[CV 2/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 1);, score=0.736 total time=   1.7s
[CV 3/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 1);, score=0.697 total time=   1.6s
[CV 1/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 2);, score=0.715 total time=   1.9s
[CV 2/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 2);, score=0.735 total time=   1.9s
[CV 3/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 2);, score=0.700 total time=   1.9s
[CV 1/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 3);, score=0.717 total time=   2.2s
[CV 2/3] END tfidf__max_df=0.5, tfidf__max_features=100, tfidf__ngram_range=(1, 3);, score=0.735 total time=   2.1s
[CV 3/3] E

In [15]:
tfidf = TfidfVectorizer(max_df=0.6, max_features=10000)
tfidf.fit(corpus_train)

XF_train = tfidf.transform(corpus_train)
XF_test = tfidf.transform(corpus_test)

In [16]:
xgbc = xgb.XGBClassifier()
xgbc.fit(XF_train, y_train)

yhat_train = xgbc.predict(XF_train)
yhat_test = xgbc.predict(XF_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy: {train_acc}.')
print(f'Test accuracy: {test_acc}.')

Train accuracy: 0.774384236453202.
Test accuracy: 0.7557452396585687.


In [17]:
from sklearn.model_selection import RandomizedSearchCV, KFold

params = {"learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30],
          "max_depth"        : range(3,12),
          "min_child_weight" : range(1,8),
          "gamma"            : [0.0, 0.1, 0.2, 0.3, 0.4],
          "colsample_bytree" : [0.3, 0.4, 0.5, 0.7] }

folds = 5
param_comb = 20

skf = KFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb.XGBClassifier(objective='binary:logistic'),
                                   param_distributions=params,
                                   n_iter=param_comb,
                                   scoring='roc_auc',
                                   n_jobs=4,
                                   cv=skf.split(XF_train, y_train),
                                   verbose=3,
                                   random_state=1001)

start_time = timer(None)
random_search.fit(XF_train, y_train)
timer(start_time)

Fitting 5 folds for each of 20 candidates, totalling 100 fits

 Time taken: 0 hours 2 minutes and 53.9 seconds.


In [18]:
print("Best parameters set:")
print(random_search.best_params_)

yhat_train = random_search.predict(XF_train)
yhat_test = random_search.predict(XF_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy: {train_acc}.')
print(f'Test accuracy: {test_acc}.')

Best parameters set:
{'min_child_weight': 1, 'max_depth': 11, 'learning_rate': 0.2, 'gamma': 0.3, 'colsample_bytree': 0.4}
Train accuracy: 0.9057471264367816.
Test accuracy: 0.783322390019698.


### Google Home Command Classification

[Kaggle link](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview)

Google Home, and similar smart devices, rely on speech models to detect when the user utters commands, like "Hey Google". This dataset contains 65,000 one-second long utterances of 30 different short words, each uttered by thousands of people. The challenge is to build an algorithm to classify spoken commands. 

Below we download a file called `train.7z`. It contains a few informational files and a folder of audio files. The audio folder contains subfolders with 1 second clips of voice commands, with the folder name being the label of the audio clip. The labels you will need to predict are `yes`, `no`, `up`, `down`, `left`, `right`, `on`, `off`, `stop`, `go`. You should ignore all other classes. Unlike the Kaggle challenge, here, you do not need to worry about auxiliary labels and background noise. That being said, you will not be able to directly compare your results to the Kaggle leaderboard given these differences.

We recommend featurizing the audio clips as a first step. Consider computing log mel spectrograms as we did above. 


In [19]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1sfkLsKT8JHPMM1pifQJqefL5elopjFX7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1sfkLsKT8JHPMM1pifQJqefL5elopjFX7" -O train.7z && rm -rf /tmp/cookies.txt
!7z x train.7z

--2022-09-26 06:42:10--  https://docs.google.com/uc?export=download&confirm=t&id=1sfkLsKT8JHPMM1pifQJqefL5elopjFX7
Resolving docs.google.com (docs.google.com)... 74.125.195.101, 74.125.195.138, 74.125.195.100, ...
Connecting to docs.google.com (docs.google.com)|74.125.195.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/bdtju7p4066lkra35jhp4v3usd28mit6/1664174475000/17643477956629335341/*/1sfkLsKT8JHPMM1pifQJqefL5elopjFX7?e=download&uuid=33499090-afd4-40b2-82ad-7a46d780e0e4 [following]
--2022-09-26 06:42:10--  https://doc-14-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/bdtju7p4066lkra35jhp4v3usd28mit6/1664174475000/17643477956629335341/*/1sfkLsKT8JHPMM1pifQJqefL5elopjFX7?e=download&uuid=33499090-afd4-40b2-82ad-7a46d780e0e4
Resolving doc-14-44-docs.googleusercontent.com (doc-14-44-docs.googleusercontent.com)... 74.125.197.132, 

In [20]:
import os
import librosa
from glob import glob
import pandas as pd

class CommandDataset(BaseDataset):
  _commands = ['yes', 'no', 'up', 'down', 'left', 'right', 
               'on', 'off', 'stop', 'go']
  _sample_rate = 16000
  
  def _load(self):
    # Returns NumPy arrays, not dataframes.
    data, labels = [], []
    max_length = 0
    for c, command in enumerate(self._commands):
      files = glob(os.path.join(f'./train/audio/{command}/*.wav'))
      data_c = [librosa.load(f, sr=self._sample_rate)[0] for f in files]
      labels_c = [c] * len(data_c)
      max_length_c = max(len(row) for row in data_c)
      data += data_c
      labels += labels_c
      if max_length_c > max_length:
        max_length = max_length_c
    
    data = [  # pad to max length with 0s if < 16000 frames
      np.pad(row, (0, max_length - len(row)), 
             'constant', constant_values=(0, 0))
      for row in data]
    data = np.array(data)
    labels = np.array(labels)

    return data, labels

dataset = CommandDataset()

loading data...
done.


In [21]:
X_train, y_train = dataset.get_train_data()
print('Raw Input:')
print(X_train[:5])
print('Targets:')
print(y_train[:5])

Raw Input:
[[ 6.7138672e-04  1.0681152e-03  1.4038086e-03 ...  2.2888184e-03
   2.1057129e-03  1.7700195e-03]
 [ 1.8310547e-04 -1.0681152e-03 -1.8920898e-03 ...  1.0375977e-03
   1.9531250e-03  2.6245117e-03]
 [ 1.5258789e-04  2.1362305e-04  4.8828125e-04 ...  3.9367676e-03
   7.3242188e-03  1.3305664e-02]
 [-6.1035156e-05 -6.1035156e-05 -6.1035156e-05 ...  3.5705566e-03
   6.1035156e-04 -1.0681152e-03]
 [-1.6174316e-03 -6.3476562e-03 -9.4604492e-03 ...  7.4768066e-03
   5.7373047e-03  4.1503906e-03]]
Targets:
[5 7 5 0 9]


In [22]:
#############################
#### YOUR CODE GOES HERE ####

from tqdm import tqdm
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()

def _compute_features(data):
  outputs = []
  pbar = tqdm(total=len(data), position=0, leave=True)
  for i in range(len(data)):
    audio_input = data[i]
    wav_mel = librosa.feature.melspectrogram(
        y=audio_input,
        sr=16000,
        hop_length=512,
    )
    feats = np.log(wav_mel + 1e-7)
    feats = librosa.util.normalize(feats)
    feats = feats.mean(1)
    outputs.append(feats)
    pbar.update()
  pbar.close()
  outputs = np.stack(outputs)
  return outputs

XF_train = _compute_features(X_train)
XF_test = _compute_features(X_test)

model = LinearSVC()
model.fit(XF_train, y_train)
yhat_train = model.predict(XF_train)
yhat_test = model.predict(XF_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy : {train_acc}')
print(f'Test accuracy: {test_acc}')
#############################

100%|██████████| 18945/18945 [02:29<00:00, 126.80it/s]
100%|██████████| 4737/4737 [00:36<00:00, 129.86it/s]


Train accuracy : 0.42264449722882025
Test accuracy: 0.3947646189571459




### Classifying Cats and Dogs

[Kaggle link](https://www.kaggle.com/c/dogs-vs-cats)

Is this an image of a cat or a dog? This training dataset contains 25,000 images of both animals. These are real world images of pets with different camera angles, backgrounds, and quality. In other words, this is a difficult task! The top performing model scores 98.9% but use more sophisticated methods than shown in this notebook. Still, see how well you can do!

In [23]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ya_pBnNQ72Rw9AG0-6sZNRnt2ds_mBfP' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ya_pBnNQ72Rw9AG0-6sZNRnt2ds_mBfP" -O train.zip && rm -rf /tmp/cookies.txt
!unzip -q train.zip

--2022-09-26 06:51:18--  https://docs.google.com/uc?export=download&confirm=t&id=1ya_pBnNQ72Rw9AG0-6sZNRnt2ds_mBfP
Resolving docs.google.com (docs.google.com)... 74.125.142.100, 74.125.142.139, 74.125.142.102, ...
Connecting to docs.google.com (docs.google.com)|74.125.142.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-00-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/t9d50ntcl7dso247icvlnepckr17foon/1664175075000/17643477956629335341/*/1ya_pBnNQ72Rw9AG0-6sZNRnt2ds_mBfP?e=download&uuid=66288332-728e-45f4-9822-730484c04ef1 [following]
--2022-09-26 06:51:18--  https://doc-00-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/t9d50ntcl7dso247icvlnepckr17foon/1664175075000/17643477956629335341/*/1ya_pBnNQ72Rw9AG0-6sZNRnt2ds_mBfP?e=download&uuid=66288332-728e-45f4-9822-730484c04ef1
Resolving doc-00-44-docs.googleusercontent.com (doc-00-44-docs.googleusercontent.com)... 74.125.197.132, 

In [24]:
from glob import glob
cat_files = glob('train/cat.*.jpg')
dog_files = glob('train/dog.*.jpg')
print(f'{len(cat_files)} cat photos')
print(f'{len(dog_files)} dog photos')

12500 cat photos
12500 dog photos


In [25]:
import os
import torch
import pandas as pd
from glob import glob
from tqdm import tqdm
from skimage.transform import resize
from skimage.io import imread
from skimage import color

def resize_pipeline(image, target_size):
  # find larger and smaller of height vs width
  shape = list(image.shape)
  larger_size = max(shape[:2])
  smaller_size = min(shape[:2])
  ratio = target_size / smaller_size 
  new_larger_size = int(larger_size * ratio)
  new_size = []
  for size in shape[:2]:
    if size == larger_size:
      new_size.append(new_larger_size)
    else:
      new_size.append(target_size)
  new_size.append(shape[2])
  image = resize(image, tuple(new_size), anti_aliasing=True)
  return image

def center_crop(image, target_size):
  shape = list(image.shape)
  y, x = shape[0], shape[1]
  startx = x //2-(target_size//2)
  starty = y //2-(target_size//2)    
  image = image[
    starty: starty + target_size, 
    startx: startx + target_size, 
    :
  ]
  return image

class CatDogDataset(BaseDataset):
  IMAGE_SIZE = 224

  def _load(self):
    cat_files = glob('train/cat.*.jpg')
    dog_files = glob('train/dog.*.jpg')
    img_files = cat_files + dog_files
    labels = [0] * len(cat_files) + [1] * len(dog_files)

    img_data = []
    pbar = tqdm(total=len(img_files), position=0, leave=True)
    for img_file in img_files:
      image = imread(img_file)
      image = resize_pipeline(image, 64)
      if image.shape[0] < 64 and  image.shape[1] < 64:
        # case 1: the image is already small in both width and height
        image_ = np.zeros((64, 64, 3))
        image_[:image.shape[0], :image.shape[1], :] = image
        image = image_
      elif image.shape[0] < 64:
        # case 2: the image is already small in width only
        image_ = np.zeros((64, image.shape[1], 3))
        image_[:image.shape[0], :, :] = image
        image = center_crop(image_, 64)
      elif image.shape[1] < 64:
        # case 3: the image is already small in height only
        image_ = np.zeros((image.shape[0], 64, 3))
        image_[:, :image.shape[1], :] = image
        image = center_crop(image_, 64)
      else:
        # case 2: the image is big and needs to be cropped
        image = center_crop(image, 64)
      image = image[np.newaxis, ...]  # n x h x w x c
      img_data.append(image)
      pbar.update()
    pbar.close()

    data = np.concatenate(img_data, axis=0)
    labels = np.array(labels)

    return data, labels


dataset = CatDogDataset()

loading data...


100%|██████████| 25000/25000 [21:20<00:00, 19.53it/s]


done.


In [26]:
X_train, y_train = dataset.get_train_data()
print('Image shape:')
print(X_train[0].shape)
print('Image:')
print(X_train[0])
print('Label:')
print(y_train[0])

Image shape:
(64, 64, 3)
Image:
[[[0.75888202 0.67260751 0.59025457]
  [0.75552445 0.66924994 0.586897  ]
  [0.75836258 0.67208807 0.58973513]
  ...
  [0.67859431 0.58100072 0.48998997]
  [0.67753593 0.58029885 0.47900234]
  [0.73481785 0.6234946  0.52118148]]

 [[0.73723262 0.65095811 0.56860517]
  [0.74127395 0.65499944 0.5726465 ]
  [0.72638146 0.64010695 0.55775401]
  ...
  [0.705083   0.62093778 0.51874721]
  [0.73777434 0.64635834 0.5494165 ]
  [0.74042864 0.63270248 0.54250641]]

 [[0.74417614 0.65790163 0.57554869]
  [0.74675245 0.66047794 0.578125  ]
  [0.73375529 0.64748078 0.56512784]
  ...
  [0.66331467 0.59344641 0.49242006]
  [0.67710283 0.59577763 0.50184798]
  [0.67999387 0.57252813 0.49651989]]

 ...

 [[0.75083138 0.65279217 0.56651766]
  [0.7441831  0.64614388 0.55986937]
  [0.75289383 0.65485461 0.5685801 ]
  ...
  [0.17990335 0.14024203 0.10494792]
  [0.17559325 0.125415   0.09392965]
  [0.18413547 0.12707359 0.0996226 ]]

 [[0.70865363 0.61061442 0.52433991]
  [0.

In [27]:
#############################
#### YOUR CODE GOES HERE ####

from tqdm import tqdm
from skimage.feature import hog
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

X_train, y_train = dataset.get_train_data()
X_test, y_test = dataset.get_test_data()

def _compute_features(data):
  outputs = []
  pbar = tqdm(total=len(data), position=0, leave=True)
  for i in range(len(data)):
    image = data[i]
    hog_features, _ = hog(
        image,
        orientations=9,
        pixels_per_cell=(8, 8),
        cells_per_block=(2, 2),
        visualize=True,
        multichannel=True,
    )
    outputs.append(hog_features)
    pbar.update()
  pbar.close()
  outputs = np.stack(outputs)
  return outputs

XF_train = _compute_features(X_train)
XF_test = _compute_features(X_test)

model = LinearSVC()
model.fit(XF_train, y_train)
yhat_train = model.predict(XF_train)
yhat_test = model.predict(XF_test)

train_acc = accuracy_score(y_train, yhat_train)
test_acc = accuracy_score(y_test, yhat_test)

print(f'Train accuracy : {train_acc}')
print(f'Test accuracy: {test_acc}')
#############################

100%|██████████| 20000/20000 [04:14<00:00, 78.47it/s]
100%|██████████| 5000/5000 [01:01<00:00, 80.68it/s]


Train accuracy : 0.76825
Test accuracy: 0.7032




## Playground

The main project work for this week is to **achieve the best results you can on the dataset of your choice**! Use this as a chance to explore whatever aspect of ML model development you want to study. Remember to use our course materials and previous projects to outline good development practices, and specific algorithms/models/techniques to try. Always work in build-measure-learn iterations to guide your thinking. 


Your goal is to produce the best model you can on the test set. You may use model-centric and data-centric techniques to improve your modeling approach, training set, and fitting/tuning procedure. Part of the challenge of this assignment is how you will choose to navigate uncertainty and allocate your time to trying different approaches. We provide some scaffolding below for featurization, training, and evaluation. You are free to customize the starter code as you wish, just report results and communicate what you tried below. The teaching staff and your peers can help provide feedback as you work. 

### What did you try?

You are free to consult the internet. All of these datasets are taken from Kaggle and you may draw inspiration from the public solutions online, especially the winning ones. 
- What models did they use? 
- What features did they introduce?
- What was their development cycle like? 

Hint: often ML papers, talks, and blogs bias towards more complex methods because they are interesting to program and think about. In practice, it is often the simple things that make the most difference. As you work through this assignment, consider prioritizing simpler experiments (e.g. adding a nonlinear feature or tuning a hyperparameter) before you explore complex pipelines (e.g. boosting a bagged ensemble). Tools like data augmentation, removing outliers, or feature engineering often make a winning difference in ML competitions where everyone can fit models correctly on a given dataset.

**Keep track of your work**

As you try different techniques, visualize data/results, and try side experiments, keep track of your code and experiments! It's okay to let your work contain models that helped you learn but were replaced in later experiments. Keeping a _research journal_ as you work will help you refer back to what you've tried, what works, and where you can improve further later. Keep track of your work here in case you talk through it with peers or teaching staff.

In [None]:
# GO FOR IT! 

### Preparing for results discussion

We don't do much in the way of formal grading, but you should prepare some experimental results, explanations of your experiments, and conclusions of your modeling work. Reporting what you tried and the outcomes you observed is a central part of quality ML engineering -- and it's critical for building successful ML systems when collaboration is involved. Here's some results and answers you should have ready when discussing your project:
* What are some baseline methods and their performance on this task?
* What modeling improvements did you try? How did each modeling improvement affect results (show a full results table if you can)
* What is your best result? What combination of modeling/data tricks produced this result?
* Did you perform any ablation or sensitivity experiments to understand which aspects of your best system are most important?
* Error analysis: Have you visualized where your model makes mistakes? (Either in aggregate or with individual mistaken examples)
* What is your current diagnosis of the ML System? Is it high variance/bias? What are your thoughts on current dataset size relative to model capacity / fit?
* What might you try next to improve on this task? Could you improve with more data? More time spent building larger models? Data augmentation or similar techniques? 
* Can you identify cases or types of inputs where the model is likely to make mistakes? Are there gaps in the training set and/or model assumptions which would lead the model to make mistakes or not have sufficient data in certain situations?