<a href="https://colab.research.google.com/github/qingdao81/mlops/blob/main/Lars_Bachmann_%5BMLOps%5D%5BJune_2023%5D_Week_1_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

### Problem

In the project this week, we will build a machine learning text classifier to predict news categories from the news article text. 

1. We will iterate on classification models with increasing level of complexity and improved performance: N-gram models, pre-trained Transformer models, and third-party hosted Large Language Models (LLMs).

2. We will look at the impact of labeled dataset size and composition on model performance. The labeled dataset will be used for training in case of N-gram models and pre-trained Transformers, and for selecting examples for in-context few-shot learning for LLMs.

3. [advanced] As an extension, we will explore how to augment data efficiently to your existing training data (efficiency measured as improvement in performance normalized by volume of data augmented). 

Throughout the project there are suggested model architectures that we expect to work reasonably well for this problem. But if you wish to extend/modify any part of this pipeline, or explore new model architectures you should definitely feel free to do so.


## Step1: Prereqs & Installation

Download & Import all the necessary libraries we need throughout the project.

In [4]:
# Install all the required dependencies for the project

!pip install numpy
!pip install scikit-learn
!pip install sentence-transformers
!pip install matplotlib
!pip install langchain
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━

In [5]:
# Package imports that will be needed for this project

import numpy as np
import json
from collections import Counter
from sklearn.metrics import accuracy_score, f1_score
from sentence_transformers import SentenceTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from pprint import pprint
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# [TO BE IMPLEMENTED] 
# Add any other imports needed below depending on the model architectures you are using. For e.g.
# from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

In [6]:
# Global Constants
LABEL_SET = [
    'Business',
    'Sci/Tech',
    'Software and Developement',
    'Entertainment',
    'Sports',
    'Health',
    'Toons',
    'Music Feeds'
]

WORD_VECTOR_MODEL = 'glove-wiki-gigaword-100'
SENTENCE_TRANSFORMER_MODEL = 'all-mpnet-base-v2'

TRAIN_SIZE_EVALS = [500, 1000, 2000, 5000, 10000, 25000]
EPS = 0.001
SEED = 0

np.random.seed(SEED)

## Step 2: Download & Load Datasets 

[AG News](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) is a collection of more than 1 million news articles gathered from more than 2000 news sources by an academic news search engine. The news topic classification dataset & benchmark was first used in [Character-level Convolutional Networks for Text Classification (NIPS 2015)](https://arxiv.org/abs/1509.01626). The dataset has the text description (summary) of the news article along with some metadata. **For this project, we will use a slightly modified (cleaned up) version of this dataset** 

Schema:
* Source - News publication source
* URL - URL of the news article
* Title - Title of the news article
* Description - Summary description of the news article
* Category (Label) - News category

Sample row in this dataset:
```
{
    'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
    'id': 86273,
    'label': 'Entertainment',
    'source': 'Voice of America',
    'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
    'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'
 }
```




In [7]:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

DIRECTORY_NAME = "data"
DOWNLOAD_URL = 'https://corise-mlops.s3.us-west-2.amazonaws.com/project1/agnews.zip'

def download_dataset():
    """
    Download the dataset. The zip contains three files: train.json, test.json and unlabeled.json 
    """
    http_response = urlopen(DOWNLOAD_URL)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=DIRECTORY_NAME)

# Expensive operation so we should just do this once
download_dataset()

In [8]:
Datasets = {}

for ds in ['train', 'test', 'augment', 'test_mini']:
    with open('data/{}.json'.format(ds), 'r') as f:
        Datasets[ds] = json.load(f)
    print("Loaded Dataset {0} with {1} rows".format(ds, len(Datasets[ds])))

print("\nExample train row:\n")
pprint(Datasets['train'][0])

print("\nExample test row:\n")
pprint(Datasets['test'][0])

print("\nExample test mini row:\n")
pprint(Datasets['test_mini'][0])
print(len(Datasets['test_mini']))

Loaded Dataset train with 25000 rows
Loaded Dataset test with 5000 rows
Loaded Dataset augment with 150000 rows
Loaded Dataset test_mini with 1000 rows

Example train row:

{'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
 'id': 86273,
 'label': 'Entertainment',
 'source': 'Voice of America',
 'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
 'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'}

Example test row:

{'description': 'European Union regulators will decide Tuesday whether Oracle '
                "Corp.'s hostile \\$7.7 billion bid for rival business "
                "software concern PeopleSoft Inc. can proceed, the EU's "
                'antitrust chief said Friday.',
 'id': 278781,
 'label': 'Sci/Tech',
 'source': 'Washington Post Tech',
 'title': "EU to Rule Tuesday on Oracle'

In [9]:
X_train, Y_train = [], []
X_test, Y_true = [], []
X_augment, Y_augment = [], []
X_test_mini, Y_true_mini = [], []

for row in Datasets['train']:
    X_train.append(row['description'])
    Y_train.append(row['label'])

for row in Datasets['test']:
    X_test.append(row['description'])
    Y_true.append(row['label'])

for row in Datasets['augment']:
    X_augment.append(row['description'])
    Y_augment.append(row['label'])

for row in Datasets['test_mini']:
    X_test_mini.append(row['description'])
    Y_true_mini.append(row['label'])

## Step 3: [Modeling part 1] N-gram model


In [10]:
models = {}

for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    """
    [TO BE IMPLEMENTED]
        
    Goal: initialized below is a dummy sklearn Pipeline object with no steps.
    You have to replace it with a pipeline object which contains at least two steps:
    (1) mapping the input document to an N-gram feature extractor. You can use feature extractors
        provided by sklearn out of the box (e.g. CountVectorizer, TfidfTransformer)
    (2) a classifier that predicts the class label using the feature output of first step

    You can add other steps to preproces, post-process your data as you see fit. 
    You can also try any sklearn model architecture you want, but a linear classifier
    will do just fine to start with

    e.g. 
    pipeline = Pipeline([
        ('featurizer', <your WordVectorFeaturizer class instance here>),
        ('classifier', <your sklearn classifier class instance here>)
    ])

    Reference: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
    """
    pipeline = Pipeline([
      ('featurizer', CountVectorizer(ngram_range=(1,1))),
      ('tf-idf', TfidfTransformer(use_idf=True)),
      ('classifier', LogisticRegression())
      #('classifier', MultinomialNB())   
      #('classifier', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=20, tol=None)) 
    ])
    
    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))

Evaluating for training data size = 500
Accuracy on test set: 0.5866
Evaluating for training data size = 1000
Accuracy on test set: 0.6216
Evaluating for training data size = 2000
Accuracy on test set: 0.6614
Evaluating for training data size = 5000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy on test set: 0.7106
Evaluating for training data size = 10000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy on test set: 0.7364
Evaluating for training data size = 25000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy on test set: 0.7546


## Step 4: [Modeling part 2] Pretrained Transformer model

In [11]:
# Initialize the pretrained transformer model
sentence_transformer_model = SentenceTransformer(
    'sentence-transformers/{model}'.format(model=SENTENCE_TRANSFORMER_MODEL))

# Sanity check
example_encoding = sentence_transformer_model.encode(
    "This is an example sentence",
    normalize_embeddings=True
)

print(example_encoding.shape)


Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

(768,)


In [12]:
class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, dim, sentence_transformer_model):
        self.dim = dim
        self.sentence_transformer_model = sentence_transformer_model
        # you can add any other params to be passed to the constructor here

    #estimator. Since we don't have to learn anything in the featurizer, this is a no-op
    def fit(self, X, y=None):
        return self

    #transformation: return the encoding of the document as returned by the transformer model 
    def transform(self, X, y=None):
        X_t = []
        """
        [TO BE IMPLEMENTED]
        
        Goal: TransformerFeaturizer's transform() method converts the raw text document
        into a feature vector to be passed as input to the classifier.
            
        Given below is a dummy implementation that always maps it to a zero vector.
        You have to implement this function so it uses computes a document embedding
        of the input document using self.sentence_transformer_model. 
        This will be our feature representation of the document
        """
        for doc in X:
            # TODO: replace this dummy implementation
            # X_t.append(np.zeros(self.dim))
            X_t.append(self.sentence_transformer_model.encode(doc, normalize_embeddings=True))
        return X_t

In [21]:
models_v2 = {}
for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    """
    [TO BE IMPLEMENTED]
        
    Goal: initialized below is a dummy sklearn Pipeline object with no steps.
    You have to replace it with a pipeline object which contains at least two steps:
    (1) mapping the input document to a feature vector (using TransformerFeaturizer)
    (2) a classifier that predicts the class label using the feature output of first step

    You can add other steps to preproces, post-process your data as you see fit. 
    You can also try any sklearn model architecture you want, but a linear classifier
    will do just fine to start with

    e.g. 
    pipeline = Pipeline([
        ('featurizer', <your TransformerFeaturizer class instance here>),
        ('classifier', <your sklearn classifier class instance here>)
    ])
    """
    pipeline = Pipeline([
        ('featurizer', TransformerFeaturizer(dim=768, sentence_transformer_model=sentence_transformer_model)),
        ('classifier', LogisticRegression())
    ])

    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models_v2[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))


Evaluating for training data size = 500
Accuracy on test set: 0.713
Evaluating for training data size = 1000
Accuracy on test set: 0.7378
Evaluating for training data size = 2000
Accuracy on test set: 0.7542
Evaluating for training data size = 5000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy on test set: 0.7706
Evaluating for training data size = 10000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy on test set: 0.7758
Evaluating for training data size = 25000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy on test set: 0.784


## Step 5: [Modeling part 3] Large Language Models

In [13]:
# Here's a couple of code snippets to help you familiarize with how to generate labels with LLMs using langchain,

import os

os.environ['OPENAI_API_KEY'] = "XXX"

from langchain.chat_models import ChatOpenAI
from langchain.schema import LLMResult, HumanMessage, Generation

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    max_tokens=1000,
    temperature=0.0
)

In [14]:

zero_shot_prompt_template = """
You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Tweet: {tweet}
Sentiment:
"""

prompt = zero_shot_prompt_template.format(
    tweet="Yesss! I love machine learning"
)
print(prompt)

result = llm.generate([[HumanMessage(content=prompt)]])
print(result.generations[0][0])




You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Tweet: Yesss! I love machine learning
Sentiment:

text='Positive' generation_info=None message=AIMessage(content='Positive', additional_kwargs={}, example=False)


In [15]:

few_shot_prompt_template = """
You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Some example tweets along with the correct sentiment are shown below.

Tweet: Another big happy 18th birthday to my partner in crime. I love u very much!
Sentiment: Positive

Tweet: The more I use this application, the more I dislike it. It's slow and full of bugs.
Sentiment: Negative

Tweet: #Dreamforce Returns to San Francisco for 20th Anniversary. Learn more: http://bit.ly/3AgwO0H
Sentiment: Neutral

Now I want you to label the following example: 
Tweet: {tweet}
Sentiment:
"""

prompt = few_shot_prompt_template.format(
    tweet="I like chocolate"
)

result = llm.generate([[HumanMessage(content=prompt)]])
print(result.generations[0][0])



text='Positive' generation_info=None message=AIMessage(content='Positive', additional_kwargs={}, example=False)


In [16]:
from sklearn.base import BaseEstimator, ClassifierMixin


class LLMClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, llm_model, prompt_template):
        self.llm_model = llm_model
        self.prompt_template = prompt_template

    #This will be called during the training step
    def fit(self, X, y):
        return self

    #This will be called during inference.
    def predict(self, X):
        """
        [TO BE IMPLEMENTED]
        
        Goal: LLMClassifier's predict() method constructs the final prompt input
        for the LLM for each x in X, using the prompt template.

        You have to implement this function so it does the following:
        1. Construct the final prompt for the LLM
        2. Call `self.llm_model` to generate the completion (label) for the prompt
        3. Do any post-processing/response parsing to fetch the label from the LLM response
        """
        predicted_labels = []

        for x in X:
          prompt = self.prompt_template.format(description=x)
          response = self.llm_model.generate([[HumanMessage(content=prompt)]])
          predicted_labels.append(response.generations[0][0].text)

        print(predicted_labels)
        return predicted_labels


In [18]:
# Zero-shot classification pipeline with LLMs

models_v3 = {}

"""
[TO BE IMPLEMENTED]
        
Goal: initialized below is a dummy sklearn Pipeline object with no steps.
You have to replace it with a pipeline object which uses the `LLMClassifier` you have implemented 
above to perform zero-shot classification on the test set.

You can add other steps to preproces, post-process your data as you see fit. 

"""
labels = ', '.join(LABEL_SET)
base_template = """
You are an expert in finding the correct labels of the description.
Your job is to categorize the label as the given description into one of the categories: {labels}
"""

zero_template = base_template.format(labels=labels) + """

Description: {description}
Label:
""" 

# I ran into rate limiting issues all the time and further decreased the test mini data set

X_test_mini_mini = X_test_mini[:10]
Y_true_mini_mini = Y_true_mini[:10]

pipeline = Pipeline([
    ('classifier', LLMClassifier(llm_model=llm, prompt_template=zero_template))
])

# train
pipeline.fit(X_train_i, Y_train_i)
# predict
Y_pred_i = pipeline.predict(X_test_mini_mini)
# record results
models_v3["zero-shot"] = {
    'test_predictions': Y_pred_i,
    'accuracy': accuracy_score(Y_true_mini_mini, Y_pred_i),
    'f1': f1_score(Y_true_mini_mini, Y_pred_i, average='weighted'),
    'errors': sum([x != y for (x, y) in zip(Y_true_mini_mini, Y_pred_i)])
}
print("Accuracy on test set: {}".format(accuracy_score(Y_true_mini_mini, Y_pred_i)))



['Sports', 'Sci/Tech', 'Sports', 'Business', 'Sports', 'Sports', 'Business', 'Sports', 'Politics', 'Sports']
Accuracy on test set: 0.7


In [20]:
# Few-shot classification with LLMs

"""
[TO BE IMPLEMENTED]
        
Goal: initialized below is a dummy sklearn Pipeline object with no steps.
You have to replace it with a pipeline object which uses the `LLMClassifier` you have implemented 
above to perform few-shot classification on the test set.

With few-shot classification, you can pass upto 5 demonstration examples as part of the prompt 
to the LLM. You can add other steps to preproces, post-process your data as you see fit. 

"""
base_template = """
You are an expert in finding the correct labels of the description.
Your job is to categorize the label as the given description into one of the categories: {labels}

Here are some example labels for given descriptions:

Description: {e1}
Label: {l1}

Description: {e2}
Label: {l2}

Description: {e3}
Label: {l3}

Description: {e4}
Label: {l4}

Description: {e5}
Label: {l5}
"""

few_shot_template = base_template.format(
    labels=labels,
    e1=X_train_i[0],
    l1=Y_train_i[0],
    e2=X_train_i[1],
    l2=Y_train_i[1],
    e3=X_train_i[2],
    l3=Y_train_i[2],
    e4=X_train_i[3],
    l4=Y_train_i[3],
    e5=X_train_i[4],
    l5=Y_train_i[4]
    ) + """

Description: {description}
Label:
""" 

pipeline = Pipeline([
    ('classifier', LLMClassifier(llm_model=llm, prompt_template=few_shot_template))
])

# train
pipeline.fit(X_train_i, Y_train_i)
# predict
Y_pred_i = pipeline.predict(X_test_mini_mini)
# record results
models_v3["few-shot"] = {
    'test_predictions': Y_pred_i,
    'accuracy': accuracy_score(Y_true_mini_mini, Y_pred_i),
    'f1': f1_score(Y_true_mini_mini, Y_pred_i, average='weighted'),
    'errors': sum([x != y for (x, y) in zip(Y_true_mini_mini, Y_pred_i)])
}
print("Accuracy on test set: {}".format(accuracy_score(Y_true_mini_mini, Y_pred_i)))




['Sports', 'Business', 'Sports', 'Business', 'Sports', 'Sports', 'Business', 'Sports', 'Politics', 'Sports']
Accuracy on test set: 0.7


## Step 5: Report Results from previous two steps

In [33]:
# Report results

print("N-gram Models: ")
for train_size, result in models.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))


N-gram Models: 
Train size: 500  |  Accuracy: 0.5866  |  F1 score: 0.5531464348676678 |  Num errors: 2067
Train size: 1000  |  Accuracy: 0.6216  |  F1 score: 0.5946056697055893 |  Num errors: 1892
Train size: 2000  |  Accuracy: 0.6614  |  F1 score: 0.6447871530651375 |  Num errors: 1693
Train size: 5000  |  Accuracy: 0.7106  |  F1 score: 0.7029140315199996 |  Num errors: 1447
Train size: 10000  |  Accuracy: 0.7364  |  F1 score: 0.7293632339280425 |  Num errors: 1318
Train size: 25000  |  Accuracy: 0.7546  |  F1 score: 0.7490570514445333 |  Num errors: 1227


In [34]:
print("Pretrained Transformer Models: ")
for train_size, result in models_v2.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Pretrained Transformer Models: 
Train size: 500  |  Accuracy: 0.713  |  F1 score: 0.7037330967530628 |  Num errors: 1435
Train size: 1000  |  Accuracy: 0.7378  |  F1 score: 0.7295150730982559 |  Num errors: 1311
Train size: 2000  |  Accuracy: 0.7542  |  F1 score: 0.7456153407085256 |  Num errors: 1229
Train size: 5000  |  Accuracy: 0.7706  |  F1 score: 0.7639019368889814 |  Num errors: 1147
Train size: 10000  |  Accuracy: 0.7758  |  F1 score: 0.7695766070197272 |  Num errors: 1121
Train size: 25000  |  Accuracy: 0.784  |  F1 score: 0.7774143430544371 |  Num errors: 1080


In [35]:
print("Large Language Models: ")
for mode, result in models_v3.items():
    print("Mode: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        mode,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Large Language Models: 
Mode: zero-shot  |  Accuracy: 0.7  |  F1 score: 0.6666666666666667 |  Num errors: 3
Mode: few-shot  |  Accuracy: 0.7  |  F1 score: 0.65 |  Num errors: 3


## Step 6: Data Augmentation [Optional]

In this section, we want to explore how to augment data efficiently to your existing training data. This is a very empirical exercise with a less well-defined playbook which means this section of the project is going to be open ended. Let us first understand what we mean by efficiency here, and why it matters:

### Performance Gain (G):
We will measure performance gain from data augmentation as the improvement in model accuracy (reduction in num. errors) on the Test dataset as defined above. 

### Budget (K):
We will measure "budget" as the number of additional rows augmentated to the original training dataset.  In this project, the universe of data from which you will select to add to your training set is Datasets['augment'] (and downstream X_augment, Y_augment).

This data is already labeled of course, but in most real-world scenarios the additional data is typically unlabeled. In order to augment it to your training data, you have to get it annotated which incurs some cost in time & money. This is the motivation to consider budget as a metric.

### Efficiency (E = G / K): 
Efficiency = Performance Gain (Reduction in num errors in test set) / Budget (Number of additional rows augmented to the training dataset)

We want to get the maximum gain in performance, while incurring minimum annotation cost.



We can always sample more data at random from the augmentation set, and this is probably the first thing to try. Can we be more intelligent with the data we choose to augment to the training dataset?

**Idea 1**: Look at the test errors that the current model is making. How can this help us guide our "data collection" for augmentation? One possible idea is to select examples from the augmentation dataset that are similar to these errors and add them to the training data. Similarity can be approximated in many ways:
1. [Jaccard distance between two texts](https://studymachinelearning.com/jaccard-similarity-text-similarity-metric-in-nlp/)
2. L2 distance between mean word vectors (we already compute these features for the entire dataset using WordVectorFeaturizer)
3. L2 distance between sentence transformer embedding (we already compute these features for the entire dataset using TransformerFeaturizer)
  

**Idea 2**: Compute model's predictions on the augmentation dataset, and include those examples to the training dataset that the model finds "hard" ? (a proxy for this would be to look at cases where the output score distribution across all labels has nearly identical scores for top two or three labels).

**Idea 3**: Look at the test errors that the current model is making, and the distribution of these errors across labels. Select examples from the augmentation dataset that belong to these classes - adding more training data for labels that the curent model does not do well on, can improve performance (assuming label quality is good)

In [41]:
# Examine current test errors
test_errors = []
Y_pred_i = models[25000]['test_predictions']

for idx, label in enumerate(Y_true):
    if label != Y_pred_i[idx]:
        test_errors.append((X_test[idx], label,  Y_pred_i[idx]))

print("Number of errors in the test set: {}".format(len(test_errors)))
print("Example errors: [example, true label, predicted label]")
for i in range(10):
    print(test_errors[i])

Number of errors in the test set: 1227
Example errors: [example, true label, predicted label]
("European Union regulators will decide Tuesday whether Oracle Corp.'s hostile \\$7.7 billion bid for rival business software concern PeopleSoft Inc. can proceed, the EU's antitrust chief said Friday.", 'Sci/Tech', 'Business')
('Police said Wednesday a woman who lives at the home found the intruder Sept. 27 and called police. A police dog tried to track the intruder but the person got away.', 'Sports', 'Entertainment')
('OCTOBER 22, 2004 (IDG NEWS SERVICE) - Yahoo Inc. has snapped up privately held software company Stata Labs, which develops technology allowing users to quickly search through e-mail and attachments.', 'Entertainment', 'Sci/Tech')
('The unexpected windfall from DVD sales have become the wild card in Hollywoods negotiations with the Screen Actors Guild.', 'Business', 'Sci/Tech')
('Within the next 18 months more than 20 million mobile telephone users in Britain, Germany and Irela

In [72]:
'''
[TO BE IMPLEMENTED]

Your additional data augmentation explorations go here

For instance, the pseudocode for Idea (1) might look like the following:

Augmented = {}
For e in test_errors:
   1. X_nn, y_nn = k nearest neighbors to (e) from X_augment, y_augment
   2. Add each (x, y) from (X_nn, y_nn) to Augmented

Add the Augmented examples to the training set
Train the new model and record performance improvements

'''
def jaccard_distance(e, a):
  e_vec = set(e.lower().split())
  a_vec = set(a.lower().split())

  intersection = e_vec.intersection(a_vec)
  union = e_vec.union(a_vec)
  if len(union) == 0:
    return 0
  return len(intersection) / len(union)

Augmented = []
k = 5

for idx_test_error, test_error in enumerate(test_errors):
  Selected = []
  for idx, aug in enumerate(X_augment):
    distance = jaccard_distance(test_error[0], aug)
    Selected.append({"index": idx, "dist": distance})
  Selected = sorted(Selected, key=lambda d: d['dist'])
  for i in Selected[-k:]:
    Augmented.append(i["index"])

for aug in set(Augmented):
  X_train.append(X_augment[idx])
  Y_train.append(Y_augment[idx])

# TDB retrain model with new training set