# Introduction

### Problem

In the project this week, we will build a machine learning text classifier to predict news categories from the news article text. 

1. We will iterate on classification models with increasing level of complexity and improved performance: N-gram models, pre-trained Transformer models, and third-party hosted Large Language Models (LLMs).

2. We will look at the impact of labeled dataset size and composition on model performance. The labeled dataset will be used for training in case of N-gram models and pre-trained Transformers, and for selecting examples for in-context few-shot learning for LLMs.

3. [advanced] As an extension, we will explore how to augment data efficiently to your existing training data (efficiency measured as improvement in performance normalized by volume of data augmented). 

Throughout the project there are suggested model architectures that we expect to work reasonably well for this problem. But if you wish to extend/modify any part of this pipeline, or explore new model architectures you should definitely feel free to do so.


## Step1: Prereqs & Installation

Download & Import all the necessary libraries we need throughout the project.

In [97]:
# Install all the required dependencies for the project

!pip install numpy
!pip install scikit-learn
!pip install sentence-transformers
!pip install matplotlib
!pip install langchain
!pip install openai
!pip install asyncio
!pip install nest_asyncio
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (

In [41]:
# Package imports that will be needed for this project

import numpy as np
import json
from collections import Counter
from sklearn.metrics import accuracy_score, f1_score
from sentence_transformers import SentenceTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from pprint import pprint
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import nest_asyncio

# Needed to run asyncio code in notebook easily
# Refer to https://stackoverflow.com/questions/55409641/asyncio-run-cannot-be-called-from-a-running-event-loop-when-using-jupyter-no
nest_asyncio.apply()
# [TO BE IMPLEMENTED] 
# Add any other imports needed below depending on the model architectures you are using. For e.g.
# from sklearn.linear_model import LogisticRegression

In [2]:
# Global Constants
LABEL_SET = [
    'Business',
    'Sci/Tech',
    'Software and Developement',
    'Entertainment',
    'Sports',
    'Health',
    'Toons',
    'Music Feeds'
]

WORD_VECTOR_MODEL = 'glove-wiki-gigaword-100'
SENTENCE_TRANSFORMER_MODEL = 'all-mpnet-base-v2'

TRAIN_SIZE_EVALS = [500, 1000, 2000, 5000, 10000, 25000]
EPS = 0.001
SEED = 0

np.random.seed(SEED)

## Step 2: Download & Load Datasets 

[AG News](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) is a collection of more than 1 million news articles gathered from more than 2000 news sources by an academic news search engine. The news topic classification dataset & benchmark was first used in [Character-level Convolutional Networks for Text Classification (NIPS 2015)](https://arxiv.org/abs/1509.01626). The dataset has the text description (summary) of the news article along with some metadata. **For this project, we will use a slightly modified (cleaned up) version of this dataset** 

Schema:
* Source - News publication source
* URL - URL of the news article
* Title - Title of the news article
* Description - Summary description of the news article
* Category (Label) - News category

Sample row in this dataset:
```
{
    'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
    'id': 86273,
    'label': 'Entertainment',
    'source': 'Voice of America',
    'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
    'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'
 }
```




In [3]:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

DIRECTORY_NAME = "data"
DOWNLOAD_URL = 'https://corise-mlops.s3.us-west-2.amazonaws.com/project1/agnews.zip'

def download_dataset():
    """
    Download the dataset. The zip contains three files: train.json, test.json and unlabeled.json 
    """
    http_response = urlopen(DOWNLOAD_URL)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=DIRECTORY_NAME)

# Expensive operation so we should just do this once
download_dataset()

In [4]:
Datasets = {}

for ds in ['train', 'test', 'augment', 'test_mini']:
    with open('data/{}.json'.format(ds), 'r') as f:
      Datasets[ds] = json.load(f)
    print("Loaded Dataset {0} with {1} rows".format(ds, len(Datasets[ds])))

print("\nExample train row:\n")
pprint(Datasets['train'][0])

print("\nExample test row:\n")
pprint(Datasets['test'][0])

Loaded Dataset train with 25000 rows
Loaded Dataset test with 5000 rows
Loaded Dataset augment with 150000 rows
Loaded Dataset test_mini with 1000 rows

Example train row:

{'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
 'id': 86273,
 'label': 'Entertainment',
 'source': 'Voice of America',
 'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
 'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'}

Example test row:

{'description': 'European Union regulators will decide Tuesday whether Oracle '
                "Corp.'s hostile \\$7.7 billion bid for rival business "
                "software concern PeopleSoft Inc. can proceed, the EU's "
                'antitrust chief said Friday.',
 'id': 278781,
 'label': 'Sci/Tech',
 'source': 'Washington Post Tech',
 'title': "EU to Rule Tuesday on Oracle'

In [5]:
X_train, Y_train = [], []
X_test, Y_true = [], []
X_augment, Y_augment = [], []

for row in Datasets['train']:
    X_train.append(row['description'])
    Y_train.append(row['label'])

for row in Datasets['test_mini']:
    X_test.append(row['description'])
    Y_true.append(row['label'])

for row in Datasets['augment']:
    X_augment.append(row['description'])
    Y_augment.append(row['label'])

In [6]:
target_labels = set(Y_true)
print(f"The unique labels for this dataset are {target_labels}")

The unique labels for this dataset are {'Music Feeds', 'Sci/Tech', 'Entertainment', 'Toons', 'Business', 'Software and Developement', 'Sports', 'Health'}


## Step 3: [Modeling part 1] N-gram model


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

N_GRAM_SIZE = 5
LR_MAX_ITER = 1_000

models = {}

for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    """
    [TO BE IMPLEMENTED]
        
    Goal: initialized below is a dummy sklearn Pipeline object with no steps.
    You have to replace it with a pipeline object which contains at least two steps:
    (1) mapping the input document to an N-gram feature extractor. You can use feature extractors
        provided by sklearn out of the box (e.g. CountVectorizer, TfidfTransformer)
    (2) a classifier that predicts the class label using the feature output of first step

    You can add other steps to preproces, post-process your data as you see fit. 
    You can also try any sklearn model architecture you want, but a linear classifier
    will do just fine to start with

    e.g. 
    pipeline = Pipeline([
        ('featurizer', <your WordVectorFeaturizer class instance here>),
        ('classifier', <your sklearn classifier class instance here>)
    ])

    Reference: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
    """
    pipeline = Pipeline(
        [
            ('featurizer', CountVectorizer(ngram_range=(N_GRAM_SIZE, N_GRAM_SIZE))),
            ('classifier', LogisticRegression(max_iter=LR_MAX_ITER))
        ]
    )
    
    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))

Evaluating for training data size = 500
Accuracy on test set: 0.303
Evaluating for training data size = 1000
Accuracy on test set: 0.216
Evaluating for training data size = 2000
Accuracy on test set: 0.329
Evaluating for training data size = 5000
Accuracy on test set: 0.352
Evaluating for training data size = 10000
Accuracy on test set: 0.281
Evaluating for training data size = 25000
Accuracy on test set: 0.368


## Step 4: [Modeling part 2] Pretrained Transformer model

In [8]:
# Initialize the pretrained transformer model
sentence_transformer_model = SentenceTransformer(
    'sentence-transformers/{model}'.format(model=SENTENCE_TRANSFORMER_MODEL))

# Sanity check
example_encoding = sentence_transformer_model.encode(
    "This is an example sentence",
    normalize_embeddings=True
)

print(example_encoding)


Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

[ 2.25026105e-02 -7.82917812e-02 -2.30307486e-02 -5.10006677e-03
 -8.03404450e-02  3.91321294e-02  1.13428580e-02  3.46483383e-03
 -2.94574704e-02 -1.88930072e-02  9.47434008e-02  2.92747878e-02
  3.94859761e-02 -4.63165939e-02  2.54246294e-02 -3.21999453e-02
  6.21928424e-02  1.55591788e-02 -4.67795618e-02  5.03902026e-02
  1.46113662e-02  2.31413934e-02  1.22066885e-02  2.50696652e-02
  2.93654157e-03 -4.19822149e-02 -4.01036721e-03 -2.27843709e-02
 -7.68588809e-03 -3.31090614e-02  3.22118513e-02 -2.09992286e-02
  1.16730649e-02 -9.85073894e-02  1.77932623e-06 -2.29931846e-02
 -1.31140910e-02 -2.80222818e-02 -6.99970722e-02  2.59314068e-02
 -2.89501362e-02  8.76336619e-02 -1.20919012e-02  3.98605317e-02
 -3.31382118e-02  3.59108336e-02  3.46099064e-02  6.49783984e-02
 -3.00817713e-02  6.98187873e-02 -3.99513636e-03 -1.01600029e-03
 -3.50185446e-02 -4.36567441e-02  5.08025587e-02  4.68757823e-02
  5.39663173e-02 -4.03008349e-02  3.20139038e-03  1.36618130e-02
  3.82188670e-02 -3.23845

In [9]:
class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, sentence_transformer_model):
        self.sentence_transformer_model = sentence_transformer_model
        # you can add any other params to be passed to the constructor here

    #estimator. Since we don't have to learn anything in the featurizer, this is a no-op
    def fit(self, X, y=None):
        return self

    #transformation: return the encoding of the document as returned by the transformer model 
    def transform(self, X, y=None):
        """
        [TO BE IMPLEMENTED]
        
        Goal: TransformerFeaturizer's transform() method converts the raw text document
        into a feature vector to be passed as input to the classifier.
            
        Given below is a dummy implementation that always maps it to a zero vector.
        You have to implement this function so it uses computes a document embedding
        of the input document using self.sentence_transformer_model. 
        This will be our feature representation of the document
        """
        pool = self.sentence_transformer_model.start_multi_process_pool()

        #Compute the embeddings using the multi-process pool
        embeddings = self.sentence_transformer_model.encode_multi_process(X, pool)
        print("Embeddings computed. Shape:", embeddings.shape)

        #Optional: Stop the proccesses in the pool
        self.sentence_transformer_model.stop_multi_process_pool(pool)
        return embeddings

In [10]:
models_v2 = {}
for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    """
    [TO BE IMPLEMENTED]
        
    Goal: initialized below is a dummy sklearn Pipeline object with no steps.
    You have to replace it with a pipeline object which contains at least two steps:
    (1) mapping the input document to a feature vector (using TransformerFeaturizer)
    (2) a classifier that predicts the class label using the feature output of first step

    You can add other steps to preproces, post-process your data as you see fit. 
    You can also try any sklearn model architecture you want, but a linear classifier
    will do just fine to start with

    e.g. 
    pipeline = Pipeline([
        ('featurizer', <your TransformerFeaturizer class instance here>),
        ('classifier', <your sklearn classifier class instance here>)
    ])
    """
    pipeline = Pipeline([
        ('featurizer', TransformerFeaturizer(sentence_transformer_model)),
        ('classifier', LogisticRegression(max_iter=LR_MAX_ITER))
    ])

    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models_v2[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))


Evaluating for training data size = 500
Embeddings computed. Shape: (500, 768)
Embeddings computed. Shape: (1000, 768)
Accuracy on test set: 0.717
Evaluating for training data size = 1000
Embeddings computed. Shape: (1000, 768)
Embeddings computed. Shape: (1000, 768)
Accuracy on test set: 0.74
Evaluating for training data size = 2000
Embeddings computed. Shape: (2000, 768)
Embeddings computed. Shape: (1000, 768)
Accuracy on test set: 0.75
Evaluating for training data size = 5000
Embeddings computed. Shape: (5000, 768)
Embeddings computed. Shape: (1000, 768)
Accuracy on test set: 0.767
Evaluating for training data size = 10000
Embeddings computed. Shape: (10000, 768)
Embeddings computed. Shape: (1000, 768)
Accuracy on test set: 0.771
Evaluating for training data size = 25000
Embeddings computed. Shape: (25000, 768)
Embeddings computed. Shape: (1000, 768)
Accuracy on test set: 0.789


## Step 5: [Modeling part 3] Large Language Models

In [103]:
# Here's a couple of code snippets to help you familiarize with how to generate labels with LLMs using langchain,

from langchain.chat_models import ChatOpenAI
from langchain.schema import LLMResult, HumanMessage, Generation
import tiktoken

MODEL_NAME = "gpt-3.5-turbo" 
encoding = tiktoken.encoding_for_model(MODEL_NAME)

def num_tokens_from_string(string, encoding_name = encoding) -> int:
    """Returns the number of tokens in a text string."""
    num_tokens = len(encoding.encode(string))
    return num_tokens

MAX_TOKENS = max(num_tokens_from_string(label) for label in LABEL_SET)

llm = ChatOpenAI(
    model_name=MODEL_NAME,
    max_tokens=MAX_TOKENS,
    temperature=0.0
)



Sci/Tech




In [14]:

zero_shot_prompt_template = """
You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Tweet: {tweet}
Sentiment:
"""

prompt = zero_shot_prompt_template.format(
    tweet="I hate machine learning"
)

result = llm.generate([[HumanMessage(content=prompt)], [HumanMessage(content=prompt)]])
print(result.generations[0][0])


text='Negative' generation_info=None message=AIMessage(content='Negative', additional_kwargs={}, example=False)


In [15]:

few_shot_prompt_template = """
You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Some example tweets along with the correct sentiment are shown below.

Tweet: Another big happy 18th birthday to my partner in crime. I love u very much!
Sentiment: Positive

Tweet: The more I use this application, the more I dislike it. It's slow and full of bugs.
Sentiment: Negative

Tweet: #Dreamforce Returns to San Francisco for 20th Anniversary. Learn more: http://bit.ly/3AgwO0H
Sentiment: Neutral

Now I want you to label the following example: 
Tweet: {tweet}
Sentiment:
"""

prompt = few_shot_prompt_template.format(
    tweet="I like chocolate"
)

result = llm.generate([[HumanMessage(content=prompt)]])
print(result.generations[0][0])


text='Positive' generation_info=None message=AIMessage(content='Positive', additional_kwargs={}, example=False)


In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from langchain.callbacks import get_openai_callback
import asyncio
from asyncio import Semaphore

class LLMClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, llm_model, prompt_template, semaphore):
        self.llm_model = llm_model
        self.prompt_template = prompt_template
        self.semaphore = semaphore

    # This will be called during the training step
    def fit(self, X, y):
        return self

    # This will be called during inference.
    def predict(self, X):
        """
        [TO BE IMPLEMENTED]

        Goal: LLMClassifier's predict() method constructs the final prompt input
        for the LLM for each x in X, using the prompt template.

        You have to implement this function so it does the following:
        1. Construct the final prompt for the LLM
        2. Call `self.llm_model` to generate the completion (label) for the prompt
        3. Do any post-processing/response parsing to fetch the label from the LLM response
        """
        with get_openai_callback() as cb:
          prompts = [
              [[HumanMessage(content=self.prompt_template.format(article=document))]]
              for document in X
          ]
          result = asyncio.run(self.__generate_labels(prompts))
          print(f"Total Tokens: {cb.total_tokens}")
          print(f"Prompt Tokens: {cb.prompt_tokens}")
          print(f"Completion Tokens: {cb.completion_tokens}")
          print(f"Total Cost (USD): ${cb.total_cost}")
          print(result)
          return result

    async def __async_generate(self, prompt):
        async with self.semaphore:
          response = await self.llm_model.agenerate(prompt)
          label = response.generations[0][0].text
          print(label)
          return label

    async def __generate_labels(self, prompts):
        tasks = [self.__async_generate(prompt) for prompt in prompts]
        labels = await asyncio.gather(*tasks)
        return labels




In [105]:
news_article_zero_shot_template = """
You are an expert at judging the category of which a news article belongs too. 
Your job is to categorize the category of a given article into one of three categories: Sci/Tech, Software and Developement, Entertainment, Sports, Health, Toons, and Music Feed.

Article: {article}
News article category:
"""

In [106]:
# Zero-shot classification pipeline with LLMs

models_v3 = {}
"""
[TO BE IMPLEMENTED]
        
Goal: initialized below is a dummy sklearn Pipeline object with no steps.
You have to replace it with a pipeline object which uses the `LLMClassifier` you have implemented 
above to perform zero-shot classification on the test set.

You can add other steps to preproces, post-process your data as you see fit. 

"""
semaphore = Semaphore(2)

pipeline = Pipeline(
    [
      ('classifier', LLMClassifier(llm_model=llm, prompt_template=news_article_zero_shot_template, semaphore=semaphore)) 
    ]
)

# train
pipeline.fit(X_train_i, Y_train_i)
# predict
Y_pred = pipeline.predict(X_test)
# record results
models_v3["zero-shot"] = {
    'test_predictions': Y_pred,
    'accuracy': accuracy_score(Y_true, Y_pred),
    'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
    'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred)])
}
print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred)))

Sports
Sci/Tech
Sci/Tech




Sports
Sci/Tech
Business/Finance
Sports
Sports
Health




Economy/Finance
Sports
Politics/Current Events
Sports
Entertainment
Health
Business/Finance
Sports
None of the above categories. This article belongs to the category of World/International News or Politics.
Sci/Tech
Sci/Tech
Health
Business/Finance
Entertainment
Health
Business/Economy




Sci/Tech
Sports
Sports
Sports
Sports
Sports
Sports (possibly political




Sports
Politics/Current Events
Politics/Current Events
Sci/Tech
Entertainment
Entertainment
Sports




Sports
Business/Finance
Politics/Economics
Economics/Business
Crime/Justice
Sports
Music Feed
Politics
Business/Economy
Sports
Sci/Tech




Sports
Health
Health
Health




Health
Software and Development
Music Feed
Sports
Business/Finance
Sports
Sci/Tech




Sports
Finance/Economy
Sports




Sci/Tech
Sports
Sports




Sci/Tech
Sports
Sci/Tech
Sports
Health
Sports
Health




Employment and Labor




Sci/Tech
Entertainment
Sports
Sports
Entertainment
Sci/Tech




Sports
Business/Finance
This news article belongs
Health
Health
Sci/Tech




Health
Health




Sports
Sci/Tech
Software and Development
Sports
Entertainment
Finance/Business
Politics/Law




Health
Entertainment
Sci/Tech
Sports
Sports
Sci/Tech
Entertainment
Entertainment




Sci/Tech
Sports
Health
Sports
Sci/Tech
Sports
Health




Sports




Sports
Sports




Business/Finance




Politics/World News




Sci/Tech
Health
Sports
Sports
Politics/Current Events
Sports




Sports
Sports




Health
Sports
Sci/Tech
Sports
Sci/Tech
Entertainment
Sports




Sports
Entertainment
Health
Entertainment
Politics
Software and Development




Sci/Tech




Sci/Tech
Sports




Sci/Tech




Trade/Economy
Sports
Politics/Government
Politics
Sports (incorrect categor




Health
Sports




Entertainment




Sports
Sci/Tech
Politics/Elections
Sports
Sports
Business/Finance
Sci/Tech




Entertainment
Economics/Finance
Sci/Tech
Sports
Sports




Sports
Sports




None of the above
Sci/Tech
Health
Sci/Tech




Animals and Wildlife
Entertainment
Sports
Entertainment
Politics




Sports
Health




Sports
Sports
Sports
Sci/Tech




Sci/Tech
Sports




Weather/Natural Dis
Sci/Tech
Entertainment




Sci/Tech
Politics/Law




Sports
Sci/Tech
Politics/International Relations
Music Feed




Entertainment
Sci/Tech (as it involves government contracts and investigations)
Sports




Sci/Tech
Sci/Tech
Politics




Business/Finance
Sci/Tech




Business/Finance
Business/Finance
Sports
Entertainment
Software and Development




Sports
Politics/International Affairs
Politics/Law




Software and Development
Business/Economy
Business/Finance
Sci/Tech
Health
Sports
Business/Economy
Entertainment
Sports
Sports




Sci/Tech
Sports
Politics
Sports
Sci/Tech
Business/Finance
Sci/Tech




Sports
Sports
Sports
Sci/Tech




Sci/Tech
Sports
Entertainment
Sports
Business/Economy
Sports




Sports
Business/Finance
Music Feed
Obituary/
Sports
Sports
Politics
Health




Entertainment
None of the given
Sci/Tech
Entertainment
Sports




Sports
Sci/Tech
Entertainment
Sci/Tech
I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.
Health




Politics or World News
Health




Sports
Sports
Toons
Sports
Health
Entertainment




Entertainment
Sports
Sci/Tech
Entertainment
Sports
Health
Sports
Sports
Software and Development
Sports
Sci/Tech




Sports
Sci/Tech
Sports
Sports
Sci/Tech
Entertainment
Sci/Tech
Health
Sports
Sci/Tech




Entertainment
Sports
Entertainment
Entertainment
Sci/Tech
Health
Sci/Tech
Entertainment
Sports
Sports




Health
Sports (This article
Sci/Tech
Sci/Tech
Sports
Business/Finance
Politics/Current Events
Health
Politics/International Affairs
Health
Sports




Sports
Politics
Entertainment
Health
Sports
Music Feed
Sports
Sports
Sports
Sci/Tech
Sports
Sports
Entertainment
Finance/Investment
Sports
Sports




Sports
I'm sorry,
Sci/Tech
Health
Music Feed
Sci/Tech
Sports (This is
Sports
Health




Sci/Tech or
Music Feed
Entertainment
Economics/Business
Finance/Economy
Finance/Economy
Sci/Tech
Sci/Tech
Sci/Tech
Sports




Entertainment
Sports
Entertainment
Sci/Tech
Sports
Sports
Sports
Sports




Sci/Tech
Politics
Software and Development
Sci/Tech
Entertainment
Toons
Retail/Sales
Sports
Entertainment
Labor/Employment (not one of the given categories, but closest match would be Business/Finance)
Sports
Sports




Sports
Sports
Toons
Sports
Entertainment
Health
Entertainment
Health
Sports
Finance/Business
Entertainment
Health (due to
Business/Finance
Sports
Politics




Sci/Tech
None of the above
Business/Finance
Sports
Entertainment
None of the above categories. This article falls under the category of World News or Current Events.
Health
Science/Technology (Sci/Tech)
Politics/LGBTQ+ Rights




Sports
Sports




Sci/Tech
Politics
Sci/Tech




Health (as it pertains to the safety and well-being of Nepalese workers)
It is difficult to determine the category of this news article based on the given information. It could potentially fall under the category of Crime or Law Enforcement.
Sports




Entertainment




Politics/International Affairs
Sports
Sports




Politics
Sci/Tech




Sports
Business/Finance
Sports




Sci/Tech
This news article falls under the category of Crime or Law and Order. It does not fit into any of the given categories.
Sci/Tech
Health
Sports




Sci/Tech
Sci/Tech
Sports




Sports
Sports




Sports
Sports
Health
Sports




Health
Sports
Entertainment
Finance/Economy
Politics/International Affairs
Sports
Business/Economy
Health
Entertainment




Business/Finance
Sports
Sports
Sports




Sports
Sports




Health
Health
Sci/Tech
Health




Sports
Entertainment
Weather/Natural Disaster
Sports (incorrect category) 

Correct category: None of the given categories fit this news article. It is related to airport security and does not fall under any of the provided categories.
Entertainment
Health
Sports
Health (as the article is related to the safety and well-being of individuals in a conflict zone)
Sports
Sci/Tech
Business/Economy




Sports
Music Feed
Health
Sports
Sports




Entertainment
Sci/Tech
Sci/Tech




Sports
None of the above
Sci/Tech
Health
Health (due to
Business/Finance
Music Feed
Health
Sports (This is not a news article, but if it were, it would likely fall under the category of war or conflict, which is not one of the given categories.)
Sci/Tech
Sports
Finance/Business




Sci/Tech
Sports
Sports
Sci/Tech
Sports




Sports




Charity/Fundraising (not one of the given categories)
Health




Sports




Health
Sports




Sports
Sci/Tech
Sports




Sports
Business/Finance
Sci/Tech
Business/Economics
Sci/Tech
Natural Disasters/Weather
Entertainment




Sci/Tech
Finance/Business
Sci/Tech
Politics/International Affairs
Sports (as it pertains to military action and conflict)
Politics
Retail/Sales




Health
Sports
Sports
Sports
Entertainment




Health
Economics/Business
Politics/Economics
Sci/Tech
Health




Health
Politics/Elections
Sports
Sports
Sports




Business/Finance
Entertainment




Toons
Sports
Sports
Sports




Economy/Finance
Sci/Tech
Sci/Tech
Business/Economy
Sports
Sci/Tech
Entertainment
Business/Finance
Health
Sports
Business/Finance




Sports
Health
Sports
Sci/Tech
Business/Finance
Health
Sports
Health
None of the above
Sci/Tech
Sports
Sci/Tech or Software and Development (depending on the focus of the article)
Sports
Sports
Sports
Entertainment




Economics/Business
Sports
Business/Economy
Sci/Tech
Sports
Sports
Sports
Sports
Sports




Sports
Sports
Sci/Tech




Health
Health
Sports
Sports
Health
Entertainment
Politics/Current Events




Health
Sci/Tech
Sports
Sports
Finance/Business
Sports
Sci/Tech
Sports




Entertainment
Sports
Sports
Sports
Business/Finance
Sports
Politics/Economics
Politics
Politics/World News
Sports
Business/Finance




Sports
Sports
Sci/Tech
Entertainment
Entertainment
Sci/Tech
Sci/Tech




Sci/Tech
Sci/Tech
Sci/Tech
Politics
Health
Entertainment
Sci/Tech
Sports
Sports




Business/Finance
Sci/Tech
Sci/Tech
Sci/Tech




Health
Sci/Tech
Politics
Finance/Economy
Sports
Entertainment
Sci/Tech
Sci/Tech




Entertainment
Politics/Law
Sports




Music Feed
Sci/Tech (
Sci/Tech
I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.
Sports
Finance/Business




Business/Finance
Sports
Politics/International Affairs
Sci/Tech
Sports




Sports
Sports
Business/Finance
Sports




Sports
Entertainment
Sports (as the article is likely discussing the attack on Pearl Harbor, which is a historical event related to war and military action)
Politics




Entertainment
I'm sorry,
Sports




Health
Health
Sports
Sports
Sci/Tech




Sports
Sci/Tech




Entertainment
Sci/Tech
Sci/Tech
It is difficult to
Sports




Politics/International Affairs
Sports
Sports




Finance/Economy
Sports
Sports




Health
Sports




Sports




Sports




None of the above. This article does not fit into any of the given categories. It may fall under the category of "human interest" or "local news."
Sci/Tech




Sports
Travel/Tourism




Business/Finance




Sports
Politics




Politics/Elections
Politics




Sci/Tech




Sports




Politics/International Affairs




Sci/Tech
Sci/Tech
Finance/Economics
Sports




Sci/Tech




Sci/Tech
Sci/Tech




Sports
Sports




Health
Economics/Business
Sports




Entertainment




Finance/Economy
Sci/Tech
Sci/Tech
Sports
Sports




Sports
Sci/Tech
Toons
Health
Labor/Employment
Sports




Sports
Entertainment
Entertainment
Sci/Tech
Entertainment
Business/Finance
Politics
Health (as it pertains to the regulation of alcoholic beverages)
Sports
Finance/Business
Entertainment
Sci/Tech




None of the above
Health
Science/Technology (
Security/Politics
Politics/LGBTQ
Sports
Sci/Tech
Entertainment
Sports




Business/Finance




Sci/Tech
Politics
Sci/Tech
Sports
Sci/Tech
Finance/Investment
Sports
Sports




Health (as it
Entertainment
Sports
Sports
Politics/International Affairs
Politics
Sci/Tech
Sci/Tech
Sports
This news article falls
Finance/Economy
Sports




Sports
Sci/Tech
Sports
Sports
Sports
Sports
Finance/Economy
Health
Sports
Politics/Current Events




Health
Health
Sports (incorrect category
Health
Health
Health (as the
Sci/Tech
Music Feed
Sports (incorrect categorization) 

Correct categorization: International Affairs/Politics
Sports
Sci/Tech
Health
Sci/Tech
Sports




Sports (This is
Sports
Sci/Tech
Sports
Sports
Sports
Sports
Sci/Tech




Charity/Fundra
Sports
Health
Sports
Politics/Economics
Health
Health
Sports
Sci/Tech
Sports




Sci/Tech
Sports
Sci/Tech (
Health
Sci/Tech
Business/Finance
Politics/Elections




Sci/Tech
Business/Economics
Sci/Tech
Sports
Natural Disasters/
Business/Finance
Entertainment




Business/Economics
Sci/Tech
Finance/Business
Sci/Tech
Politics/International Affairs
Sports (as it
It is difficult to determine the exact category of the given article without more context. However, based on the keywords "neat idea" and "price," it could potentially fall under the Sci/Tech or Software and Development categories.
Politics
Entertainment
Retail/Sales
Sports
Health
Sports
Sports




Sci/Tech
Sports




Sports
Entertainment
Politics/World News
Health




Sci/Tech




Economics/Business
Politics/Economics
Sci/Tech




Sci/Tech
Sports




Health
Health
Politics/Elections
Sci/Tech




Health
Sports




Sports
Sports




Sports
Health
Business/Finance
Business/Finance
Sports
Sci/Tech or
Sports
Sports
Sports
Entertainment
Sci/Tech
Sports




Politics/Current Events
Sports
Sci/Tech
Entertainment
Politics/Economics
Sci/Tech
Business/Finance
Health
Sports




Sports
Business/Economy
Sports
Sci/Tech
Sports
Finance/Economy
Sci/Tech
Entertainment




Sci/Tech
I'm sorry,
Finance/Business
Sci/Tech
Sports
Sports (as the
Sports
Sports
Entertainment
Sports
Sci/Tech
Sci/Tech
Sports
Finance/Economy
Sports
Sports
Health
Sports
Sports
None of the given




Entertainment
Sports
Politics/International Relations
Sports
Energy/Commodities
Health
Health
Sports
Sports
Finance/Economy
Sports
Entertainment
Health
Sci/Tech
Music Feed
Crime/Local News




Health
Politics
Sports
Politics/Government
Business/Finance
Business/Finance
Politics/International Affairs
Sports
Sports
Sci/Tech
Sports
Sci/Tech
Sports
Health
Sci/Tech
Sports
Entertainment
Sci/Tech
Sports
Entertainment
Entertainment
Health
Sports
Entertainment
Sports
Sports
Sci/Tech
Sports
None of the above (this is just a schedule of events, not a news article with a specific category)
Health
Health
Business/Finance
Software and Development




Health
Sci/Tech




Sci/Tech




Sports
Sports




Sci/Tech
Sports
Sports
Politics/Elections
Finance/Business




Travel/Tourism
Health




Business/Finance
Sports
Politics
Politics




Sci/Tech
Politics
Sports
Finance/Economics
Business/Finance
Politics/International Affairs
Health




Sci/Tech
Sci/Tech
Health




Sports
Sci/Tech




Sports




Sci/Tech
Sports
Sci/Tech




Sports
Politics/International Affairs
Entertainment
Business/Economy
Sci/Tech




Business/Economy
Sports




Health
Sports




Sci/Tech
Sports (incorrect categorization) 

Correct categorization: Weather/Natural Disaster or Current Events/Politics
Health (as it
Sports
Finance/Business
Health




Sci/Tech
Security/Politics
Toons




Sci/Tech
Entertainment
Business/Finance
None of the given categories fit this article. It could potentially fall under the category of "World News" or "Politics."
Sports




Sci/Tech
Sports
Sports
Sports




Sci/Tech




Sci/Tech
Finance/Economy
Health
Sci/Tech
Sports
Politics/Current Events




Health
Health
Sports
Sci/Tech
Sports (incorrect categor




Health
Sports
Sports
Sports




Sports
Sci/Tech




Sci/Tech
Politics/Economics
Health
Health
Sci/Tech




Sci/Tech
Health
Sci/Tech
Politics/Elections
Sci/Tech




I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.
Sports




Business/Finance
Business/Economics
Toons
Sci/Tech
Sci/Tech




Entertainment
Sports
Sports
Sports




Music Feed




Politics/World News
Sci/Tech
Sci/Tech
Sports




Sports
Sports




Sci/Tech




Health
Health
Employment and Labor Force (not one of the given categories)
Sports
Sci/Tech
Entertainment
Sports
Entertainment




Health
Sports




Entertainment
Sci/Tech
Health
Health
Sports
Sports
Sports
Sci/Tech
Sci/Tech (related to international trade and sanctions)
Politics/Government
Health




Sports
Politics/Elections
Politics/Diplom




Sports
Energy/Commod
Health
Sci/Tech
Health
Legal/Finance




Sports
Entertainment
Sports




Finance/Economy
Sports
Entertainment
Sports
Health
Sci/Tech
Sci/Tech
Sports




Music Feed
Crime/Local News
Science/Technology
Sports




Entertainment
Politics




Sports
Business/Finance
Sports
Business/Finance




Politics/International Affairs
Sports




Sports
Sports
Sports
Sci/Tech
Sci/Tech
Sports
Sports
Health
Sports
Sports




Sci/Tech (
Sports
Entertainment
Sci/Tech
Sports




Entertainment
Sci/Tech
Entertainment
Health
Sports
Entertainment
Sports
Sports
Sci/Tech
I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.
None of the above




Sports
Health
Health
Health
Business/Finance
Sci/Tech




Software and Development
Health
Sci/Tech
Sci/Tech
Sports
Health
Sci/Tech
Sports
Sports
Finance/Business
Sports




Politics
Health
Health
Business/Finance




Sports
Business/Finance
Politics/International Affairs
Sports




Sci/Tech
Sports
Business/Economy
Business/Economy
Sports (incorrect categor
Toons
Sports
Business/Finance
Health




Sports (incorrect categorization)
None of the given
Sports




Sci/Tech
Sports
Sci/Tech
Sports (assuming it




Sports




Sports
Sci/Tech
Sports
Sci/Tech
Finance/Economy
Health
Health




Sci/Tech




Sci/Tech
Sports




Sports
I'm sorry,




Toons
Sci/Tech
Sci/Tech
Sports




Sci/Tech
Music Feed




Sports
Sci/Tech
Sports




Weather/Natural Disasters
Employment and Labor
Music Feed
Politics/Elections




Sci/Tech
Sports
Sports
Health




Entertainment




Health
Sports
Politics/Economics
Business/Economy




Health
None of the above. This article does not fit into any of the given categories.
Entertainment




Politics
Sci/Tech




Legal/Finance
Entertainment
Sports
Sci/Tech
Politics/Current Events
Politics/Law




Sports
Sci/Tech
Sports
Sports




Sports




Sports
Sports
Sports
Sports
Business/Finance
Entertainment




Sports
Sci/Tech




I'm sorry,
Health
Entertainment
Sci/Tech
Sports
Sci/Tech
Software and Development
Sports




Sports (incorrect category
Sports
Business/Finance
Business/Trade
Sci/Tech




Sports
Sci/Tech
Finance/Economy
Sports
Sci/Tech




Health
Sports
Sports
Sports
Sports




Sci/Tech
Sci/Tech
Sci/Tech
Sports
Sci/Tech




Weather/Natural Dis
Sci/Tech (This article does not fit into any of the given categories. It is a news article about a political and security issue.)
Music Feed




Sports
Entertainment




Health




Business/Economy
None of the above
Sci/Tech




Sci/Tech
Politics/Law
Politics
Business/Finance
Politics/Current Events




Sci/Tech
Sports
Sports
Business/Finance
Entertainment
Sci/Tech
Sports




Sports
Entertainment
Sports
Software and Development
Sci/Tech
Sci/Tech




Entertainment
Sports




Sports
Sports
Sci/Tech (
Sci/Tech
Health
Sports
Business/Economy
Sci/Tech




Sports
Business/Finance
Sci/Tech
Entertainment
Sci/Tech
I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.
Business/Economy
I'm sorry,
Sports
Sports
Sports
Sports
Sports




Software and Development
Sci/Tech
Finance/Economy
Sports
Sci/Tech
Entertainment
Sports
Health




Software and Development
Sci/Tech
Sci/Tech




Sci/Tech
Sports
Sports
Sports
Sports
Sci/Tech
Sports
Politics/International Affairs
Sports
Sports
Finance/Economy




Sports
Sci/Tech
Sports
Sports
Sports
Sci/Tech
Sports
None of the above
Entertainment
Entertainment
Sci/Tech




Health
Entertainment
Toons
Health
Sports
Politics
Health
Finance/Economy
Sci/Tech
Sci/Tech




Health
Health (due to
Sports
Sports
Sci/Tech
Sports




Entertainment
Health
Sci/Tech
This news article belongs
Sci/Tech
Business/Finance
Sports
Health
Sci/Tech
Health




Health
Sci/Tech (related to the oil industry and market speculation)
Business/Finance
Sports
Sports
Politics/International Relations




Sports
Sports
Sci/Tech
Sci/Tech
Sci/Tech
Finance/Economy
Sports
Politics/Current Affairs
I'm sorry,
Sports




Sports
Business/Finance
I'm sorry,
Sports




Health
Retail/Business
Sports
Sports
Sports
Sports
Politics/World News




Sports
Sci/Tech or
Sports
Sports
Sports
Sports
Sci/Tech
Sports
Sci/Tech or




Music Feed
Sports
Politics or Military Affairs
Finance/Investment
Health
Entertainment
Sports
Politics/Human Rights
Sci/Tech
Sci/Tech
Sports




Politics
Sports
Music Feed
Sci/Tech
Sci/Tech
Sports
Sports
Entertainment
Sports
Entertainment
Software and Development




Sports
Health




Health
Sci/Tech
I'm sorry,




Sports
Sports
Sci/Tech
Entertainment
Sports
Sci/Tech
Sports
Health (due to
Business/Finance
Entertainment
Sci/Tech




Sports
Sci/Tech
Sports
Sci/Tech
Business/Finance
Sports
Sci/Tech
Business/Finance
Sports
Health
Politics




Sports
Health
Sci/Tech or
Finance/Economics
Sci/Tech
Sports
Entertainment
Sports
Entertainment
Health
Health




Sports
Sports
Sci/Tech
Entertainment
Sci/Tech
Toons
Economy/Politics
Sports
Entertainment




Entertainment
Sports
Finance/Economy
Sci/Tech
Politics
Sci/Tech
Entertainment
Sci/Tech
Finance/Economy




Business/Finance
Entertainment
Sports
Sci/Tech
Sci/Tech
None of the given
Health
Sci/Tech
Business/Finance
Sports
Health




Sci/Tech (
None of the above
Sci/Tech
Health
Business/Finance
Sports
Sci/Tech
Business/Finance
Health
Health (due to the focus on the number of civilian deaths as a consequence of war)




Sports
None of the above. This article does not fit into any of the given categories.
Software and Development
Sports
Sports




Sports
Health




Health
Health
Health
Sci/Tech
Sports
Entertainment




Sports
Entertainment
Sports
Health
Sports




Sports




Sci/Tech
Health
Sci/Tech




Health
Health
Health
Health
None of the given
Entertainment
This news article belongs to the category of "Current Events" or "Politics" as it reports on the deployment of the army in response to violent protests during a funeral in Pakistan. It does not fit into any of the given categories.
Sci/Tech




Sports
Sci/Tech
Sports
Health




Sci/Tech
Sports
Business/Finance




Entertainment
Sports
Sports
Sci/Tech




Health
Entertainment




Sci/Tech
Software and Development




Business/Finance
Sci/Tech
Sports




Sci/Tech
Sports
Sports
Sci/Tech
Politics/International Relations




Health
Sci/Tech




Sports




Sports
Sci/Tech




Sci/Tech




Sci/Tech




Politics/Current Affairs
Finance/Economy
Sci/Tech




I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.
Sports
I'm sorry, but you have not provided an article for me to categorize. Please provide an article for me to categorize.




Sports




Retail/Business
Health
Sports
Sports
Sci/Tech or Music Feed
Sports
Sports
Sports
Sports




Sports
Health
Total Tokens: 113655
Prompt Tokens: 111599
Completion Tokens: 2056
Total Cost (USD): $0.2273100000000002
['Sports', 'Sci/Tech', 'Sports', 'Sci/Tech', 'Sports', 'Sports', 'Economy/Finance', 'Sports', 'Politics/Current Events', 'Sports', 'Health', 'Entertainment', 'Business/Finance', 'Sports', 'Sci/Tech', 'Sci/Tech', 'Business/Finance', 'Entertainment', 'Health', 'Sci/Tech', 'Sci/Tech', 'Sports', 'Sports', 'Sports', 'Sports (possibly political', 'Sports', 'Politics/Current Events', 'Politics/Current Events', 'Entertainment', 'Sports', 'Sports', 'Business/Finance', 'Politics/Economics', 'Crime/Justice', 'Music Feed', 'Politics', 'Sports', 'Sports', 'Health', 'Health', 'Health', 'Software and Development', 'Music Feed', 'Business/Finance', 'Sci/Tech', 'Finance/Economy', 'Sci/Tech', 'Sports', 'Sci/Tech', 'Sports', 'Sci/Tech', 'Sports', 'Employment and Labor', 'Sci/Tech', 'Entertainment', 'Sports', 'Sci/Tech', 'Sports', 'Business/Finance', 'This news article belongs', 'Health',

In [111]:
for x, y in zip(X_test[:3], Y_true[:3]):
  print(x, y)


AP - Denny Neagle's contract was terminated by the Colorado Rockies on Monday, three days after the oft-injured pitcher was cited for solicitation. Sports
The rush by Wal-Mart and other companies to put radio frequency identification devices in their goods could imperil consumer privacy. Software and Developement
AP - Tampa Bay Buccaneers owner Malcolm Glazer increased his stake in Manchester United for the third straight trading day, upping his ownership share Tuesday to 28.11 percent and edging closer still to a possible takeover bid. Sports


In [112]:
news_article_few_shot_template = """
You are an expert at judging the category of which a news article belongs too. 
Your job is to categorize the category of a given article into one of three categories: Sci/Tech, Software and Developement, Entertainment, Sports, Health, Toons, and Music Feed.

Some example tweets along with the correct sentiment are shown below.

Article: AP - Denny Neagle's contract was terminated by the Colorado Rockies on Monday, three days after the oft-injured pitcher was cited for solicitation.
News article category: Sports

Article: The rush by Wal-Mart and other companies to put radio frequency identification devices in their goods could imperil consumer privacy.
News article category: Software and Developement

Article: AP - Tampa Bay Buccaneers owner Malcolm Glazer increased his stake in Manchester United for the third straight trading day, upping his ownership share Tuesday to 28.11 percent and edging closer still to a possible takeover bid.
News article category: Sports

Now I want you to label the following example: 

Article: {article}
News article category:
"""

In [113]:
# Few-shot classification with LLMs

"""
[TO BE IMPLEMENTED]
        
Goal: initialized below is a dummy sklearn Pipeline object with no steps.
You have to replace it with a pipeline object which uses the `LLMClassifier` you have implemented 
above to perform few-shot classification on the test set.

With few-shot classification, you can pass upto 5 demonstration examples as part of the prompt 
to the LLM. You can add other steps to preproces, post-process your data as you see fit. 

"""
pipeline = Pipeline(
    [
      ('classifier', LLMClassifier(llm_model=llm, prompt_template=news_article_few_shot_template, semaphore=semaphore)) 
    ]
)

# train
pipeline.fit(X_train_i, Y_train_i)
# predict
Y_pred_i = pipeline.predict(X_test)
# record results
models_v3["few-shot"] = {
    'test_predictions': Y_pred_i,
    'accuracy': accuracy_score(Y_true, Y_pred_i),
    'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
    'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
}
print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))


Sports
Software and Development
Sports
Sci/Tech
Sports
Sports
Sci/Tech
Sports
Politics
Sports
Entertainment
Health
Sports
Business/Finance
Sci/Tech
Sci/Tech
Business
Entertainment
Health
Sci/Tech
Sci/Tech
Natural Disasters/
Sports
Health
Politics/Current Events
Sports
Politics/Current Events
Politics/World News
Fashion/Entertainment
Sports
Sports
Business
Sci/Tech
Sci/Tech
Crime or Law Enforcement
Sci/Tech
Sports
Sports
Health
Health
Health
Software and Developement
Business
Music Feed
Sci/Tech
Sci/Tech
Sci/Tech
Sci/Tech
Sports
Sci/Tech
Sports
Sci/Tech
Health
Sci/Tech
Sports
Entertainment
Sports
Sci/Tech
Politics
Business/Finance
Health
Health
Sports
Sci/Tech
Sci/Tech
Entertainment
Business
Health
Entertainment
Sci/Tech
Sci/Tech
Sci/Tech
Sci/Tech
Sports
Health
Sci/Tech
Health
Entertainment
Health
Sports
Politics
Sports
Health
Sports
Sci/Tech
Sports
Sports
Sports
Entertainment
Health
Software and Development
Sports
Trade and Commerce
Sports
Politics
Sci/Tech (
Sports
Sports
Sci/Tech
Spo



Sports
Music Feed
Sports
Sports
Sports
Sports
Sci/Tech
Entertainment
Sports
Sports
Sports
I'm sorry,
Sci/Tech
Sports
Health
Entertainment
Sci/Tech (
Music Feed
Sci/Tech
Entertainment
Finance/Economy
Sci/Tech
Sci/Tech
Health
Entertainment
Sports
Entertainment
Sports
Health
Politics
Sci/Tech
Sports
Software and Development
Sports
Retail/Sales (
Sports
Music Feed
Sports
Sports
Entertainment
Sports
Health
Health
Sports
Sci/Tech
Health (due to
Automotive
Sci/Tech
Sci/Tech (
Sci/Tech
Sci/Tech
Business
Sci/Tech
Sci/Tech
Health
Sci/Tech
Sports
Sports
Sports
Health
Health
Sports
Entertainment
Politics
Business/Economy
Entertainment
Sports
Business
Sports
Sports
Health
Health
Sci/Tech
Sports
Entertainment
Weather/Natural Disaster
Entertainment
Health
Sports
Sports
Sci/Tech
Sports
Health
Entertainment
Sports
Sci/Tech
Politics
Health
Health
Sports
Business
Music Feed
Health
Finance/Business
Sports
Sci/Tech
Sports
Sci/Tech
Sci/Tech
Entertainment
Toons
Sports
Economics/Finance
Sci/Tech
Sci/Tech
Spor



Entertainment
Sports
Business/Finance
Sports
Health
None of the given
Sports
Health
Sci/Tech
Health (as the
Sci/Tech
Sci/Tech
Sports
Sci/Tech
Sci/Tech
Sci/Tech (
Sports
Sci/Tech
Sports
Sports




Charity/Fundra
Sports
Health
Sports
Sports
Health
Sports
Sports
Sci/Tech
Sports
Business/Finance
Sci/Tech
Sci/Tech
Sci/Tech
Sci/Tech
Natural Disasters/
Sci/Tech
Sci/Tech
Sci/Tech
Sci/Tech
Politics
Sci/Tech
Politics
Business/Economy
Health
Sports
Sci/Tech
Sports
Entertainment
Health
Economics/Business
Politics
Sci/Tech
Health
Health
Politics/Government
Sci/Tech
Sports
Sports
Business
Sports
Business
Sports
Sports
Sci/Tech
Sports
Sci/Tech
Politics
Sports
Sci/Tech
Politics/Economy
Business/Finance
Sports
Sports
Health
Sci/Tech
Entertainment
Sci/Tech
Sci/Tech
I'm sorry,
Sci/Tech
Finance/Economics
Sports
It is difficult to
Sports
Sports
Music Feed
Sports
Sci/Tech
Sci/Tech
Sports
Sci/Tech
Sports
Sports
Health
Sports
Sports
Health
Sci/Tech
Sports
Politics
Travel/Tourism
Business/Finance
Sports
Politics
Sci/Tech
Politics
Finance
Sports
Politics
Sci/Tech
Sci/Tech
Sci/Tech
Sports
Sci/Tech
Sports
Entertainment
Sports
Sci/Tech
Health
Sports
Sci/Tech
Legal/Politics
Sci/Tech
Sci/Tech
Health (as it
S



Entertainment
Health
Sports
Entertainment
Sci/Tech
Sports
Sports
Sci/Tech
Sci/Tech
Sports
Health
Health
Sci/Tech
Software and Developement
Health
Sci/Tech
Sci/Tech
Entertainment
Health
Sports
Sports
Politics
Health
Business/Finance
Health
Politics
Sports
Sports
Sci/Tech
Sci/Tech
Sci/Tech
Natural Disaster/Weather
Toons
Sports
Health
Politics/International Affairs
Sports
Sci/Tech
Sports
Sci/Tech
Sci/Tech (
Sci/Tech
Sports
Sci/Tech
Sports
Sci/Tech
Health
Sci/Tech
I'm sorry,
It is impossible to
Sci/Tech
Sports
Music Feed
Sports
Business
Health
Politics
Sci/Tech
Sports
Health
Health
Sports
Sci/Tech
Health
Entertainment
Sci/Tech
Legal/Politics
Entertainment
Sports
Sci/Tech
Sports
Sports
Business
Sports
Sports
Sports
Business
It is unclear what
I'm sorry,
Health
Sci/Tech
Sci/Tech
Sports




Sci/Tech
Sports
Business/Finance
Sports
Finance/Economy
Finance/Economy
Sci/Tech
Health
Sports
Sports
Sci/Tech
Sci/Tech
Sci/Tech
Sports
Weather/Natural Dis
Sci/Tech
Entertainment
Business
Toons
Politics
Politics
Politics
Sci/Tech
Sci/Tech
Sports
Sports
Entertainment
Sports
Entertainment
Sports
Software and Development
Sci/Tech
Sci/Tech
Entertainment
Travel/Transportation
Sports
Sci/Tech
Sci/Tech
Health
Sports
Sports
Sci/Tech
Business/Finance
Sci/Tech
Entertainment
Sci/Tech
Business
I'm sorry,
Sports
Sports
Sports
Software and Development
Sci/Tech
Sci/Tech
Sports
Sci/Tech
Entertainment
Health
Sci/Tech
Sci/Tech
Sports
Sports
Sci/Tech
Sports
Sports
Sports
Politics/World News
Sports
Sports
Sci/Tech
Sports
Weather/Natural Dis
Entertainment
Health
Sports
Entertainment
Sci/Tech
Health
Health
Toons
Toons and Music
Politics
Sports
Finance/Economy
Software and Developement
Health
Health
Sports
Sports
Entertainment
Sports
Health
Sci/Tech
Politics/Current Events
Sci/Tech
Business/Finance
Health
He

## Step 5: Report Results from previous two steps

In [126]:
# Report results

print("N-gram Models: ")
for train_size, result in models.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))


N-gram Models: 
Train size: 500  |  Accuracy: 0.303  |  F1 score: 0.16343520671051212 |  Num errors: 697
Train size: 1000  |  Accuracy: 0.216  |  F1 score: 0.10880678256758546 |  Num errors: 784
Train size: 2000  |  Accuracy: 0.329  |  F1 score: 0.21644958343160567 |  Num errors: 671
Train size: 5000  |  Accuracy: 0.352  |  F1 score: 0.26360265195289206 |  Num errors: 648
Train size: 10000  |  Accuracy: 0.281  |  F1 score: 0.2310255973993498 |  Num errors: 719
Train size: 25000  |  Accuracy: 0.368  |  F1 score: 0.3551272825603496 |  Num errors: 632


In [127]:
print("Pretrained Transformer Models: ")
for train_size, result in models_v2.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Pretrained Transformer Models: 
Train size: 500  |  Accuracy: 0.717  |  F1 score: 0.7083022505473252 |  Num errors: 283
Train size: 1000  |  Accuracy: 0.74  |  F1 score: 0.7321001798864292 |  Num errors: 260
Train size: 2000  |  Accuracy: 0.75  |  F1 score: 0.7413502143733617 |  Num errors: 250
Train size: 5000  |  Accuracy: 0.767  |  F1 score: 0.7599778036819289 |  Num errors: 233
Train size: 10000  |  Accuracy: 0.771  |  F1 score: 0.7635818154780631 |  Num errors: 229
Train size: 25000  |  Accuracy: 0.789  |  F1 score: 0.7812156072716253 |  Num errors: 211


In [128]:
print("Large Language Models: ")
for mode, result in models_v3.items():
    print("Mode: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        mode,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Large Language Models: 
Mode: zero-shot  |  Accuracy: 0.517  |  F1 score: 0.7812156072716253 |  Num errors: 483
Mode: few-shot  |  Accuracy: 0.534  |  F1 score: 0.5193196290625451 |  Num errors: 466


## Step 6: Data Augmentation [Optional]

In this section, we want to explore how to augment data efficiently to your existing training data. This is a very empirical exercise with a less well-defined playbook which means this section of the project is going to be open ended. Let us first understand what we mean by efficiency here, and why it matters:

### Performance Gain (G):
We will measure performance gain from data augmentation as the improvement in model accuracy (reduction in num. errors) on the Test dataset as defined above. 

### Budget (K):
We will measure "budget" as the number of additional rows augmentated to the original training dataset.  In this project, the universe of data from which you will select to add to your training set is Datasets['augment'] (and downstream X_augment, Y_augment).

This data is already labeled of course, but in most real-world scenarios the additional data is typically unlabeled. In order to augment it to your training data, you have to get it annotated which incurs some cost in time & money. This is the motivation to consider budget as a metric.

### Efficiency (E = G / K): 
Efficiency = Performance Gain (Reduction in num errors in test set) / Budget (Number of additional rows augmented to the training dataset)

We want to get the maximum gain in performance, while incurring minimum annotation cost.



We can always sample more data at random from the augmentation set, and this is probably the first thing to try. Can we be more intelligent with the data we choose to augment to the training dataset?

**Idea 1**: Look at the test errors that the current model is making. How can this help us guide our "data collection" for augmentation? One possible idea is to select examples from the augmentation dataset that are similar to these errors and add them to the training data. Similarity can be approximated in many ways:
1. [Jaccard distance between two texts](https://studymachinelearning.com/jaccard-similarity-text-similarity-metric-in-nlp/)
2. L2 distance between mean word vectors (we already compute these features for the entire dataset using WordVectorFeaturizer)
3. L2 distance between sentence transformer embedding (we already compute these features for the entire dataset using TransformerFeaturizer)
  

**Idea 2**: Compute model's predictions on the augmentation dataset, and include those examples to the training dataset that the model finds "hard" ? (a proxy for this would be to look at cases where the output score distribution across all labels has nearly identical scores for top two or three labels).

**Idea 3**: Look at the test errors that the current model is making, and the distribution of these errors across labels. Select examples from the augmentation dataset that belong to these classes - adding more training data for labels that the curent model does not do well on, can improve performance (assuming label quality is good)

In [None]:
# Examine current test errors
test_errors = []
Y_pred_i = models[25000]['test_predictions']

for idx, label in enumerate(Y_true):
    if label != Y_pred_i[idx]:
        test_errors.append((X_test[idx], label,  Y_pred_i[idx]))

print("Number of errors in the test set: {}".format(len(test_errors)))
print("Example errors: [example, true label, predicted label]")
for i in range(10):
    print(test_errors[i])

In [None]:
'''
[TO BE IMPLEMENTED]

Your additional data augmentation explorations go here

For instance, the pseudocode for Idea (1) might look like the following:

Augmented = {}
For e in test_errors:
   1. X_nn, y_nn = k nearest neighbors to (e) from X_augment, y_augment
   2. Add each (x, y) from (X_nn, y_nn) to Augmented

Add the Augmented examples to the training set
Train the new model and record performance improvements

'''