# Introduction

### Problem

In the project this week, we will build a machine learning text classifier to predict news categories from the news article text. 

1. We will iterate on classification models with increasing level of complexity and improved performance: N-gram models, pre-trained Transformer models, and third-party hosted Large Language Models (LLMs).

2. We will look at the impact of labeled dataset size and composition on model performance. The labeled dataset will be used for training in case of N-gram models and pre-trained Transformers, and for selecting examples for in-context few-shot learning for LLMs.

3. [advanced] As an extension, we will explore how to augment data efficiently to your existing training data (efficiency measured as improvement in performance normalized by volume of data augmented). 

Throughout the project there are suggested model architectures that we expect to work reasonably well for this problem. But if you wish to extend/modify any part of this pipeline, or explore new model architectures you should definitely feel free to do so.


## Step1: Prereqs & Installation

Download & Import all the necessary libraries we need throughout the project.

In [1]:
# Install all the required dependencies for the project

!pip install numpy
!pip install scikit-learn
!pip install sentence-transformers
!pip install matplotlib
!pip install langchain
!pip install openai
!pip install wandb
!pip install asyncio
!pip install nest_asyncio
!pip install tiktoken

You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/sjabbireddy/.pyenv/versions/3.10.0/bin/python3.10 -m pip install --upgrade pip' command.[0m
Collecting asyncio
  Downlo

In [None]:
import nest_asyncio

# Needed to run asyncio code in notebook easily
# Refer to https://stackoverflow.com/questions/55409641/asyncio-run-cannot-be-called-from-a-running-event-loop-when-using-jupyter-no
nest_asyncio.apply()

In [5]:
# Package imports that will be needed for this project

import os
import numpy as np
import json
from collections import Counter
from sklearn.metrics import accuracy_score, f1_score
from sentence_transformers import SentenceTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from pprint import pprint
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# [TO BE IMPLEMENTED] 
# Add any other imports needed below depending on the model architectures you are using. For e.g.
# from sklearn.linear_model import LogisticRegression

  from .autonotebook import tqdm as notebook_tqdm


In [41]:
with open('apikey.json') as f:
   apikey = json.load(f)

OPENAI_API_KEY = apikey['API_KEY_USER']

# Global Constants
LABEL_SET = [
    'Business',
    'Sci/Tech',
    'Software and Developement',
    'Entertainment',
    'Sports',
    'Health',
    'Toons',
    'Music Feeds'
]

WORD_VECTOR_MODEL = 'glove-wiki-gigaword-100'
SENTENCE_TRANSFORMER_MODEL = 'all-mpnet-base-v2'

TRAIN_SIZE_EVALS = [500, 1000, 10000, 25000]
EPS = 0.001
SEED = 0

np.random.seed(SEED)

## Step 2: Download & Load Datasets 

[AG News](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) is a collection of more than 1 million news articles gathered from more than 2000 news sources by an academic news search engine. The news topic classification dataset & benchmark was first used in [Character-level Convolutional Networks for Text Classification (NIPS 2015)](https://arxiv.org/abs/1509.01626). The dataset has the text description (summary) of the news article along with some metadata. **For this project, we will use a slightly modified (cleaned up) version of this dataset** 

Schema:
* Source - News publication source
* URL - URL of the news article
* Title - Title of the news article
* Description - Summary description of the news article
* Category (Label) - News category

Sample row in this dataset:
```
{
    'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
    'id': 86273,
    'label': 'Entertainment',
    'source': 'Voice of America',
    'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
    'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'
 }
```




In [7]:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

DIRECTORY_NAME = "data"
DOWNLOAD_URL = 'https://corise-mlops.s3.us-west-2.amazonaws.com/project1/agnews.zip'

def download_dataset():
    """
    Download the dataset. The zip contains three files: train.json, test.json and unlabeled.json 
    """
    http_response = urlopen(DOWNLOAD_URL)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=DIRECTORY_NAME)

# Expensive operation so we should just do this once
download_dataset()

In [8]:
Datasets = {}

for ds in ['train', 'test_mini', 'augment']:
    with open('data/{}.json'.format(ds), 'r') as f:
        if ds == 'test_mini':
          Datasets['test'] = json.load(f)
        else:
          Datasets[ds] = json.load(f)
    print("Loaded Dataset {0}".format(ds))

print("\nExample train row:\n")
pprint(Datasets['train'][0])

print("\nExample test row:\n")
pprint(Datasets['test'][0])

Loaded Dataset train
Loaded Dataset test_mini
Loaded Dataset augment

Example train row:

{'description': 'A capsule carrying solar material from the Genesis space '
                'probe has made a crash landing at a US Air Force training '
                'facility in the US state of Utah.',
 'id': 86273,
 'label': 'Entertainment',
 'source': 'Voice of America',
 'title': 'Capsule from Genesis Space Probe Crashes in Utah Desert',
 'url': 'http://www.sciencedaily.com/releases/2004/09/040908090621.htm'}

Example test row:

{'description': "AP - Denny Neagle's contract was terminated by the Colorado "
                'Rockies on Monday, three days after the oft-injured pitcher '
                'was cited for solicitation.',
 'id': 116767,
 'label': 'Sports',
 'source': 'Yahoo Sports',
 'title': "Rockies Terminate Neagle's Contract (AP)",
 'url': 'http://us.rd.yahoo.com/dailynews/rss/sports/*http://story.news.yahoo.com/news?tmpl=story2 '
        'u=/ap/20041207/ap_on_sp_ba_ne/bbn_rocki

In [9]:
X_train, Y_train = [], []
X_test, Y_true = [], []
X_augment, Y_augment = [], []

for row in Datasets['train']:
    X_train.append(row['description'])
    Y_train.append(row['label'])

for row in Datasets['test']:
    X_test.append(row['description'])
    Y_true.append(row['label'])

for row in Datasets['augment']:
    X_augment.append(row['description'])
    Y_augment.append(row['label'])

## Step 3: [Modeling part 1] N-gram model


In [10]:
models = {}

for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    """
    [TO BE IMPLEMENTED]
        
    Goal: initialized below is a dummy sklearn Pipeline object with no steps.
    You have to replace it with a pipeline object which contains at least two steps:
    (1) mapping the input document to an N-gram feature extractor. You can use feature extractors
        provided by sklearn out of the box (e.g. CountVectorizer, TfidfTransformer)
    (2) a classifier that predicts the class label using the feature output of first step

    You can add other steps to preproces, post-process your data as you see fit. 
    You can also try any sklearn model architecture you want, but a linear classifier
    will do just fine to start with

    e.g. 
    pipeline = Pipeline([
        ('featurizer', <your WordVectorFeaturizer class instance here>),
        ('classifier', <your sklearn classifier class instance here>)
    ])

    Reference: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
    """
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', LogisticRegression(max_iter=1000, C=0.1))
    ])
    
    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))

Evaluating for training data size = 500
Accuracy on test set: 0.339
Evaluating for training data size = 1000
Accuracy on test set: 0.509
Evaluating for training data size = 10000
Accuracy on test set: 0.647
Evaluating for training data size = 25000
Accuracy on test set: 0.696


Side-Note: 
Recieved following error for training data size = 10000 and above 
/Users/sjabbireddy/.pyenv/versions/3.10.0/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

The error message says that the LogisticRegression object failed to converge after a certain number of iterations. This can happen for a number of reasons, including:
* The data is not linearly separable.
* The data is too noisy.
* The regularization parameter is too high.

To fix this error, following are the options:

* Increase the number of iterations.
* Use a different regularization parameter.
* Use a different algorithm, such as a decision tree or a random forest.

Here are some specific changes you can make:

* Change the max_iter parameter of the LogisticRegression object to a higher value.
* Change the C parameter of the LogisticRegression object to a lower value.
* Use a different algorithm, such as a decision tree or a random forest.


## Step 4: [Modeling part 2] Pretrained Transformer model

In [11]:
# Initialize the pretrained transformer model
sentence_transformer_model = SentenceTransformer(
    'sentence-transformers/{model}'.format(model=SENTENCE_TRANSFORMER_MODEL))

# Sanity check
example_encoding = sentence_transformer_model.encode(
    "This is an example sentence",
    normalize_embeddings=True
)

print('example_encoding:\n', example_encoding)


example_encoding:
 [ 2.25026626e-02 -7.82916993e-02 -2.30307151e-02 -5.10002859e-03
 -8.03404301e-02  3.91322300e-02  1.13428347e-02  3.46491998e-03
 -2.94575840e-02 -1.88930500e-02  9.47433040e-02  2.92748380e-02
  3.94859649e-02 -4.63165380e-02  2.54245698e-02 -3.21999528e-02
  6.21928051e-02  1.55591769e-02 -4.67794649e-02  5.03901392e-02
  1.46114370e-02  2.31413841e-02  1.22067872e-02  2.50696782e-02
  2.93659070e-03 -4.19822037e-02 -4.01036115e-03 -2.27843486e-02
 -7.68594025e-03 -3.31090614e-02  3.22118476e-02 -2.09993217e-02
  1.16730919e-02 -9.85073894e-02  1.77932668e-06 -2.29932163e-02
 -1.31141404e-02 -2.80222669e-02 -6.99970424e-02  2.59314720e-02
 -2.89502330e-02  8.76336992e-02 -1.20919039e-02  3.98605354e-02
 -3.31381746e-02  3.59109193e-02  3.46098952e-02  6.49784356e-02
 -3.00816000e-02  6.98187649e-02 -3.99512099e-03 -1.01595640e-03
 -3.50185521e-02 -4.36566696e-02  5.08026257e-02  4.68758643e-02
  5.39663546e-02 -4.03008834e-02  3.20150354e-03  1.36618102e-02
  3.82

In [12]:
class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, dim, sentence_transformer_model):
        self.dim = dim
        self.sentence_transformer_model = sentence_transformer_model
        # you can add any other params to be passed to the constructor here

    #estimator. Since we don't have to learn anything in the featurizer, this is a no-op
    def fit(self, X, y=None):
        return self

    #transformation: return the encoding of the document as returned by the transformer model 
    def transform(self, X, y=None):
        X_t = []
        """
        [TO BE IMPLEMENTED]
        
        Goal: TransformerFeaturizer's transform() method converts the raw text document
        into a feature vector to be passed as input to the classifier.
            
        Given below is a dummy implementation that always maps it to a zero vector.
        You have to implement this function so it uses computes a document embedding
        of the input document using self.sentence_transformer_model. 
        This will be our feature representation of the document
        """
        # for doc in X:
        #     # TODO: replace this dummy implementation
        #     X_t.append(np.zeros(self.dim))
        # return X_t
        return self.sentence_transformer_model.encode(X, normalize_embeddings=True, batch_size=128)

In [13]:
models_v2 = {}
for n in TRAIN_SIZE_EVALS:
    print("Evaluating for training data size = {}".format(n))
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    """
    [TO BE IMPLEMENTED]
        
    Goal: initialized below is a dummy sklearn Pipeline object with no steps.
    You have to replace it with a pipeline object which contains at least two steps:
    (1) mapping the input document to a feature vector (using TransformerFeaturizer)
    (2) a classifier that predicts the class label using the feature output of first step

    You can add other steps to preproces, post-process your data as you see fit. 
    You can also try any sklearn model architecture you want, but a linear classifier
    will do just fine to start with

    e.g. 
    pipeline = Pipeline([
        ('featurizer', <your TransformerFeaturizer class instance here>),
        ('classifier', <your sklearn classifier class instance here>)
    ])
    """
    pipeline =  Pipeline([
        ('vect', TransformerFeaturizer(dim=1024, sentence_transformer_model=sentence_transformer_model)),
        ('clf', LogisticRegression(max_iter=1000, C=0.1))
    ])

    # train
    pipeline.fit(X_train_i, Y_train_i)
    # predict
    Y_pred_i = pipeline.predict(X_test)
    # record results
    models_v2[n] = {
        'pipeline': pipeline,
        'test_predictions': Y_pred_i,
        'accuracy': accuracy_score(Y_true, Y_pred_i),
        'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
        'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
    }
    print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))


Evaluating for training data size = 500
Accuracy on test set: 0.672
Evaluating for training data size = 1000
Accuracy on test set: 0.681
Evaluating for training data size = 10000
Accuracy on test set: 0.762
Evaluating for training data size = 25000
Accuracy on test set: 0.775


## Step 5: [Modeling part 3] Large Language Models

In [44]:
# Here's a couple of code snippets to help you familiarize with how to generate labels with LLMs using langchain,

from langchain.chat_models import ChatOpenAI
from langchain.schema import LLMResult, HumanMessage, Generation
import tiktoken

MODEL_NAME = "gpt-3.5-turbo" 
encoding = tiktoken.encoding_for_model(MODEL_NAME)

def num_tokens_from_string(string, encoding_name = encoding) -> int:
    """Returns the number of tokens in a text string."""
    num_tokens = len(encoding.encode(string))
    return num_tokens

MAX_TOKENS = max(num_tokens_from_string(label) for label in LABEL_SET)

llm = ChatOpenAI(
    model_name=MODEL_NAME,
    max_tokens=MAX_TOKENS,
    temperature=0.0,
    # It's better to do this an environment variable but putting it in plain text for clarity
    openai_api_key = OPENAI_API_KEY
)

In [48]:
from sklearn.base import BaseEstimator, ClassifierMixin
from langchain.callbacks import get_openai_callback
import asyncio
from asyncio import Semaphore

class LLMClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, llm_model, prompt_template, semaphore):
        self.llm_model = llm_model
        self.prompt_template = prompt_template
        self.semaphore = semaphore

    # This will be called during the training step
    def fit(self, X, y):
        return self

    # This will be called during inference.
    def predict(self, X):
        """
        [TO BE IMPLEMENTED]

        Goal: LLMClassifier's predict() method constructs the final prompt input
        for the LLM for each x in X, using the prompt template.

        You have to implement this function so it does the following:
        1. Construct the final prompt for the LLM
        2. Call `self.llm_model` to generate the completion (label) for the prompt
        3. Do any post-processing/response parsing to fetch the label from the LLM response
        """
        with get_openai_callback() as cb:
          prompts = [
              [[HumanMessage(content=self.prompt_template.format(article=document))]]
              for document in X
          ]
          result = asyncio.run(self.__generate_labels(prompts))
          print(f"Total Tokens: {cb.total_tokens}")
          print(f"Prompt Tokens: {cb.prompt_tokens}")
          print(f"Completion Tokens: {cb.completion_tokens}")
          print(f"Total Cost (USD): ${cb.total_cost}")
          print(result)
          return result

    async def __async_generate(self, prompt):
        async with self.semaphore:
          response = await self.llm_model.agenerate(prompt)
          label = response.generations[0][0].text
        #   print(label)
          return label

    async def __generate_labels(self, prompts):
        tasks = [self.__async_generate(prompt) for prompt in prompts]
        labels = await asyncio.gather(*tasks)
        return labels

Pass in the semaphore to the pipeline, the semaphore will control the maximum number of concurrent requests to run in one go, this will prevent a rate limit error.

In [35]:
news_article_zero_shot_template = """
You are an expert at categorizing the topics of different articles. 
Your job is to receive information about a news article, including title and summary and categorize the topic
The possible topic categories are: 'Business', 'Sci/Tech', 'Software and Developement', 'Entertainment', 'Sports', 'Health', 'Toons' and 'Music Feeds'.

I want you to label the following example: 
article: {article}
Category:
"""

LangChain has a neat function called get_openai_callback() which allows us to view metadata about our usage such as:


In [36]:
get_openai_callback()

<contextlib._GeneratorContextManager at 0x2fa3ab460>

In [14]:
# Here's a couple of code snippets to help you familiarize with how to generate labels with LLMs using langchain,

from langchain.chat_models import ChatOpenAI
from langchain.schema import LLMResult, HumanMessage, Generation

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    # model_name='ada',
    max_tokens=1000,
    temperature=0.0,
    request_timeout=120,
    # It's better to do this an environment variable but putting it in plain text for clarity
    openai_api_key = OPEN_API_KEY
)

### Example Prompts

In [15]:

zero_shot_prompt_template = """
You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Tweet: {tweet}
Sentiment:
"""

prompt = zero_shot_prompt_template.format(
    tweet="Yesss! I love machine learning"
)

result = llm.generate([[HumanMessage(content=prompt)]])
print(result.generations[0][0])


text='Positive' generation_info=None message=AIMessage(content='Positive', additional_kwargs={}, example=False)


In [16]:

few_shot_prompt_template = """
You are an expert at judging the sentiment of tweets. 
Your job is to categorize the sentiment of a given tweet into one of three categories: Positive, Negative, Neutral.

Some example tweets along with the correct sentiment are shown below.

Tweet: Another big happy 18th birthday to my partner in crime. I love u very much!
Sentiment: Positive

Tweet: The more I use this application, the more I dislike it. It's slow and full of bugs.
Sentiment: Negative

Tweet: #Dreamforce Returns to San Francisco for 20th Anniversary. Learn more: http://bit.ly/3AgwO0H
Sentiment: Neutral

Now I want you to label the following example: 
Tweet: {tweet}
Sentiment:
"""

prompt = few_shot_prompt_template.format(
    tweet="I like chocolate"
)

result = llm.generate([[HumanMessage(content=prompt)]])
print(result.generations[0][0])


AuthenticationError: <empty message>

### Actual Pipeline Implementation

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin


class LLMClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, llm_model, prompt_template):
        self.llm_model = llm_model
        self.prompt_template = prompt_template

    #This will be called during the training step
    def fit(self, X, y):
        return self

    #This will be called during inference.
    def predict(self, X):
        """
        [TO BE IMPLEMENTED]
        
        Goal: LLMClassifier's predict() method constructs the final prompt input
        for the LLM for each x in X, using the prompt template.

        You have to implement this function so it does the following:
        1. Construct the final prompt for the LLM
        2. Call `self.llm_model` to generate the completion (label) for the prompt
        3. Do any post-processing/response parsing to fetch the label from the LLM response
        """
        predictions = []
        for x in X:
            # Construct the final prompt for the LLM
            prompt = self.prompt_template.format(article = x)

            # Call `self.llm_model` to generate the completion (label) for the prompt
            label = self.llm_model.generate([[HumanMessage(content=prompt)]]).generations[0][0].text

            # Do any post-processing/response parsing to fetch the label from the LLM response
            predictions.append(label)

        return predictions
        # pass


In [46]:
# Zero-shot classification pipeline with LLMs
semaphore = Semaphore(2)

pipeline = Pipeline(
    [
      ('LLMClassifier', LLMClassifier(llm_model=llm, prompt_template=news_article_zero_shot_template, semaphore=semaphore)) 
    ]
)


models_v3 = {}

"""
[TO BE IMPLEMENTED]
        
Goal: initialized below is a dummy sklearn Pipeline object with no steps.
You have to replace it with a pipeline object which uses the `LLMClassifier` you have implemented 
above to perform zero-shot classification on the test set.

You can add other steps to preproces, post-process your data as you see fit. 

"""
# pipeline = Pipeline([('LLMClassifier', LLMClassifier(llm, news_article_zero_shot_template, semaphore=semaphore))])

# train
pipeline.fit(X_train_i, Y_train_i)
# predict
Y_pred_i = pipeline.predict(X_test)
# record results
models_v3["zero-shot"] = {
    'test_predictions': Y_pred_i,
    'accuracy': accuracy_score(Y_true, Y_pred_i),
    'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
    'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
}
print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))

Sports
Sci/Tech
Sports
Business
Sports
Sports
Business
Politics/Government
Sports
Sports
Health
Entertainment
Business
Business
Sports
Business
Business
Health
Business
Sci/Tech
Sci/Tech
Sports
Business
Military/Politics
Politics/Current Events
Sports
Politics/Military
Politics/Current Events
Sports
Entertainment
Business
Sports
Business
Crime/Justice
Music Feeds
Business
Sports
Sports
Health
Health
Health
Music Feeds
Software and Development
Business
Sci/Tech
Business
Business
Sorry, I cannot
Business
Sports
Sports
Business
Sci/Tech
Software and Development
Sports
Business
Sports
Business
Politics/International Affairs
Business
Health
Sports
Business
Business
Entertainment
Business
Health
Entertainment
Sci/Tech
Sci/Tech
Sci/Tech
Software and Development
Sports
Health
Sci/Tech
Health
Entertainment
Health
Sports
Politics
Sports
Health
Sports
Sci/Tech
Sports
Sports
Sports
Entertainment
Health
Software and Development
Sports
Business
Sports
Politics
War/Conflict
Sports
Sports
Politics/Ele

Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID c063275870c439a8be46d742fd327cca in your message.).


Health
Animals/Pets
Entertainment
Sports
Sci/Tech
Entertainment
Sports
Sports
Sports
Business
Sports
Weather/Natural Dis
Sci/Tech
Business
Sports
Sci/Tech
Politics/International Relations
Business
Sports
Politics
Business
Business
Business
Business
Sports
Software and Development
Software and Development
Politics/Law
Business
Business
Music Feeds
Sci/Tech
Business
Sci/Tech
Sports
Business
Sports
Politics
Business
Sports
Sports
Sports
Business
Sci/Tech
Sci/Tech
Sports
Entertainment
Sports
Business
Sports
Business
Music Feeds
Sci/Tech
Sports
Health
Sports
Entertainment
Politics/Current Events
Business
Business
Sports
Sports
Software and Development
Entertainment
Politics/Current Events
Health
Sports
Sports
Sports
Toons
Entertainment
Sports
Sci/Tech
Entertainment
Sports
Health
Sports
Software and Development
Sports
Sci/Tech
Sports
Sports
Sci/Tech
Sports
Business
Health
Sports
Entertainment
Sports
Entertainment
Entertainment
Sci/Tech
Health
Sci/Tech
Business
Health
Sports
Terrorism/
Busine

Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 5436716917e8caf93fa676c7377ea629 in your message.).


Health
Entertainment
Sci/Tech
Sports
Business
Sports
Business
Health
Sci/Tech
Politics
Sports
Business
Entertainment
Politics/Law
Music Feeds
Business
Business
Sports
Business
Sports
Politics/International Affairs
Sports
Sports
Business
Sports
Entertainment
Politics
I'm sorry,
Health
Health
Sports
Sports
Sci/Tech
Entertainment
Sci/Tech
I'm sorry,
Politics/Conflict
Sports
Sports
Sports
Business
Sci/Tech
Sports
Health
Business
Business
Software and Development
Sci/Tech
Sports
Sports
Business
Toons
Business
Sports
Business
Entertainment
Entertainment
Business
Politics
Sports
Business
Health (as the
Health
Sci/Tech
Politics/LGBTQ
Sports
Sports
Sci/Tech
Politics
Business
International Affairs/Politics
Sports
Sports
Sports
Politics/Law
Sports
Politics
Software and Development
Health (as the
Sports
Sports
Sports
Business
Sports
Sports
Sports
Business
Sports
Health
Conflict/Politics
Sports
Health
'Business' (
'News/Current
Sci/Tech
Music Feeds
Sports
Sci/Tech
Software and Development
Sports
Sp

Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 0ebf319ef075bba0f4c0f579dc5ef97e in your message.).


Software and Development
Business
Business
Business
Politics/International Relations
Sports
Politics
Business
Health
Sports
Sports
Sports
Entertainment
Health
Business
Business
Sci/Tech
Health
Health
Politics/Elections
Travel/Transportation
Sports
Sports
Sports
Business
Business
Sports
Software and Development
Sports
Sports
Sci/Tech
Politics/International Affairs
Sports
Business
Business
Business
Sports
Sports
Business
Politics/Law
Business
Entertainment
Business
I'm sorry,
Business
Sci/Tech
Sports
History/Current Events
Sports
Sports
Entertainment
Sports
Sci/Tech
Business
Sports
Business
Sports
Sports
Health
Sports
Sports
None of the given
Sci/Tech
Sports
Politics
Travel/Tourism
Business
Sports
Politics
Business
Politics
Sports
Business
Politics/International Relations
Business
Sci/Tech
Software and Development
Business
Sports
Sports
Entertainment
Business
Sports
Health
Sports
Sci/Tech
Business
Business
Business
Business
Business
Entertainment
Business
Business
Sports
Sports
Sci/Tech


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} <CIMultiDictProxy('Date': 'Sat, 10 Jun 2023 04:10:54 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7d4ebcd71ae1e843-DFW', 'alt-svc': 'h3=":443"; ma=86400')>.


Sci/Tech
Sports
Business
'News/Current
Sci/Tech
Entertainment
Sports
Sci/Tech
Sports
Business
Business
Health
I'm sorry,
Sorry, I cannot
Sports
Sci/Tech
Music Feeds
Business
Business
Sports
Business
Politics/Elections
Health
Sports
Sports
Business
Health
Entertainment
Sci/Tech
Business
Entertainment
Sports
'Sci/Tech
Sorry, I cannot
Sports
Business
Sports
Sports
Sports
Business
Software and Development
I'm sorry,
Health
Sci/Tech


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 6210a50713a2c8b9f3a2b87bddce6528 in your message.).


Health
Sports
Sports (incorrect categor
Sports
Business
Sports
Software and Development
Business
Health
Sports
Sports
Sci/Tech
Business
Environment/Climate Change
Sports
Weather/Natural Dis


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 3381a6747a2c54e46bbdeb78df6da2be in your message.).


Software and Development
Entertainment
Software and Development
Business
Business
Politics
Politics/Military


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID c580a66e8a4f5be6aabfdd13f87555da in your message.).


Business
Business
Sports


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 92f94ef313706167cab4b043ec6601ca in your message.).


Sports
Entertainment
Sports
Entertainment
Business
Business
Sports
Business
Business
Sports
Sports
Sports
Politics/Terrorism
Business
Health
Sports
Business
Business
Sports
Business
Entertainment
Business
Business
Sports
Sports
Sports
Software and Development
Sci/Tech
Business
Sports
Sci/Tech
Entertainment
Health
Sci/Tech
Business
Sports
Business
Sports
Sports
Sports
Sports
Politics/International Affairs
Sports
Sports
Business
Sports
Weather/Natural Dis
Sports
None of the given
Sports
Entertainment
Business
Sci/Tech
Business
Health
Toons
Sports
Politics/Foreign Policy
Business
Software and Development
Health
'Politics/Current
Sports
Sports
Business
Sports
Health
Sci/Tech
'Politics/Current
Sci/Tech
Business
Health
Business
Health
Business
Health
Sports
Business
Sports
Sports
Sci/Tech
Business
Business
Energy/Environment
Politics/Current Affairs
I'm sorry,
Business
Sports
I'm sorry,
Sports
Health
Business
Sports


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 5bf1154a84c09cd946c629bc71aa52f2 in your message.).


Sports
Sports
Software and Development
I'm sorry,
Sports
Sports
Weather/Natural Disaster
Sports
Sports
Sci/Tech (
Software and Development
Sports
Music Feeds
Business
Military/Politics
Health
Entertainment
Business
Politics/Human Rights
Politics
Sports
Sports
Music Feeds
Business
Sci/Tech
Sports
Entertainment
Software and Development
Entertainment
Politics
Health
Health
I'm sorry,
Sports
'Sci/Tech
Health
Sci/Tech
Sports
Health (due to
Business
Business
Sports
Sci/Tech
Sports
Business
Business
Sports
Business
Sports
Politics/Government
Sports
Health
Sci/Tech
Business
'News' (
Sci/Tech
Sports
Entertainment
Sports
Health (Note:
Sports
Entertainment
Business
Sci/Tech
Business
Entertainment
Business
Sports
Business
Sci/Tech
Business
Entertainment
Business
Sci/Tech
Sports
Entertainment
Business
Software and Development
Business
Politics/International Affairs
Business
Sports
'Politics/International
News/Current Events
Business
Health
Sports
Business
Business
Business
Health
Sports
Sports
Soft

In [47]:
news_category_few_shot_template = """
You are an expert at categorizing the topics of different articles. 
You receive info about a news article and have to categorize the topic.
The possible topic categories are: 'Business', 'Sci/Tech', 'Software and Developement', 'Entertainment', 'Sports', 'Health', 'Toons' and 'Music Feeds'.

Some examples below.

Article information: EU to Rule Tuesday on Oracle's Bid for PeopleSoft. European Union regulators will decide Tuesday whether Oracle Corporation hostile $7.7 billion bid for rival business software concern PeopleSoft Inc. can proceed, the EU's antitrust chief said Friday.
Category: Sci/Tech

Article information: Capsule from Genesis Space Probe Crashes in Utah Desert. A capsule carrying solar material from the Genesis space probe has made a crash landing at a US Air Force training facility in the US state of Utah.
Category: Entertainment


I want you to label the following example: 
article: {article}
Category:
"""

In [49]:
# Few-shot classification with LLMs

"""
[TO BE IMPLEMENTED]
        
Goal: initialized below is a dummy sklearn Pipeline object with no steps.
You have to replace it with a pipeline object which uses the `LLMClassifier` you have implemented 
above to perform few-shot classification on the test set.

With few-shot classification, you can pass upto 5 demonstration examples as part of the prompt 
to the LLM. You can add other steps to preproces, post-process your data as you see fit. 

"""
semaphore = Semaphore(2)

pipeline = Pipeline(
    [
      ('LLMClassifier', LLMClassifier(llm_model=llm, prompt_template=news_article_zero_shot_template, semaphore=semaphore)) 
    ]
)

# train
pipeline.fit(X_train_i, Y_train_i)
# predict
Y_pred_i = pipeline.predict(X_test)
# record results
models_v3["few-shot"] = {
    'test_predictions': Y_pred_i,
    'accuracy': accuracy_score(Y_true, Y_pred_i),
    'f1': f1_score(Y_true, Y_pred_i, average='weighted'),
    'errors': sum([x != y for (x, y) in zip(Y_true, Y_pred_i)])
}
print("Accuracy on test set: {}".format(accuracy_score(Y_true, Y_pred_i)))


Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID ef8a75696393932763ecfb7f5af3f268 in your message.).
Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 7ef575cbe2b02dce4dac2c7b41cc68fa in your message.).
Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact 

Total Tokens: 139273
Prompt Tokens: 137599
Completion Tokens: 1674
Total Cost (USD): $0.2785459999999996
['Sports', 'Sci/Tech', 'Sports', 'Business', 'Sports', 'Sports', 'Business', 'Sports', 'Politics/Government', 'Sports', 'Health', 'Entertainment', 'Business', 'Sports', 'Business', 'Business', 'Business', 'Business', 'Health', 'Sci/Tech', 'Sci/Tech', 'Business', 'Sports', 'Military/Politics', 'Politics/Current Events', 'Sports', 'Politics/Military', 'Politics/Current Events', 'Entertainment', 'Sports', 'Sports', 'Business', 'Business', 'Crime/Justice', 'Music Feeds', 'Business', 'Sports', 'Sports', 'Health', 'Health', 'Health', 'Software and Development', 'Music Feeds', 'Business', 'Sci/Tech', 'Business', 'Sorry, I cannot', 'Business', 'Business', 'Sports', 'Sci/Tech', 'Sports', 'Business', 'Sci/Tech', 'Software and Development', 'Sports', 'Business', 'Sports', 'Business', 'Politics/International Affairs', 'Business', 'Health', 'Sports', 'Business', 'Business', 'Entertainment', 'Bus

## Step 5: Report Results from previous two steps

In [50]:
models.items()

dict_items([(500, {'pipeline': Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(C=0.1, max_iter=1000))]), 'test_predictions': array(['Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
       'Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
       'Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
       'Entertainment', 'Entertainment', 'Entertainment', 'Business',
       'Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
       'Entertainment', 'Business', 'Entertainment', 'Entertainment',
       'Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
       'Entertainment', 'Sports', 'Entertainment', 'Entertainment',
       'Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
       'Entertainment', 'Sports', 'Entertainment', 'Entertainment',
       'Entertainment', 'Entertainment', 'Entertainment', 'Entertainment',
   

In [51]:
# Report results

print("N-gram Models: ")
for train_size, result in models.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))


N-gram Models: 
Train size: 500  |  Accuracy: 0.339  |  F1 score: 0.22849242844534107 |  Num errors: 661
Train size: 1000  |  Accuracy: 0.509  |  F1 score: 0.4454164885922577 |  Num errors: 491
Train size: 10000  |  Accuracy: 0.647  |  F1 score: 0.6220611335342483 |  Num errors: 353
Train size: 25000  |  Accuracy: 0.696  |  F1 score: 0.6869693176074849 |  Num errors: 304


In [52]:
print("Pretrained Transformer Models: ")
for train_size, result in models_v2.items():
    print("Train size: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        train_size,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Pretrained Transformer Models: 
Train size: 500  |  Accuracy: 0.672  |  F1 score: 0.6341188063142182 |  Num errors: 328
Train size: 1000  |  Accuracy: 0.681  |  F1 score: 0.6440309626741572 |  Num errors: 319
Train size: 10000  |  Accuracy: 0.762  |  F1 score: 0.7535683408162516 |  Num errors: 238
Train size: 25000  |  Accuracy: 0.775  |  F1 score: 0.7661563476965911 |  Num errors: 225


In [53]:
print("Large Language Models: ")
for mode, result in models_v3.items():
    print("Mode: {0}  |  Accuracy: {1}  |  F1 score: {2} |  Num errors: {3}".format(
        mode,
        result['accuracy'],
        result['f1'],
        result['errors']
    ))

Large Language Models: 
Mode: zero-shot  |  Accuracy: 0.649  |  F1 score: 0.6558476695041018 |  Num errors: 351
Mode: few-shot  |  Accuracy: 0.649  |  F1 score: 0.6560318803610582 |  Num errors: 351


## Step 6: Data Augmentation [Optional]

In this section, we want to explore how to augment data efficiently to your existing training data. This is a very empirical exercise with a less well-defined playbook which means this section of the project is going to be open ended. Let us first understand what we mean by efficiency here, and why it matters:

### Performance Gain (G):
We will measure performance gain from data augmentation as the improvement in model accuracy (reduction in num. errors) on the Test dataset as defined above. 

### Budget (K):
We will measure "budget" as the number of additional rows augmentated to the original training dataset.  In this project, the universe of data from which you will select to add to your training set is Datasets['augment'] (and downstream X_augment, Y_augment).

This data is already labeled of course, but in most real-world scenarios the additional data is typically unlabeled. In order to augment it to your training data, you have to get it annotated which incurs some cost in time & money. This is the motivation to consider budget as a metric.

### Efficiency (E = G / K): 
Efficiency = Performance Gain (Reduction in num errors in test set) / Budget (Number of additional rows augmented to the training dataset)

We want to get the maximum gain in performance, while incurring minimum annotation cost.



We can always sample more data at random from the augmentation set, and this is probably the first thing to try. Can we be more intelligent with the data we choose to augment to the training dataset?

**Idea 1**: Look at the test errors that the current model is making. How can this help us guide our "data collection" for augmentation? One possible idea is to select examples from the augmentation dataset that are similar to these errors and add them to the training data. Similarity can be approximated in many ways:
1. [Jaccard distance between two texts](https://studymachinelearning.com/jaccard-similarity-text-similarity-metric-in-nlp/)
2. L2 distance between mean word vectors (we already compute these features for the entire dataset using WordVectorFeaturizer)
3. L2 distance between sentence transformer embedding (we already compute these features for the entire dataset using TransformerFeaturizer)
  

**Idea 2**: Compute model's predictions on the augmentation dataset, and include those examples to the training dataset that the model finds "hard" ? (a proxy for this would be to look at cases where the output score distribution across all labels has nearly identical scores for top two or three labels).

**Idea 3**: Look at the test errors that the current model is making, and the distribution of these errors across labels. Select examples from the augmentation dataset that belong to these classes - adding more training data for labels that the curent model does not do well on, can improve performance (assuming label quality is good)

In [54]:
# Examine current test errors
test_errors = []
Y_pred_i = models[25000]['test_predictions']

for idx, label in enumerate(Y_true):
    if label != Y_pred_i[idx]:
        test_errors.append((X_test[idx], label,  Y_pred_i[idx]))

print("Number of errors in the test set: {}".format(len(test_errors)))
print("Example errors: [example, true label, predicted label]")
for i in range(10):
    print(test_errors[i])

Number of errors in the test set: 304
Example errors: [example, true label, predicted label]
('The rush by Wal-Mart and other companies to put radio frequency identification devices in their goods could imperil consumer privacy.', 'Software and Developement', 'Sci/Tech')
('Intel said Thursday that President Paul Otellini will become its next CEO -- a change that could help the No. 1 chipmaker overcome recent missteps.', 'Entertainment', 'Sci/Tech')
('Thousands of Douglas County residents at high risk for influenza may be unable to get flu shots this year because of a shortage of vaccine.', 'Entertainment', 'Health')
('Jeff Greenberg leaves post at No. 1 insurance broker not long after Spitzer targeted firm: WSJ. NEW YORK (CNN/Money) - Marsh  amp; McLennan #39;s embattled Chief Executive Officer Jeffrey Greenberg stepped down as', 'Entertainment', 'Business')
('How the toy industry is being outplayed by video games this holiday season', 'Business', 'Sci/Tech')
('They say antioxidants in

In [None]:
'''
[TO BE IMPLEMENTED]

Your additional data augmentation explorations go here

For instance, the pseudocode for Idea (1) might look like the following:

Augmented = {}
For e in test_errors:
   1. X_nn, y_nn = k nearest neighbors to (e) from X_augment, y_augment
   2. Add each (x, y) from (X_nn, y_nn) to Augmented

Add the Augmented examples to the training set
Train the new model and record performance improvements

'''

In [57]:
test_errors[1]

('Intel said Thursday that President Paul Otellini will become its next CEO -- a change that could help the No. 1 chipmaker overcome recent missteps.',
 'Entertainment',
 'Sci/Tech')

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.neighbors import NearestNeighbors

# Initialize the augmented dataset
Augmented = []
for idx, label in enumerate(Y_true)
X_test[idx], label,  Y_pred_i[idx]

neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(test_errors)

# For each test error
for e in test_errors:

    # Find the k nearest neighbors of e in the augmentation dataset
    
    X_nn, y_nn = kneighbors(X=, n_neighbors=None, return_distance=True)
    X_nn, y_nn = KNeighborsClasfier(e, X_test[idx], Y_pred_i[idx])

    # Add each of the k nearest neighbors to the augmented dataset
    Augmented.extend(X_nn)

# Add the augmented dataset to the training set
X_train.extend(Augmented)
y_train.extend(y_nn)

# Train a new model on the augmented training set
model = train_model(X_train, y_train)

# Evaluate the new model on the test set
evaluate_model(model, X_test, y_test)

In [55]:
sentence_transformer_model.encode(X_augment, normalize_embeddings=True, batch_size=128)

KeyboardInterrupt: 