# Monitoring Text Embeddings with the 20 Newsgroups Dataset

We will divide this example in two stages: Pre-deployment Stage and Production Stage.

In the __Pre-deployment Stage__ we will:
- train a classifier
- calculate the centroids for each topic cluster

In the __Production Stage__ we will:
- load daily batches of data
- vectorize the data
- predict the topic for each document
- log:
    - embeddings distance to the centroids
    - tokens list for each document
    - predictions and targets

In the Production Stage, we will introduce documents in another language (Spanish) to see how the model behaves, and how we can monitor this with WhyLabs.

## Installing Dependencies

In [1]:
# !pip install whylogs[whylabs] scikit-learn==1.0.2

## ✔️ Setting the Environment Variables


In [2]:
# import getpass
# import os

# # set your org-id here - should be something like "org-xxxx"
# print("Enter your WhyLabs Org ID") 
# os.environ["WHYLABS_DEFAULT_ORG_ID"] = input()

# # set your datased_id (or model_id) here - should be something like "model-xxxx"
# print("Enter your WhyLabs Dataset ID")
# os.environ["WHYLABS_DEFAULT_DATASET_ID"] = input()


# # set your API key here
# print("Enter your WhyLabs API key")
# os.environ["WHYLABS_API_KEY"] = getpass.getpass()
# print("Using API Key ID: ", os.environ["WHYLABS_API_KEY"][0:10])

In [3]:
import os

os.environ["WHYLABS_API_ENDPOINT"] = "https://songbird.development.whylabsdev.com"
# set your org-id here - should be something like "org-xxxx"
print("Enter your WhyLabs Org ID") 
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "org-HVB9AM"

# set your datased_id (or model_id) here - should be something like "model-xxxx"
print("Enter your WhyLabs Dataset ID")
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-51"


# set your API key here
print("Enter your WhyLabs API key")
os.environ["WHYLABS_API_KEY"] = "z8fYdnQwHr.ibJaqDpZSsZd9dpo5ILyKlOgwXWPV7LGvtIsyqFUs54MGUsHMNz6q"
print("Using API Key ID: ", os.environ["WHYLABS_API_KEY"][0:10])

Enter your WhyLabs Org ID
Enter your WhyLabs Dataset ID
Enter your WhyLabs API key
Using API Key ID:  z8fYdnQwHr


## Pre-deployment

### Training the model

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector
from sklearn.naive_bayes import MultinomialNB

In [5]:
categories = [
    "alt.atheism",
    "soc.religion.christian",
    "comp.graphics",
    "rec.sport.baseball",
    "talk.politics.guns",
    "misc.forsale",
    "sci.med",
]

twenty_train = fetch_20newsgroups(
    subset="train", remove=("headers", "footers", "quotes"), categories=categories, shuffle=True, random_state=42
)

vectorizer = Pipeline(
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
    ]
)
vectors_train = vectorizer.fit_transform(twenty_train.data)

vectors_train = vectors_train.toarray()

clf = MultinomialNB(alpha=0.01)
clf.fit(vectors_train, twenty_train.target)

MultinomialNB(alpha=0.01)

### Calculating Reference Embeddings

In [6]:
references, labels = PCACentroidsSelector(n_components=20).calculate_references(vectors_train, twenty_train.target)
ref_labels = [twenty_train.target_names[x].split(".")[-1] for x in labels]
print(ref_labels)

['atheism', 'graphics', 'forsale', 'baseball', 'med', 'christian', 'guns']


## Production Stage

### Configuring Schema for Embeddings+Tokens+Performance logging

In [7]:
import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
    DistanceFunction,
    EmbeddingConfig,
    EmbeddingMetric,
)
from whylogs.experimental.extras.nlp_metric import BagOfWordsMetric
from whylogs.core.resolvers import STANDARD_RESOLVER


config = EmbeddingConfig(
    references=references,
    labels=ref_labels,
    distance_fn=DistanceFunction.cosine,
)
embeddings_resolver = ResolverSpec(column_name="news_centroids", metrics=[MetricSpec(EmbeddingMetric, config)])
tokens_resolver = ResolverSpec(column_name="document_tokens", metrics=[MetricSpec(BagOfWordsMetric)])

schema = DeclarativeSchema(STANDARD_RESOLVER+[embeddings_resolver,tokens_resolver])

### Loading daily batches

To speed things up, let's download the production data from a public S3 bucket. That way, we won't have to translate or tokenize the documents ourselves.

The DataFrame below contains 5306 documents - 2653 in English and 2653 in Spanish. The spanish documents were obtained by simply translating the english ones. Documents that have the same `doc_id` refers to the same document in different languages.

The tokenization was done using the `nltk` library.

In [8]:
download_url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Newsgroups/production_en_es.parquet"
prod_df = pd.read_parquet(download_url)
prod_df.head()

Unnamed: 0,doc,target,predicted,tokens,language,batch_id,doc_id
0,Hello\n\n Just one quick question\n ...,4,4,"[Hello, Just, one, quick, question, My, father...",en,0,0.0
1,OFFICIAL UNITED NATIONS SOUVENIR FOLDERS\n\nEa...,2,2,"[OFFICIAL, UNITED, NATIONS, SOUVENIR, FOLDERS,...",en,0,1.0
2,I am selling Joe Montana SportsTalk Football 9...,2,2,"[I, selling, Joe, Montana, SportsTalk, Footbal...",en,0,2.0
3,\n\nNonsteroid Proventil is a brand of albute...,4,4,"[Nonsteroid, Proventil, brand, albuterol, bron...",en,0,3.0
4,Two URGENT requests\n\n1 I need the latest upd...,6,6,"[Two, URGENT, requests, 1, I, need, latest, up...",en,0,4.0


### Language Perturbation - Spanish Documents

In [9]:
language_perturbation_ratio = [0,0,0,0,0.33,0.66,1]

def get_docs_by_language_ratio(batch_df, ratio):
    n_docs = len(batch_df[batch_df["language"] == "en"])
    n_es_docs = int(n_docs * ratio)
    n_en_docs = n_docs - n_es_docs
    en_df = batch_df[batch_df["language"] == "en"].sample(n_en_docs)    
    es_df = batch_df[~batch_df['doc_id'].isin(en_df["doc_id"])]
    # filter out docs with doc_id in en_df


    es_df = es_df[es_df["language"] == "es"].sample(n_es_docs)
    docs = pd.concat([en_df, es_df])
    return docs

### Log and Upload to WhyLabs

In [12]:
mixed_df = get_docs_by_language_ratio(batch_df, ratio)

ValueError: Cannot take a larger sample than population when 'replace=False'

In [10]:
from datetime import datetime,timedelta, timezone
import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter
import random

writer = WhyLabsWriter()

for day, batch_df in prod_df.groupby("batch_id"):
    dataset_timestamp = datetime.now() - timedelta(days=6-day)
    dataset_timestamp = dataset_timestamp.replace(hour=0, minute=0, second=0, microsecond=0, tzinfo = timezone.utc)

    print(f"day {day}: {dataset_timestamp}")

    ratio = language_perturbation_ratio[day]
    print(f"{ratio*100}% of documents with language perturbation")
 
    mixed_df = get_docs_by_language_ratio(batch_df, ratio)
    mixed_df = mixed_df.dropna()

    sample_ratio = random.uniform(0.8, 1) # just to have some variability in the total number of daily docs
    mixed_df = mixed_df.sample(frac=sample_ratio).reset_index(drop=True)


    vectors = vectorizer.transform(mixed_df['doc']).toarray()
    predicted = clf.predict(vectors)
    print("mean accuracy: ", np.mean(predicted == mixed_df['target']))

    print("Profiling and logging Embeddings and Tokens...")
    profile = why.log(row={"news_centroids": vectors,
                                     "document_tokens": mixed_df["tokens"]},
                                     schema=schema)
    profile.set_dataset_timestamp(dataset_timestamp)
    writer.write(file=profile.view())


    newsgroups_df = pd.DataFrame({"output_target": mixed_df["target"],
                            "output_prediction": predicted})
    # to map indices to label names
    newsgroups_df["output_target"] = newsgroups_df["output_target"].apply(lambda x: ref_labels[x])
    newsgroups_df["output_prediction"] = newsgroups_df["output_prediction"].apply(lambda x: ref_labels[x])
    newsgroups_df["document_tokens"] = mixed_df["tokens"]

    
    print("Profiling and logging classification metrics...")
    classification_results = why.log_classification_metrics(
        newsgroups_df,
        target_column="output_target",
        prediction_column="output_prediction",
        schema=schema,
        log_full_data=True
    )
    classification_results.set_dataset_timestamp(dataset_timestamp)    
    writer.write(file=classification_results.view())


day 0: 2023-03-25 00:00:00+00:00
0% of documents with language perturbation
mean accuracy:  0.8467966573816156
Profiling and logging Embeddings and Tokens...
Profiling and logging classification metrics...
day 1: 2023-03-26 00:00:00+00:00
0% of documents with language perturbation
mean accuracy:  0.8333333333333334
Profiling and logging Embeddings and Tokens...
Profiling and logging classification metrics...
day 2: 2023-03-27 00:00:00+00:00
0% of documents with language perturbation
mean accuracy:  0.8253521126760563
Profiling and logging Embeddings and Tokens...
Profiling and logging classification metrics...
day 3: 2023-03-28 00:00:00+00:00
0% of documents with language perturbation
mean accuracy:  0.8065476190476191
Profiling and logging Embeddings and Tokens...
Profiling and logging classification metrics...
day 4: 2023-03-29 00:00:00+00:00
33.0% of documents with language perturbation
mean accuracy:  0.6225352112676056
Profiling and logging Embeddings and Tokens...
Profiling and l

ValueError: Cannot take a larger sample than population when 'replace=False'