# NLP Sentiment Analysis with CatBoost

This notebook introduces model developers to documenting a natural language processing (NLP) model with the ValidMind Developer Framework. The use case is sentiment analysis of COVID-19-related tweets, categorized as positive or negative. The model employs binary text classification using the CatBoost library. The notebook guides you through setting up the ValidMind Developer Framework, initializing the client library, and loading a sample dataset for training. It then runs the framework's model validation tests to generate documentation on the data and model.

## ValidMind at a glance

ValidMind's platform enables organizations to identify, document, and manage model risks for all types of models, including AI/ML models, LLMs, and statistical models. As a model developer, you use the ValidMind Developer Framework to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on documentation projects. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

If this is your first time trying out ValidMind, we recommend going through the following resources first:

- [Get started](https://docs.validmind.ai/guide/get-started.html) — The basics, including key concepts, and how our products work
- [Get started with the ValidMind Developer Framework](https://docs.validmind.ai/guide/get-started-developer-framework.html) —  The path for developers, more code samples, and our developer reference

## Before you begin

::: {.callout-tip}
### New to ValidMind? 
For access to all features available in this notebook, create a free ValidMind account. 

Signing up is FREE — [**Sign up now**](https://app.prod.validmind.ai)
:::

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

## Install the client library

The client library provides Python support for the ValidMind Developer Framework. To install it:

In [1]:
%pip install -q validmind


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Initialize the client library

Every documentation project in the Platform UI comes with a _code snippet_ that lets the client library associate your documentation and tests with the right project on the Platform UI when you run this notebook.

Get your code snippet by creating a documentation project:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. Go to Go to **Documentation Projects** and click **Create new project**.

<!--- NR TO DO this model doesn't exist in the inventory --->
3. Select **`NLP-based Text Classification`** and **`Initial Validation`** for the model name and type, give the project a unique  name to make it yours, and then click **Create project**.

4. Go to **Documentation Projects** > **YOUR_UNIQUE_PROJECT_NAME** > **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:

In [2]:
## Replace with code snippet from your documentation project ##

import validmind as vm

vm.init(
  api_host = "https://api.prod.validmind.ai/api/v1/tracking",
  api_key = "...",
  api_secret = "...",
  project = "..."
)
  

2024-01-19 13:59:48,431 - INFO(validmind.api_client): Connected to ValidMind. Project: NLP Text Classification Model - Initial Validation (clrkpesc2005m19jwe634yolh)


## 1. Explorary data analysis of COVID-19 tweets data
The emphasis in this section is on the in-depth analysis and preprocessing of the text data (tweets). In this section, we introduce the manually tagged COVID-19 tweets, which range from Highly Negative to Highly Positive, representing five distinct classes. In this Exploratory Data Analysis (EDA), these five classes will be simplified to two classes: Positive and Negative.



### Initialize the Python environment

Next, let's initialize the environment and imports libraries for data manipulation, machine learning, and plotting, followed by configuring PyTorch:

In [3]:
%set_env PYTORCH_MPS_HIGH_WATERMARK_RATIO 0.8

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split


%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

device = "cpu"

train_model = True

env: PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.8


###  Load COVID-19 tweets data

In [4]:
from validmind.datasets.nlp import twitter_covid_19 as demo_data
df = demo_data.load_data()
df.head(10)

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,As news of the regions first confirmed COVID-...,Positive
6,Cashier at grocery store was sharing his insig...,Positive
7,Was at the supermarket today. Didn't buy toile...,Neutral
8,Due to COVID-19 our retail store and classroom...,Positive
9,"For corona prevention,we should stop to buy th...",Negative


### Run text data quality test suite
In this section, we use the ValidMind Developer Framework to run various data quality checks on the dataset, and send the results to the model document on the ValidMind Platform UI:

In [5]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='OriginalTweet', target_column="Sentiment")

2024-01-19 13:59:49,709 - INFO(validmind.client): The 'type' argument to init_dataset() argument is deprecated and no longer required.
2024-01-19 13:59:49,709 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


In [6]:
config = {
    "class_imbalance":{"min_percent_threshold": 3}
}
text_data_test_suite = vm.run_test_suite("text_data_quality",
                                       inputs = {"dataset":vm_ds},
                                       config=config)

HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=12)))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juanvalidmind/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juanvalidmind/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Text Data Quality</i></h2><hr>'),…

## 2. Preprocess data

### Handle class bias 

One way to handle class bias is to merge a specific class data with related class. Here, we copy the text and class lables in separate columns so that the original text is also there for comparison:

In [7]:
print("Original Classes:", df.Sentiment.unique())

df['text'] = df.OriginalTweet
df["text"] = df["text"].astype(str)

def classes_def(x):
    if x ==  "Extremely Positive":
        return "positive"
    elif x == "Extremely Negative":
        return "negative"
    elif x == "Negative":
        return "negative"
    elif x ==  "Positive":
        return "positive"
    else:
        return "neutral"

df['sentiment']=df['Sentiment'].apply(lambda x:classes_def(x))
target=df['sentiment']

print(df.sentiment.value_counts(normalize= True))
print("Modified Classes:", df.sentiment.unique())

Original Classes: ['Neutral' 'Positive' 'Extremely Negative' 'Negative' 'Extremely Positive']
positive    0.435814
negative    0.378846
neutral     0.185341
Name: sentiment, dtype: float64
Modified Classes: ['neutral' 'positive' 'negative']


### Remove sentiments that are neutral

In [8]:
df = df[df["sentiment"] != "neutral"]
print(df.sentiment.unique())
print(df.sentiment.value_counts(normalize= True))
print(df.shape)

['positive' 'negative']
positive    0.534964
negative    0.465036
Name: sentiment, dtype: float64
(36623, 4)


In [9]:
df

Unnamed: 0,OriginalTweet,Sentiment,text,sentiment
1,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...,positive
2,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia: Woolworths to give elde...,positive
3,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...,positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,"Me, ready to go at supermarket during the #COV...",negative
5,As news of the regions first confirmed COVID-...,Positive,As news of the regions first confirmed COVID-...,positive
...,...,...,...,...
44949,"@RicePolitics @MDCounties Craig, will you call...",Negative,"@RicePolitics @MDCounties Craig, will you call...",negative
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,Meanwhile In A Supermarket in Israel -- People...,positive
44951,Did you panic buy a lot of non-perishable item...,Negative,Did you panic buy a lot of non-perishable item...,negative
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,Gov need to do somethings instead of biar je r...,negative


### Remove URLs and HTML links

In [10]:
import re

def remove_urls(text):
    url_remove = re.compile(r'https?://\S+|www\.\S+')
    return url_remove.sub(r'', text)

df['text']=df['text'].apply(lambda x:remove_urls(x))

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

df['text']=df['text'].apply(lambda x:remove_html(x))

### Convert text to lower case 


In [11]:
def lower(text):
    low_text= text.lower()
    return low_text
df['text']=df['text'].apply(lambda x:lower(x))


### Remove numbers 

In [12]:
def remove_num(text):
    remove= re.sub(r'\d+', '', text)
    return remove
df['text']=df['text'].apply(lambda x:remove_num(x))


### Remove stopwords 

In [13]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
df['text']=df['text'].apply(lambda x:remove_stopwords(x))

### Remove punctuation

In [14]:
def punct_remove(text):
    punct = re.sub(r"[^\w\s\d]","", text)
    return punct
df['text']=df['text'].apply(lambda x:punct_remove(x))


### Remove mentions 

In [15]:
def remove_mention(x):
    text=re.sub(r'@\w+','',x)
    return text
df['text']=df['text'].apply(lambda x:remove_mention(x))


### Remove hashtags 

In [16]:
def remove_hash(x):
    text=re.sub(r'#\w+','',x)
    return text
df['text']=df['text'].apply(lambda x:remove_hash(x))

### Remove extra whitespace left while removing other text

In [17]:
def remove_space(text):
    space_remove = re.sub(r"\s+"," ",text).strip()
    return space_remove
df['text']=df['text'].apply(lambda x:remove_space(x))

In [18]:
df

Unnamed: 0,OriginalTweet,Sentiment,text,sentiment
1,advice Talk to your neighbours family to excha...,Positive,advice talk neighbours family exchange phone n...,positive
2,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworths give elderly ...,positive
3,My food stock is not the only one which is emp...,Positive,food stock one empty please panic enough food ...,positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,me ready go supermarket covid outbreak im para...,negative
5,As news of the regions first confirmed COVID-...,Positive,news regions first confirmed covid case came s...,positive
...,...,...,...,...
44949,"@RicePolitics @MDCounties Craig, will you call...",Negative,ricepolitics mdcounties craig call general ass...,negative
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,meanwhile supermarket israel people dance sing...,positive
44951,Did you panic buy a lot of non-perishable item...,Negative,panic buy lot nonperishable items echo needs f...,negative
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,gov need somethings instead biar je rakyat ass...,negative


### Run text data quality tests again
Here, we are checking the quality of the data again by running the data quality tests again to verify that we have preprocessed the data to a sufficient standard and that tests are passing according to our requirements:

In [19]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='text', target_column="sentiment")

config = {
    "class_imbalance":{"min_percent_threshold": 3}
}
text_data_test_suite = vm.run_test_suite("text_data_quality",
                                       inputs = {"dataset":vm_ds},
                                       config=config)

2024-01-19 14:00:15,065 - INFO(validmind.client): The 'type' argument to init_dataset() argument is deprecated and no longer required.
2024-01-19 14:00:15,066 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=12)))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juanvalidmind/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juanvalidmind/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Text Data Quality</i></h2><hr>'),…

## Modeling 

### Create training, validation, and test data sets

With our data in nice shape, we'll split it into training, validation, and test sets:

In [20]:

df = df[df['sentiment'] != "neutral"]
df.loc[df['sentiment'] == "positive", 'sentiment'] = 1
df.loc[df['sentiment'] == "negative", 'sentiment'] = 0
print(np.unique(df['sentiment']))

print(df.head())
train, test = train_test_split(df[['text','sentiment']], test_size=0.33, random_state=42)
train = train[['text','sentiment']]
test = test[['text','sentiment']]

train, valid = train_test_split(
    train,
    train_size=0.7,
    random_state=0,
    stratify=train['sentiment'])
y_train, X_train = \
    train['sentiment'], train.drop(['sentiment'], axis=1)
y_valid, X_valid = \
    valid['sentiment'], valid.drop(['sentiment'], axis=1)
y_test, X_test= \
    test['sentiment'], test.drop(['sentiment'], axis=1)

[0 1]
                                       OriginalTweet           Sentiment  \
1  advice Talk to your neighbours family to excha...            Positive   
2  Coronavirus Australia: Woolworths to give elde...            Positive   
3  My food stock is not the only one which is emp...            Positive   
4  Me, ready to go at supermarket during the #COV...  Extremely Negative   
5  As news of the regions first confirmed COVID-...            Positive   

                                                text sentiment  
1  advice talk neighbours family exchange phone n...         1  
2  coronavirus australia woolworths give elderly ...         1  
3  food stock one empty please panic enough food ...         1  
4  me ready go supermarket covid outbreak im para...         0  
5  news regions first confirmed covid case came s...         1  


### Build the model

In [21]:
def fit_model(X_train, y_train,val_data, **kwargs):
    model = CatBoostClassifier(
        task_type='CPU',
        iterations=5000,
        eval_metric='Accuracy',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )
    return model.fit(
        X=X_train,
        y=y_train,
        eval_set=val_data,
        verbose=100,
        plot=True,
        use_best_model=True
        )

In [22]:
model = fit_model(
    X_train, y_train,
    val_data=(X_valid,y_valid),
    text_features=['text'],
    learning_rate=0.35,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word', 'Number', 'SentenceBreak'],
            'sub_tokens_policy':'SeveralTokens'
        }
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '5000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.6037263	test: 0.6062211	best: 0.6062211 (0)	total: 124ms	remaining: 10m 19s
100:	learn: 0.8601456	test: 0.8288509	best: 0.8291225 (98)	total: 7.42s	remaining: 5m 59s
200:	learn: 0.9101601	test: 0.8515349	best: 0.8528932 (193)	total: 15.1s	remaining: 5m 59s
300:	learn: 0.9349054	test: 0.8562891	best: 0.8564249 (297)	total: 22.9s	remaining: 5m 57s
400:	learn: 0.9557496	test: 0.8566965	best: 0.8583265 (348)	total: 30.5s	remaining: 5m 49s
500:	learn: 0.9712955	test: 0.8569682	best: 0.8583265 (348)	total: 37.9s	remaining: 5m 39s
600:	learn: 0.9795633	test: 0.8576474	best: 0.8583265 (348)	total: 46s	remaining: 5m 36s
700:	learn: 0.9865502	test: 0.8533007	best: 0.8583265 (348)	total: 53.7s	remaining: 5m 29s
800:	learn: 0.9915575	test: 0.8572399	best: 0.8583265 (348)	total: 1m 1s	remaining: 5m 19s
900:	learn: 0.9949345	test: 0.8603640	best: 0.8610432 (869)	total: 1m 8s	remaining: 5m 11s
1000:	learn: 0.9970306	test: 0.8604999	best: 0.8610432 (869)	total: 1m 15s	remaining: 5m 1s
1100

### Initialize ValidMind objects

With the model ready, we can now initialize the training and testing datasets, as well as the model, for sentiment analysis using [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) and [`vm.init_model`](https://docs.validmind.ai/validmind/validmind.html#init_model):

In [23]:
vm_train_ds = vm.init_dataset(dataset=pd.concat([X_train, y_train], axis=1), type="generic", target_column="sentiment")
vm_test_ds = vm.init_dataset(dataset=pd.concat([X_test, y_test], axis=1), type="generic",target_column="sentiment")
vm_model = vm.init_model(model, train_ds=vm_train_ds, test_ds=vm_test_ds)

2024-01-19 14:03:13,512 - INFO(validmind.client): The 'type' argument to init_dataset() argument is deprecated and no longer required.
2024-01-19 14:03:13,523 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
2024-01-19 14:03:13,539 - INFO(validmind.client): The 'type' argument to init_dataset() argument is deprecated and no longer required.
2024-01-19 14:03:13,542 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


#### Run model metrics test suite

Next, we run the `binary_classifier_metrics` test suite on the initialized model to collect performance metrics for binary classification:

In [25]:
model_metrics_test_suite = vm.run_test_suite("classifier_metrics",
                                             inputs = {"model":vm_model}
                                            )

HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=20)))

2024-01-19 14:03:54,387 - ERROR(validmind.vm_models.test_suite.test): Failed to run test 'classifier_in_sample_performance': (ValueError) Classification metrics can't handle a mix of unknown and binary targets
2024-01-19 14:03:54,525 - ERROR(validmind.vm_models.test_suite.test): Failed to run test 'classifier_out_of_sample_performance': (ValueError) Classification metrics can't handle a mix of unknown and binary targets
2024-01-19 14:03:54,529 - ERROR(validmind.vm_models.test_suite.test): Failed to run test 'pfi': (SkipTestError) Skipping PFI for catboost models
2024-01-19 14:03:54,621 - ERROR(validmind.vm_models.test_suite.test): Failed to run test 'pr_curve': (ValueError) unknown format is not supported
2024-01-19 14:03:54,745 - INFO(validmind.tests.model_validation.sklearn.PopulationStabilityIndex): Skiping PSI for catboost models
2024-01-19 14:03:54,745 - INFO(validmind.tests.model_validation.sklearn.PopulationStabilityIndex): Skiping PSI for catboost models
2024-01-19 14:03:54,746

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Classifier Metrics</i></h2><hr>')…

#### Run model validation test suite

And finally, let's runs the `binary_classifier_validation` test suite on the initialized model to validate the model's binary classification performance:

In [27]:
model_validation_test_suite = vm.run_test_suite(
    "classifier_validation",
    inputs = {
        "model":vm_model,
        "models":[vm_model]
    }
)

HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=10)))

2024-01-19 14:05:14,586 - ERROR(validmind.vm_models.test_suite.test): Failed to run test 'models_performance_comparison': (ValueError) Classification metrics can't handle a mix of unknown and binary targets


VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Classifier Validation</i></h2><hr…

## Next steps

You can look at the results of this test suite right in the notebook where you ran the code, as you would expect. But there is a better way: view the prompt validation test results as part of your model documentation right in the ValidMind Platform UI: 

1. Log back into the [Platform UI](https://app.prod.validmind.ai) 

2. Go to **Documentation Projects** > **YOUR_DOCUMENTATION_PROJECT** > **Documentation**.

3. Expand **3. Model Development** > **3.2. Prompt Evaluation**.

What you can see now is a more easily consumable version of the prompt validation testing you just performed, along with other parts of your documentation project that still need to be completed. 

If you want to learn more about where you are in the model documentation process, take a look at [How do I use the framework?](https://docs.validmind.ai/guide/get-started-developer-framework.html#how-do-i-use-the-framework).

