# NLP Sentiment Analysis with CatBoost

This notebook introduces model developers to documenting a natural language processing (NLP) model with the ValidMind Developer Framework. The use case is sentiment analysis of COVID-19-related tweets, categorized as positive or negative. The model employs binary text classification using the CatBoost library. The notebook guides you through setting up the ValidMind Developer Framework, initializing the client library, and loading a sample dataset for training. It then runs the framework's model validation tests to generate documentation on the data and model.

## ValidMind at a glance

ValidMind's platform enables organizations to identify, document, and manage model risks for all types of models, including AI/ML models, LLMs, and statistical models. As a model developer, you use the ValidMind Developer Framework to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on documentation projects. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

If this is your first time trying out ValidMind, we recommend going through the following resources first:

- [Get started](https://docs.validmind.ai/guide/get-started.html) — The basics, including key concepts, and how our products work
- [Get started with the ValidMind Developer Framework](https://docs.validmind.ai/guide/get-started-developer-framework.html) —  The path for developers, more code samples, and our developer reference

## Before you begin

::: {.callout-tip}
### New to ValidMind? 
For access to all features available in this notebook, create a free ValidMind account. 

Signing up is FREE — [**Sign up now**](https://app.prod.validmind.ai)
:::

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

## Install the client library

The client library provides Python support for the ValidMind Developer Framework. To install it:

In [None]:
%pip install -q validmind

## Initialize the client library

Every documentation project in the Platform UI comes with a _code snippet_ that lets the client library associate your documentation and tests with the right project on the Platform UI when you run this notebook.

Get your code snippet by creating a documentation project:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. Go to Go to **Documentation Projects** and click **Create new project**.

<!--- NR TO DO this model doesn't exist in the inventory --->
3. Select **`NLP-based Text Classification`** and **`Initial Validation`** for the model name and type, give the project a unique  name to make it yours, and then click **Create project**.

4. Go to **Documentation Projects** > **YOUR_UNIQUE_PROJECT_NAME** > **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:

In [None]:
## Replace with code snippet from your documentation project ##

import validmind as vm

vm.init(
  api_host = "https://api.prod.validmind.ai/api/v1/tracking",
  api_key = "...",
  api_secret = "...",
  project = "..."
)


## 1. Explorary data analysis of COVID-19 tweets data
The emphasis in this section is on the in-depth analysis and preprocessing of the text data (tweets). In this section, we introduce the manually tagged COVID-19 tweets, which range from Highly Negative to Highly Positive, representing five distinct classes. In this Exploratory Data Analysis (EDA), these five classes will be simplified to two classes: Positive and Negative.



### Initialize the Python environment

Next, let's initialize the environment and imports libraries for data manipulation, machine learning, and plotting, followed by configuring PyTorch:

In [None]:
%set_env PYTORCH_MPS_HIGH_WATERMARK_RATIO 0.8

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split


%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

device = "cpu"

train_model = True

###  Load COVID-19 tweets data

In [None]:
from validmind.datasets.nlp import twitter_covid_19 as demo_data
df = demo_data.load_data()
df.head(10)

### Run text data quality test suite
In this section, we use the ValidMind Developer Framework to run various data quality checks on the dataset, and send the results to the model document on the ValidMind Platform UI:

In [None]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='OriginalTweet', target_column="Sentiment")

In [None]:
config = {
    "class_imbalance":{"min_percent_threshold": 3}
}
text_data_test_suite = vm.run_test_suite("text_data_quality",
                                       inputs = {"dataset":vm_ds},
                                       config=config)

## 2. Preprocess data

### Handle class bias 

One way to handle class bias is to merge a specific class data with related class. Here, we copy the text and class lables in separate columns so that the original text is also there for comparison:

In [None]:
print("Original Classes:", df.Sentiment.unique())

df['text'] = df.OriginalTweet
df["text"] = df["text"].astype(str)

def classes_def(x):
    if x ==  "Extremely Positive":
        return "positive"
    elif x == "Extremely Negative":
        return "negative"
    elif x == "Negative":
        return "negative"
    elif x ==  "Positive":
        return "positive"
    else:
        return "neutral"

df['sentiment']=df['Sentiment'].apply(lambda x:classes_def(x))
target=df['sentiment']

print(df.sentiment.value_counts(normalize= True))
print("Modified Classes:", df.sentiment.unique())

### Remove sentiments that are neutral

In [None]:
df = df[df["sentiment"] != "neutral"]
print(df.sentiment.unique())
print(df.sentiment.value_counts(normalize= True))
print(df.shape)

In [None]:
df

### Remove URLs and HTML links

In [None]:
import re

def remove_urls(text):
    url_remove = re.compile(r'https?://\S+|www\.\S+')
    return url_remove.sub(r'', text)

df['text']=df['text'].apply(lambda x:remove_urls(x))

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

df['text']=df['text'].apply(lambda x:remove_html(x))

### Convert text to lower case 


In [None]:
def lower(text):
    low_text= text.lower()
    return low_text
df['text']=df['text'].apply(lambda x:lower(x))


### Remove numbers 

In [None]:
def remove_num(text):
    remove= re.sub(r'\d+', '', text)
    return remove
df['text']=df['text'].apply(lambda x:remove_num(x))


### Remove stopwords 

In [None]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
df['text']=df['text'].apply(lambda x:remove_stopwords(x))

### Remove punctuation

In [None]:
def punct_remove(text):
    punct = re.sub(r"[^\w\s\d]","", text)
    return punct
df['text']=df['text'].apply(lambda x:punct_remove(x))


### Remove mentions 

In [None]:
def remove_mention(x):
    text=re.sub(r'@\w+','',x)
    return text
df['text']=df['text'].apply(lambda x:remove_mention(x))


### Remove hashtags 

In [None]:
def remove_hash(x):
    text=re.sub(r'#\w+','',x)
    return text
df['text']=df['text'].apply(lambda x:remove_hash(x))

### Remove extra whitespace left while removing other text

In [None]:
def remove_space(text):
    space_remove = re.sub(r"\s+"," ",text).strip()
    return space_remove
df['text']=df['text'].apply(lambda x:remove_space(x))

In [None]:
df

### Run text data quality tests again
Here, we are checking the quality of the data again by running the data quality tests again to verify that we have preprocessed the data to a sufficient standard and that tests are passing according to our requirements:

In [None]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='text', target_column="sentiment")

config = {
    "class_imbalance":{"min_percent_threshold": 3}
}
text_data_test_suite = vm.run_test_suite("text_data_quality",
                                       inputs = {"dataset":vm_ds},
                                       config=config)

## Modeling 

### Create training, validation, and test data sets

With our data in nice shape, we'll split it into training, validation, and test sets:

In [None]:

df = df[df['sentiment'] != "neutral"]
df.loc[df['sentiment'] == "positive", 'sentiment'] = 1
df.loc[df['sentiment'] == "negative", 'sentiment'] = 0
print(np.unique(df['sentiment']))

print(df.head())
train, test = train_test_split(df[['text','sentiment']], test_size=0.33, random_state=42)
train = train[['text','sentiment']]
test = test[['text','sentiment']]

train, valid = train_test_split(
    train,
    train_size=0.7,
    random_state=0,
    stratify=train['sentiment'])
y_train, X_train = \
    train['sentiment'], train.drop(['sentiment'], axis=1)
y_valid, X_valid = \
    valid['sentiment'], valid.drop(['sentiment'], axis=1)
y_test, X_test= \
    test['sentiment'], test.drop(['sentiment'], axis=1)

### Build the model

In [None]:
def fit_model(X_train, y_train,val_data, **kwargs):
    model = CatBoostClassifier(
        task_type='CPU',
        iterations=5000,
        eval_metric='Accuracy',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )
    return model.fit(
        X=X_train,
        y=y_train,
        eval_set=val_data,
        verbose=100,
        plot=True,
        use_best_model=True
        )

In [None]:
model = fit_model(
    X_train, y_train,
    val_data=(X_valid,y_valid),
    text_features=['text'],
    learning_rate=0.35,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word', 'Number', 'SentenceBreak'],
            'sub_tokens_policy':'SeveralTokens'
        }
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '5000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)

### Initialize ValidMind objects

With the model ready, we can now initialize the training and testing datasets, as well as the model, for sentiment analysis using [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) and [`vm.init_model`](https://docs.validmind.ai/validmind/validmind.html#init_model):

In [None]:
vm_train_ds = vm.init_dataset(dataset=pd.concat([X_train, y_train], axis=1), type="generic", target_column="sentiment")
vm_test_ds = vm.init_dataset(dataset=pd.concat([X_test, y_test], axis=1), type="generic",target_column="sentiment")
vm_model = vm.init_model(model, train_ds=vm_train_ds, test_ds=vm_test_ds)

#### Run model metrics test suite

Next, we run the `binary_classifier_metrics` test suite on the initialized model to collect performance metrics for binary classification:

In [None]:
model_metrics_test_suite = vm.run_test_suite("classifier_metrics",
                                             inputs = {"model":vm_model}
                                            )

#### Run model validation test suite

And finally, let's runs the `binary_classifier_validation` test suite on the initialized model to validate the model's binary classification performance:

In [None]:
model_validation_test_suite = vm.run_test_suite(
    "classifier_validation",
    inputs = {
        "model":vm_model,
        "models":[vm_model]
    }
)

## Next steps

You can look at the results of this test suite right in the notebook where you ran the code, as you would expect. But there is a better way: view the prompt validation test results as part of your model documentation right in the ValidMind Platform UI: 

1. Log back into the [Platform UI](https://app.prod.validmind.ai) 

2. Go to **Documentation Projects** > **YOUR_DOCUMENTATION_PROJECT** > **Documentation**.

3. Expand **3. Model Development** > **3.2. Prompt Evaluation**.

What you can see now is a more easily consumable version of the prompt validation testing you just performed, along with other parts of your documentation project that still need to be completed. 

If you want to learn more about where you are in the model documentation process, take a look at [How do I use the framework?](https://docs.validmind.ai/guide/get-started-developer-framework.html#how-do-i-use-the-framework).

