# Sensitivity Analysis
## Natural Language Processing Analysis & Binary Classification using CatBoost
This notebook aims to provide an introduction to documenting an NLP model using the ValidMind Developer Framework. The use case presented is a sentiment analysis of tweets related to COVID-19 into "positive" and "negative"; the model is a binary text classification using the CatBoost library.

We will train a sample model and demonstrate the following documentation functionalities:

- Initializing the ValidMind Developer Framework
- Using a sample datasets provided by the library to train a simple nlp classification model using CatBoost library
- Running a test various tests to quickly generate document about the data and model

## Before you begin

To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also [create a documentation project](https://docs.validmind.ai/guide/create-your-first-documentation-project.html) on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library


In [1]:
%pip install --upgrade validmind

## Initialize the client library

In a browser, go to the **Client Integration** page of your documentation project and click **Copy to clipboard** next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

::: {.column-margin}
::: {.callout-tip}
This step requires a documentation project. [Learn how you can create one](https://docs.validmind.ai/guide/create-your-first-documentation-project.html).
:::
:::

Next, replace this placeholder with your own code snippet:

In [2]:
## Replace with code snippet from your documentation project ##

import validmind as vm

vm.init(
    api_host="https://api.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    project="..."
)

2023-07-06 12:52:29,172 - INFO - api_client - Connected to ValidMind. Project: nlp model sensitivity analysis - Initial Validation (cliop8llc003x32rlklophmdl)


## 1. Explorary Data Analysis of Covid tweets data
The emphasis in this section is on the in-depth analysis and preprocessing of the text data (tweets). In this section, we introduce the manually tagged COVID-19 tweets, which range from Highly Negative to Highly Positive, representing five distinct classes. In this Exploratory Data Analysis (EDA), these five classes will be simplified to two classes: Positive and Negative.



### Load library

In [3]:
%set_env PYTORCH_MPS_HIGH_WATERMARK_RATIO 0.8

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split


%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

device = "cpu"

train_model = True

env: PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.8


###  Load covid-19 tweets data

In [4]:
from validmind.datasets.nlp import twitter_covid_19 as demo_data
df = demo_data.load_data()
df.head(10)

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,As news of the regions first confirmed COVID-...,Positive
6,Cashier at grocery store was sharing his insig...,Positive
7,Was at the supermarket today. Didn't buy toile...,Neutral
8,Due to COVID-19 our retail store and classroom...,Positive
9,"For corona prevention,we should stop to buy th...",Negative


### Run text data quality test plan
In this section we use the ValidMind Developer Framework to run various data quality checks on the dataset, and send the results to the model document on the ValidMind Platform UI.

In [5]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='OriginalTweet', target_column="Sentiment")

2023-07-06 12:52:30,028 - INFO - client - Pandas dataset detected. Initializing VM Dataset instance...
2023-07-06 12:52:30,028 - INFO - dataset - Inferring dataset types...


In [6]:
config = {
    "class_imbalance":{"min_percent_threshold": 3}
}
text_data_test_plan = vm.run_test_plan("text_data_quality",
                                       dataset=vm_ds,
                                       config=config)

HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=14)))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anilsorathiya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anilsorathiya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h2>Results for <i>Text Data Quality</i> Test Plan:</h2><hr>'), HTML(value='<div cl…

## 2. Preprocess data

### Handle class bias 
One way to handle class bias is to merge a specific class data with related class. 
Here, we will copy the text and class lables in separate columns so that the original text is also there for comparison.

In [7]:
print("Original Classes:", df.Sentiment.unique())

df['text'] = df.OriginalTweet
df["text"] = df["text"].astype(str)

def classes_def(x):
    if x ==  "Extremely Positive":
        return "positive"
    elif x == "Extremely Negative":
        return "negative"
    elif x == "Negative":
        return "negative"
    elif x ==  "Positive":
        return "positive"
    else:
        return "neutral"
    
df['sentiment']=df['Sentiment'].apply(lambda x:classes_def(x))
target=df['sentiment']

print(df.sentiment.value_counts(normalize= True))
print("Modified Classes:", df.sentiment.unique())

Original Classes: ['Neutral' 'Positive' 'Extremely Negative' 'Negative' 'Extremely Positive']
positive    0.435814
negative    0.378846
neutral     0.185341
Name: sentiment, dtype: float64
Modified Classes: ['neutral' 'positive' 'negative']


### Remove neutral class

In [8]:
df = df[df["sentiment"] != "neutral"]
print(df.sentiment.unique())
print(df.sentiment.value_counts(normalize= True))
print(df.shape)

['positive' 'negative']
positive    0.534964
negative    0.465036
Name: sentiment, dtype: float64
(36623, 4)


In [9]:
df

Unnamed: 0,OriginalTweet,Sentiment,text,sentiment
1,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...,positive
2,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia: Woolworths to give elde...,positive
3,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...,positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,"Me, ready to go at supermarket during the #COV...",negative
5,As news of the regions first confirmed COVID-...,Positive,As news of the regions first confirmed COVID-...,positive
...,...,...,...,...
44949,"@RicePolitics @MDCounties Craig, will you call...",Negative,"@RicePolitics @MDCounties Craig, will you call...",negative
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,Meanwhile In A Supermarket in Israel -- People...,positive
44951,Did you panic buy a lot of non-perishable item...,Negative,Did you panic buy a lot of non-perishable item...,negative
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,Gov need to do somethings instead of biar je r...,negative


### Remove urls and html links

In [10]:
#Remove Urls and HTML links
import re

def remove_urls(text):
    url_remove = re.compile(r'https?://\S+|www\.\S+')
    return url_remove.sub(r'', text)

df['text']=df['text'].apply(lambda x:remove_urls(x))

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

df['text']=df['text'].apply(lambda x:remove_html(x))

### Convert text to lower case 


In [11]:
# Lower casing
def lower(text):
    low_text= text.lower()
    return low_text
df['text']=df['text'].apply(lambda x:lower(x))


### Remove numbers 

In [12]:
# Number removal
def remove_num(text):
    remove= re.sub(r'\d+', '', text)
    return remove
df['text']=df['text'].apply(lambda x:remove_num(x))


### Remove stopwords 

In [13]:
#Remove stopwords
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
df['text']=df['text'].apply(lambda x:remove_stopwords(x))

### Remove Punctuations 

In [14]:
#Remove Punctuations

def punct_remove(text):
    punct = re.sub(r"[^\w\s\d]","", text)
    return punct
df['text']=df['text'].apply(lambda x:punct_remove(x))


### Remove mentions 

In [15]:
#Remove mentions 
def remove_mention(x):
    text=re.sub(r'@\w+','',x)
    return text
df['text']=df['text'].apply(lambda x:remove_mention(x))


### Remove hashtags 

In [16]:
#Remove hashtags 

def remove_hash(x):
    text=re.sub(r'#\w+','',x)
    return text
df['text']=df['text'].apply(lambda x:remove_hash(x))

### Remove extra white space left while removing stuff

In [17]:
#Remove extra white space left while removing stuff
def remove_space(text):
    space_remove = re.sub(r"\s+"," ",text).strip()
    return space_remove
df['text']=df['text'].apply(lambda x:remove_space(x))

In [18]:
df

Unnamed: 0,OriginalTweet,Sentiment,text,sentiment
1,advice Talk to your neighbours family to excha...,Positive,advice talk neighbours family exchange phone n...,positive
2,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworths give elderly ...,positive
3,My food stock is not the only one which is emp...,Positive,food stock one empty please panic enough food ...,positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,me ready go supermarket covid outbreak im para...,negative
5,As news of the regions first confirmed COVID-...,Positive,news regions first confirmed covid case came s...,positive
...,...,...,...,...
44949,"@RicePolitics @MDCounties Craig, will you call...",Negative,ricepolitics mdcounties craig call general ass...,negative
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,meanwhile supermarket israel people dance sing...,positive
44951,Did you panic buy a lot of non-perishable item...,Negative,panic buy lot nonperishable items echo needs f...,negative
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,gov need somethings instead biar je rakyat ass...,negative


### Run text data quality tests again
Here, we are checking the quality of the data again by running data quality tests so verify that we have preprocess data well and tests are passing according to our requirements.

In [19]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='text', target_column="sentiment")

config = {
    "class_imbalance":{"min_percent_threshold": 3}
}
text_data_test_plan = vm.run_test_plan("text_data_quality",
                                       dataset=vm_ds,
                                       config=config)

2023-07-06 12:52:42,024 - INFO - client - Pandas dataset detected. Initializing VM Dataset instance...
2023-07-06 12:52:42,024 - INFO - dataset - Inferring dataset types...


HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=14)))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anilsorathiya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anilsorathiya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h2>Results for <i>Text Data Quality</i> Test Plan:</h2><hr>'), HTML(value='<div cl…

## 4. Modeling 

### Training, validation, test

With our data in nice shape, we'll split it into training, validation, and test sets.

In [20]:

df = df[df['sentiment'] != "neutral"]
df.loc[df['sentiment'] == "positive", 'sentiment'] = 1
df.loc[df['sentiment'] == "negative", 'sentiment'] = 0
print(np.unique(df['sentiment']))

print(df.head())
train, test = train_test_split(df[['text','sentiment']], test_size=0.33, random_state=42)
train = train[['text','sentiment']]
test = test[['text','sentiment']]

train, valid = train_test_split(
    train,
    train_size=0.7,
    random_state=0,
    stratify=train['sentiment'])
y_train, X_train = \
    train['sentiment'], train.drop(['sentiment'], axis=1)
y_valid, X_valid = \
    valid['sentiment'], valid.drop(['sentiment'], axis=1)
y_test, X_test= \
    test['sentiment'], test.drop(['sentiment'], axis=1)

[0 1]
                                       OriginalTweet           Sentiment  \
1  advice Talk to your neighbours family to excha...            Positive   
2  Coronavirus Australia: Woolworths to give elde...            Positive   
3  My food stock is not the only one which is emp...            Positive   
4  Me, ready to go at supermarket during the #COV...  Extremely Negative   
5  As news of the regions first confirmed COVID-...            Positive   

                                                text sentiment  
1  advice talk neighbours family exchange phone n...         1  
2  coronavirus australia woolworths give elderly ...         1  
3  food stock one empty please panic enough food ...         1  
4  me ready go supermarket covid outbreak im para...         0  
5  news regions first confirmed covid case came s...         1  


### Build model

In [21]:
def fit_model(X_train, y_train,val_data, **kwargs):
    model = CatBoostClassifier(
        task_type='CPU',
        iterations=5000,
        eval_metric='Accuracy',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )
    return model.fit(
        X=X_train,
        y=y_train,
        eval_set=val_data,
        verbose=100,
        plot=True,
        use_best_model=True
        )

In [22]:
model = fit_model(
    X_train, y_train,
    val_data=(X_valid,y_valid),
    text_features=['text'],
    learning_rate=0.35,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word', 'Number', 'SentenceBreak'],
            'sub_tokens_policy':'SeveralTokens'
        }      
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '5000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.6037263	test: 0.6062211	best: 0.6062211 (0)	total: 101ms	remaining: 8m 25s
100:	learn: 0.8601456	test: 0.8288509	best: 0.8291225 (98)	total: 4.36s	remaining: 3m 31s
200:	learn: 0.9101601	test: 0.8515349	best: 0.8528932 (193)	total: 8.52s	remaining: 3m 23s
300:	learn: 0.9349054	test: 0.8562891	best: 0.8564249 (297)	total: 12.8s	remaining: 3m 19s
400:	learn: 0.9557496	test: 0.8566965	best: 0.8583265 (348)	total: 17s	remaining: 3m 15s
500:	learn: 0.9712955	test: 0.8569682	best: 0.8583265 (348)	total: 21.2s	remaining: 3m 10s
600:	learn: 0.9795633	test: 0.8576474	best: 0.8583265 (348)	total: 25.3s	remaining: 3m 5s
700:	learn: 0.9865502	test: 0.8533007	best: 0.8583265 (348)	total: 29.6s	remaining: 3m 1s
800:	learn: 0.9915575	test: 0.8572399	best: 0.8583265 (348)	total: 33.8s	remaining: 2m 57s
900:	learn: 0.9949345	test: 0.8603640	best: 0.8610432 (869)	total: 38.1s	remaining: 2m 53s
1000:	learn: 0.9970306	test: 0.8604999	best: 0.8610432 (869)	total: 42.4s	remaining: 2m 49s
1100:	l

### Initialize validmind objects

In [23]:
vm_train_ds = vm.init_dataset(dataset=pd.concat([X_train, y_train], axis=1), type="generic", target_column="sentiment")
vm_test_ds = vm.init_dataset(dataset=pd.concat([X_test, y_test], axis=1), type="generic",target_column="sentiment")
vm_model = vm.init_model(model, train_ds=vm_train_ds, test_ds=vm_test_ds)

2023-07-06 12:54:24,980 - INFO - client - Pandas dataset detected. Initializing VM Dataset instance...
2023-07-06 12:54:24,982 - INFO - dataset - Inferring dataset types...
2023-07-06 12:54:24,998 - INFO - client - Pandas dataset detected. Initializing VM Dataset instance...
2023-07-06 12:54:24,999 - INFO - dataset - Inferring dataset types...


#### Run model metrics test plan

In [24]:
model_metrics_test_plan = vm.run_test_plan("binary_classifier_metrics", 
                                             model=vm_model
                                            )

HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=20)))

2023-07-06 12:54:25,525 - INFO - PermutationFeatureImportance - Skiping PFI for catboost models
2023-07-06 12:54:25,555 - INFO - PopulationStabilityIndex - Skiping PSI for catboost models
2023-07-06 12:54:25,556 - INFO - SHAPGlobalImportance - Skiping SHAP for catboost models


VBox(children=(HTML(value='<h2>Results for <i>Binary Classifier Metrics</i> Test Plan:</h2><hr>'), HTML(value=…

#### Run model validation test plan

In [25]:
model_validation_test_plan = vm.run_test_plan("binary_classifier_validation", 
                                             model=vm_model
                                            )

HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=8)))

VBox(children=(HTML(value='<h2>Results for <i>Binary Classifier Validation</i> Test Plan:</h2><hr>'), HTML(val…