## Overview
This notebook is targeted towards application of various nlp techniques towards this competition where the we are required to **estimate** the output. Here we have input data in form of a csv file which contain various features aka columns or predictors. There is also a response variable which needs to be calculated and submitted to the competition.

Here we will be looking some of the NLP techniques applicable to this scenario.
- CountVectorizer
- tfidfVectorizer

But before that, some visualizations are also done to help in understanding the underlying data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem.snowball import SnowballStemmer
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
stemmer = SnowballStemmer("english")
df_master  = pd.read_csv('/kaggle/input/feedback-prize-effectiveness/train.csv')

Following are the steps which are required to be followed when a problem related to text is presented to us:
- Basic Exploratory Analysis to understand the data distribution.
- The first step in any scenario, is to get the data. Here, the data is stored in csv file, so we first load this.
- In Text processing scenario, the data needs to be numericalized, so that it can be fed into various alogorithms for building the models. We will start with a basic approach, using CountVectorizer.
- To achieve this, a special data structure called **corpus** is created.
- Please note that there is some basic preprocessing happens for textual data. As text data is likely to be dirty because of jargons, misspellings, this step is useful and necessary.

In [None]:

corpus = []
for i in range(0, len(df_master)):
  review = re.sub('[^a-zA-Z]', ' ', df_master['discourse_text'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  corpus.append(review)

## Basic Exploratory Analysis

In [None]:
# Basic exploration
plt.figure(figsize=(12,8))
sns.countplot(x="discourse_effectiveness", data=df_master)
plt.ylabel('Count', fontsize=12)
plt.xlabel('discourse_effectiveness', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of discourse_effectiveness", fontsize=15)
plt.show()

In [None]:
sw = set(STOPWORDS)
def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=sw,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)
    plt.imshow(wordcloud)
    plt.show()

## Examining Wordclouds

Wordclouds are a convenient way to visualizing text data. They give us the a visual representation of most occuring words which may help in choosing a corpus.
We observe that wordclouds for three categories are very different.

### WordCloud for terms appearing for `Adequate` category.

In [None]:
title = "Adequate"
df = df_master[df_master['discourse_effectiveness'] == title]
show_wordcloud(df['discourse_text'], title )

### WordCloud for terms appearing for `Ineffective` category.

In [None]:
title = "Ineffective"
df = df_master[df_master['discourse_effectiveness'] == title]
show_wordcloud(df['discourse_text'], title )

### WordCloud for terms appearing for `Effective` category.

In [None]:
title = "Effective"
df = df_master[df_master['discourse_effectiveness'] == title]
show_wordcloud(df['discourse_text'], title )

## Feature Building using CountVectorizer
Once the corpus is created, we actually do the feature building here using **CountVectorizer**. This is a powerful, yet simple technique. Here, simply a list of count of the words or terms is given in a document. Concepts of sparse matrix etc also useful to understand here. 

In [None]:
cv = CountVectorizer(max_features = 500)

## Machine Learning
Once the feature building is done, Machine Learning can be performed, because we have the dataset in desired format. Here we split the dataset in training and testing samples. Testing samples will help to determine the performance of the model.

In [None]:
X = cv.fit_transform(corpus).toarray()
y = df_master.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 0)

## Gaussian Naive Bayes Algorithm
A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It's specifically used when the features have continuous values. It's also assumed that all the features are following a gaussian distribution i.e, normal distribution.

In [None]:
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

## Evaluating the model based on CountVectorizer strategy
After the model is built, it is evaluated on test set. This model shows about 32 percent of accuracy. Confusion Matrix is also printed for better understanding of model.

In [None]:
cm = confusion_matrix(y_test, y_pred)
score = accuracy_score(y_test, y_pred)
print("Confusion Matrix")
print(cm)
print(score)

## Creating the Submission File
Since this is a competition so there is a need to have file called `submission.csv` which can be scored. Folloing steps do the prediction on the test set and generate the submission file.

In [None]:
df_submit  = pd.read_csv('/kaggle/input/feedback-prize-effectiveness/test.csv')
submit_corpus = []
for i in range(0, len(df_submit)):
  review = re.sub('[^a-zA-Z]', ' ', df_master['discourse_text'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  submit_corpus.append(review)

X_submit = cv.transform(submit_corpus).toarray()
preds = classifier.predict_proba(X_submit)
df_res = pd.DataFrame(preds)
df_res.columns  = classifier.classes_
pd.concat([df_submit['discourse_id'], df_res], axis = 1).to_csv("submission.csv", index = False)

## Technique 2 - TF-IDF based text analyzer
tf-idf is another popular technique which we will be looking at. As per wikipedia

In information retrieval, tf–idf short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

In [None]:
df_master  = pd.read_csv('/kaggle/input/feedback-prize-effectiveness/train.csv')
def clean_tokenize_orig(document):
  review = re.sub('[^a-zA-Z]', ' ', document)
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  return(review)

## Loading and Cleaning Data

In [None]:
df_master = df_master[['discourse_id','discourse_text','discourse_effectiveness']].dropna()
texts = df_master['discourse_text'].tolist()
texts[0] #one instance of dataset.
cleaned_texts = list(map(clean_tokenize_orig, texts))

## Feature creation using TFIDF vectorizer and Model Creation

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_array = tfidf_vectorizer.fit_transform(cleaned_texts)
X = tfidf_array.toarray()
y = df_master.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 0)
classifier = GaussianNB()
classifier.fit(X_train, y_train)

## Evaluating Model Performance

In [None]:
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
score = accuracy_score(y_test, y_pred)
print("Confusion Matrix")
print(cm)
print(round(score,2))

We observe that Model built by CountVectorizer outperforms the tfidfVectorizer by a margin of 26 percent which is huge. But this should not undermine the effectiveness of tfidf as this has applicability in large areas and this should always be considered.

Thats it. We have seen couple of basic NLP techniques to analyze text data. I will be adding more in coming days. Stay tuned.

## Using Transformers

In [None]:
from transformers import TrainingArguments,Trainer
from transformers import AutoModelForSequenceClassification,AutoTokenizer
import datasets
from datasets import load_dataset, Dataset, DatasetDict

In [None]:
model_nm = 'microsoft/deberta-v3-small'
tokz = AutoTokenizer.from_pretrained(model_nm)

In [None]:
ds = Dataset.from_pandas(df_master)

In [None]:
ds

In [None]:
def tok_func(x): return tokz(x["discourse_text"])
tok_ds = ds.map(tok_func, batched=True)

In [None]:
row = tok_ds[0]
row['discourse_text'], row['input_ids']

In [None]:
tok_ds = tok_ds.rename_columns({'discourse_effectiveness':'labels'})

In [None]:
tok_ds.column_names

In [None]:
tok_ds = tok_ds.remove_columns(tok_ds.column_names)

In [None]:
dds = tok_ds.train_test_split(0.20, seed=0)
dds

In [None]:
tok_ds = tok_ds.remove_columns(dds["train"].column_names)

In [None]:
from transformers import TrainingArguments,Trainer

In [None]:
bs = 128
epochs = 4

In [None]:
lr = 8e-5

In [None]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=False,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [None]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

In [None]:
trainer.train();