# Coronavirus Tweets NLP - Text Classifiation
This notebook aims at building at text classification engine from the content of Coronavirus Tweets NLP - Text Classifiation dataset that contains around 41157 reviews. Basically, the engine works as follows: after user has provided with tweet, the engine cleans the data and tries to classify the tweet as positive, negative or neutral.

The Notebook is organised as follows.

**1.Text Cleaning**

* Removing the URLS 
* Removing HTML tags
* Removing Numbers/Digits
* Removing Punctuations
* Removing Mentions
* Removing Hash
* Removing extra spaces

**2.Converting Text to Numerical Vector**  
* TF-IDF

**3.Modeling**
* MultinomialNB
* Random Forest
* SGD Classifier
* XGBoost

**4.Conclusion**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Libraries

In [None]:
# Importing the libraries
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import nltk
import string
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
train_dataset = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv",encoding="latin")
test_dataset = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv",encoding="latin")

In [None]:
print(train_dataset.shape)
print(test_dataset.shape)

In [None]:
print(train_dataset.columns)
print(test_dataset.columns)

In [None]:
train_dataset["Sentiment"].unique()

Here, there are five classes:'Neutral, Positive, Extremely Negative, Negative,Extremely Positive'.<br>
Extremely Negative & Negative is encoded as 0.<br>
Extremely Positive & Positive is encoded as 2.<br>
Neutral is encoded as 2.

In [None]:
def classes_def(x):
    if x ==  "Extremely Positive":
        return "2"
    elif x == "Extremely Negative":
        return "0"
    elif x == "Negative":
        return "0"
    elif x ==  "Positive":
        return "2"
    else:
        return "1"
    

train_dataset['class']=train_dataset['Sentiment'].apply(lambda x:classes_def(x))

In [None]:
train_dataset["class"].value_counts(normalize= True)

# Text Cleaning

In [None]:
from bs4 import BeautifulSoup
STOPWORDS = set(stopwords.words('english'))

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

from tqdm import tqdm
preprocessed_tweets = []
# tqdm is for printing the status bar
for sentance in tqdm(train_dataset['OriginalTweet'].values):
    sentance = re.sub(r'https?://\S+|www\.\S+', r'', sentance) # remove URLS
    sentance = re.sub(r'<.*?>', r'', sentance) # remove HTML
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub(r'\d+', '', sentance).strip() # remove number
    sentance = re.sub(r"[^\w\s\d]","", sentance) # remove pnctuations
    sentance = re.sub(r'@\w+','', sentance) # remove mentions
    sentance = re.sub(r'#\w+','', sentance) # remove hash
    sentance = re.sub(r"\s+"," ", sentance).strip() # remove space
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    
    sentance = ' '.join([e.lower() for e in sentance.split() if e.lower() not in STOPWORDS])
    preprocessed_tweets.append(sentance.strip())

# TF-IDF

In [None]:
tf_idf_vect = TfidfVectorizer(min_df=10)
tf_idf_vect.fit(preprocessed_tweets)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_tweets)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

In [None]:
X = final_tf_idf
y = train_dataset["class"].tolist()

X_train, X_test, y_train, y_test = train_test_split(X.tocsr(), y, test_size= 0.33, stratify=y,  random_state=42)

# Modeling

# MultinomialNB

In [None]:
grid_params ={'alpha':[10**x for x in range(-4,4)]}
alpha_log = [math.log(x,10) for x in grid_params["alpha"]]

MultinomialNB_model = GridSearchCV(MultinomialNB(),grid_params,
                     scoring = 'accuracy', cv=10,n_jobs=-1, return_train_score=True)
MultinomialNB_model.fit(X_train, y_train)

In [None]:
results = pd.DataFrame.from_dict(MultinomialNB_model.cv_results_)
results = results.sort_values(['param_alpha'])

plt.plot(alpha_log, results["mean_train_score"], label='Train Accuracy')
plt.plot(alpha_log, results["mean_test_score"].values, label='CV Accuracy')

plt.scatter(alpha_log, results["mean_train_score"].values, label='Train Accuracy points')
plt.scatter(alpha_log, results["mean_test_score"].values, label='CV Accuracy points')

plt.legend()
plt.xlabel("Alpha: hyperparameter")
plt.ylabel("Accuracy")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
print(MultinomialNB_model.best_estimator_)

In [None]:
MultinomialNB_model = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
MultinomialNB_model.fit(X_train,y_train)

y_pred = MultinomialNB_model.predict(X_test)
cm=confusion_matrix(y_test, y_pred)
cm_df=pd.DataFrame(cm,index=[0,1,2],columns=[0,1,2])
print("Accuracy:",accuracy_score(y_test, y_pred))

sns.set(font_scale=1.4,color_codes=True,palette="deep")
sns.heatmap(cm_df,annot=True,annot_kws={"size":16},fmt="d",cmap="YlGnBu")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
print(metrics.classification_report(y_test, y_pred, 
                                    target_names= train_dataset['class'].unique()))

From the results, we can see that the recall is very low for class 2.<br>
Recall = True Positive/(True Positive + True Negative).<br>
Which implies, **74% tweets** indicating **Covid-19 Positive** but model predicting as **Negative** or **Netrual**.

# Random Forest

In [None]:
max_depth = [1,5,10,50]
n_estimators = [5,10,100,500]
grid_params ={'max_depth':max_depth,'n_estimators':n_estimators}

RandomFoest_model = GridSearchCV(RandomForestClassifier(class_weight = 'balanced'), grid_params,
                  scoring = 'accuracy', cv=10,n_jobs=-1, return_train_score=True)
RandomFoest_model.fit(X_train, y_train)

results = pd.DataFrame.from_dict(RandomFoest_model.cv_results_)
print(RandomFoest_model.best_estimator_)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib

max_depth = [1,1,1,1,5,5,5,5,10,10,10,10,50,50,50,50]
n_estimators = [5,10,100,500,5,10,100,500,5,10,100,500,5,10,100,500]
mean_train_score = list(results["mean_train_score"].values)
mean_test_score = list(results["mean_test_score"].values)

fig = matplotlib.pyplot.figure(figsize=(12,6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(max_depth, n_estimators, mean_train_score, c='r', marker='o')
ax.scatter(max_depth, n_estimators, mean_test_score, c='b', marker='o')

ax.set_xlabel('max_depth ')
ax.set_ylabel('n_estimators')
ax.set_zlabel('Accuracy')

In [None]:
RandomFoest_model = RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=50, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
RandomFoest_model.fit(X_train,y_train)

y_pred = RandomFoest_model.predict(X_test)
cm=confusion_matrix(y_test, y_pred)
cm_df=pd.DataFrame(cm,index=[0,1,2],columns=[0,1,2])
print("Accuracy:",accuracy_score(y_test, y_pred))

sns.set(font_scale=1.4,color_codes=True,palette="deep")
sns.heatmap(cm_df,annot=True,annot_kws={"size":16},fmt="d",cmap="YlGnBu")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
print(metrics.classification_report(y_test, y_pred, 
                                    target_names= train_dataset['class'].unique()))

From the results, we can see that recall is imporved but precision is low for class 2.<br>
Precision = True Positive/(True Positive + False Positive).<br>
Which implies, **51% tweets** indicating **Covid-19 Negative** but model predicting as **Positive** or **Netrual**.<br>
Recall and Precision is descent for class 1 and 2.

# SGD Classifier

In [None]:
alpha = [10**x for x in range(-4,4)]
penalty = ["l1","l2"]
grid_params ={'alpha':alpha,'penalty':penalty}
alpha_log = [math.log(x,10) for x in grid_params["alpha"]]

SGDClassifier_model = GridSearchCV(SGDClassifier(class_weight= 'balanced'), grid_params,
                     scoring = 'accuracy', cv=10,n_jobs=-1, return_train_score=True)
SGDClassifier_model.fit(X_train, y_train)

results = pd.DataFrame.from_dict(SGDClassifier_model.cv_results_)
results = results.sort_values(['param_alpha'])

print(SGDClassifier_model.best_estimator_)

In [None]:
SGDClassifier_model = SGDClassifier(class_weight='balanced', penalty='l1')
SGDClassifier_model.fit(X_train,y_train)

y_pred = SGDClassifier_model.predict(X_test)
cm=confusion_matrix(y_test, y_pred)
cm_df=pd.DataFrame(cm,index=[0,1,2],columns=[0,1,2])
print("Accuracy:",accuracy_score(y_test, y_pred))

sns.set(font_scale=1.4,color_codes=True,palette="deep")
sns.heatmap(cm_df,annot=True,annot_kws={"size":16},fmt="d",cmap="YlGnBu")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
print(metrics.classification_report(y_test, y_pred, 
                                    target_names= train_dataset['class'].unique()))

From the results, we can see that recall is imporved but precision is low for class 2.<br>
Precision = True Positive/(True Positive + False Positive).<br>
Which implies, **42% tweets** indicating **Covid-19 Negative** but model predicting as **Positive** or **Netrual**.<br>
Recall and Precision is good for class 1 and 2.

# XGBoost

In [None]:
learning_rate = [0.0001, 0.001, 0.01, 0.1]
max_depth = [1,3,5,7]
n_estimators = [5,10,100,500]
grid_params ={'max_depth':max_depth,'n_estimators':n_estimators, 'learning_rate':learning_rate}

XGBoost_model = GridSearchCV(XGBClassifier(), grid_params,
                      scoring = 'accuracy', cv=10,n_jobs=-1, return_train_score=True)
XGBoost_model.fit(X_train, y_train)

results = pd.DataFrame.from_dict(XGBoost_model.cv_results_)
print(XGBoost_model.best_estimator_)

In [None]:
XGBoost_model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=7, min_child_weight=1, missing=None, n_estimators=500,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBoost_model.fit(X_train,y_train)

y_pred = XGBoost_model.predict(X_test)
cm=confusion_matrix(y_test, y_pred)
cm_df=pd.DataFrame(cm,index=[0,1,2],columns=[0,1,2])
print("Accuracy:",accuracy_score(y_test, y_pred))

sns.set(font_scale=1.4,color_codes=True,palette="deep")
sns.heatmap(cm_df,annot=True,annot_kws={"size":16},fmt="d",cmap="YlGnBu")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
print(metrics.classification_report(y_test, y_pred, 
                                    target_names= train_dataset['class'].unique()))

From the results, we can see that recall & Precision is imporved and descent for all classes.

# Conclusion

* Precision, Recall and F1 score  for XGBoost model is descent enough than other models. So will go with XGBoost Model.