## _Natural Language Processing_ 
### _Help Twitter Combat Hate Speech Using NLP and Machine Learning_
***
<b>DESCRIPTION</b>

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

<b>Problem Statement:</b>
***

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

Domain: Social Media

<b>Analysis to be done:</b> Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model.

<b>Content: </b>

id: identifier number of the tweet

Label: 0 (non-hate) /1 (hate)

Tweet: the text in the tweet
***

<b>Tasks: </b>

Load the tweets file using read_csv function from Pandas package. 

Get the tweets into a list for easy text cleanup and manipulation.

<b>To cleanup: </b>

- Normalize the casing.
- Using regular expressions, remove user handles. These begin with '@’.
- Using regular expressions, remove URLs.
- Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
- Remove stop words.
- Remove redundant terms like ‘amp’, ‘rt’, etc.
- Remove ‘#’ symbols from the tweet while retaining the term.
- Extra cleanup by removing terms with a length of 1.

<b>Check out the top terms in the tweets:</b>
- First, get all the tokenized terms into one large list.
- Use the counter and find the 10 most common terms.

<b>Data formatting for predictive modeling:</b>
- Join the tokens back to form strings. This will be required for the vectorizers.
- Assign x and y.
- Perform train_test_split using sklearn.

<b>We’ll use TF-IDF values for the terms as a feature to get into a vector space model.</b>
- Import TF-IDF  vectorizer from sklearn.
- Instantiate with a maximum of 5000 terms in your vocabulary.
- Fit and apply on the train set.
- Apply on the test set.

</b>Model building: Ordinary Logistic Regression</b>
- Instantiate Logistic Regression from sklearn with default parameters.
- Fit into  the train data.
- Make predictions for the train and the test set.

<b>Model evaluation: Accuracy, recall, and f_1 score.</b>
- Report the accuracy on the train set.
- Report the recall on the train set: decent, high, or low.
- Get the f1 score on the train set.

<b>Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.</b>
- Adjust the appropriate class in the LogisticRegression model.

<b>Train again with the adjustment and evaluate.</b>
- Train the model on the train set.
- Evaluate the predictions on the train set: accuracy, recall, and f_1 score.

<b>Regularization and Hyperparameter tuning:</b>
- Import GridSearch and StratifiedKFold because of class imbalance.
- Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.
- Use a balanced class weight while instantiating the logistic regression.

<b>Find the parameters with the best recall in cross-validation.</b>
- Choose ‘recall’ as the metric for scoring.
- Choose a stratified 4 fold cross-validation scheme.
- Fit into  the train set.

<b>What are the best parameters?</b>

<b>Predict and evaluate using the best estimator.</b>
- Use the best estimator from the grid search to make predictions on the test set.
- What is the recall on the test set for the toxic comments?
- What is the f_1 score?


## Table of Contents

- [1 - Import Libraries and Load Data](#1)
- [2 - Text Cleaning](#2)
    - [2.1 - Handle Diacritics using Text Normalization](#2-1)
    - [2.1 - Remove user handles](#2-2)
    - [2.2 - Remove the URLs](#2-3)
    - [2.3 - Tokenize using TweetTokenizer](#2-4)
    - [2.4 - Remove Stopwords](#2-5)
    - [2.5 - Spelling Corrections](#2-6)
    - [2.6 - Remove #symbols while retaining the text](#2-7)
    - [2.7 - Remove single and double character length tokens ](#2-8)
    - [2.8 - Remove digits](#2-9)
    - [2.9 - Remove non alpha numeric characters ](#2-10)

    
- [3 - Exploratory Data Analysis](#3)
    - [3.1 - Check for data imbalance](#3-1)
    - [3.2 - Check top terms in the tweet](#3-2)
    
- [ 4 - Predictive Modeling](#4)
    - [4.1 - Data Formatting for Predidictive Modeling](#4-1)
    - [4.2 - Using tf-idf vectorizer to generate the feature vectors](#4-2)
    - [4-3 - Model using Ordinary Logistic Regression with Default Parameters](#4-3)
    - [4-4 - Model Evaluation](#4-4)
    - [4-5 - Model using Weighted Logistic Regression to handle data imbalance](#4-5)
    - [4-6 - Model Fine Tuning using Randomized Grid Search](#4-6)
    - [4-7 - Fine Tuned Model Prediction & Evaluation with balanced class weights](#4-7)
    - [4-8 - Fine Tuned Model Prediction & Evaluation with imbalanced class weights](#4-8)
- [5 - Summary](#5)
    

<a id='1'></a>
## _Import Libraries and Load Data_

In [None]:
#general packages for data manipulation
import os
import pandas as pd
import numpy as np
#visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#consistent sized plot 
from pylab import rcParams
rcParams['figure.figsize']=12,5
rcParams['axes.labelsize']=12
rcParams['xtick.labelsize']=12
rcParams['ytick.labelsize']=12
#handle the warnings in the code
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)
#text preprocessing libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
#import texthero
#import texthero as hero
#regular expressions
import re
#display pandas dataframe columns 
pd.options.display.max_columns = None

In [None]:
#load the csv file as a pandas dataframe
#ISO-8859-1
tweet = pd.read_csv('/kaggle/input/twitter-hate-speech/TwitterHate.csv',delimiter=',',engine='python',encoding='utf-8-sig')
tweet.head()

In [None]:
#get rid of the identifier number of the tweet
tweet.drop('id',axis=1,inplace=True)

In [None]:
#view one of the tweets randomly 
random = np.random.randint(0,len(tweet))
print(random)
tweet.iloc[random]['tweet']

In [None]:
#create a copy of the original data to work with 
df = tweet.copy()

<a id='2'></a>
## _Text Cleaning_

<a name='2-1'></a>
### _Handle Diacritics using text normalization_

In [None]:
def simplify(text):
    '''Function to handle the diacritics in the text'''
    import unicodedata
    try:
        text = unicode(text, 'utf-8')
    except NameError:
        pass
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode("utf-8")
    return str(text)

In [None]:
df['tweet'] = df['tweet'].apply(simplify)

<a id='2-2'></a>
### _Remove user handles_

In [None]:
#test on a sample string
sample = "and @user1 i would like you to discuss with @user2 and then with @username3"
pattern = re.compile(r'@\w+')
re.findall(pattern,sample)

In [None]:
#remove all the user handles --> strings starting with @
df['tweet'].replace(r'@\w+','',regex=True,inplace=True)

<a id='2-3'></a>
### _Remove the urls_

In [None]:
#test on a sample 
sample = "https://www.machinelearing.com prakhar and https://www.simple.com"
pattern = re.compile(r'http\S+')
re.findall(pattern,sample)

In [None]:
df['tweet'].replace(r'http\S+','',regex=True,inplace=True)

<a id='2-4'></a>
### _Tokenize using tweet tokenizer_

In [None]:
#test on a sample text
sample = 'wonderfl :-)  when are you coming for #party'
tweet_tokenize = TweetTokenizer(preserve_case=True)
tweet_tokenize.tokenize(sample)

In [None]:
#tokenize the tweets in the dataframe using TweetTokenizer
tokenizer = TweetTokenizer(preserve_case=True)
df['tweet'] = df['tweet'].apply(tokenizer.tokenize)

In [None]:
#view the tokenized tweets
df.head(3)

<a id='2-5'></a>
### _Remove Stopwords_
_Append more words to be removed from the text - example rt and amp which occur very frequently_

In [None]:
stop_words = stopwords.words('english')

#add additional stop words to be removed from the text
additional_list = ['amp','rt','u',"can't",'ur']

for words in additional_list:
    stop_words.append(words)

In [None]:
stop_words[-10:]

In [None]:
#remove stop words
def remove_stopwords(text):
    '''Function to remove the stop words from the text corpus'''
    clean_text = [word for word in text if not word in stop_words]
    return clean_text    

In [None]:
#remove the stop words from the tweets
df['tweet'] = df['tweet'].apply(remove_stopwords)

In [None]:
df['tweet'].head()

<a id='2-6'></a>
### _Spelling corrections_

In [None]:
#apply spelling correction on a sample text
from textblob import TextBlob
sample = 'amazng man you did it finallyy'
txtblob = TextBlob(sample)
corrected_text = txtblob.correct()
print(corrected_text)

In [None]:
#textblob expect a string to be passed and not a list of strings
from textblob import TextBlob

def spell_check(text):
    '''Function to do spelling correction using '''
    txtblob = TextBlob(text)
    corrected_text = txtblob.correct()
    return corrected_text
    

<a id='2-7'></a>
### _Remove # symbols while retaining the text_

In [None]:
#try tremoving # symbols from a sample text
sample = '#winner #machine i am learning'
pattern = re.compile(r'#')
re.sub(pattern,'',sample)

In [None]:
def remove_hashsymbols(text):
    '''Function to remove the hashtag symbol from the text'''
    pattern = re.compile(r'#')
    text = ' '.join(text)
    clean_text = re.sub(pattern,'',text)
    return tokenizer.tokenize(clean_text)    

In [None]:
df['tweet'] = df['tweet'].apply(remove_hashsymbols)

In [None]:
df.head(3)

<a id='2-8'></a>
### _Remove single and double length characters_

In [None]:
def rem_shortwords(text):
    '''Function to remove the short words of length 1 and 2 characters'''
    '''Arguments: 
       text: string
       returns: string without containing words of length 1 and 2'''
    lengths = [1,2]
    new_text = ' '.join(text)
    for word in text:
        text = [word for word in tokenizer.tokenize(new_text) if not len(word) in lengths]
        
    return new_text       
    

In [None]:
df['tweet'] = df['tweet'].apply(rem_shortwords)

In [None]:
df.head(2)

In [None]:
df['tweet'] = df['tweet'].apply(tokenizer.tokenize)

In [None]:
df.head(3)

<a id='2-9'></a>
### _Remove digits_

In [None]:
def rem_digits(text):
    '''Function to remove the digits from the list of strings'''
    no_digits = []
    for word in text:
        no_digits.append(re.sub(r'\d','',word))
    return ' '.join(no_digits)   

In [None]:
df['tweet'] = df['tweet'].apply(rem_digits)

In [None]:
df['tweet'] = df['tweet'].apply(tokenizer.tokenize)

In [None]:
df.head()

<a id='2-10'></a>
### _Remove special characters_


In [None]:
def rem_nonalpha(text):
    '''Function to remove the non-alphanumeric characters from the text'''
    text = [word for word in text if word.isalpha()]
    return text

In [None]:
#remove the non alpha numeric characters from the tweet tokens
df['tweet'] = df['tweet'].apply(rem_nonalpha)

<a id='3'></a>
## _Exploratory Data Analysis - Broad Approach_

<a id='3-1'></a>
### _Check for data balance_

In [None]:
#plot of the count of hate and non hate tweet
sns.countplot(df['label'])
plt.title('Count of Hate vs Non Hate Tweet')
plt.grid()
plt.show()

_There are more non hatespeeches than the hatespeech in the dataset_

<a id='3-2'></a>
### _Check out the top terms in the tweets_

In [None]:
from collections import Counter
results = Counter()
df['tweet'].apply(results.update)
#print the top 10 most common terms in the tweet 
print(results.most_common(10))

In [None]:
#plot the cumulative frequency of the top 10 most common tokens 
frequency = nltk.FreqDist(results)
plt.title('Top 10 Most Common Terms')
frequency.plot(10,cumulative=True)
plt.show()

In [None]:
#plot the frequency of the top 10 most common tokens 
frequency = nltk.FreqDist(results)
plt.title('Top 10 Most Common Terms')
frequency.plot(10,cumulative=False)
plt.show()

_Love is the most frequently used word followed by day, happy etc. This is expected as there are more non hate tweets than hate tweets in the dataset_

<a id='4'></a>
## _Predictive Modeling_

### _Data Formatting for Predictive Modeling_

In [None]:
df.head()

In [None]:
#check for the null values
df.isnull().sum()

In [None]:
#join the tokens back to form the string
df['tweet'] = df['tweet'].apply(lambda x: ' '.join(x))

In [None]:
#check the top rows
df.head(3)

In [None]:
#split the data into input X and output y
X = df['tweet']
y = df['label']

In [None]:
#split the data 
from sklearn.model_selection import train_test_split
seed = 51
test_size = 0.2 #20% of the data in the 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=seed,stratify=df['label'])
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

<a id='4-2'></a>
### _Use tf-idf as a feature to get into the vector space model_


In [None]:
#import tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
#instantiate the vectorizer 
vectorizer = TfidfVectorizer(max_features=5000)

In [None]:
#fit on the training data
X_train = vectorizer.fit_transform(X_train)
#transform the test data
X_test = vectorizer.transform(X_test)

In [None]:
#check the shape
X_train.shape, X_test.shape

<a id='4-3'></a>
### _Model building: Ordinary Logistic Regression_

In [None]:
#import the models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [None]:
#instantiate the models with default hyper-parameters
clf = LogisticRegression()
clf.fit(X_train,y_train)
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

<a id='4-4'></a>
### _Model evaluation_



In [None]:
#import the metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
#get the model accuracy on the training and the test set
print('Accuracy Score on training set %.5f' %accuracy_score(y_train,train_predictions))
print('Accuracy Score on test set %.5f' %accuracy_score(y_test,test_predictions))

_Accuracy is never a good metric for an imbalanced dataset as in this case. This can be highighted using the f1 score. A low f1-score for a label indicate poor performance of the model._

In [None]:
print('Classification Report Training set')
print('\n')
print(classification_report(y_train,train_predictions))

In [None]:
print('Classification Report Testing set')
print('\n')
print(classification_report(y_test,test_predictions))

_The model's f1-score is low for label 1 which indicates the hate text in the twitter_

<a id='4-5'></a>
### _Weighted Logistic Regression Or Cost Sensitive Logistic Regression_


In [None]:
df['label'].value_counts()

_The minority to majority class ratio is 1:13_ 

In [None]:
#define the weight of the class labels using inverse ratio
weights = {0:1.0,1:13.0}

#instantiate the logistic regression model and account for the weights to be applied for model coefficients update magnitude
clf = LogisticRegression(solver='lbfgs',class_weight=weights)

#fit and predict
clf.fit(X_train,y_train)
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

#classification report
print('Classification Report Training set')
print('------------------------------------')
print('\n')
print(classification_report(y_train,train_predictions))
print('\n')

print('Classification Report Testing set')
print('------------------------------------')
print('\n')
print(classification_report(y_test,test_predictions))

_The f1 score of both the training and testing set has improved compared to the plain vanilla Logistic Regression model. There is still more opportunity to improve the score using better models or even handling the data imbalance by adding synthetic data_

<a id='4-6'></a>
### _Regularization and Hyperparameter tuning:_

In [None]:
#import the required libraries for grid search
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

In [None]:
# define search space
from scipy.stats import loguniform
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['l1', 'l2', 'elasticnet']
space['C'] = loguniform(1e-5, 100)

In [None]:
#check the search space 
print(space)

<a id='4-7'></a>
### _Fine tuned Model with Balanced Class Weights_

In [None]:
#define the model with balanced class weights
weights = {0:1.0,1:1.0}
clf = LogisticRegression(class_weight=weights)
#define the number of folds 
folds = StratifiedKFold(n_splits=4,random_state=seed)
# define search
grid_search = RandomizedSearchCV(estimator=clf,param_distributions=space, n_iter=100, scoring='recall',
                            n_jobs=-1, cv=folds, random_state=seed)
#fit grid search on the train data
grid_result = grid_search.fit(X_train,y_train)

In [None]:
#retrieve the best model 
grid_result.best_estimator_

In [None]:
#instantiate the best model
clf = LogisticRegression(C=23.871926754399514,penalty='l1',solver='liblinear',class_weight=weights)

In [None]:
#fit and predict
clf.fit(X_train,y_train)
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

#classification report
print('Classification Report Training set')
print('------------------------------------')
print('\n')
print(classification_report(y_train,train_predictions))
print('\n')

print('Classification Report Testing set')
print('------------------------------------')
print('\n')
print(classification_report(y_test,test_predictions))

<a id='4-8'></a>
### _Fine tuned model with class weights proportional to the class imbalance_

In [None]:
#use the class weights to handle the imbalance in the labels
weights = {0:1.0,1:13}

clf = LogisticRegression(class_weight=weights)
#define the number of folds 
folds = StratifiedKFold(n_splits=4,random_state=seed)
# define search
grid_search = RandomizedSearchCV(estimator=clf,param_distributions=space, n_iter=100, scoring='recall',
                            n_jobs=-1, cv=folds, random_state=seed)
#fit grid search on the train data
grid_result = grid_search.fit(X_train,y_train)

#retrieve the best model 
grid_result.best_estimator_

In [None]:
#instantiate the best model
clf = LogisticRegression(C=0.16731783677034165,penalty='l2',solver='liblinear',class_weight=weights)

#fit and predict
clf.fit(X_train,y_train)
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

#classification report
print('Classification Report Training set')
print('------------------------------------')
print('\n')
print(classification_report(y_train,train_predictions))
print('\n')

print('Classification Report Testing set')
print('------------------------------------')
print('\n')
print(classification_report(y_test,test_predictions))


In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf,X_test,y_test,cmap='summer')
plt.title('Confusion Matrix Test Set')
plt.show()

<a id='5'></a>
## _Summary_

- Logistic Regression with default paramaters recall = 29%
- Logistic Regression with class weights in proportion to the data imbalance recall = 75%
- Logistic Regression fine tuned with grid search and balanced class weights recall = 56%
- Logistic Regression fine tuned with grid search and class weights in proportion to data imbalance recall = 77%
