# PGP AI - Natural Language Processing

## Project 2: Help Twitter Combat Hate Speech Using NLP and Machine Learning

> ### Author:
>
> ***Saikat Narayan Bhattacharjya***
>
>  ***Email: <snbhattacharjya@gmail.com>***

### DESCRIPTION

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

### Problem Statement:  

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.


### Domain: Social Media

### Analysis to be done: 

Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model

### Content: 

1. id: identifier number of the tweet
2. Label: 0 (non-hate) /1 (hate)
3. Tweet: the text in the tweet


### Tasks: 

1. Load the tweets file using read_csv function from Pandas package.
2. Get the tweets into a list for easy text cleanup and manipulation.
3. To cleanup: 
    1. Normalize the casing.
    2. Using regular expressions, remove user handles. These begin with '@’.
    3. Using regular expressions, remove URLs.
    4. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
    5. Remove stop words.
    6. Remove redundant terms like ‘amp’, ‘rt’, etc.
    7. Remove ‘#’ symbols from the tweet while retaining the term.
4. Extra cleanup by removing terms with a length of 1.
5. Check out the top terms in the tweets:
    1. First, get all the tokenized terms into one large list.
    2. Use the counter and find the 10 most common terms.
6. Data formatting for predictive modeling:
    1. Join the tokens back to form strings. This will be required for the vectorizers.
    2. Assign x and y.
    3. Perform train_test_split using sklearn.
7. We’ll use TF-IDF values for the terms as a feature to get into a vector space model.
    1. Import TF-IDF  vectorizer from sklearn.
    2. Instantiate with a maximum of 5000 terms in your vocabulary.
    3. Fit and apply on the train set.
    4. Apply on the test set.
8. Model building: Ordinary Logistic Regression
    1. Instantiate Logistic Regression from sklearn with default parameters.
    2. Fit into  the train data.
    3. Make predictions for the train and the test set.
9. Model evaluation: Accuracy, recall, and f_1 score.
    1. Report the accuracy on the train set.
    2. Report the recall on the train set: decent, high, or low.
    3. Get the f1 score on the train set.
10.	Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.
    1. Adjust the appropriate class in the LogisticRegression model.
11.	Train again with the adjustment and evaluate.
    1. Train the model on the train set.
    2. Evaluate the predictions on the train set: accuracy, recall, and f_1 score.
12.	Regularization and Hyperparameter tuning:
    1. Import GridSearch and StratifiedKFold because of class imbalance.
    2. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.
    3. Use a balanced class weight while instantiating the logistic regression.
13.	Find the parameters with the best recall in cross-validation.
    1. Choose ‘recall’ as the metric for scoring.
    2. Choose a stratified 4 fold cross-validation scheme.
    3. Fit into  the train set.
14.	What are the best parameters?
15.	Predict and evaluate using the best estimator.
    1. Use the best estimator from the grid search to make predictions on the test set.
    2. What is the recall on the test set for the toxic comments?
    3. What is the f_1 score?

#### Setting up the Environment

In [None]:
import numpy as np
import pandas as pd
import nltk
from pprint import pprint

import matplotlib.pyplot as plt
%matplotlib inline

#### Read the data

In [None]:
tweet_data = pd.read_csv('TwitterHate.csv')
tweet_data.head()

In [None]:
tweet_data.info()

#### Extracting the tweet in a List sturcture for pre-processing

In [None]:
tweets = tweet_data['tweet'].to_list()
tweets[:5]

#### Defining a Basic pre-processing function

In [None]:
import re
from nltk.tokenize import TweetTokenizer

def basic_tweet_cleanup(tweets):
    #Lower casing
    tweets = [tweet.lower() for tweet in tweets]
    
    #Removing @
    tweets = [re.sub('@\S+\s+','',tweet) for tweet in tweets]
    
    #Removing URL 
    tweets = [re.sub('http\S://\S+','',tweet) for tweet in tweets]
    
    #Remove ‘#’ symbols from the tweet while retaining the term.
    tweets = [re.sub('#','',tweet) for tweet in tweets]
      
    #Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
    tweet_tokenizer = TweetTokenizer()
    tweets = [tweet_tokenizer.tokenize(tweet) for tweet in tweets]
    
    #Remove stop words
    tweets = [[token for token in tweet if token not in nltk.corpus.stopwords.words('english')] for tweet in tweets]
    
    #Remove redundant words like 'amp', 'rt'
    tweets = [[token for token in tweet if token not in ['amp','rt']] for tweet in tweets]
    
    return tweets

#### Checking the output after basic preprocessing

In [None]:
%%time
tweets_cleaned = basic_tweet_cleanup(tweets)
pprint(tweets_cleaned[:5], compact=True)

#### Original text data before pre-processing

In [None]:
pprint(tweets[:5], compact=True)

#### Defining an Advanced Cleanup function

In [None]:
def advanced_tweet_cleanup(tweets):
    #Filtering only alphabet words with length > 1
    tweets_cleaned = [[token for token in tweet_tokens if token.isalpha() and len(token) > 1] for tweet_tokens in tweets]
    return tweets_cleaned

#### Checking the output after Advanced cleanup of the text data

In [None]:
tweets_cleaned = advanced_tweet_cleanup(tweets_cleaned)
pprint(tweets_cleaned[:5], compact=True)

#### Joining the tokens in a list to find the top ten common terms

In [None]:
terms = []

for tweet in tweets_cleaned:
    for token in tweet:
        terms.append(token)

print("Total Tokens: {}".format(len(terms)))

#### Creating a table of top ten common words in the text data

In [None]:
from collections import Counter

counts_terms = Counter(terms)
terms_df = pd.DataFrame(counts_terms.most_common(10), columns=['term', 'count'])
terms_df

#### Visualising the top ten common words

In [None]:
terms_df.sort_values(by='count', ascending=True).plot(kind="barh", x='term', figsize=(12,10), color='teal')
plt.show()

#### Adding cleaned tweet data to the data frame for creating Bag of Words by TfidfVectorizer

In [None]:
tweets_cleaned_sent = []

for tweet in tweets_cleaned:
    sent = ""
    for token in tweet:
        sent = sent + token + " "
    tweets_cleaned_sent.append(sent[:-1])

tweets_cleaned_sent[:5]

In [None]:
tweet_data['tweet_cleaned'] = tweets_cleaned_sent
tweet_data.head(10)

#### Assigning X and y for Classification Model

In [None]:
X = tweet_data['tweet_cleaned']
y = tweet_data['label']

#### Spliting the dataset into Train and Test set in the ratio of 70:30

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
len(X_train), len(X_test)

#### Initialising the vectorizer with maximum features as 5000 (words/columns) for the creating the Bag of Words

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 5000)

#### Creating the Train and Test feature matrix for prediction modelling

In [None]:
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.fit_transform(X_test)

X_train_bow.shape, X_test_bow.shape

#### Creating a classification model using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

In [None]:
log_reg.fit(X_train_bow, y_train)

#### Performing predictions on Train and Test data

In [None]:
y_train_pred = log_reg.predict(X_train_bow)
y_test_pred = log_reg.predict(X_test_bow)

#### Checking the Accuracy and Performance Metrics without Regularisation and Hyperparameter Tuning

In [None]:
from sklearn.metrics import accuracy_score, classification_report

In [None]:
print("Accuracy Score for Training set: {}%".format(accuracy_score(y_train, y_train_pred)*100))

In [None]:
print(classification_report(y_train, y_train_pred))

#### Analysis of the result:

It is understood that the dominance of class label '0' imbalances the dataset.

In [None]:
tweet_data['label'].value_counts(normalize=True)

#### Creating a classifier model by Logistic Regression using class weight as 'balanced'

In [None]:
log_reg = LogisticRegression(class_weight='balanced')

In [None]:
log_reg.fit(X_train_bow, y_train)

In [None]:
y_train_pred = log_reg.predict(X_train_bow)
y_test_pred = log_reg.predict(X_test_bow)

In [None]:
print("Accuracy Score for Training set after Class Balanced: {}%".format(accuracy_score(y_train, y_train_pred)))

In [None]:
print(classification_report(y_train_pred, y_train))

#### For Regularisation and Hyperparameter Tuning, we are using GridsearchCV and StratifiedKFold 

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

#### Creating a parameter grid of C and Penalty to find the best possible combination for a higher recall

In [None]:
search_params= {
    "C": [0.01,0.1,1,10,100],
    "penalty": ["l1","l2"],
}

In [None]:
log_reg = LogisticRegression(class_weight = "balanced")

In [None]:
grid_search = GridSearchCV(estimator = log_reg, param_grid = search_params, cv = StratifiedKFold(4), scoring = "recall", 
                           n_jobs = -1, verbose = 1)

In [None]:
import warnings
warnings.filterwarnings('ignore')

grid_search.fit(X_train_bow, y_train)

In [None]:
grid_search.best_estimator_

In [None]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)
print(classification_report(y_test, y_test_pred))