# Combating Hate Speech Using NLP and Machine Learning

### Objective:

Using NLP and ML, create a model to identify hate speech in Twitter.

### Problem Statement:

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium to spread hate.

### Domain: Social Media

### Analysis to be done: 

Clean up tweets and build a classification model by using NLP techniques, cleanup for tweets data, regularization and hyperparameter tuning using stratifies k-fold and cross validation to get the best model.

### Content:
id: identifier number of the tweet
Label: 0(non-hate) / 1(hate)
Tweet: the text in the tweet

### Tasks:
1. Load the tweets file using read_csv function from Pandas Package.
2. Get the tweets into a list for easy text cleanup and manipulation
3. To Cleanup:
    * Normalize the casing
    * Using regular expressions, remove user handles. These begin with '@'
    * Using regular expressions, remove URLs.
    * Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
    * Remove stopwords
    * remove redundant terms like 'amp', 'rt', etc
    * remove '#' symbols from the tweet while retaining the term
4. Extra Cleanup by removing terms with a length of 1
5. Check out the top terms in the tweets:
    * First, get all the tokenized terms into one large list.
    * Use the counter and find the 10 most common terms.
6. Data Formatting for predictive modeling:
    * Join the tokens back to form strings. This will be required for the vectorizers.
    * Assign X and y.
    * Perform train_test_split using sklearn
7. We'll use TF-IDF values for the terms as a feature to get into a vector space model
    * Import TF-IDf vectorizer from sklearn
    * Instantiate with a maximum of 5000 terms in your vocabulary
    * Fit and apply on the train set.
    * Apply on the test set
8. Model Building: Ordinary Logistic Regression
    * Instantiate Logistic Regression from sklearn with default parameters.
    * Fit into the train data
    * Make predictions for the train and the test set
9. Model Evaluation: Accuracy, recall and f1 score
    * Report the accracy on the train set.
    * Report the recall on the train set: decent, high or low
    * Get the f1 score on the train set
10. Looks like you need to adjust
11. Train again with the adjustment and evaluate.
    * Train the model on the train set
    * Evaluate the predictions on the train set: accuracy, recall and f1 score
12. Regularization and Hyperparameter tuning:
    * Import Gridsearch and StratifiedKFold because of class imabalance.
    * Provide the parameter grid to choose for 'C' and penalty parameters.
    * Use a balanced class weight while instantiating the logistic regression
13. Find the parameters with the best recall in cross validation
    * Choose 'recall' as the metric for scoring
    * Choose stratified 4 fold cross validation scheme
    * Fit into the train set
14. What are the best parameters?
15. Predict and evaluate using the best estimator.
    * Use the best estimator from the grid search to make predictions on the test
    * What is the recall on the test set for the toxic comments?
    * What is the f1 score?

In [2]:
import pandas as pd
import numpy as np
import os, re

#### Read in the csv using pandas

In [3]:
inp_tweets0 = pd.read_csv ("Tweets_USA.csv")
inp_tweets0.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
inp_tweets0.label.value_counts(normalize = True)

0    0.929854
1    0.070146
Name: label, dtype: float64

#### so we have around 7% of hate tweets in this data

In [5]:
inp_tweets0.tweet.sample().values[0]

#checking the terms containing in a random tweets

'#thursdaythoughts  from @user  choose   #everyday '

#### Get the tweets into a list, for easy text clean up and manipulation

In [6]:
tweets0 = inp_tweets0.tweet.values

#getting them in list

In [7]:
len(tweets0)

31962

In [8]:
tweets0[:5]

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty',
       '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
       ' factsguide: society now    #motivation'], dtype=object)

#### So as we can see here, The tweets contain:

1. URLs
2. Hashtags
3. user handles
4. 'RT'

### Cleaning up the data

#### Normalizing the casing

In [9]:
tweets_lower = [twt.lower() for twt in tweets0] 

#to avoid case sensitivity

In [10]:
tweets_lower[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

### Removing user handles that begin with '@'

In [11]:
import re

In [12]:
re.sub("@\w+","", "@Rahim this course rocks! http://rahimbaig.com/ai")

#substituting using regular expression

' this course rocks! http://rahimbaig.com/ai'

In [13]:
tweets_nouser = [re.sub("@\w+","",twt) for twt in tweets_lower] 

#doing the same for the whole list

In [14]:
tweets_nouser[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#### Now it looks a little bit cleaner as all the user handles are removed here

### Removing URLs

In [15]:
re.sub("\w+://\S+","", "@Rahim this course rocks! http://rahimbaig.com/ai")

#the + sign after S is important. as withouit this the output will contain the 'http://rahimbaig.com/ai' part

'@Rahim this course rocks! '

In [16]:
tweets_nourl = [re.sub("\w+://\S+","", twt) 
                for twt in tweets_nouser]
#doing same for the whole list

In [17]:
tweets_nourl[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#### Now we can see that there is no URL  present in the tweets

#### Tokenzing the Tweets into individual terms using TweetTokenizer from NLTK

In [18]:
from nltk.tokenize import TweetTokenizer

In [19]:
tkn = TweetTokenizer()

In [20]:
print(tkn.tokenize(tweets_nourl[0]))

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [21]:
tweet_token = [tkn.tokenize(sent) for sent in tweets_nourl]
#print(tweet_token[31000])
print(tweet_token[12345])

['got', 'a', 't-shi', 'and', 'mini-sticker', 'for', "father's", 'day', '!', '!', 'just', 'need', 'to', 'get', 'a', 'magnet', 'to', 'complete', 'the', 'collection', '!']


### Remove punctuations and stop words and other redundant terms like 'rt', 'amp' etc

* Also remove hashtags

In [22]:
from nltk.corpus import stopwords
from string import punctuation
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\2211592\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
stop_nltk = stopwords.words("english")
stop_punct = list(punctuation)
stop_punct

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [24]:
stop_punct.extend(['...', '``', "''",".."])

In [25]:
stop_context = ['rt', 'amp']

In [26]:
stop_final = stop_nltk + stop_punct + stop_context

#### creating a function to:

* remove stop words from a single sentence
* remove # tags
* remove terms with length = 1

In [27]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

In [28]:
#del_stop(tweet_token[4])
del_stop(tweet_token[12345])
#trying the function on one random tweet

['got',
 't-shi',
 'mini-sticker',
 "father's",
 'day',
 'need',
 'get',
 'magnet',
 'complete',
 'collection']

In [29]:
tweets_clean = [del_stop(tweet) for tweet in tweet_token]

#applying on the whole list

### Checking out the top terms in the tweets

In [30]:
from collections import Counter

In [31]:
term_list = []
for tweet in tweets_clean:
    term_list.extend(tweet)
#getting all the tokenized terms into one list

In [32]:
res = Counter(term_list)
res.most_common(10)

#using the counter to get the 10 most common terms

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

### Data Formatting for Predictive Modelling

In [33]:
tweets_clean[30000]

#checking the tokens for one random tweet

['never', 'msg', 'first', 'dun', 'msg', 'first', 'disappointed']

In [34]:
tweets_clean = [" ".join(tweet) for tweet in tweets_clean]
# joining the tokens back to string form

In [35]:
tweets_clean[30000]
#checking the strings for one random tweet

'never msg first dun msg first disappointed'

### Separating X and y and performing train test split, 70-30

In [36]:
len(tweets_clean)

31962

In [37]:
X = tweets_clean
y = inp_tweets0.label.values

#### Train Test Split using sklearn

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

### Create a document term matrix using count vectorizer

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [40]:
vectorizer = TfidfVectorizer(max_features = 5000)

#instantiating with a maximum of 5000 terms in our vocabulary

In [41]:
len(X_train), len(X_test)

(22373, 9589)

In [42]:
X_train_bow = vectorizer.fit_transform(X_train)

X_test_bow = vectorizer.transform(X_test)

In [43]:
X_train_bow.shape, X_test_bow.shape

((22373, 5000), (9589, 5000))

### Model Building

#### Using a simple logistic Regression

In [44]:
from sklearn.linear_model import LogisticRegression

In [45]:
logreg = LogisticRegression()

In [46]:
logreg.fit(X_train_bow, y_train)

LogisticRegression()

In [47]:
y_train_pred = logreg.predict(X_train_bow)
y_test_pred = logreg.predict(X_test_bow)

In [48]:
from sklearn.metrics import accuracy_score, classification_report

In [49]:
accuracy_score(y_train, y_train_pred)

0.9560184150538595

In [50]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



#### Here, weighted average for recall is 0.96

### Adjusting for class imbalance

In [51]:
logreg = LogisticRegression(class_weight = "balanced") 

In [52]:
logreg.fit(X_train_bow, y_train)

LogisticRegression(class_weight='balanced')

In [53]:
y_train_pred = logreg.predict(X_train_bow)
y_test_pred = logreg.predict(X_test_bow)

In [54]:
accuracy_score(y_train, y_train_pred)

0.9527108568363652

In [55]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     20815
           1       0.60      0.97      0.74      1558

    accuracy                           0.95     22373
   macro avg       0.80      0.96      0.86     22373
weighted avg       0.97      0.95      0.96     22373



#### Here, weighted average for recall is 0.95

### Regularization and Hyperparameter tuning:

In [56]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# importing these libraries because of class imbalance

In [57]:
#Create the parameter grid based on the results of random search
param_grid = {
    'C': [0.01, 0.1, 0.5, 1, 5, 10]
}

# [0.01, 0.1, 0.5, 1, 5, 10] we can use this also

In [58]:
classifier_lr = LogisticRegression(class_weight = "balanced")

#using a balanced class weight while instantiating the logistic regression

In [59]:
# Instantiate the grid search model

grid_search = GridSearchCV(estimator = classifier_lr,
                           param_grid = param_grid,
                           cv = StratifiedKFold(4),
                           n_jobs = -1, verbose = 1,
                           scoring = "recall" )

#choosing recall as the metric for scoring and stratified 4 fold cross validation scheme

In [60]:
grid_search.fit(X_train_bow, y_train)

Fitting 4 folds for each of 6 candidates, totalling 24 fits


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 0.5, 1, 5, 10]}, scoring='recall',
             verbose=1)

In [61]:
grid_search.best_estimator_

LogisticRegression(C=0.5, class_weight='balanced')

### Using the best estimator to make predictions on the test set

In [62]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)

In [63]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow)

In [64]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.93      0.96      8905
           1       0.47      0.78      0.59       684

    accuracy                           0.92      9589
   macro avg       0.73      0.86      0.77      9589
weighted avg       0.95      0.92      0.93      9589



#### Here, weighted average for recall is 0.92