<b>DESCRIPTION</b> :- Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

<font color="blue"><b>Problem Statement</b></font>:  

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

<font color="green"><b>Domain</b></font>: Social Media

Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross-validation to get the best model.

Content: 

id: identifier number of the tweet

Label: 0 (non-hate) /1 (hate)

Tweet: the text in the tweet

In [1]:
import pandas as pd
import gensim
from bs4 import BeautifulSoup
import re, string
import numpy as np

In [2]:
inp_data = pd.read_csv("TwitterHate.csv")

In [3]:
inp_data.shape

(31962, 3)

In [4]:
inp_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


## Cleaning data

In [5]:
# ---------------- Take only english character and numbers--------------

def remove_junk(x):
    import string
    import re
    sent = ''
    for i in x:
        if(i in string.printable):
            sent+=i
    new_sent = ''
    for i in sent:
        if(i in string.ascii_letters+' '+'0123456789'):
            new_sent+=i
            
    new_sent= re.sub('\s+', ' ', new_sent)
    return new_sent

            
    

In [6]:
inp_data["tweet"] = inp_data["tweet"].apply(lambda a: remove_junk(a))

In [7]:
inp_data.head()

Unnamed: 0,id,label,tweet
0,1,0,user when a father is dysfunctional and is so...
1,2,0,user user thanks for lyft credit i cant use ca...
2,3,0,bihday your majesty
3,4,0,model i love u take with u all the time in ur
4,5,0,factsguide society now motivation


## Lemmatization using Spacy

In [None]:
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])

# function to lemmatize text
def lemmatization(texts):
    s = [token.lemma_ for token in nlp(texts)]
    return ' '.join(s)

In [12]:
inp_data['tweet'] = inp_data['tweet'].apply(lambda a: lemmatization(a))

## Prepare data for modellig

In [13]:
from gensim import utils
from gensim.test.utils import datapath
from gensim import utils


class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        print(open(corpus_path))
        for line in inp_data.tweet:
            print(line)
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [14]:
import gensim.models

sentences = MyCorpus()


## Word to vec model using gensim

In [None]:
model = gensim.models.Word2Vec(sentences,
                              size=50)#model = gensim.models.Word2Vec(sentences=sentences)

## Saving and Loading model

In [20]:
#

import tempfile
import numpy as np
with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

In [21]:
w2v_data = []

## Adding word vectors to form sentence vectors

In [23]:
for j in inp_data.tweet:
    word_lst = j.split()
   # print(word_lst)
    for i, k in enumerate(word_lst):
        average_vec = np.array([0]*50)
        if i ==0 or (k not in new_model.wv.index2word) :
            average_vec = average_vec+ np.array([0]*50)
            #print(k)
            continue
        
        average_vec = average_vec+new_model.wv[k]
    #average_vec = average_vec/len(word_lst)
    w2v_data.append(average_vec)
    
    #print(j)

In [24]:
##---------------Convert it to numpy array
new_inp_data = np.array(w2v_data)

## Break the data into train and validation

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
train_xx, test_xx, train_yy, test_yy = train_test_split(new_inp_data, inp_data.label)

In [27]:
train_xx.shape, test_xx.shape, train_yy.shape, test_yy.shape

((23971, 50), (7991, 50), (23971,), (7991,))

## Use Logstic Regression for modelling

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
lg = LogisticRegression()
lg.fit(train_xx, train_yy)
# make class predictions for X_test_dtm
y_pred_class = lg.predict(test_xx)
# print the confusion matrix


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Evaluation of model

In [30]:
from sklearn import metrics
print("accuracy",metrics.accuracy_score(test_yy, y_pred_class))
print("precision",metrics.precision_score(test_yy, y_pred_class))
print("recall",metrics.recall_score(test_yy, y_pred_class))
print("f1",metrics.f1_score(test_yy, y_pred_class))
print("confusion",metrics.confusion_matrix(test_yy, y_pred_class))



accuracy 0.9317982730571893
precision 0.6666666666666666
recall 0.05008944543828265
f1 0.09317803660565724
confusion [[7418   14]
 [ 531   28]]
