This project was based on an Analytis Vidhya prompt, which narrows the scope of hate speech to racist and sexist language.  

In [1]:
import re    # for regular expressions 
import nltk  # for text manipulation 
import string 
import warnings 
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt  

pd.set_option("display.max_colwidth", 200) 
warnings.filterwarnings("ignore", category=DeprecationWarning) 

%matplotlib inline

In [2]:
train  = pd.read_csv('csv_and_npz_files/train_E6oV3lV.csv') 
test = pd.read_csv('csv_and_npz_files/test_tweets_anuFYb8.csv')
combined = pd.read_csv('csv_and_npz_files/processed_tweets.csv').fillna('')

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
import gensim

# Feature Generation with Bag of Words

In [4]:
# Here, I use a Bag of Words approach, and arbitrarily use 1000 features for the normalized training words.

bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 
bow = bow_vectorizer.fit_transform(combined['tidy_tweet'])

# Then Apply a TF-IDF Vectorizer to assign greater importance to those feature words that are unique to each data set. 
# Words that occur in both data sets receive less weight in subsequent calculations.

tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 
tfidf = tfidf_vectorizer.fit_transform(combined['tidy_tweet']) 

# We can keep this set of features for future ML application.


# Feature Generation with Word2Vector Embedding

In [19]:
# Here, we make an alternative model that is trained using a shallow Neural Network using the Skip-Gram model
# This basically studies associations between words and attempts to infer context based on an input word
# Weights are applied in the nodes of the network as opposed to individual words.
# We are also shrinking the number of words which develop context to just 200.

tokenized_tweet = combined['tidy_tweet'].apply(lambda x: x.split()) # tokenizing 
model_w2v = gensim.models.Word2Vec(
            tokenized_tweet,
            size=200, # desired no. of features/independent variables
            window=5, # context window size
            min_count=2,
            sg = 1, # 1 for skip-gram model
            hs = 0,
            negative = 10, # for negative sampling
            workers= 2, # no.of cores
            seed = 34) 

model_w2v.train(tokenized_tweet, total_examples= len(combined['tidy_tweet']), epochs=20)

(6510028, 7536020)

In [22]:
# Let's experiment with our model_w2v
# The numbers generated by the model are an estimates which words most strongly correlate with a given context word.

model_w2v.wv.most_similar(positive="breakfast")

[('#feelingdiet', 0.6099264621734619),
 ('melani', 0.5826607942581177),
 ('avocado', 0.5726335048675537),
 ('#tacotuesday', 0.5583226680755615),
 ('deadlift', 0.5555545091629028),
 ('#pushyourself', 0.5494673252105713),
 ('cookout', 0.5476097464561462),
 ('#yogu', 0.5425668358802795),
 ('cuppa', 0.5418993830680847),
 ('sushi', 0.5393021106719971)]

In [None]:
# Deadlift is an interesting entry, but otherwise this seems to have worked effectively.

In [23]:
model_w2v.wv.most_similar(positive="trump")

[('donald', 0.5566582083702087),
 ('hillari', 0.5443348288536072),
 ('phoni', 0.535092830657959),
 ('unstabl', 0.5234814882278442),
 ('unfit', 0.5224140882492065),
 ('melo', 0.517794132232666),
 ('nomine', 0.5167006850242615),
 ('#delegaterevolt', 0.514083743095398),
 ('potu', 0.5131095051765442),
 ('unfavor', 0.5121117234230042)]

The w2v model we create with Skip Gram takes a given word and outputs a vector of probable associations with the other words in the 200 word bank.  To craft features which will be usable in future ML application, we need to convert each tweet into a vector of possible contexts based on our 200 word dictionary.  This wil result in 200 additional features we can use for training.

In [20]:
def word_vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_w2v[word].reshape((1, size))
            count += 1.
        except KeyError: # handling the case where the token is not in vocabulary                                     continue
            if count != 0:
                vec /= count
    return vec

wordvec_arrays = np.zeros((len(tokenized_tweet), 200)) 

for i in range(len(tokenized_tweet)):
    wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)

wordvec_df = pd.DataFrame(wordvec_arrays)


# Feature Generation with Doc2Vec Embedding

In this next section, let's use Doc2Vec Embedding to generate additional features.  Doc2Vec can be thought of as an extension of Word2Vec, where the context is guessed for the whole document as opposed to single words within the document.

In [3]:
from tqdm import tqdm 
tqdm.pandas(desc="progress-bar") 
from gensim.models.doc2vec import LabeledSentence

In [22]:
def add_label(twt):
    output = []
    for i, s in zip(twt.index, twt):
        output.append(LabeledSentence(s, ["tweet_" + str(i)]))
    return output
labeled_tweets = add_label(tokenized_tweet) # label all the tweets

In [27]:
labeled_tweets[:6]

[LabeledSentence(words=['when', 'father', 'dysfunct', 'selfish', 'drag', 'kid', 'into', 'dysfunct', '#run'], tags=['tweet_0']),
 LabeledSentence(words=['thank', '#lyft', 'credit', 'caus', 'they', 'offer', 'wheelchair', 'van', '#disapoint', '#getthank'], tags=['tweet_1']),
 LabeledSentence(words=['bihday', 'your', 'majesti'], tags=['tweet_2']),
 LabeledSentence(words=['#model', 'love', 'take', 'with', 'time'], tags=['tweet_3']),
 LabeledSentence(words=['factsguid', 'societi', '#motiv'], tags=['tweet_4']),
 LabeledSentence(words=['huge', 'fare', 'talk', 'befor', 'they', 'leav', 'chao', 'disput', 'when', 'they', 'there', '#allshowandnogo'], tags=['tweet_5'])]

In [23]:
model_d2v = gensim.models.Doc2Vec(dm=1, # dm = 1 for ‘distributed memory’ model                                   
                                  dm_mean=1, # dm = 1 for using mean of the context word vectors                                  
                                  vector_size=200, # no. of desired features                                  
                                  window=5, # width of the context window                                  
                                  negative=7, # if > 0 then negative sampling will be used                                 
                                  min_count=5, # Ignores all words with total frequency lower than 2.                                  
                                  workers=3, # no. of cores                                  
                                  alpha=0.1, # learning rate                                  
                                  seed = 23) 
model_d2v.build_vocab([i for i in tqdm(labeled_tweets)])
model_d2v.train(labeled_tweets, total_examples= len(combined['tidy_tweet']), epochs=15)

100%|██████████| 49159/49159 [00:00<00:00, 1833464.85it/s]


In [29]:
docvec_arrays = np.zeros((len(tokenized_tweet), 200)) 
for i in range(len(combined)):
    docvec_arrays[i,:] = model_d2v.docvecs[i].reshape((1,200))    

docvec_df = pd.DataFrame(docvec_arrays) 

In [3]:
from scipy import sparse

sparse.save_npz("csv_and_npz_files/bow.npz", bow)
sparse.save_npz("csv_and_npz_files/tfidf.npz", tfidf)
wordvec_df.to_csv('csv_and_npz_files/wordvec_df.csv')
docvec_df.to_csv('csv_and_npz_files/docvec_df.csv')


This concludes the Data Processing prior to ML application.