## Selection of Methods


 ##### Libraries 
 - spaCy
 
 spaCy outperforms NLTK in **word tokenization** & **Part-of-speech tagging**, and though NLTK performs faster for **Sentence tokenization** through simple attempts at splitting text into sentences, spaCy constructs a syntactic tree for each sentence which is a more robust method that yields more information about the text.

 Here we assume that the Out of bag samples are all in English, and as such spaCy can be used


 ![https://www.thedataincubator.com/wp-content/uploads/timing.png](https://www.thedataincubator.com/wp-content/uploads/timing.png)

## Methods to explore (from proposal)
- TF-IDF
- N-gram
- Word Embedding
- Bag-of-Words
- Skip-Gram


In [None]:
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

path = "/content/drive/My Drive/Project/labeled_data.csv"
twitter_hate = pd.read_csv(path)

Mounted at /content/drive


In [None]:
twitter_hate.head(20)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
5,5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just..."
6,6,3,0,3,0,1,"!!!!!!""@__BrighterDays: I can not just sit up ..."
7,7,3,0,3,0,1,!!!!&#8220;@selfiequeenbri: cause I'm tired of...
8,8,3,0,3,0,1,""" &amp; you might not get ya bitch back &amp; ..."
9,9,3,1,2,0,1,""" @rhythmixx_ :hobbies include: fighting Maria..."


# Cleaning Tweets

https://www.kaggle.com/code/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn/notebook

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
#!python -m spacy download en_core_web_lg

### Function used to 


1.   Remove stopwords
2.   Lemmatize words
3.   Convert to lowercase
4.   Reconnect all tokens to form new sentence



In [None]:
import string
from tqdm import tqdm
import re 

stopwords = list(STOP_WORDS) +['rt']
punctuations = list(string.punctuation)
punctuations.remove('#')
parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

In [None]:
tqdm.pandas()
twitter_hate["tweets_cleaned_v1"] = twitter_hate["tweet"].progress_apply(spacy_tokenizer)

100%|██████████| 24783/24783 [00:08<00:00, 3031.60it/s]


### After exploring 1st cleaned version of tweets, apply second layer of cleaning to 

1. Remove @ mentions
2. URL Links
3. Numbers
4. Underscores

In [None]:
tqdm.pandas()
twitter_hate["tweets_cleaned"] = twitter_hate["tweets_cleaned_v1"].progress_apply(lambda x: re.sub(r"(_[A-Za-z0-9-_]+)|(@[A-Za-z0-9]+)|[^\w\s]|http\S+|[0-9]", "",x))



100%|██████████| 24783/24783 [00:00<00:00, 161116.85it/s]


In [None]:
twitter_hate.head(1000)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,tweets_cleaned_v1,tweets_cleaned
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,@mayasolovely woman complain cleaning house am...,woman complain cleaning house amp man trash
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,@mleew17 boy dats cold ... tyga dwn bad cuffin...,boy dats cold tyga dwn bad cuffin dat hoe st...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,@urkindofbrand dawg @80sbaby4life fuck bitch s...,dawg fuck bitch start cry confused shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,@c_g_anderson @viva_based look like tranny,look like tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,@shenikaroberts shit hear true faker bitch tol...,shit hear true faker bitch told ya
...,...,...,...,...,...,...,...,...,...
995,1017,3,0,3,0,1,&#128514;&#128514;&#128514;&#128514; RT @SMASH...,# 128514;&#128514;&#128514;&#128514 @smashavel...,murda sucking bitches howdhow
996,1018,3,0,3,0,1,&#128514;&#128514;&#128514;&#128514; bitch if ...,# 128514;&#128514;&#128514;&#128514 bitch hobb...,bitch hobbit need let know right
997,1019,3,0,2,1,1,&#128514;&#128514;&#128514;&#128514; these fol...,# 128514;&#128514;&#128514;&#128514 folks bad ...,folks bad talk trash
998,1020,6,0,6,0,1,&#128514;&#128514;&#128514;&#128514;&#128514; ...,# 128514;&#128514;&#128514;&#128514;&#128514 b...,brittany bitch u dog man


### Function to bucket Offensive & Hate Speeches into the same class

In [None]:
def bucket (x):
  if x == 2:
    return 0
  else:
    return 1

In [None]:

twitter_hate["class"] = twitter_hate['class'].progress_apply(bucket)

100%|██████████| 24783/24783 [00:00<00:00, 700803.88it/s]


### New  Dataframe with just cleaned tweets and new classes

In [None]:
twitter_cleaned = twitter_hate[["tweets_cleaned","class"]]

In [None]:
twitter_cleaned.head(100)

Unnamed: 0,tweets_cleaned,class
0,woman complain cleaning house amp man trash,0
1,boy dats cold tyga dwn bad cuffin dat hoe st...,1
2,dawg fuck bitch start cry confused shit,1
3,look like tranny,1
4,shit hear true faker bitch told ya,1
...,...,...
95,going school sucks dick hoes attend,1
96,way fuck yo bitch year old,1
97,come bring food car retard,1
98,richnow hella tinder hoes friend anymore chil...,1


# EDA of tweets

### Most common words

**Looking at the top 20 most used tokens for neutral speech**

In [None]:
import itertools
import collections

nwl = [tweet.split() for tweet in twitter_cleaned.tweets_cleaned[twitter_cleaned['class'] == 0 ]]

word_list_neutral = list(itertools.chain(*nwl))

neutral_count_word = collections.Counter(word_list_neutral)

neutral_count_word.most_common(20)

[('trash', 689),
 ('like', 304),
 ('bird', 304),
 ('charlie', 259),
 ('yankees', 223),
 ('yellow', 218),
 ('birds', 171),
 ('amp', 166),
 ('lol', 145),
 ('got', 131),
 ('colored', 117),
 ('monkey', 115),
 ('ghetto', 113),
 ('u', 111),
 ('good', 95),
 ('know', 90),
 ('new', 90),
 ('love', 87),
 ('day', 85),
 ('game', 85)]

**Looking at the top 20 most used tokens for Hate & Offensive speech**

In [None]:
hwl = [tweet.split() for tweet in twitter_cleaned.tweets_cleaned[twitter_cleaned['class'] == 1 ]]

word_list_hate = list(itertools.chain(*hwl))

hate_count_word = collections.Counter(word_list_hate)

hate_count_word.most_common(20)

[('bitch', 8299),
 ('bitches', 3109),
 ('like', 2480),
 ('hoes', 2376),
 ('pussy', 2135),
 ('hoe', 1907),
 ('ass', 1572),
 ('got', 1469),
 ('fuck', 1425),
 ('shit', 1278),
 ('nigga', 1225),
 ('u', 1185),
 ('lol', 941),
 ('niggas', 790),
 ('know', 721),
 ('amp', 681),
 ('fucking', 631),
 ('love', 629),
 ('yo', 597),
 ('bad', 535)]

**amp, u, lol, got, know & love are added as stopwords because they are commonly found in both classes and might affect the classification accuracy** 

**We modify our initial tweet cleaning function with the additional stopwords**

In [None]:
additional_stopwords = ["amp","u","lol","got","know","love"]
stopwords = list(STOP_WORDS) + additional_stopwords
parser = English()
def spacy_tokenizer_2(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

In [None]:
tqdm.pandas()
twitter_cleaned["tweets_cleaned"] = twitter_hate["tweets_cleaned"].progress_apply(spacy_tokenizer_2)

100%|██████████| 24783/24783 [00:03<00:00, 7400.88it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
nwl_2 = [tweet.split() for tweet in twitter_cleaned.tweets_cleaned[twitter_cleaned['class'] == 0 ]]

word_list_neutral2 = list(itertools.chain(*nwl_2))

neutral_count_word2 = collections.Counter(word_list_neutral2)

neutral_count_word2.most_common(20)

[('trash', 689),
 ('like', 304),
 ('bird', 304),
 ('charlie', 259),
 ('yankees', 223),
 ('yellow', 218),
 ('birds', 171),
 ('colored', 117),
 ('monkey', 115),
 ('ghetto', 113),
 ('good', 95),
 ('new', 90),
 ('day', 85),
 ('game', 85),
 ('man', 84),
 ('people', 83),
 ('want', 83),
 ('ho', 80),
 ('time', 79),
 ('brownies', 78)]

In [None]:
hwl_2 = [tweet.split() for tweet in twitter_cleaned.tweets_cleaned[twitter_cleaned['class'] == 1 ]]

word_list_hate2 = list(itertools.chain(*hwl_2))

hate_count_word2 = collections.Counter(word_list_hate2)

hate_count_word2.most_common(20)

[('bitch', 8300),
 ('bitches', 3109),
 ('like', 2480),
 ('hoes', 2376),
 ('pussy', 2135),
 ('hoe', 1907),
 ('ass', 1572),
 ('fuck', 1425),
 ('shit', 1278),
 ('nigga', 1225),
 ('niggas', 790),
 ('y', 680),
 ('fucking', 631),
 ('yo', 597),
 ('bad', 535),
 ('want', 499),
 ('trash', 465),
 ('ya', 460),
 ('man', 454),
 ('good', 437)]

###N-Gram Analysis


###Topic Modelling 

**Using Topic Modelling to conduct a rough EDA on the frequent words under different topics**

https://www.kaggle.com/code/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn/notebook

In [None]:
!pip install pyldavis
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence




In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(twitter_cleaned.tweets_cleaned)

In [None]:
NUM_TOPICS = 5

In [None]:
# Latent Dirichlet Allocation Model
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [None]:
# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized) 

In [None]:
# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)

In [None]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [None]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)

LDA Model:
Topic 0:
[('bitch', 776.4424008690262), ('fucking', 636.8736122977173), ('faggot', 388.5604596852031), ('charlie', 291.9436547507748), ('retarded', 260.6308595885329), ('like', 260.502549054973), ('smh', 205.89669479386362), ('best', 184.68607913692645), ('fag', 184.04580723943042), ('colored', 172.84119476712627)]
Topic 1:
[('bitch', 3106.382551606828), ('trash', 1163.4561433099423), ('nigga', 642.4750127852396), ('fuck', 544.9488148621414), ('pussy', 456.19397647594985), ('girl', 436.327025349354), ('let', 410.6993731627661), ('shit', 379.29837908483967), ('white', 372.5633221444536), ('need', 343.799576757489)]
Topic 2:
[('bitches', 2209.9786481694446), ('bitch', 1757.7056794305333), ('ass', 1543.634084453861), ('pussy', 983.5138900924397), ('like', 949.379370302404), ('fuck', 735.2472137413894), ('hoe', 531.3826115304356), ('bad', 453.2950380904888), ('bird', 388.71125555771414), ('wit', 365.0343390972079)]
Topic 3:
[('hoes', 2478.887859818733), ('bitch', 1623.2147822457

In [None]:
# Keywords for topics clustered by Latent Semantic Indexing
print("NMF Model:")
selected_topics(nmf, vectorizer)

NMF Model:
Topic 0:
[('bitch', 10.121489378845006), ('ass', 0.9816763486992757), ('fuck', 0.7695945696187617), ('nigga', 0.7349595456495887), ('shit', 0.4971439799472803), ('bad', 0.28978974512627487), ('fucking', 0.22353242965763054), ('lil', 0.20779090304605902), ('little', 0.19616398543903946), ('want', 0.18542074205735687)]
Topic 1:
[('bitches', 8.701705551328779), ('fuck', 0.8816804480676657), ('niggas', 0.8234114834674219), ('shit', 0.6443059734670081), ('bad', 0.5247609318889208), ('ass', 0.47380960543621364), ('nigga', 0.4103009240726412), ('hate', 0.27646590945956195), ('want', 0.2756984731790545), ('wit', 0.2386922803976572)]
Topic 2:
[('like', 7.987074860020253), ('hoe', 1.2700804744988752), ('look', 0.7810100462906572), ('trash', 0.3774679738275771), ('shit', 0.2969158459869883), ('act', 0.24239548447911366), ('feel', 0.23733659647170807), ('looks', 0.19897500570741788), ('hate', 0.1955027444251619), ('people', 0.17437858167090442)]
Topic 3:
[('hoes', 6.632151344799834), ('

In [None]:
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("LSI Model:")
selected_topics(lsi, vectorizer)

LSI Model:
Topic 0:
[('bitch', 0.9470234456612769), ('like', 0.18537465172726372), ('ass', 0.11728147789397204), ('fuck', 0.09606182500117315), ('nigga', 0.09067559473796152), ('shit', 0.07140415494385846), ('bitches', 0.059890368650847896), ('pussy', 0.03868429040918971), ('hoe', 0.03740061744886309), ('bad', 0.03586237898675675)]
Topic 1:
[('bitches', 0.6805720719800732), ('like', 0.5320641345904927), ('hoes', 0.2462875237061559), ('pussy', 0.16625269225159084), ('hoe', 0.1255531853551552), ('fuck', 0.11311790615666562), ('shit', 0.10767237886730269), ('niggas', 0.10455890674253303), ('ass', 0.0914828505609747), ('nigga', 0.08358845133163126)]
Topic 2:
[('like', 0.4853069557821071), ('hoes', 0.42151877901573037), ('pussy', 0.24338610629544832), ('hoe', 0.18092090895778323), ('look', 0.04992648903719221), ('nigga', 0.038002362477741004), ('ass', 0.03499981363624225), ('trash', 0.034779640301759396), ('shit', 0.02190104806464595), ('eat', 0.01898274356854081)]
Topic 3:
[('hoes', 0.8106


**1. Topics on the left while their respective keywords are on the right.**

**2. Larger topics are more frequent and closer the topics, mor the similarity**

**3. Selection of keywords is based on their frequency and discriminancy.** 

In [None]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash

###Splitting to train, validation and test datasets, ensuring balance of classes

Here the class labels are as such and are imbalanced:

<br/> 0 = Hate Speech (5.77%)<br>
<br/> 1 = Offensive Speech (77.43%)<br>
<br/> 2 = Neither (Neutral) (16.8%) <br>


In [None]:
for i,j in enumerate(twitter_cleaned['class'].value_counts().sort_index()):
    print(i,j,j/twitter_cleaned.shape[0]*100)

0 4163 16.797804946939436
1 20620 83.20219505306056


In [None]:
X = twitter_cleaned.tweets_cleaned
y = twitter_cleaned['class']


**Train -> 72%**


**Validation -> 18%**

 **Test -> 10%**






In [None]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.1, shuffle=True, stratify=y)

In [None]:
print(X_train_val.shape)
print(y_train_val.shape)
print(X_test.shape)
print(y_test.shape)

(22304,)
(22304,)
(2479,)
(2479,)


Ensuring balance in class for both the training and validation sets

In [None]:
for i,j in enumerate(y_train_val.value_counts().sort_index()):
    print(i,j,j/y_train_val.shape[0]*100)

for i,j in enumerate(y_train.value_counts().sort_index()):
    print(i,j,j/y_train.shape[0]*100)

0 3747 16.79967718794835
1 18557 83.20032281205165
0 2998 16.802107268957016
1 14845 83.19789273104298


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size = 0.2, shuffle = True, stratify = y_train_val)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)

(17843,)
(17843,)
(4461,)
(4461,)


In [None]:
for i,j in enumerate(y_train.value_counts().sort_index()):
    print(i,j,j/y_train.shape[0]*100)

for i,j in enumerate(y_val.value_counts().sort_index()):
    print(i,j,j/y_val.shape[0]*100)

0 2998 16.802107268957016
1 14845 83.19789273104298
0 749 16.78995740865277
1 3712 83.21004259134723


# Models

### Text Generation based on sequence using LSTM

https://www.kaggle.com/code/shivamb/beginners-guide-to-text-generation-using-lstms/notebook

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [None]:
from tensorflow import keras
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
# import keras.utils as ku 

In [None]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

In [None]:
twitter_cleaned.tweets_cleaned[0]

' woman complain cleaning house man trash'

In [None]:
inp_sequences, total_words = get_sequence_of_tokens(twitter_cleaned.tweets_cleaned)
inp_sequences[:10]

[[202, 825],
 [202, 825, 2981],
 [202, 825, 2981, 141],
 [202, 825, 2981, 141, 18],
 [202, 825, 2981, 141, 18, 11],
 [95, 969],
 [95, 969, 366],
 [95, 969, 366, 1276],
 [95, 969, 366, 1276, 6327],
 [95, 969, 366, 1276, 6327, 17]]

**Add padding to sentence to fit LSTM input layer**

In [None]:
import numpy as np
# !pip inst0

In [None]:
predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

Training LSTM model with:

**1 input layer**


**1 LSTM hidden layer (dropout = 0.1)**

**1 Dense Layer (Softmax)**

**Backpropagation using Cross Entropy (Adam Optimizer)**



In [None]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 27, 10)            202490    
                                                                 
 lstm (LSTM)                 (None, 100)               44400     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 20249)             2045149   
                                                                 
Total params: 2,292,039
Trainable params: 2,292,039
Non-trainable params: 0
_________________________________________________________________


Can try to test with larger variety of data? 

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
model.fit(predictors, label, epochs= 100, verbose=5)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f7b61167690>

In [None]:
model.save("/content/drive/My Drive/Project/model_final")



INFO:tensorflow:Assets written to: /content/drive/My Drive/Project/model_final/assets


INFO:tensorflow:Assets written to: /content/drive/My Drive/Project/model_final/assets


In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted = np.argmax(predicted,axis=1)       

        # predict_x=model.predict(X_test) 
        # classes_x=np.argmax(predict_x,axis=1)

        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [None]:
import tensorflow as tf
new_model = tf.keras.models.load_model('/content/drive/My Drive/Project/model_final')

OSError: ignored

In [None]:
print(generate_text("Yo", 5, new_model, max_sequence_len))

Yo Bitch Ass Nigga Said Fuck


In [None]:
print(generate_text("You are", 5, new_model, max_sequence_len))

You Are Gtrt Ankle Socks Turned Colored


In [None]:
print(generate_text("You are not", 5, new_model, max_sequence_len))

You Are Not Gtrt Ankle Socks Turned Colored


In [None]:
print(generate_text("How are you so", 5, new_model, max_sequence_len))

How Are You So Gtrt Ankle Socks Turned Colored


In [None]:
print(generate_text("You are so", 5, new_model, max_sequence_len))

You Are So Gtrt Ankle Socks Turned Colored


In [None]:
print(generate_text("I am not a", 5, new_model, max_sequence_len))

I Am Not A Gtrt Ankle Socks Turned Colored


In [None]:
print(generate_text("Hi how are you, why are you", 5, new_model, max_sequence_len))

Hi How Are You, Why Are You Bitch Kids Okay Ampamp Hahaha


In [None]:
print(generate_text("what do you mean, I don't want", 5, new_model, max_sequence_len))

What Do You Mean, I Don'T Want Pussy Tastes Like Dragons Burnt


After user types in a message, model automatically detects the next defined (K = number of relevant continuous words). Based on TF-IDF , we can set a threshold limit using n-grams or word frequency.

For example, if the top 10 words for the classified hate speech appears in the predicted phrase based on the sentence that the user has generated so far, a warning message will appear to the user - " If you post this message, and the message is tagged as a hate speech, you will be severely penalized" 

### Predicting class labels using NB

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test_nb(vect, X_train, y_train, X_test, y_test):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to predict the class labels of tweets
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    # print the accuracy of its predictions
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = CountVectorizer()
tokenize_test_nb(vect, X_train, y_train, X_test, y_test)

NameError: ignored

In [None]:
vect = TfidfVectorizer()
tokenize_test_nb(vect, X_train, y_train, X_test, y_test)

In [None]:
vect = TfidfVectorizer(norm=None)
tokenize_test_nb(vect, X_train, y_train, X_test, y_test)

In [None]:
vect = TfidfVectorizer(norm=None,stop_words='english')
tokenize_test_nb(vect, X_train, y_train, X_test, y_test)

In [None]:
vect = TfidfVectorizer(norm=None,stop_words='english',ngram_range=(1, 2))
tokenize_test_nb(vect, X_train, y_train, X_test, y_test)

### Predicting using Logistic Regression

In [None]:
from sklearn import metrics


In [None]:
from sklearn.linear_model import LogisticRegression

def tokenize_test_lr(vect, X_train, y_train, X_test, y_test):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    
    # use logistic regression to predict the class labels of tweets
    # Regularization at 0.01, solver stochastic gradient , iterate for 500 epochs
    lr = LogisticRegression(C=100,solver='sag',max_iter=500)
    lr.fit(X_train_dtm, y_train)
    y_pred_class = lr.predict(X_test_dtm)
    
    # print the accuracy of its predictions
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [None]:
vect = CountVectorizer()
tokenize_test_lr(vect, X_train, y_train, X_val, y_val)

Features:  16887
Accuracy:  0.9466487334678323


In [None]:
vect = TfidfVectorizer()
tokenize_test_lr(vect, X_train, y_train, X_test, y_test)

NameError: ignored

In [None]:
vect = TfidfVectorizer(norm=None)
tokenize_test_lr(vect, X_train, y_train, X_test, y_test)

Features:  20213
Accuracy:  0.864635868468832




In [None]:
vect = TfidfVectorizer(norm=None,stop_words='english')
tokenize_test_lr(vect, X_train, y_train, X_test, y_test)

Features:  19949
Accuracy:  0.864635868468832




In [None]:
vect = TfidfVectorizer(norm=None,stop_words='english',ngram_range=(1, 2))
tokenize_test_lr(vect, X_train, y_train, X_test, y_test)

Features:  110022
Accuracy:  0.892071817631632




In [None]:
#  max_features= 50000, max_df = 0.5
vect = TfidfVectorizer(norm=None,stop_words='english',ngram_range=(1, 2), max_df= 0.7)
tokenize_test_lr(vect, X_train, y_train, X_test, y_test)

Features:  110022
Accuracy:  0.8924752874722615




In [None]:
import fastai