# Problem Statement

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

# Data

Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data, 30% is public and the rest is private.

# Data Aquisition

In [2]:
# Importing necessary Packages
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
data = pd.read_csv('./Hate_Speech_train.csv')

In [4]:
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [5]:
data['tweet'].head()

0     @user when a father is dysfunctional and is s...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model   i love u take with u all the time in ...
4               factsguide: society now    #motivation
Name: tweet, dtype: object

# Text Cleaning

Text Cleaning involves 4 major steps:

1) Lowering the entire text

2) Removing Punctuations

3) Removing Stopwords

4) Normalizing the text

In [21]:
# Using Word Lemmatizer
lemm = WordNetLemmatizer()

def cleaning(text):
    # Lowering the entire text
    lower = text.lower()
    
    # Removing Punctuations
    punctuations = string.punctuation
    no_punc = "".join(char for char in lower if char not in punctuations)
    
    # Removing Stopword
    words = no_punc.split()
    stopwords_list = stopwords.words('english')
    no_stop = [word for word in words if word not in stopwords_list]
    
    # Lemmatization or Normalization
    cleaned = [lemm.lemmatize(word,'v') for word in no_stop]
    
    # Join
    cleaned_text = ' '.join(cleaned)
    return cleaned_text
    
    

In [22]:
data['Cleaned_tweet'] = data['tweet'].apply(lambda x : cleaning(x))

In [23]:
data.head()

Unnamed: 0,id,label,tweet,Cleaned_tweet
0,1,0,@user when a father is dysfunctional and is s...,user father dysfunctional selfish drag kid dys...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thank lyft credit cant use cause don...
2,3,0,bihday your majesty,bihday majesty
3,4,0,#model i love u take with u all the time in ...,model love u take u time urð± ðððð...
4,5,0,factsguide: society now #motivation,factsguide society motivation


# Feature Engineering

## Count Features

In [25]:
# Word count before cleaning
data['word_count'] = data['tweet'].apply(lambda x : len(x.split()))

In [26]:
# Word count after cleaning
data['word_count_clean'] = data['Cleaned_tweet'].apply(lambda x: len(x.split()))

In [27]:
data['char_count'] = data['tweet'].apply(lambda x: len(x.replace(' ','')))

In [28]:
data.head()

Unnamed: 0,id,label,tweet,Cleaned_tweet,word_count,word_count_clean,char_count
0,1,0,@user when a father is dysfunctional and is s...,user father dysfunctional selfish drag kid dys...,18,8,82
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thank lyft credit cant use cause don...,19,15,101
2,3,0,bihday your majesty,bihday majesty,3,2,17
3,4,0,#model i love u take with u all the time in ...,model love u take u time urð± ðððð...,14,9,70
4,5,0,factsguide: society now #motivation,factsguide society motivation,4,3,32


## PoS counting

In [29]:
pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

In [36]:
# function to check and get the part of speech tag count of a words in a given sentence
def check_pos_tag(txt, flag):
    tags = nltk.pos_tag(nltk.word_tokenize(txt))
    count = 0
    for tag in tags:
        tag = tag[1]
        if tag in pos_family[flag]:
            count += 1 
    return count

In [35]:
data['noun_count'] = data['tweet'].apply(lambda x: check_pos_tag(x, 'noun'))
data['verb_count'] = data['tweet'].apply(lambda x: check_pos_tag(x, 'verb'))
data['adj_count'] = data['tweet'].apply(lambda x: check_pos_tag(x, 'adj'))
data['adv_count'] = data['tweet'].apply(lambda x: check_pos_tag(x, 'adv'))
data['pron_count'] = data['tweet'].apply(lambda x: check_pos_tag(x, 'pron'))

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer 

In [49]:
tfidf =TfidfVectorizer(max_features=500, ngram_range=(1,2))
tfidf.fit(data["Cleaned_tweet"].values)
tfidf = tfidf.transform(data["Cleaned_tweet"].values)

In [43]:
#data = data.drop(['tfidf'],axis=1)
data.head()

Unnamed: 0,id,label,tweet,Cleaned_tweet,word_count,word_count_clean,char_count,noun_count,verb_count,adj_count,adv_count,pron_count
0,1,0,@user when a father is dysfunctional and is s...,user father dysfunctional selfish drag kid dys...,18,8,82,5,4,2,2,3
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thank lyft credit cant use cause don...,19,15,101,9,5,3,2,1
2,3,0,bihday your majesty,bihday majesty,3,2,17,1,0,0,1,1
3,4,0,#model i love u take with u all the time in ...,model love u take u time urð± ðððð...,14,9,70,5,1,4,0,0
4,5,0,factsguide: society now #motivation,factsguide society motivation,4,3,32,3,0,0,1,0


In [50]:
from scipy.sparse import hstack, csr_matrix

meta_features = ['word_count', 'word_count_clean',
       'char_count', 'noun_count', 'verb_count', 'adj_count', 'adv_count',
       'pron_count']

feature_set1 = data[meta_features]

train = hstack([tfidf, csr_matrix(feature_set1)], "csr")
train

<31962x508 sparse matrix of type '<class 'numpy.float64'>'
	with 348770 stored elements in Compressed Sparse Row format>

In [52]:
from sklearn.preprocessing import LabelEncoder 

target = data["label"].values
target = LabelEncoder().fit_transform(target)

In [53]:
target

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [54]:
from sklearn.model_selection import train_test_split
train_x, val_x, train_y, val_y = train_test_split(train, target)

In [57]:
train_x.shape

(23971, 508)

In [58]:
val_x.shape

(7991, 508)

In [59]:
from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn import svm 
from sklearn import ensemble
from sklearn.metrics import accuracy_score

In [77]:
model1 = naive_bayes.MultinomialNB()
model1.fit(train_x, train_y)
preds1 = model1.predict(val_x)
accuracy_score(preds1, val_y)

0.9413089725941685

In [78]:
f1_score(preds1,val_y)

0.33663366336633666

In [79]:
model2 = LogisticRegression()
model2.fit(train_x, train_y)
preds2 = model2.predict(val_x)
accuracy_score(preds2, val_y)

0.9436866474784132

In [80]:
f1_score(preds2,val_y)

0.4155844155844156

In [81]:
model3 = svm.SVC()
model3.fit(train_x, train_y)
preds3 = model3.predict(val_x)
accuracy_score(preds3, val_y)

0.9302965836566137

In [84]:
f1_score(preds3,val_y)

  'recall', 'true', average, warn_for)


0.0

In [85]:
import xgboost

In [86]:
model4 = xgboost.XGBClassifier()
model4.fit(train_x, train_y)
preds4 = model4.predict(val_x)
accuracy_score(preds4, val_y)

0.9423100988612189

In [66]:
from sklearn.metrics import f1_score

In [87]:
f1_score(preds4,val_y)

0.321060382916053

In [88]:
from sklearn.metrics import precision_score,recall_score

In [89]:
model = model2

In [90]:
test = pd.read_csv('./Hate_Speech_test.csv')

In [91]:
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [92]:
test['Cleaned_tweet'] = test['tweet'].apply(lambda x : cleaning(x))

In [93]:
test.head()

Unnamed: 0,id,tweet,Cleaned_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,studiolife aislife require passion dedication ...
1,31964,@user #white #supremacists want everyone to s...,user white supremacists want everyone see new ...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways heal acne altwaystoheal healthy heal
3,31966,is the hp and the cursed child book up for res...,hp curse child book reservations already yes ð...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",3rd bihday amaze hilarious nephew eli ahmir un...


In [94]:
test['word_count'] = test['tweet'].apply(lambda x : len(x.split()))
test['word_count_clean'] = test['Cleaned_tweet'].apply(lambda x: len(x.split()))
test['char_count'] = test['tweet'].apply(lambda x: len(x.replace(' ','')))

In [95]:
test['noun_count'] = test['tweet'].apply(lambda x: check_pos_tag(x, 'noun'))
test['verb_count'] = test['tweet'].apply(lambda x: check_pos_tag(x, 'verb'))
test['adj_count'] = test['tweet'].apply(lambda x: check_pos_tag(x, 'adj'))
test['adv_count'] = test['tweet'].apply(lambda x: check_pos_tag(x, 'adv'))
test['pron_count'] = test['tweet'].apply(lambda x: check_pos_tag(x, 'pron'))

In [98]:
tfidf_test =TfidfVectorizer(max_features=500, ngram_range=(1,2))
tfidf_test.fit(test["Cleaned_tweet"].values)
tfidf_test = tfidf_test.transform(test["Cleaned_tweet"].values)

In [99]:
test.head()

Unnamed: 0,id,tweet,Cleaned_tweet,word_count,word_count_clean,char_count,noun_count,verb_count,adj_count,adv_count,pron_count
0,31963,#studiolife #aislife #requires #passion #dedic...,studiolife aislife require passion dedication ...,9,8,79,3,2,3,0,0
1,31964,@user #white #supremacists want everyone to s...,user white supremacists want everyone see new ...,16,12,82,7,2,2,3,0
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways heal acne altwaystoheal healthy heal,9,7,57,2,2,3,0,1
3,31966,is the hp and the cursed child book up for res...,hp curse child book reservations already yes ð...,22,11,119,5,1,2,6,0
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",3rd bihday amaze hilarious nephew eli ahmir un...,15,11,76,5,2,3,0,2


In [100]:
from scipy.sparse import hstack, csr_matrix

meta_features = ['word_count', 'word_count_clean',
       'char_count', 'noun_count', 'verb_count', 'adj_count', 'adv_count',
       'pron_count']

feature_set2 = test[meta_features]

test_test = hstack([tfidf_test, csr_matrix(feature_set2)], "csr")
test_test

<17197x508 sparse matrix of type '<class 'numpy.float64'>'
	with 187231 stored elements in Compressed Sparse Row format>

In [101]:
preds = model.predict(test_test)

In [102]:
preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [103]:
out = pd.DataFrame({'id': test['id'],'label':preds})

In [104]:
out.to_csv('./OutputV1.csv')

In [106]:
preds.sum()

162