# Dataset Information
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, our objective is to buildd the model with best accuracy.


 # Import Modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import nltk
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

# Load Dataset

In [2]:
df = pd.read_csv("C:/Users/vardh/Desktop/twitter_dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,tweet,sentiment
0,0,is upset that he can't update his Facebook by ...,0.0
1,1,@Kenichan I dived many times for the ball. Man...,0.0
2,2,my whole body feels itchy and like its on fire,0.0
3,3,"@nationwideclass no, it's not behaving at all....",0.0
4,4,@Kwesidei not the whole crew,0.0


In [3]:
# datatype info
df.info()
df['sentiment'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142403 entries, 0 to 3142402
Data columns (total 3 columns):
 #   Column      Dtype  
---  ------      -----  
 0   Unnamed: 0  int64  
 1   tweet       object 
 2   sentiment   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 71.9+ MB


0.0    1570067
1.0    1561529
2.0      10725
Name: sentiment, dtype: int64

In [4]:
#data.drop(['Unnamed: 0'],axis=1,inplace=True)

In [5]:
# data_pos = data[data['sentiment'] == 1]
# data_neg = data[data['sentiment'] == 0]

In [6]:
# data_pos = data_pos.iloc[:int(150000)]
# data_neg = data_neg.iloc[:int(150000)]

In [7]:
#df = pd.concat([data_pos, data_neg])
df['tweet']=df['tweet'].str.lower()
df.reset_index(drop=True, inplace= True)

# Preprocessing the dataset

In [8]:
# removes pattern in the input text
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for word in r:
        input_txt = re.sub(word, "", input_txt)
    return input_txt

In [9]:
# Remove Usernames
def rem_username(text):
    pattern=re.compile("@[\w]*")
    return pattern.sub('',text)
df['clean_tweet']=df['tweet'].apply(rem_username)

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,tweet,sentiment,clean_tweet
0,0,is upset that he can't update his facebook by ...,0.0,is upset that he can't update his facebook by ...
1,1,@kenichan i dived many times for the ball. man...,0.0,i dived many times for the ball. managed to s...
2,2,my whole body feels itchy and like its on fire,0.0,my whole body feels itchy and like its on fire
3,3,"@nationwideclass no, it's not behaving at all....",0.0,"no, it's not behaving at all. i'm mad. why am..."
4,4,@kwesidei not the whole crew,0.0,not the whole crew


In [11]:
# remove special characters, numbers and punctuations
df['clean_tweet'] = df['clean_tweet'].str.replace("[^a-zA-Z#]", " ")
df.head()

Unnamed: 0.1,Unnamed: 0,tweet,sentiment,clean_tweet
0,0,is upset that he can't update his facebook by ...,0.0,is upset that he can t update his facebook by ...
1,1,@kenichan i dived many times for the ball. man...,0.0,i dived many times for the ball managed to s...
2,2,my whole body feels itchy and like its on fire,0.0,my whole body feels itchy and like its on fire
3,3,"@nationwideclass no, it's not behaving at all....",0.0,no it s not behaving at all i m mad why am...
4,4,@kwesidei not the whole crew,0.0,not the whole crew


In [12]:
#remove short words
df['clean_tweet'] = df['clean_tweet'].apply(lambda x: " ".join([w for w in x.split() if len(w)>2]))
df.head()

Unnamed: 0.1,Unnamed: 0,tweet,sentiment,clean_tweet
0,0,is upset that he can't update his facebook by ...,0.0,upset that can update his facebook texting and...
1,1,@kenichan i dived many times for the ball. man...,0.0,dived many times for the ball managed save the...
2,2,my whole body feels itchy and like its on fire,0.0,whole body feels itchy and like its fire
3,3,"@nationwideclass no, it's not behaving at all....",0.0,not behaving all mad why here because can see ...
4,4,@kwesidei not the whole crew,0.0,not the whole crew


# Tokenization

In [13]:
# from nltk.tokenize import word_tokenize
# # nltk.download('punkt')
# tokenized_tweet = df['clean_tweet'].apply(lambda text:word_tokenize(text))

In [14]:
#individual words considered as tokens
tokenized_tweet = df['clean_tweet'].apply(lambda x: x.split())
tokenized_tweet.head()

0    [upset, that, can, update, his, facebook, text...
1    [dived, many, times, for, the, ball, managed, ...
2    [whole, body, feels, itchy, and, like, its, fire]
3    [not, behaving, all, mad, why, here, because, ...
4                              [not, the, whole, crew]
Name: clean_tweet, dtype: object

# Stemming 

In [15]:
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

STOPWORDS = set(stopwordlist)
# def cleaning_stopwords(text):
#     return " ".join([word for word in str(text).split() if word not in STOPWORDS])
# data['clean_tweet'] = data['clean_tweet'].apply(lambda text: cleaning_stopwords(text) for i in text)
# data['clean_tweet'].head()

In [None]:
# stemming the words
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokenized_tweet = tokenized_tweet.apply(lambda sentence: [stemmer.stem(word.lower()) for word in sentence if word not in STOPWORDS])
tokenized_tweet.head()

In [None]:
# combine words into single sentence
#df.reset_index(drop=True,inplace=True)
for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = " ".join(tokenized_tweet[i])
df['clean_tweet'] = tokenized_tweet
df.head()

# Visualizing Frequent words

In [None]:
# visualize the frequent words
all_words = " ".join([sentence for sentence in df['clean_tweet']])

from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)

# plot the graph
plt.figure(figsize=(15,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


In [None]:
# frequent words visualization for +ve
all_words = " ".join([sentence for sentence in df['clean_tweet'][df['sentiment']==1]])

wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)

# plot the graph
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
# frequent words visualization for -ve
all_words = " ".join([sentence for sentence in df['clean_tweet'][df['sentiment']==0]])

wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)

# plotting the graph
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# Hashtags

In [None]:
# extract the hashtag
def hashtag_extract(tweets):
    hashtags = []
    # loop words in the tweet
    for tweet in tweets:
        ht = re.findall(r"#(\w+)", tweet)
        hashtags.append(ht)
    return hashtags    

In [None]:
data['tweet'] = data['tweet'].apply(lambda x: " ".join([w.lower() for w in x.split() if len(w)>3]))
data.head()

In [None]:
# extract hashtags from non-racist/sexist tweets
ht_positive = hashtag_extract(data['tweet'][data['sentiment']==1])

# extract hashtags from racist/sexist tweets
ht_negative = hashtag_extract(data['tweet'][data['sentiment']==0])

In [None]:
ht_pos=[]
for rows in ht_positive:
    for i in rows:
        ht_pos.append(i)
ht_pos[:5]

In [None]:
ht_neg=[]
for rows in ht_negative:
    for i in rows:
        ht_neg.append(i)
ht_neg[:5]

In [None]:
freq = nltk.FreqDist(ht_pos)
d = pd.DataFrame({'Hashtag': list(freq.keys()),
                 'Count': list(freq.values())})
d.head(10)

In [None]:
# select top 10 positive hashtags
dp = d.nlargest(columns='Count',n=10)
plt.figure(figsize=(15,9))
ax=sns.barplot(data=dp, x='Hashtag', y='Count',edgecolor='black')
for bars in ax.containers:
    ax.bar_label(bars)

In [None]:
freqn = nltk.FreqDist(ht_neg)
dn= pd.DataFrame({'Hashtag': list(freqn.keys()),
                 'Count': list(freqn.values())})
dn.head(10)

In [None]:
# select top 10 negative hashtags
dn = dn.nlargest(columns='Count', n=10)
plt.figure(figsize=(15,9))
ax=sns.barplot(data=dn, x='Hashtag', y='Count',edgecolor='black')
for bars in ax.containers:
    ax.bar_label(bars)

# Creating the Bag of Words model

In [None]:
# feature extraction
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
x= cv.fit_transform(df['clean_tweet']).toarray()
y=df['sentiment'].values

# Splitting the dataset into the Training set and Test set


In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.25)

# Training the LogisticRegression model on the Training set

In [None]:
from sklearn.linear_model import LogisticRegression
model_1= LogisticRegression()
model_1.fit(x_train, y_train)

Predicting Output and checking the model performance

In [None]:
# testing
from sklearn.metrics import f1_score, accuracy_score,confusion_matrix
pred = model_1.predict(x_test)

In [None]:
cm1 = confusion_matrix(y_test, pred)
print(cm1)
print("f1_score : ",f1_score(y_test, pred))
print("accuracy_score : ",accuracy_score(y_test,pred))

# Training the Naive Bayes model on the Training set

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)

Predicting Output and checking the model performance

In [None]:
y1_pred = classifier.predict(x_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y1_pred)
print(cm)
print("f1_score : ",f1_score(y_test, y1_pred))
print("accuracy_score : ",accuracy_score(y_test,y1_pred))

# Training the RandomForestClassifier model on the Training set

In [None]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
RFC.fit(x_train, y_train)


Predicting Output and checking the model performance

In [None]:
y2_pred = RFC.predict(x_test)
cm1 = confusion_matrix(y_test, y2_pred)
print(cm1)
print("f1_score : ",f1_score(y_test, y2_pred))
print("accuracy_score : ",accuracy_score(y_test,y2_pred))

# Training the SupportVectorClassifier model on the Training set

In [None]:
# from sklearn.svm import SVC
# svc = SVC(kernel = 'rbf', random_state = 0)
# svc.fit(x_train, y_train)

Predicting Output and checking the model performance

In [None]:
# y3_pred = svc.predict(x_test)
# cm2 = confusion_matrix(y_test, y3_pred)
# print(cm2)
# print("f1_score : ",f1_score(y_test, y3_pred))
# print("accuracy_score : ",accuracy_score(y_test,y3_pred))

# Conclusion

Upon evaluating all the models we can conclude the following details i.e.

Accuracy: As far as the accuracy of the model is SVM performs better than concerned Logistic Regression which in turn performs better than RandomForestClassifier.

F1-score: The F1 Scores for class 0 and class 1 are : (a) For class 0: Bernoulli Naive Bayes(accuracy = 0.90) < SVM (accuracy =0.91) < Logistic Regression (accuracy = 0.92) (b) For class 1: Bernoulli Naive Bayes (accuracy = 0.66) < SVM (accuracy = 0.68) < Logistic Regression (accuracy = 0.69)

We, therefore, conclude that the Support Vector Machine is the best model for the above-given dataset.

In our problem statement, if the data has no assumption, then the simplest model works the best. Since our dataset does not have any assumptions and Logistic Regression is a simple model, therefore the concept holds true for the above-mentioned dataset.