# Problem Statement:


The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech 
if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and 
label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.

<b>Note:</b>
The evaluation metric from this practice problem is F1-Score

![title](sentiment_analytics_header.jpg)



## Data
Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data,
30% is public and the rest is private.

## Data Files
 

train.csv - For training the models, we provide a labelled dataset of 31,962 tweets. The dataset is provided in the form of 
a csv file with each line storing a tweet id, its label and the tweet.
There is 1 test file (public)

test_tweets.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line.

## Evaluation Metric:
The metric used for evaluating the performance of classification model would be F1-Score.

The metric can be understood as -

 
True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.

True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.

False Positives (FP) – When actual class is no and predicted class is yes.

False Negatives (FN) – When actual class is yes but predicted class in no.

Precision = TP/TP+FP

Recall = TP/TP+FN

 
F1 Score = 2*(Recall * Precision) / (Recall + Precision)

F1 is usually more useful than accuracy, especially if for an uneven class distribution

## 1.Exploring Dataset:

In [312]:
# importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [313]:
# reading csv file

train_data=pd.read_csv("train_E6oV3lV.csv")

test_data=pd.read_csv("test_tweets_anuFYb8.csv")

In [314]:
# checking dimension of both csv file

print(train_data.shape)
print(test_data.shape)

(31962, 3)
(17197, 2)


In [315]:
# seeing first 10 rows

train_data.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [316]:
test_data.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [317]:
# it will be used during submission time

test_id=test_data['id']
test_id=pd.DataFrame(data=test_id)
test_id.head()

Unnamed: 0,id
0,31963
1,31964
2,31965
3,31966
4,31967


### Observation:
we can easily see the symbol <b>'@','#'</b> etc in 'tweet' column so in preprocessing part these must be removed because these will not contribute towards our results

In [318]:
# keeping copy of both dataset

train_data_copy=train_data.copy()
test_data_copy=test_data.copy()

## Variables Identification:

In [319]:
# checking variables type

train_data.dtypes

id        int64
label     int64
tweet    object
dtype: object

In [320]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
id       31962 non-null int64
label    31962 non-null int64
tweet    31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [321]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
id       17197 non-null int64
tweet    17197 non-null object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB


### Observation :
There is no null value in both data set so we don't require <b>'Missing Value treatment'</b> step

## Separating Independent and dependent variables and merging train and test datasets for preprocessing part: 

In [322]:
# dependent variable

y=train_data['label']
print(type(y))
print('*****************')
y=pd.DataFrame(data=y)
print(y.shape)
y.head()

<class 'pandas.core.series.Series'>
*****************
(31962, 1)


Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0


In [323]:
# independent variables

train_data=train_data[['id','tweet']]

train_data.head()

Unnamed: 0,id,tweet
0,1,@user when a father is dysfunctional and is s...
1,2,@user @user thanks for #lyft credit i can't us...
2,3,bihday your majesty
3,4,#model i love u take with u all the time in ...
4,5,factsguide: society now #motivation


In [324]:
# combining two dataset

combine_data=pd.merge(train_data,test_data,how='outer')
combine_data.shape

(49159, 2)

In [325]:
# staring of second dataset at this index

combine_data.iloc[31962]

id                                                   31963
tweet    #studiolife #aislife #requires #passion #dedic...
Name: 31962, dtype: object

In [326]:
combine_data.head()

Unnamed: 0,id,tweet
0,1,@user when a father is dysfunctional and is s...
1,2,@user @user thanks for #lyft credit i can't us...
2,3,bihday your majesty
3,4,#model i love u take with u all the time in ...
4,5,factsguide: society now #motivation


In [327]:
# 'id' columns doesn't help me in prediction so i am taking it out

combine_data=combine_data['tweet']
print(type(combine_data))
print('*****************')
combine_data=pd.DataFrame(data=combine_data)
combine_data.head()

<class 'pandas.core.series.Series'>
*****************


Unnamed: 0,tweet
0,@user when a father is dysfunctional and is s...
1,@user @user thanks for #lyft credit i can't us...
2,bihday your majesty
3,#model i love u take with u all the time in ...
4,factsguide: society now #motivation


## Removing punctuation,numbers,converting into lowercase

In [328]:
import re
import string

In [329]:
# keeping only text and rest thing like digit,punctuation etc are removing because this can't help in prediction

#remove_num= lambda x: re.sub('\w*\d\w*',' ',x)
#removing_pun_low=lambda x: re.sub('[%s]'%re.escape(string.punctuation),' ',x.lower())

remove_all=lambda x:re.sub('[^A-Za-z]',' ',x.lower())

combine_data['tweet']=combine_data['tweet'].map(remove_all)
combine_data

Unnamed: 0,tweet
0,user when a father is dysfunctional and is s...
1,user user thanks for lyft credit i can t us...
2,bihday your majesty
3,model i love u take with u all the time in ...
4,factsguide society now motivation
5,huge fan fare and big talking before the...
6,user camping tomorrow user user user use...
7,the next school year is the year for exams ...
8,we won love the land allin cavs champ...
9,user user welcome here i m it s so gr...


## Removing whitespace 
(front and end both) and if there is whitespace in between two words that will automatically removed once we tokenize it

In [330]:
#for w in range(combine_data.shape[0]):
#    combine_data['tweet'].iloc[w]=combine_data['tweet'].iloc[w].strip()

In [331]:
strip=lambda x: re.sub('%s',' ',x)

combine_data['tweet']=combine_data['tweet'].map(strip)
combine_data.head()

Unnamed: 0,tweet
0,user when a father is dysfunctional and is s...
1,user user thanks for lyft credit i can t us...
2,bihday your majesty
3,model i love u take with u all the time in ...
4,factsguide society now motivation


## Stemming & Lemmatization:

In [332]:
from nltk.stem import WordNetLemmatizer,LancasterStemmer

wnc=WordNetLemmatizer()
lcs=LancasterStemmer()
lemmatize=lambda x: lcs.stem(x) 

combine_data['tweet']=combine_data['tweet'].map(lemmatize)
combine_data.head()

Unnamed: 0,tweet
0,user when a father is dysfunctional and is s...
1,user user thanks for lyft credit i can t us...
2,bihday your majesty
3,model i love u take with u all the time in ...
4,factsguide society now motivation


In [333]:
combine_data.iloc[111]

tweet     user i m not interested in a  linguistics tha...
Name: 111, dtype: object

## Removing Stopwords:

In [334]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
combine_data['tweet'] = combine_data['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

combine_data.head()

Unnamed: 0,tweet
0,user father dysfunctional selfish drags kids d...
1,user user thanks lyft credit use cause offer w...
2,bihday majesty
3,model love u take u time ur
4,factsguide society motivation


In [335]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(max_features=1000)
m=cv.fit_transform(combine_data['tweet'])
combine_data=pd.DataFrame(m.toarray(),columns=cv.get_feature_names())
combine_data.head()

Unnamed: 0,able,absolutely,account,act,actor,actually,adapt,add,adventure,affirmation,...,yes,yesterday,yet,yo,yoga,york,young,youtube,yr,yummy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [336]:
"""
from sklearn.feature_extraction.text import TfidfVectorizer

tf=TfidfVectorizer(max_features=1000)
p=tf.fit_transform(combine_data['tweet'])
combine_data=pd.DataFrame(p.toarray(),columns=tf.get_feature_names())
combine_data.head()

"""

"\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\ntf=TfidfVectorizer(max_features=1000)\np=tf.fit_transform(combine_data['tweet'])\ncombine_data=pd.DataFrame(p.toarray(),columns=tf.get_feature_names())\ncombine_data.head()\n\n"

In [337]:
combine_data.shape

(49159, 1000)

In [338]:
# now again separating as train and test dataset because we had combined it for preprocessing

x=combine_data[:31962]
test=combine_data[31962:]

In [339]:
x.shape,test.shape

((31962, 1000), (17197, 1000))

In [340]:
x.head(3)

Unnamed: 0,able,absolutely,account,act,actor,actually,adapt,add,adventure,affirmation,...,yes,yesterday,yet,yo,yoga,york,young,youtube,yr,yummy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Splitting into train and test set

In [341]:

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=51)
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)


(23971, 1000) (23971, 1)
(7991, 1000) (7991, 1)


## Modeling:

In [342]:
from sklearn.linear_model import LogisticRegression

lr=LogisticRegression()
lr.fit(x_train,y_train)
y_predict=lr.predict(x_test)

  y = column_or_1d(y, warn=True)


In [343]:
"""
from sklearn.model_selection import cross_val_score

cvs=cross_val_score(lr,x,y,cv=15,scoring='f1').mean()
cvs
"""

"\nfrom sklearn.model_selection import cross_val_score\n\ncvs=cross_val_score(lr,x,y,cv=15,scoring='f1').mean()\ncvs\n"

In [344]:
from sklearn.metrics import accuracy_score,f1_score

In [345]:
f_score=f1_score(y_test,y_predict)
f_score

0.4976851851851853

In [346]:
from sklearn.naive_bayes import BernoulliNB

gnb=BernoulliNB()
gnb.fit(x_train,y_train)
y_predict=gnb.predict(x_test)
f1_score(y_test,y_predict)

  y = column_or_1d(y, warn=True)


0.5203252032520326

In [347]:
#cvs1=cross_val_score(gnb,x,y,cv=10,scoring='f1').mean()
#cvs1

In [348]:
# now predicting on test dataset

yy_predict=gnb.predict(test)
yy_predict=pd.DataFrame(data=yy_predict)
yy_predict.shape

(17197, 1)

In [349]:
combine=test_id.join(yy_predict)
combine.rename(columns={0:'my_label'},inplace=True)
combine.head()

Unnamed: 0,id,my_label
0,31963,0
1,31964,0
2,31965,0
3,31966,0
4,31967,0


In [350]:
submission=pd.read_csv('sample_submission_gfvA5FD.csv')
submission.head()

Unnamed: 0,id,label
0,31963,0
1,31964,0
2,31965,0
3,31966,0
4,31967,0


In [351]:
submission['my_label']=yy_predict
submission['id']=combine['id']
submission=submission[['id','my_label']]
submission.rename(columns={'my_label':'label'},inplace=True)
submission.head()

Unnamed: 0,id,label
0,31963,0
1,31964,0
2,31965,0
3,31966,0
4,31967,0


In [352]:
pd.DataFrame(submission, columns=['id','label']).to_csv('naive.csv',index=False)