**Dataset**
labeled datasset collected from twitter

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Total Estimated Time = 90-120 Mins**

### Import Libraries

In [19]:
import numpy as np
import pandas as pd
import re

### Load Dataset

In [5]:
df = pd.read_csv("dataset.csv")

In [16]:
df.head()
#print(df.shape)

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation


### EDA

- check NaNs

In [7]:
df.isna().sum()

id       0
label    0
tweet    0
dtype: int64

In [8]:
df.drop(columns = 'id' , inplace = True)

- check duplicates

In [11]:
df.duplicated().sum()

2432

In [12]:
df = df.drop_duplicates()

In [13]:
df.duplicated().sum()

0

- show samples of data texts to find out required preprocessing steps

In [17]:
df.tail()

Unnamed: 0,label,tweet
31956,0,off fishing tomorrow @user carnt wait first ti...
31957,0,ate @user isz that youuu?ðððððð...
31958,0,to see nina turner on the airwaves trying to...
31959,0,listening to sad songs on a monday morning otw...
31961,0,thank you @user for you follow


- check dataset balancing

In [18]:
df['label'].value_counts()

0    27517
1     2013
Name: label, dtype: int64

- Cleaning and Preprocessing are:
    - 1
    - 2
    - 3
    - ... etc.

### Cleaning and Preprocessing

In [None]:
# relation between # , :) and label !
# problems : #  @user  emojii  punctuate  url_validate 
# want to check if sentenses had hate speech or not 

In [20]:
def remove_at(txt):
    clean_txt = re.sub('@user', '',txt )
    return clean_txt
#checked 

In [25]:
def remove_pun(txt):
    regex = r"[!\"\$%&\'\(\)\*\+,-\./:;<=>\?@\[\\\]\^_`{\|}~]"
    subst = ""
    clean_txt = re.sub(regex, subst, txt, 0, re.MULTILINE)
    return clean_txt
#checked 

In [26]:
def contains_url(text):
    pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return bool(pattern.search(text))

In [27]:
df['tweet'].apply(contains_url)

0        False
1        False
2        False
3        False
4        False
         ...  
31956    False
31957    False
31958    False
31959    False
31961    False
Name: tweet, Length: 29530, dtype: bool

In [None]:
#def remove_emojii(txt):

In [33]:
def remove_non_english(txt):
    clean_txt = re.sub(r"[^a-zA-Z0-9#]", ' ',txt )
    return clean_txt
#checked     

In [36]:
def wrangle(txt):
    clean_txt1 = remove_at(txt)
    clean_txt2 = remove_pun(clean_txt1)
    clean_txt3 = remove_non_english(clean_txt2)
    return clean_txt3
    
    
    

In [37]:
df['tweet'] = df['tweet'].apply(wrangle)

In [39]:
df.tail()

Unnamed: 0,label,tweet
31956,0,off fishing tomorrow carnt wait first time in...
31957,0,ate isz that youuu ...
31958,0,to see nina turner on the airwaves trying to...
31959,0,listening to sad songs on a monday morning otw...
31961,0,thank you for you follow


**If it takes 60 Mins till here, you are doing Great** <br>
**If not! You also are doing Great**

### Modelling

In [53]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [57]:
target = 'label'
feature = 'tweet'
X = df[feature]
y = df[target]

In [58]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [59]:
vec = CountVectorizer()
clf = LogisticRegression()
pipe = make_pipeline(vec, clf)
pipe.fit(X_train,y_train);

#### Evaluation

In [64]:
def print_report(pipe, X_test, y_test):
    y_pred = pipe.predict(X_test)
    report = metrics.classification_report(y_test, y_pred)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))



In [65]:
print_report(pipe, X_test, y_test)

              precision    recall  f1-score   support

           0       0.96      0.99      0.98      9075
           1       0.84      0.44      0.58       670

    accuracy                           0.96      9745
   macro avg       0.90      0.72      0.78      9745
weighted avg       0.95      0.96      0.95      9745

accuracy: 0.956


### Enhancement

- Using different N-grams
- Using different text representation technique

In [66]:
vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5), min_df=.01, max_df=.3)
clf = LinearSVC()
pipe_tfidf = make_pipeline(vec, clf)
pipe_tfidf.fit(X_train,y_train)

In [67]:
print_report(pipe_tfidf,X_test, y_test)

              precision    recall  f1-score   support

           0       0.96      0.99      0.97      9075
           1       0.77      0.43      0.55       670

    accuracy                           0.95      9745
   macro avg       0.86      0.71      0.76      9745
weighted avg       0.95      0.95      0.95      9745

accuracy: 0.952


#### Done!