In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier , LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from string import punctuation
from collections import Counter



# Tasks:

 

1)Load the tweets file using read_csv function from Pandas package. 

2)Get the tweets into a list for easy text cleanup and manipulation.

3)To cleanup: 

     a)Normalize the casing.

     b)Using regular expressions, remove user handles. These begin with '@’.

     c)Using regular expressions, remove URLs.

     d)Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

     e)Remove stop words.

     f)Remove redundant terms like ‘amp’, ‘rt’, etc.

     g)Remove ‘#’ symbols from the tweet while retaining the term.

4)Extra cleanup by removing terms with a length of 1.

5)Check out the top terms in the tweets:

      a)First, get all the tokenized terms into one large list.

       b)Use the counter and find the 10 most common terms.

6)Data formatting for predictive modeling:

      a)Join the tokens back to form strings. This will be required for the vectorizers.

      b)Assign x and y.

      c)Perform train_test_split using sklearn.

7)We’ll use TF-IDF values for the terms as a feature to get into a vector space model.

       a)Import TF-IDF  vectorizer from sklearn.

       b)Instantiate with a maximum of 5000 terms in your vocabulary.

       c)Fit and apply on the train set.

       d)Apply on the test set.

8)Model building: Ordinary Logistic Regression

       a)Instantiate Logistic Regression from sklearn with default parameters.

       b)Fit into  the train data.

       c)Make predictions for the train and the test set.

9)Model evaluation: Accuracy, recall, and f_1 score.

       a)Report the accuracy on the train set.

       b)Report the recall on the train set: decent, high, or low.

       c)Get the f1 score on the train set.

10)Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.

       a)Adjust the appropriate class in the LogisticRegression model.

11)Train again with the adjustment and evaluate.

       a)Train the model on the train set.

       b)Evaluate the predictions on the train set: accuracy, recall, and f_1 score.

12)Regularization and Hyperparameter tuning:

       a)Import GridSearch and StratifiedKFold because of class imbalance.

       b)Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.

       c)Use a balanced class weight while instantiating the logistic regression.

13)Find the parameters with the best recall in cross validation.

       a)Choose ‘recall’ as the metric for scoring.

       b)Choose stratified 4 fold cross validation scheme.

       c)Fit into  the train set.

14)What are the best parameters?

15)Predict and evaluate using the best estimator.

      a)Use the best estimator from the grid search to make predictions on the test set.

      b)What is the recall on the test set for the toxic comments?

      c)What is the f_1 score?


Load the tweets file using read_csv function from Pandas package. 

In [2]:
df = pd.read_csv('TwitterHate.csv')
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


Get the tweets into a list for easy text cleanup and manipulation.

In [3]:
tweets = df['tweet'].tolist()
tweets

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation',
 '[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo  ',
 ' @user camping tomorrow @user @user @user @user @user @user @user dannyâ\x80¦',
 "the next school year is the year for exams.ð\x9f\x98¯ can't think about that ð\x9f\x98\xad #school #exams   #hate #imagine #actorslife #revolutionschool #girl",
 'we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers  â\x80¦ ',
 " @user @user welcome here !  i'm   it's so #gr8 ! ",
 ' â\x86\x9d #ireland consume

To cleanup: 

     a)Normalize the casing.

     b)Using regular expressions, remove user handles. These begin with '@’.

     c)Using regular expressions, remove URLs.

     d)Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

     e)Remove stop words.

     f)Remove redundant terms like ‘amp’, ‘rt’, etc.

     g)Remove ‘#’ symbols from the tweet while retaining the term.


In [4]:
tweets_lower = [twt.lower() for twt in tweets]

In [5]:
tweets_removing_handles = [re.sub("@\w+","", twee) for twee in tweets_lower]

In [6]:
tweets_removing_urls = [re.sub("\w+://\S+","", twt) for twt in tweets_removing_handles]

In [7]:
tkn = TweetTokenizer()

In [8]:
tweet_token = [tkn.tokenize(sent) for sent in tweets_removing_urls]
print(tweet_token[0])

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [9]:
stop_nltk = stopwords.words("english")
stop_punct = list(punctuation)

#Adding some specific punctuation from the  data :
stop_punct.extend(['...','``',"''",".."])
stop_context = ['rt', 'amp']

#Final stop word list including all of these:
stop_final = stop_nltk + stop_punct + stop_context

In [10]:
stop_final

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Extra cleanup by removing terms with a length of 1.

In [11]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

#Applying the function on the data:
tweets_clean = [del_stop(tweet) for tweet in tweet_token]


Check out the top terms in the tweets:

      a)First, get all the tokenized terms into one large list.

       b)Use the counter and find the 10 most common terms.


In [12]:
term_list = []
for tweet in tweets_clean:
    term_list.extend(tweet)

#Using counter to get top terms:
res = Counter(term_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

Data formatting for predictive modeling:

      a)Join the tokens back to form strings. This will be required for the vectorizers.

      b)Assign x and y.

      c)Perform train_test_split using sklearn.

In [13]:
tweets_clean = [" ".join(tweet) for tweet in tweets_clean]

In [14]:
X = tweets_clean
y = df['label']

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)

We’ll use TF-IDF values for the terms as a feature to get into a vector space model.

       a)Import TF-IDF  vectorizer from sklearn.

       b)Instantiate with a maximum of 5000 terms in your vocabulary.

       c)Fit and apply on the train set.

       d)Apply on the test set.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfmodel = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidfmodel.fit_transform(X_train)
X_test_tfidf = tfidfmodel.transform(X_test)
X_train_tfidf.shape, X_test_tfidf.shape

((22373, 5000), (9589, 5000))

Model building: Ordinary Logistic Regression

       a)Instantiate Logistic Regression from sklearn with default parameters.

       b)Fit into  the train data.

       c)Make predictions for the train and the test set.

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
log = LogisticRegression()

In [19]:
log.fit(X_train_tfidf,y_train)

LogisticRegression()

In [20]:
y_train_pred = log.predict(X_train_tfidf)

In [21]:
y_test_pred = log.predict(X_test_tfidf)

Model evaluation: Accuracy, recall, and f_1 score.

       a)Report the accuracy on the train set.

       b)Report the recall on the train set: decent, high, or low.

       c)Get the f1 score on the train set.

In [22]:
from sklearn.metrics import accuracy_score,classification_report

In [23]:
accuracy_score(y_train,y_train_pred)

0.9561078085191973

In [24]:
print(classification_report(y_train,y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



In [25]:
df['label'].value_counts(normalize=True)

0    0.929854
1    0.070146
Name: label, dtype: float64

In [26]:
logreg = LogisticRegression(class_weight="balanced")

In [27]:
logreg.fit(X_train_tfidf,y_train)

LogisticRegression(class_weight='balanced')

In [28]:
y_train_pred_balanced = logreg.predict(X_train_tfidf)
accuracy_score(y_train, y_train_pred_balanced)

0.9521744960443391

In [29]:
print(classification_report(y_train,y_train_pred_balanced))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     20815
           1       0.60      0.97      0.74      1558

    accuracy                           0.95     22373
   macro avg       0.80      0.96      0.86     22373
weighted avg       0.97      0.95      0.96     22373



12)Regularization and Hyperparameter tuning:

       a)Import GridSearch and StratifiedKFold because of class imbalance.

       b)Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.

       c)Use a balanced class weight while instantiating the logistic regression.

In [30]:
from sklearn.model_selection import GridSearchCV,StratifiedKFold

In [31]:
param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty': ["l1","l2"]
}


In [32]:
classifier_lr = LogisticRegression(class_weight="balanced")

Find the parameters with the best recall in cross validation.

       a)Choose ‘recall’ as the metric for scoring.

       b)Choose stratified 4 fold cross validation scheme.

       c)Fit into  the train set.

In [33]:
grid_search = GridSearchCV(estimator = classifier_lr, param_grid = param_grid, 
                          cv = StratifiedKFold(4), n_jobs = -1, verbose = 1, scoring = "recall" )

In [34]:
grid_search.fit(X_train_tfidf, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    5.7s finished


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']},
             scoring='recall', verbose=1)

What are the best parameters?

In [35]:
grid_search.best_estimator_

LogisticRegression(C=1, class_weight='balanced')

Predict and evaluate using the best estimator.

      a)Use the best estimator from the grid search to make predictions on the test set.

      b)What is the recall on the test set for the toxic comments?

      c)What is the f_1 score?

In [36]:
y_test_pred = grid_search.best_estimator_.predict(X_test_tfidf)
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      8905
           1       0.49      0.77      0.60       684

    accuracy                           0.93      9589
   macro avg       0.73      0.85      0.78      9589
weighted avg       0.95      0.93      0.93      9589



The f1_score for 1 class is 0.60 and the recall is 0.77.

In [38]:
Predictions = pd.Series(y_test_pred,name='Label_Predicted')

In [39]:
idtest = df[['label','tweet']]

In [40]:
results = pd.concat([idtest,Predictions],axis=1)

In [45]:
results.to_csv('Tweets_Predictions.csv',index=False)