In this notebook I applied different Machine Learning methods (SVM, Decisions Trees, Naive Bayes) on the data I streamed in the previous notebook.

In [1]:
import pandas as pd
import numpy as np

import nltk

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
import re

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import naive_bayes
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zainab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zainab\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

I started by uploading the data into a dataframe, and below are all the informations about the dataframe

In [6]:
df = pd.read_pickle("./twitter_data.pkl")
df = df.sample(frac=1).reset_index(drop=True)

df.info()
df.head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 7 columns):
id          30000 non-null int64
text        30000 non-null object
date        30000 non-null datetime64[ns]
source      30000 non-null object
likes       30000 non-null int64
retweets    30000 non-null int64
class       30000 non-null int64
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 1.6+ MB


Unnamed: 0,id,text,date,source,likes,retweets,class
0,1287859173556920327,Rafale: First French fighter jets head to Indi...,2020-07-27 21:15:28,True Anthem,164,59,0
1,1214207516726513672,Anti-Wall Extremist The Kool-Aid Man Leads Cam...,2020-01-06 15:30:05,Buffer,3836,704,1
2,1288034661155901440,A video featuring a group of doctors making fa...,2020-07-28 08:52:48,SocialFlow,2073,912,0
3,1222515270708801544,Bell CEO’s mental health dramatically improves...,2020-01-29 13:42:08,Twitter Web App,822,612,1
4,1248618917871812610,Lots of people are taking part in WHO’s #SafeH...,2020-04-10 14:28:42,Wildmoka,43,16,0
5,1244737188908208128,New York City Health Officials Board Up Sun To...,2020-03-30 21:24:06,SocialFlow,932,141,1
6,1225622721758998528,".@Lin_Manuel: “Holy hell, I wrote one musical ...",2020-02-07 03:30:02,Buffer,5776,488,1
7,1252215192576032768,Failure Now An Option https://t.co/wFtITBA6PZ ...,2020-04-20 12:39:01,Sprout Social,3471,427,1
8,1283624204336857090,A Pennsylvania couple didn't know they had hou...,2020-07-16 04:47:13,SocialFlow,329,82,0
9,1281400157330984960,California Attorney General Xavier Becerra ann...,2020-07-10 01:29:39,SocialFlow,1048,222,0


 <br> Before applying anything to the dataframe I wanted to see how many of the 30000 tweets are corona virus related by looking for the words "corona" or "covid" in the tweet text like below.</br>

In [3]:
c=0
for i in range(0, len(df)):
    t=df['text'][i]
    t=t.lower()
    if t.find('corona')!=-1 or t.find('covid')!=-1: 
        c=c+1   
print(c)

3949


Since only 3949 tweets are covid related, which is about 13% of the data. I decided to use them as the testing test and the other tweets as the training set.
I had the idea to see if Machine Learning models can differentiate between real and fake news when it comes to a new topic.

In [8]:
corona_tweets = list()
for i in range(0, len(df)):
    t=df['text'][i]
    t=t.lower()
    if t.find('corona')!=-1 or t.find('covid')!=-1: 
        corona_tweets.append([df['id'][i],
                         df['text'][i],
                         df['date'][i],
                         df['source'][i],
                         df['likes'][i],
                         df['retweets'][i],
                         df['class'][i]
                         ])
        df=df.drop(i)
df.reset_index(drop=True,inplace=True)
df_ct= pd.DataFrame(corona_tweets,columns = ['id' , 'text', 'date','source','likes','retweets','class'])

In [9]:
df.info()
df_ct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26051 entries, 0 to 26050
Data columns (total 7 columns):
id          26051 non-null int64
text        26051 non-null object
date        26051 non-null datetime64[ns]
source      26051 non-null object
likes       26051 non-null int64
retweets    26051 non-null int64
class       26051 non-null int64
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 1.4+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3949 entries, 0 to 3948
Data columns (total 7 columns):
id          3949 non-null int64
text        3949 non-null object
date        3949 non-null datetime64[ns]
source      3949 non-null object
likes       3949 non-null int64
retweets    3949 non-null int64
class       3949 non-null int64
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 216.1+ KB



Here I separated each sentence from the next sentences to deal with them individually. Then I remove UN-important words from each sentence. After that I used stemming ( Returns each word to its origin) then created a new column in both dataframes called 'cleaned', which contains the new cleaned tweet.

In [10]:
stemmer = SnowballStemmer('english')
words = stopwords.words("english")

df['cleaned'] = df['text'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())
df_ct['cleaned'] = df_ct['text'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())


In [11]:
# Shuffling the data before applying ML algorithms 
df = df.sample(frac=1).reset_index(drop=True)
df_ct = df_ct.sample(frac=1).reset_index(drop=True)



<br> The 'fit_transform' function learns the vocabulary dictionary and returns document-term matrix. I used 'transform' function for the testing Set in order to have matching dimensions; otherwise you get an error </br>

In [12]:
cv = CountVectorizer()
X_train = cv.fit_transform(df['cleaned'])  
X_test = cv.transform(df_ct['cleaned'])

In [144]:
#LinearSVC is the linear classification method of SVM in sklearn library
svc = LinearSVC().fit(X_train, y_train)

In [145]:
# Calculating the confusion matrix and accuracy score of the training Set
pred= svc.predict(X_train)
print(classification_report(y_train,pred))
print()
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print('Accuracy: ',accuracy_score(y_train,pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11566
           1       1.00      1.00      1.00     14485

    accuracy                           1.00     26051
   macro avg       1.00      1.00      1.00     26051
weighted avg       1.00      1.00      1.00     26051


Confusion Matrix: 
 [[11566     0]
 [    2 14483]]
Accuracy:  0.9999232275152585


In [147]:
# Calculating the confusion matrix and accuracy score of the testing Set
pred= svc.predict(X_test)
print(classification_report(y_test,pred))
print()
print('Confusion Matrix: \n',confusion_matrix(y_test,pred))
print('Accuracy: ',accuracy_score(y_test,pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      3434
           1       0.68      0.89      0.77       515

    accuracy                           0.93      3949
   macro avg       0.83      0.91      0.87      3949
weighted avg       0.94      0.93      0.94      3949


Confusion Matrix: 
 [[3221  213]
 [  58  457]]
Accuracy:  0.9313750316535832


As you can see I got 99% accuracy on training, and 93% on testing when using SVM.


In [148]:
tree = DecisionTreeClassifier(criterion='gini',splitter='random',max_features='auto').fit(X_train,y_train)

In [149]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred= tree.predict(X_train)
print(classification_report(y_train,pred))
print()
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print('Accuracy: ',accuracy_score(y_train,pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11566
           1       1.00      1.00      1.00     14485

    accuracy                           1.00     26051
   macro avg       1.00      1.00      1.00     26051
weighted avg       1.00      1.00      1.00     26051


Confusion Matrix: 
 [[11566     0]
 [    0 14485]]
Accuracy:  1.0


In [151]:
pred= tree.predict(X_test)
print(classification_report(y_test,pred))
print()
print('Confusion Matrix: \n',confusion_matrix(y_test,pred))
print('Accuracy: ',accuracy_score(y_test,pred))


              precision    recall  f1-score   support

           0       0.96      0.78      0.86      3434
           1       0.35      0.80      0.49       515

    accuracy                           0.78      3949
   macro avg       0.66      0.79      0.67      3949
weighted avg       0.88      0.78      0.81      3949


Confusion Matrix: 
 [[2663  771]
 [ 101  414]]
Accuracy:  0.7791846036971385


When comes to the DecisionTrees the training accuracy is 100% yet the training accuracy is 77%

In [153]:
naive = naive_bayes.MultinomialNB().fit(X_train,y_train)

In [155]:
pred= naive.predict(X_train)
print(classification_report(y_train,pred))
print()
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print('Accuracy: ',accuracy_score(y_train,pred))


              precision    recall  f1-score   support

           0       0.98      0.97      0.97     11566
           1       0.97      0.98      0.98     14485

    accuracy                           0.97     26051
   macro avg       0.98      0.97      0.97     26051
weighted avg       0.97      0.97      0.97     26051


Confusion Matrix: 
 [[11179   387]
 [  268 14217]]
Accuracy:  0.9748570112471691


In [156]:
pred= classifier.predict(X_test)
print(classification_report(y_test,pred))
print()
print('Confusion Matrix: \n',confusion_matrix(y_test,pred))
print('Accuracy: ',accuracy_score(y_test,pred))


              precision    recall  f1-score   support

           0       0.98      0.95      0.97      3434
           1       0.72      0.88      0.79       515

    accuracy                           0.94      3949
   macro avg       0.85      0.91      0.88      3949
weighted avg       0.95      0.94      0.94      3949


Confusion Matrix: 
 [[3262  172]
 [  63  452]]
Accuracy:  0.9404912636110407


For Naive Bayes we have 97% accuracy for training and 94% accuracy for testing.


Even though SVM does better when it comes to the training data. Naive Bayes out performs it in the testing data.