##Naive Bayes Classifier for Classifying Russian Troll Tweets
---

Bayes' theorem was introduced as a way to determine conditional probabilty for an event.

This theorem has been used almost everywhere from finances to medicine. 

Bayes can be summed up as:

![alt text](https://i.imgur.com/SnKmKwx.gif)

Where P(A|B) is the probability of A being true given B, P(B|A) is the likelihood or the probability of A given B, P(A) is the probability of A being true before we look at our evidence (known as the prior term), and P(B) is the probability of our evidence being true (known as the evidence term).

---

Bayes' theorem has also seen use in machine learning through linear classifiers known as naive Bayes classifiers. 

These classifiers are called naive Bayes classifiers because they utilize Bayes' theorem, and make the assumption that every sample within our data is independent from other samples. 

Naive Bayes classifiers tend to hold up pretty well since they tend to perform really well given a limited dataset. 

---

Utilizing a naive Bayes classifier, we will be going over how to tell if a tweet is from an alleged troll account.

Lets get started.

The first thing we need to do is install kaggle so that we can obtain our datasets.

In [0]:
!pip install kaggle

Once we have installed kaggle we need to upload our kaggle.json key and download the datasets.

In [0]:
from google.colab import files
files.upload()

In [0]:
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d vikasg/russian-troll-tweets
!kaggle datasets download -d shashank1558/preprocessed-twitter-tweets
!kaggle datasets download -d speckledpingu/RawTwitterFeeds
!unzip russian-troll-tweets.zip
!unzip preprocessed-twitter-tweets.zip
!unzip RawTwitterFeeds.zip
!ls
!rm kaggle.json
!rm ~/.kaggle/kaggle.json

With our datasets unpacked we should import the packages we will be using.

In [0]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import math

Our tweets consist of five different .csv files.

Real tweets are contained within:
- processedPositive.csv
- processedNeutral.csv
- processedNegative.csv
- AllTweets.csv

The fake troll tweets are within:
- tweets.csv

What we need to do is place all of these tweets in an array for our training set and an array for our test set. 

Each item within the arrays is labeled and represented as: [tweet text, label] where a label of 1 means the tweet is fake and a label of 0 means the tweet is real.


In [9]:
np.random.seed(0)
twts_pos = pd.read_csv('processedPositive.csv')
twts_neut = pd.read_csv('processedNeutral.csv')
twts_neg = pd.read_csv('processedNegative.csv')
twts_various = pd.read_csv('AllTweets.csv')
twts_various = twts_various.text.dropna()

twts_real = pd.concat([twts_pos, twts_neut, twts_neg])
twts_real_cnt = len(twts_real.columns) + len(twts_various)

twts_fake = pd.read_csv('tweets.csv')
twts_fake = twts_fake.text.dropna()

# want to sample the data and make sure we have an even amount of data
twts_fake_sample = twts_fake.sample(twts_real_cnt, random_state=0)

labeled_tweets = []

for row in tqdm(twts_fake_sample):
  labeled_tweets.append([row, 1])
  
for row in tqdm(twts_real):
  labeled_tweets.append([row, 0])

for row in tqdm(twts_various):
  labeled_tweets.append([row, 0])
  
labeled_tweets = np.array(labeled_tweets)
np.random.shuffle(labeled_tweets)

train_set_size = math.floor(len(labeled_tweets) * 0.2)

test_set = labeled_tweets[:train_set_size]
train_set = labeled_tweets[train_set_size:]

print(train_set.shape)
print(test_set.shape)

100%|██████████| 92497/92497 [00:00<00:00, 430570.67it/s]
3872it [00:00, 894686.27it/s]
100%|██████████| 88625/88625 [00:00<00:00, 418334.82it/s]


(147996, 2)
(36998, 2)


With the data split we have a training set of 147996 labeled tweets and a test set of 36998 labeled tweets.

Now that we have created our training set and test set, we can start working on our naive Bayes classifier.

We start off by creating two dictionaries that will store each word and the number of its occurances.
We will then create a function for bagging each word of the tweet.

Then we will create our training function that will iterate through every tweet in our training set, bag each word, and then compute our probability values for both fake and real tweets.


In [10]:
false_tweets = {}
real_tweets = {}

def add_words_to_bag(words, label):
  for word in words:
    if label == '1':
      false_tweets[word] = false_tweets.get(word, 0) + 1
    else:
      real_tweets[word] = real_tweets.get(word, 0) + 1
      
def train():
  for tweet in train_set:
    tweet_words = tweet[0].split()
    add_words_to_bag(tweet_words, tweet[1])
    total = (len(false_tweets) + len(real_tweets))
    if total == 0:
      total = 1
  p_fake = len(false_tweets)/total
  p_real = (total - len(false_tweets))/total
  return p_real, p_fake

p_real, p_fake = train()  
print(p_real)
print(p_fake)

0.45905262596221624
0.5409473740377837


After training we obtain a dictionaries filled with word occurances from each category and we obtain probabilities of the tweet being fake and it being real.

In [0]:
def tweet_probability(tweet, label):
  probability = 1.0
  tweet_words = tweet.split()
  for word in tweet_words:
    if label == 1:
      probability *= (false_tweets.get(word, 0 ) + 1) / (len(false_tweets) + 1 * len(list(tweet_words)))
    else:
      probability *= (real_tweets.get(word, 0) + 1) / (len(real_tweets) + 1 * len(list(tweet_words)))
  return probability

def boolean_classify_tweet(tweet):
  prob_real = p_real * tweet_probability(tweet, 0) 
  prob_fake = p_fake * tweet_probability(tweet, 1)
  return prob_real > prob_fake

def prob_classify_tweet(tweet):
  prob_real = p_real * tweet_probability(tweet, 0)
  prob_fake = p_fake * tweet_probability(tweet, 1)
  return prob_real, prob_fake

def test():
  num_correct = 0
  num_incorrect = 0
  false_positives = 0
  true_negatives = 0
  for tweet in test_set:
    classification = boolean_classify_tweet(tweet[0])
    if((classification == True and tweet[1] == '0') or (classification == False and tweet[1] == '1')):
      num_correct += 1
    else:
      num_incorrect += 1
      if(tweet[1] == '1'):
        false_positives += 1
      else:
        true_negatives += 1
      
  print('Correct: ' + str(num_correct))
  print('Incorrect: ' + str(num_incorrect))
  print('False positives: ' + str(false_positives))
  print('True negatives: ' + str(true_negatives))
  
    

We can now use the data obtained from training to classify tweets within our test set.

After classifying all tweets, we can see the algorithm has obtained ~90% accuracy when classifying troll tweets.

In [15]:
test()

Correct: 33386
Incorrect: 3612
False positives: 3310
True negatives: 302


This can be extended upon to improve the accuracy, but this works great as a proof of concept.