# Purpose
This project aims to train a machine learning classifier that reports the attitude of a message (or post) towards vaccine-related topics.

# Methods
__The data__ used in this project comes from the author of the article '[The Small, Small World of Facebook’s Anti-vaxxers](https://www.theatlantic.com/health/archive/2019/02/anti-vaxx-facebook-social-media/583681/)', Alexis C. Madrigal. By using the web-monitoring tool CrowdTangle, she analyzed the most popular posts since 2016 that contain the word _vaccine_ from Facebook. [Her data](https://docs.google.com/spreadsheets/d/1j6tJDlMJErjBwoxLh4GpzCV4nM00nQS_1faZMXPGcxw/edit#gid=1593598275) is publicly available through a link included in the article.
<br>

__The steps__ included in this projects are as follow,
1. Identify the pro-vaccine and anti-vaccine posts to be used to train the model and extract these posts from the dataset
2. Label (i.e., provaccine/anti-vaccine) each post
3. Process the message of the post to an analyzable format using Natural Language Processing techniques
4. Prepare the data for training a machine learning model
5. Train several models

# Reading Guide
I try to document everything from lay people's perspective. You shouldn't need a programming background to understand my note and interpretation in this notebook. However, there will be some Python related notes for replication purposes. Those notes are left directly with the codes.

Import all the packages used in this project. Packages used include nltk, numpy, pandas, pickle, random, re, sklearn, statistics.

In [20]:
import pandas as pd
import numpy as np
import nltk
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode

Convert the data to a pandas dataframe for data manipulation. Display the first 5 rows for an initial check.

In [21]:
data = pd.read_csv('FBVaxData.csv')
data.head()

Unnamed: 0,Page Name,User Name,Page Id,Page Likes at Posting,Created,Type,Likes,Comments,Shares,Love,...,URL,Message,Link,Final Link,Link Text,Description,Sponsor Id,Sponsor Name,Score,Yes
0,Planet Paws,PlanetPaws.ca,112438000000000.0,1578676,2017-02-26 09:50:24 EST,Native Video,75461,42879,894482,2406,...,https://www.facebook.com/PlanetPaws.ca/posts/1...,Over-vaccinating and the overdosing of pet vac...,https://www.facebook.com/PlanetPaws.ca/videos/...,,The Dangers of Vaccine Overdosing,,,,1091367.0,1
1,Natalie Bomke Fox 32 Chicago,NatalieBomkeFox32Chicago,166804000000000.0,112316,2018-02-02 09:30:00 EST,Native Video,135485,41642,627972,27306,...,https://www.facebook.com/NatalieBomkeFox32Chic...,CANCER VACCINE SUCCESSFUL IN TESTING... Stanfo...,https://www.facebook.com/NatalieBomkeFox32Chic...,,CANCER VACCINE SUCCESSFUL IN TESTING...,,,,861565.0,1
2,FOX 2 Detroit,WJBKFox2Detroit,363658000000.0,819062,2018-02-05 14:45:02 EST,Native Video,84593,15447,580983,14496,...,https://www.facebook.com/WJBKFox2Detroit/posts...,CANCER VACCINE SUCCESSFUL IN TESTING: The vacc...,https://www.facebook.com/WJBKFox2Detroit/video...,,CANCER VACCINE SUCCESSFUL,,,,712774.0,1
3,Gizmodo,gizmodo,5718759000.0,1555197,2017-11-20 17:40:46 EST,Native Video,122206,20150,433945,11158,...,https://www.facebook.com/gizmodo/posts/1015594...,"Paul Alexander spends nearly every hour, of ev...",https://www.facebook.com/gizmodo/videos/101559...,,The Last of the Iron Lungs,,,,674298.0,1
4,Hashem Al-Ghaili,ScienceNaturePage,693505000000000.0,13426659,2018-06-22 11:08:29 EDT,Native Video,40464,3112,322544,3762,...,https://www.facebook.com/ScienceNaturePage/pos...,Cancer Vaccine Has Been Approved For Human Tri...,https://www.facebook.com/ScienceNaturePage/vid...,,Cancer Vaccine Has Been Approved For Human Trials,,,,373126.0,1


### 1. Identify the posts to be used to train the model and extract these posts from the dataset
Identify the posts to be used for training and testing the model. 
<br>
- Anti-vaccine posts are chosen from the seven Facebook pages that generate the top 20% of the anti-vaccine posts. There were 1429 posts.
- In order to have close sample size for pro-vaccine posts, I choose posts from the as many Facebook page that are pro-vaccine as necessary. There are 1182 posts.
<br>

** See [Alexis's data](https://docs.google.com/spreadsheets/d/1j6tJDlMJErjBwoxLh4GpzCV4nM00nQS_1faZMXPGcxw/edit#gid=1593598275) (the 3rd spreadsheet, 'Copy of Pivot Table 1') and [her article](https://www.theatlantic.com/health/archive/2019/02/anti-vaxx-facebook-social-media/583681/) for details.

In [22]:
AntiVaxSource = ['NaturalNews.com', 'Dr. Tenpenny on Vaccines and Current Events', 'Stop Mandatory Vaccination', 
                 'March Against Monsanto', 'J. B. Handley', 'Erin at Health Nut News', 'Revolution For Choice']
ProVaxSource = ['I fucking love science', 'SciBabe', 'The Credible Hulk', 'National Vaccine Information Center',
                'Gavi, the Vaccine Alliance', 'Refutations to Anti-Vaccine Memes', 'Do you even Science, Bro',
                'Stop the Anti-Science Movement', 'We Love GMOs and Vaccines', 'NPR', 
                'World Health Organization (WHO)', 'March for Science', 'ScienceAlert', 
                "The Skeptics' Guide to the Universe", 'Futurism', 'Being Liberal', 'A Science Enthusiast',
                'ZDoggMD', 'Insufferably Intolerant Science Nerd', 'Now This']

def CalPostNumber(SourceList):
    Num = 0
    for s in SourceList:
        for p in data['Page Name']:
            if p == s:
                Num += 1
    return Num

print('Number of Anti-vaxx Posts: ', CalPostNumber(AntiVaxSource))
print('Number of Pro-vaxx Posts: ', CalPostNumber(ProVaxSource))

Number of Anti-vaxx Posts:  1429
Number of Pro-vaxx Posts:  1182


### 2. Label each post as 'ProVax' or 'AntiVax'
Label the data with 'ProVax' and 'AntiVax'. Drop the posts with missing data as well as the data that is not needed for the project (i.e., the posts that don't belong to the selected sources).
<br>
At this point, I have a dataframe with three columns: index, message, and label. Again, display the first 5 rows for double check.

In [23]:
data = data[['Page Name', 'Message']]
data['Label'] = np.NaN

def Labeling(SourceList, LabelName): # Only posts from the selected sources will be assigned a label
    for s in SourceList:
        for i in range(0, len(data)):
            if data['Page Name'][i] == s:
                data['Label'][i] = LabelName
    return data

data = Labeling(AntiVaxSource, 'AntiVax')
data = Labeling(ProVaxSource, 'ProVax')
data = data.dropna() # This will drop both posts with missing data & data without a label
data = data.reset_index(drop = True) # Reset the index after dropping data points
data = data.drop(columns = ['Page Name']) # No longer need the source for the following analyses

data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,Message,Label
0,Go vaccines!,ProVax
1,"It's like the MMR jab, but for heroin.",ProVax
2,"In 90 mice infected with cancer, 87 of them we...",ProVax
3,Girls who received the HPV vaccine are much le...,ProVax
4,Mainstream news reporting that #BigPharma is p...,AntiVax


Now, check the final sample size:
- Anti-vaccine posts: 1355
- Pro-vaccine posts: 1031

In [24]:
print('Number of Anti-vaxx Posts: ', len(data[data['Label'] == 'AntiVax']))
print('Number of Pro-vaxx Posts: ', len(data[data['Label'] == 'ProVax']))

Number of Anti-vaxx Posts:  1355
Number of Pro-vaxx Posts:  1031


### 3. Process the message using Natural Language Processing techniques
Process each message by
- Remove the hyperlink that was included in the message
- Remove punctuation
- Convert the words to lowercase and tokenize the words
- Stemmerize the words
- Filter out the stop words that were included in the message
<br>

At this point, each message is converted to a list of word stems in lowercase. All the hyperlinks, punctuations, emojis, and stop words are removed. Stop words are the words that don't bring semantic significance to the context, such as "a," "and," "but."

Convert the dataframe to a list that has the following information for each post
- processed message
- Label of the message
<br>

Display the first 2 data points for double check.

In [25]:
def ProcessMessages():
    for i in range(0, len(data)):
        NoLink = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", data['Message'][i])
        NoPunc = re.sub(r'[^\w\s]','',NoLink)
        WordTokens = word_tokenize(NoPunc.lower())
        ps = PorterStemmer()
        WordStems = [ps.stem(w) for w in WordTokens]
        StopWords = set(stopwords.words('english'))
        FilteredSentence = [w for w in WordStems if not w in StopWords] 
        data['Message'][i] = FilteredSentence
    return data

data = ProcessMessages()

documents = []

def CreateDoc():
    for i in range(0, len(data)):
        documents.append((data['Message'][i], data['Label'][i]))
    return data

CreateDoc()
random.shuffle(documents) # Randomize the order of the data points
documents[:2]

[(['antivaxxerlog'], 'ProVax'), (['cure', 'step', 'closer'], 'ProVax')]

Create a list that contains all the words that are used, and calculate the how often each word is used. Print out the 30 most commonly used words and its frequency.

In [26]:
def CreateAllWords():
    AllWords = []
    for i in range(0, len(data)):
        for w in data['Message'][i]:
            AllWords.append(w)
    return AllWords

AllWords = CreateAllWords()

AllWords = nltk.FreqDist(AllWords)
print(AllWords.most_common(30))

[('vaccin', 2474), ('thi', 774), ('children', 315), ('wa', 266), ('flu', 234), ('doctor', 214), ('get', 211), ('ha', 210), ('us', 206), ('inform', 202), ('year', 200), ('one', 198), ('caus', 193), ('antivaccin', 190), ('health', 184), ('know', 175), ('peopl', 172), ('measl', 170), ('like', 168), ('parent', 167), ('diseas', 160), ('hpv', 151), ('shot', 150), ('say', 138), ('death', 129), ('time', 128), ('hi', 128), ('whi', 126), ('free', 124), ('follow', 122)]


### 4. Prepare the data for training a machine learning model
Use the 3000 most used words as features to train the model.

In [27]:
WordFeatures = list(AllWords.keys())[:3000]

def FindFeatures(document):
    words = set(document)
    features = {}
    for w in WordFeatures:
        features[w] = (w in words)
    return features

For each post, screen through each word and mark whether the word is in the features (i.e., 3000 most used words). 

In [28]:
featuresets = [(FindFeatures(msg), label) for (msg, label) in documents]

Split the data to a training set and a testing set.

In [29]:
TrainingSet = featuresets[:1900]
TestingSet = featuresets[1900:]

### 5. Train models
Train a Naive Bayes classifier model, report prediction accuracy percentage, and show the 15 most informative features.
<br>

The following results show that the Naive Bayes classifier model makes the correct prediction __82.71%__ of the time. Word like 'meme' is used 78 times more by pro-vaxxers than anti-vaxxers (probably to make fun of anti-vaxxers), and word like 'network' is used 38 times more by anti-vaxxers than pro-vaxxers.
<br>

** The words shown below as features are word stems.

In [30]:
classifier = nltk.NaiveBayesClassifier.train(TrainingSet)
print("Original Naive Bayes Algo accuracy percent: ",(nltk.classify.accuracy(classifier, TestingSet))*100)
classifier.show_most_informative_features(15)

Original Naive Bayes Algo accuracy percent:  82.71604938271605
Most Informative Features
                    meme = True           ProVax : AntiVa =     78.4 : 1.0
                 network = True           AntiVa : ProVax =     38.1 : 1.0
                industri = True           AntiVa : ProVax =     25.7 : 1.0
                   truth = True           AntiVa : ProVax =     22.5 : 1.0
                 resourc = True           AntiVa : ProVax =     21.3 : 1.0
                  insert = True           AntiVa : ProVax =     20.7 : 1.0
              antivaccin = True           ProVax : AntiVa =     20.7 : 1.0
                  despit = True           ProVax : AntiVa =     15.9 : 1.0
              mainstream = True           AntiVa : ProVax =     13.2 : 1.0
                 coverag = True           ProVax : AntiVa =     12.4 : 1.0
                antivaxx = True           ProVax : AntiVa =     12.3 : 1.0
                    seri = True           AntiVa : ProVax =     11.7 : 1.0
           

Train another seven classifiers.
<br>

First I try another two Naive Bayes classifiers. Naive Bayes classifier for multinomial models makes the correct prediction __82.72%__ of the time, and Naive Bayes classifier for multivariate Bernoulli models makes the correct prediction __83.33%__ of the time. I would think that Naive Bayes classifier for multivariate Bernoulli models should perform slightly better, because unlike Naive Bayes classifier for multinomial models, it is designed for binary/boolean features. My data is in binary format. However, the two models preform equally well here.

Logistic regression classifier makes the correct prediction __81.48%__ of the time, and linear classifiers with stochastic gradient descent (SGD) training makes the correct prediction __81.28%__ of the time. Again, the two models perform equally well here.

C-Support vector classification, linear support vector classification, and Nu-Support vector classification makes the correct prediction __58.85%__ , __80.04%__ , __78.60%__ of the time, respectively. C-Support vector classification performs the worst among these three models.

In [31]:
MNBClassifier = SklearnClassifier(MultinomialNB())
MNBClassifier.train(TrainingSet)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNBClassifier, TestingSet))

BNBClassifier = SklearnClassifier(BernoulliNB())
BNBClassifier.train(TrainingSet)
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNBClassifier, TestingSet))

LogisticRegressionClassifier = SklearnClassifier(LogisticRegression())
LogisticRegressionClassifier.train(TrainingSet)
print("LogisticRegression Classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegressionClassifier, TestingSet))*100)

SGDClassifierClassifier = SklearnClassifier(SGDClassifier())
SGDClassifierClassifier.train(TrainingSet)
print("SGDClassifier Classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifierClassifier, TestingSet))*100)

SVCClassifier = SklearnClassifier(SVC())
SVCClassifier.train(TrainingSet)
print("SVC Classifier accuracy percent:", (nltk.classify.accuracy(SVCClassifier, TestingSet))*100)

LinearSVCClassifier = SklearnClassifier(LinearSVC())
LinearSVCClassifier.train(TrainingSet)
print("LinearSVC Classifier accuracy percent:", (nltk.classify.accuracy(LinearSVCClassifier, TestingSet))*100)

NuSVCClassifier = SklearnClassifier(NuSVC())
NuSVCClassifier.train(TrainingSet)
print("NuSVC Classifier accuracy percent:", (nltk.classify.accuracy(NuSVCClassifier, TestingSet))*100)

MultinomialNB accuracy percent: 0.8271604938271605
BernoulliNB accuracy percent: 0.8333333333333334




LogisticRegression Classifier accuracy percent: 81.48148148148148




SGDClassifier Classifier accuracy percent: 81.27572016460906




SVC Classifier accuracy percent: 58.8477366255144
LinearSVC Classifier accuracy percent: 80.04115226337449
NuSVC Classifier accuracy percent: 78.60082304526749


Finally, I ensemble several models that have similar performace to create a model that aggregate their predictions.

In [32]:
class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers
    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)
    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

This final model aggregates predictions from Naive Bayes classifier, Naive Bayes classifier for multinomial models, Naive Bayes classifier for multivariate Bernoulli models, logistic regression classifier, linear classifiers with stochastic gradient descent training, linear support vector classification, and Nu-Support vector classification. It makes the correct prediction __82.92%__ of the time. 
<br>

I also use the model to predict the first 5 posts from the testing dataset. The confidence of the prediction is shown next to the predction.

In [33]:
VotedClassifier = VoteClassifier(classifier,
                                  NuSVCClassifier,
                                  LinearSVCClassifier,
                                  SGDClassifierClassifier,
                                  MNBClassifier,
                                  BNBClassifier,
                                  LogisticRegressionClassifier)

print("VotedClassifier accuracy percent:", (nltk.classify.accuracy(VotedClassifier, TestingSet))*100)

print("Classification:", VotedClassifier.classify(TestingSet[0][0]), "Confidence %:",VotedClassifier.confidence(TestingSet[0][0])*100)
print("Classification:", VotedClassifier.classify(TestingSet[1][0]), "Confidence %:",VotedClassifier.confidence(TestingSet[1][0])*100)
print("Classification:", VotedClassifier.classify(TestingSet[2][0]), "Confidence %:",VotedClassifier.confidence(TestingSet[2][0])*100)
print("Classification:", VotedClassifier.classify(TestingSet[3][0]), "Confidence %:",VotedClassifier.confidence(TestingSet[3][0])*100)
print("Classification:", VotedClassifier.classify(TestingSet[4][0]), "Confidence %:",VotedClassifier.confidence(TestingSet[4][0])*100)
print("Classification:", VotedClassifier.classify(TestingSet[5][0]), "Confidence %:",VotedClassifier.confidence(TestingSet[5][0])*100)

VotedClassifier accuracy percent: 82.92181069958848
Classification: ProVax Confidence %: 100.0
Classification: AntiVax Confidence %: 100.0
Classification: ProVax Confidence %: 100.0
Classification: AntiVax Confidence %: 100.0
Classification: ProVax Confidence %: 71.42857142857143
Classification: ProVax Confidence %: 57.14285714285714


** The following are the lists and models that are used in the actual web app.

In [34]:
SaveWordFeatures = open("WordFeatures.pickle","wb")
pickle.dump(WordFeatures, SaveWordFeatures)
SaveWordFeatures.close()

StopWords = set(stopwords.words('english'))

SaveStopWords = open("StopWords.pickle","wb")
pickle.dump(StopWords, SaveStopWords)
SaveStopWords.close()

In [35]:
SaveClassifier = open("NaiveBayes.pickle","wb")
pickle.dump(classifier, SaveClassifier)
SaveClassifier.close()

ClassifierF = open("NaiveBayes.pickle", "rb")
classifier = pickle.load(ClassifierF)
ClassifierF.close()