## To Vaccinate or Not to Vaccinate: Analysing social media sentiment towards vaccines

Although it may be many months before we see COVID-19 vaccines available on a global scale, it is important to monitor public sentiment towards vaccinations now and especially in the future when COVID-19 vaccines are offered to the public. The anti-vaccination sentiment could pose a serious threat to the global efforts to get COVID-19 under control in the long term.

The objective of this challenge is to develop a machine learning model to assess if a Twitter post related to vaccinations is positive, neutral, or negative. 

The challenge is of an NLP kind.
* NLP (Natural Language Processing): is a machine-learning sub category that entails a wide range of techniques designed to help machines learn from text. 
    * Natural Language Processing is most commonly used in chatbots and search engines; and in tasks such as sentiment analysis and machine-translation(e.g.google-translate).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
from sklearn.model_selection import StratifiedKFold, train_test_split
import xgboost as xgb

import utils # Custom functions defined in utils.py
import re
import os

import warnings
warnings.filterwarnings('ignore')

In [None]:
train_df = pd.read_csv('./raw_data/Train.csv')
test_df = pd.read_csv('./raw_data/Test.csv')
sub = pd.read_csv('./raw_data/SampleSubmission.csv')

### Tweet exploration

Let's have a glimpse as to what pro-vaccination, neutral and anti-vaccination tweets look like

In [None]:
train_df.head()

In [None]:
# Neutral
train_df[train_df['label'] == 0]['safe_text'].values[0]

In [None]:
# Pro-vaccination
train_df[train_df['label'] == 1]['safe_text'].values[0]

In [None]:
# Anti-vaccination
train_df[train_df['label'] == -1]['safe_text'].values[0]

In [None]:
print(train_df.label.value_counts())
# Slicing out the outlier label(0.666667) 
train_df = train_df[train_df['label'].isin([-1, 0, 1])]

In [None]:
plt.figure(figsize=(9,4))
plt.title('Class Distributions')
train_df.label.value_counts().plot(kind='bar', color=('green', 'gray'))

In [None]:
train_df.head()

### Text Preprocessing:
* Remove stop words
* Remove symbols.e.g ampasands(&),question_marks(?), exclamation_marks(!)
* Remove html tags from tweets
* Remove urls
* Remove emojis
* Remove single characters (The model will not learn anything useful with them)

In [None]:
#test_df[test_df['safe_text'].isnull() == True]
train_df.dropna(inplace=True) # NAN labeled row
test_df.fillna(value='am ok with it as long as its not dangerous', inplace=True) #null safe_text row (its a random imputation)

In [None]:
# Clean train_df
train_df['safe_text'] = train_df.safe_text.apply(utils.remove_html)
train_df['safe_text'] = train_df.safe_text.apply(utils.remove_URL)
train_df['safe_text'] = train_df.safe_text.apply(utils.clean_text)
train_df['safe_text'] = train_df.safe_text.apply(utils.remove_emoji)

# Clean train_df
test_df['safe_text'] = test_df.safe_text.apply(utils.remove_html)
test_df['safe_text'] = test_df.safe_text.apply(utils.remove_URL)
test_df['safe_text'] = test_df.safe_text.apply(utils.clean_text)
test_df['safe_text'] = test_df.safe_text.apply(utils.remove_emoji)

In [None]:
# split data into documents/features and labels
X = train_df.safe_text
y = train_df.label

### Building vectors

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data a machine learning model can process.

Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with.

In [None]:
# Create train and test vectors
train_vectors, count_vectorizer = utils.count_vectorize(X)

# Map the tokens in the train vectors to the test set. 
# i.e.the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df['safe_text'])

### Building the model

Words contained in each tweet are a good indicator of whether they're about `pro vaccination(1)`, `neutral(0)` or `anti-vaccination(-1)`. The presence of particular word (or set of words) in a tweet might link directly to any of the aforementioned cases.


In [None]:
# train-test split
X_train, X_val, y_train, y_val = train_test_split(train_vectors, y, test_size=0.2, random_state=0)

In [None]:
# Build_model (without CV)
xgb_clf = xgb.XGBRegressor(max_depth=9, n_estimators=200, colsample_bytree=0.8, 
                           objective='reg:squarederror', subsample=0.8,
                           nthread=2, learning_rate=0.1, random_state=42
                            )
xgb_clf.fit(X_train, y_train)


In [None]:
val_preds = xgb_clf.predict(X_val)
RMSE = utils.rmse(y_val, val_preds)
RMSE

In [None]:
# Model with Cross-Validation
scores = []
kf = StratifiedKFold(10, shuffle=True, random_state=1)
for i, (tr, val) in enumerate(kf.split(train_vectors, y)):
    X_tr, y_tr = train_vectors[tr], np.take(y, tr, axis=0)
    X_val, y_val = train_vectors[val], np.take(y, val, axis=0)
    xgb_clf = xgb.XGBRegressor(max_depth=9, n_estimators=200, colsample_bytree=0.8, 
                               objective='reg:squarederror', subsample=0.8,
                               nthread=2, learning_rate=0.1, random_state=42
                              )
    xgb_clf.fit(X_tr, y_tr)
    score = utils.rmse(y_val, xgb_clf.predict(X_val))
    scores.append(score)
    print(score)
print(f'Mean_RMSE: {np.mean(scores)}')

## Making predictions

In [None]:
arr = xgb_clf.predict(test_vectors)
#arr = xgb_pipe.predict(test_vectors)
# Padding
for i in range(len(arr)):
    if arr[i] > 1:
        arr[i] = 1
    elif arr[i] < -1:
        arr[i] = -1
        
sub['label'] = arr

In [None]:
sub.head()

In [None]:
#os.mkdir('./submissions')
sub.to_csv(f"./submissions/sub_xgb_{np.round(np.mean(scores), 4)}.csv", index=False)