<a href="https://colab.research.google.com/github/skyflaren/sentiment-analysis/blob/master/SKLearn_SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

An AI model to predict tweet sentiments, implementing a Gaussian distribution of Bayes' algorithm. 

Created by @[astrocat879](https://github.com/astrocat879), @[dulldesk](https://github.com/dulldesk), @[hewmatt10](https://github.com/hewmatt10), and @[skyflaren](https://github.com/skyflaren)

# Preliminary

Some preliminary work prior to training and predicting the data. 

## Fetch Data
Files uploaded to Google Colab are recycled after a certain number of hours. To get around this, our data is hosted remotely and as necessary we pull it into the environment using the `wget` module. These files are exact copies of the provided datasets in the [Division Sigma resources folder](https://drive.google.com/drive/folders/1EH9S0XcSlDqEG9f-gzOjk0We5hb5OvAq). 

In [None]:
!pip install wget
import wget
import os
import requests

In [None]:
# location of the training data
data_root = "/content/given_data"

In [None]:
if not os.path.isdir(data_root): os.mkdir(data_root)
 
download_root = "https://skyflaren.github.io/sentiment-analysis/"
to_download = ['training_data.csv','contestant_judgment.csv']

for filename in to_download:
    # if dataset file already exists, remove it
    if os.path.isfile(download_root+filename): 
        os.remove(download_root+filename)
    wget.download(download_root+filename,data_root)

## Import modules

Here, we import all of the modules necessary to run the notebook. Their usages are as follows:

### **textblob**

Text analysis library used to perform common NLP tasks, such as lemmatization.

### **pandas**

Data analysis library used to easily manipulate and iterate through the training and judging dataset. 

### **re**

Used for cleaning text input from the dataset via regular expressions. 

In [None]:
from textblob import TextBlob, Word
import pandas as pd
import re, random
########################
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB, GaussianNB
# from sklearn.feature_extraction.text import TfidfTransformer

### **NLTK**

This library deals with natural language processing. We use the library to lemmatize words, and to classify them using the Naive Bayes algorithm. An overview of the downloads:

#### **punkt**

Sentence tokenizer

#### **averaged_perceptron_tagger**

Used for tagging words with their parts of speech

#### **wordnet**

Database of words

#### **stopwords**

Provides a list of common words, such as `between` or `the`, to be ignored as they do not provide value in tokenization


In [None]:
!python -m textblob.download_corpora

from nltk import download as nltk_download
nltk_download('punkt')
nltk_download('averaged_perceptron_tagger')
nltk_download('wordnet')
nltk_download('stopwords')

from nltk import classify, NaiveBayesClassifier
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import *

# The Algorithm

We first preprocess our data, cleaning the dataset, prior to training on it and judging on the judgement dataset. 

<hr>

Here, we use `pandas` to retrieve the training dataset.

In [None]:
data = pd.read_csv(data_root+'/training_data.csv', header=None)[1:]

We use the `tags` attribute of the `TextBlob` object, which yields the part of speech tag for each word in the object. NLTK (and assumingly, TextBlob, as it is based off of NLTK) uses the [Penn Treebank parts of speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 

To ensure that words such as `go` and `went` are grouped together, we lemmatized each word in the TextBlob before processing them. TextBlob's `lemmatize()` function uses WordNet ([source](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.Word.lemmatize)), whose morphological processor uses [these parts of speech tags](https://wordnet.princeton.edu/documentation/morph3wn):

| Part of Speech | POS Short Form used in `lemmatize()` |
| :----:  | :-----: |
| Noun | n |
| Verb | v |
| Adjective | a |
| Adverb | r | 
| Adjective satellite | s | 

This dictionary maps the NLTK/Penn POS tags (used by `TextBlob.tags`) to their corrsponding WordNet POS tag (used by `TextBlob.lemmatize()`).

In [None]:
parts_of_speech = {'nn':'n','nns':'n','nnp':'n','nnps':'n','prp':'n','prp$':'n','vb':'v','vbg':'v','vbd':'v','vbn':'v','vbp':'v','vbz':'v','rb':'r','rbr':'r','rbs':'r','wrb':'r','jj':'a','jjr':'a','jjs':'a'}

`delim` - a custom regex-based deliminator is used to properly tokenize contractions and keep them intact (e.g. `can't` instead of `can` and `'t`)

In [None]:
delim = RegexpTokenizer("[\\w']+|[^\\w\\s]+")

`dataset` - a list of tuples containing a map (key is a lemmatized word, value is `True`) and whether it should be a `1` or `0`

`sentiment_ind`, `text_ind` - constants for the column numbers of the input text and expected output sentiment value

In [None]:
# == Initialize variables
sentiment_ind = 3
text_ind = 2
dataset = []
all_text = []

`stop_words` - a set of the stop words to be ignored in the text

In [None]:
stop_words = set(stopwords.words('english'))

## Training

### Preprocessing

Here, we iterate and clean all words in a text, adding them to the `dataset`. This process will take about 25 minutes.

<hr>

`cleanse()` - given a pair of a word and its parts-of-speech tag (Penn guidelines), the function returns a cleaned, lemmatized version of the word. If the cleaned word is an empty string, then `None` is returned.

In [None]:
# == Cleaning (function)

def cleanse(word):
    # Remove symbols and ignore numbers
    word_txt = re.sub('[.,$^!@#%&*"()~`\\-_+=\\[\\]{}<>/\\\\:;]*','',word[0]).strip().lower()

    # Ignore stop words, empty strings, other users, or numbers
    if word_txt == '' or word_txt in stop_words or word[0].strip()[0] == '@' or  re.sub("[0-9]\s*",'',word_txt) == '':
        return None

    w = Word(word_txt)
    # Return the lemmatized word if it can be lemmatized; if not, return the original word
    return w.lemmatize( str(parts_of_speech[word[1]].lower()) ) if word[1] in parts_of_speech else word_txt

Here, the text and corresponding sentiment value in each entry in the original dataset is processed. After stripping the text feature of any hyperlinks, the text is broken down into its words and parts of speech. Each word is cleaned of unnecessary characters and/or discarded if the word does not present value (e.g. a stray symbol). 

`cleaned_words` is added to the `dataset`, which is eventually used by the classifier. 

In [None]:
# == Preprocessing

for ind, row in data.iterrows():
    # The text feature in the current netry
    txt = row[text_ind]

    # Dictionary of cleaned words used by the training classifiers
    cleaned_words = {}

    # An integer representing the label for the current feature
    pos_or_neg = int(row[sentiment_ind])

    # Remove hyperlinks
    txt = re.sub('https?:\\/\\/[^\\s]*(?=(\\s|$))','',txt)

    # Breaks down each sentence into (word, part of speech tag) pairs
    sentence = TextBlob(txt, tokenizer=delim) 

    # Iterate through all (word, part of speech tag) pairs in the sentence, cleanse them, and add them
    for word in sentence.tags:
        ret = cleanse(word)
        if ret == None: continue

        cleaned_words[ret] = True

    # Append the sentence's worth of cleaned words
    dataset.append( tuple([cleaned_words, pos_or_neg]) )

### Naive Bayes

Naive Bayes assigns values to words based on training data, then judges the sentiment of a sentence by using those values. Below is the fundamental Bayes theorem.

$$P(A|B) = \frac{P(B|A) * P(A)}{P(B)}$$

Bayes theorem helps us find the probability of an event occuring given another even occurs based off of other dependent and independent events. We use the probability that a word is in one sentiment to find the probability that given the sentiment, the word occurs. This will help us determine a relative probability that a piece of text is a sentiment, which will ultimately tell us which sentiment the piece of text is more likely to be.

In [None]:
# == Training

random.shuffle(dataset)

data_len = len(dataset)

data_for_training = dataset[int(0.7*data_len):]
data_for_judging = dataset[:int(0.7*data_len)]

classifier = NaiveBayesClassifier.train(data_for_training)
print("Accuracy is:", classify.accuracy(classifier, data_for_judging))

print(classifier.show_most_informative_features(10))

## Predicting

Here, the model predicts the judgement dataset's sentiments. 

In [None]:
judgedata = pd.read_csv(data_root+'/contestant_judgment.csv', header=None)[1:]

The text cleaning process used in the training section is performed on the judgement data. 

In [None]:
#@title
# == Predicting
predictions = []

for ind, row in judgedata.iterrows():
    txt = row[text_ind]
    cleaned_words = {}

    # Breaks down each sentence into (word, part of speech tag) pairs
    sentence = TextBlob(txt, tokenizer=delim) 

    # Iterate through all (word, part of speech tag) pairs in the sentence, cleanse them, and add them
    for word in sentence.tags:
        ret = cleanse(word)
        if ret != None:
            cleaned_words[ret] = True

    predictions.append(classifier.classify(cleaned_words))

Finally, we build a dataframe of the predicted data and export it to a csv. 

In [None]:
prediction_df = pd.DataFrame(columns=["ID","User","Text","Sentiment"])

for ind,row in judgedata.iterrows():
    # Rename the columns in the row from integer labels to string ones. The integer-labelled columns are then droppde. 
    row['ID'] = row[0]
    row['User'] = row[1]
    row['Text'] = row[2]
    row['Sentiment'] = int(predictions[ind] == 1)

    row = row.drop([0,1,2])
    
    # Add this formatted entry with its sentiment prediction to the dataframe
    prediction_df = prediction_df.append(row,ignore_index=True)

# Export the dataframe to the csv to be submitted
prediction_df.to_csv(dataroot+"/predictions.csv",index=False)