# Sentiment Analysis with NLTK and VADER
# TODO: decide on 'I' vs 'we'

## Step 1: Load the input text

In [1]:
from pathlib import Path

# read and close the file in one line
# see https://stackoverflow.com/a/49564464
text_file = Path("input.txt").read_text()

# ensure that we've read the correct file
print("Read 50 characters of input file: " + text_file[0:50])

Read 50 characters of input file: "Stop blushing. I'm not needling, really I'm not. 


## Step 2: Split into two paragraphs and reformat text
The input text file seems to contain two paragraphs that I believe should be analyzed separately.
These paragraphs are seperated by double newlines, so I split them here.
The input text file also contains newlines to avoid any one line from being too long, so I then remove all the newlines.

In [7]:
# split into paragraphs and remove newlines inside paragraphs
# we also need to remove double-spaces because some lines end with a space
paragraphs = text_file.split("\n\n")
paragraphs = [paragraph.replace("\n", " ").replace("  ", " ") for paragraph in paragraphs]
print("Extracted {} paragraphs: '{}...' '{}...'".format(len(paragraphs), paragraphs[0][0:20], paragraphs[1][0:20]))

Extracted 2 paragraphs '"Stop blushing. I'm ...' 'I think you may like...'


## Step 3: Process paragraphs into words
At this stage, we have two paragraphs to analyze. We would like to first split each paragraph into a list of words:

## Step 3.1: Import and install
We use [NLTK](https://www.nltk.org/) to tokenize our words here, leveraging the existing
[Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer to do the heavy lifting
for us. First we must import NLTK and install the `punkt` resource. We also download the `stopwords`
resource for later.

In [31]:
import nltk
nltk.download("punkt")

from nltk.tokenize import word_tokenize
print("Tokenized test sentence: {}".format(word_tokenize("Test sentence here.")))

# we will use this later
nltk.download("stopwords")

Tokenized test sentence: ['Test', 'sentence', 'here', '.']


[nltk_data] Downloading package punkt to /home/max/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/max/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Step 3.2: Tokenize words
Here we actually call the word_tokenize() method.

In [25]:
tokenized = [word_tokenize(paragraph) for paragraph in paragraphs]
print("Tokenized paragraphs: {} {}".format(tokenized[0][0:4], tokenized[1][0:4]))

Tokenized 2 paragraphs: ['``', 'Stop', 'blushing', '.'] ['I', 'think', 'you', 'may']


## Step 3.3: Remove non-words
As we saw from the previous output, the tokenized list includes "words" such as `'.'` that we don't want to
analyze. We then remove them:

In [40]:
# using a nested list comprehension because we need to iterate over a nested list
# using str.isalpha(), assuming that we only want to keep words that have an alpha score
# we also lowercase every word, as the case of words does not contribute meaningfully
# to the sentiment of the text
tokenized_words = [[word.lower() for word in paragraph if word.isalpha()] for paragraph in tokenized]

print("Removed non-words from paragraphs: {} {}".format(tokenized_words[0][0:5], tokenized_words[1][0:5]))

Removed non-words from paragraphs: ['stop', 'blushing', 'i', 'not', 'needling'] ['i', 'think', 'you', 'may', 'like']


Note that at this point we have lost sentence structure. An improved analysis might require keeping this structure.

## Step 3.4: Remove "stopwords"
As we saw from the previous output, there are many words such as "I" and "you" that do not contribute
much to the sentiment of the text. These words occur frequently in English and are known as stopwords.

In [45]:
from nltk.corpus import stopwords

stopwords_set = stopwords.words("english")
# print a few of the stopwords so we can see them
print("Loaded stopwords set: {}".format(stopwords_set[0:5]))
# note that all of these words are lowercase, just like the words in tokenized_words

# remove stop-words from dataset
tokenized_words_filtered = [[word for word in paragraph if not (word in stopwords_set)] for paragraph in tokenized_words]

# print a few of the filtered words
print("Removed stopwords from paragraphs: {} {}".format(tokenized_words_filtered[0][0:5], tokenized_words_filtered[1][0:5]))

Loaded stopwords set: ['i', 'me', 'my', 'myself', 'we']
Removed stopwords from paragraphs: ['stop', 'blushing', 'needling', 'really', 'know'] ['think', 'may', 'like', 'know', 'something']


At this point we have properly formatted our input file into the list `tokenized_words_filtered`, containing two lists
of the words in the input text files, excluding punctuation and words that do not contribute meaningfully
to the sentiment of the text. We have also transformed every word into lower-case to make it easier to process.