# Introduction

Learning from: https://realpython.com/sentiment-analysis-python/. I will be using that website to make notes and follow along with their tutorial to better understand the concepts in Sentiment Analysis.

In [1]:
import spacy

In [4]:
text = """
Dave watched as the forest burned up on the hill,
only a few miles from his house. The car had
been hastily packed and Marta was inside trying to round
up the last of the pets. "Where could she be?" he wondered
as he continued to wait for Marta to appear with the pets.
"""

# Tokenizing

The process of breaking down chunks of text into smaller pieces. 

In [5]:
nlp = spacy.load("en_core_web_sm") #load spaCy’s English model
doc = nlp(text) #tokenize the text by passing it into the nlp constructor
token_list = [token for token in doc]

In [13]:
len(token_list)

67

# Removing Stop Words

Stop words are words that are only useful in human language but not in machine learning.

In [11]:
filtered_tokens = [token for token in doc if not token.is_stop]

In [14]:
len(filtered_tokens)

34

# Normalizing Words
It entails condensing all forms of a word into a single representation of that word. For example, removing conjugation such as “watched,” “watching,” and “watches” can all be normalized into “watch.”

2 ways of doing it:
1. Stemming
2. Lemmatization

With stemming, a word is cut off at its stem, the smallest unit of that word from which you can create the descendant words. Stemming will miss the relationship between “feel” and “felt”. 

Lemmatization seeks to address this issue. This process uses a data structure that relates all forms of a word back to its simplest form, or lemma.

In [15]:
lemmas = [f"Token: {token}, lemma: {token.lemma_}" for token in filtered_tokens]

In [16]:
lemmas

['Token: \n, lemma: \n',
 'Token: Dave, lemma: Dave',
 'Token: watched, lemma: watch',
 'Token: forest, lemma: forest',
 'Token: burned, lemma: burn',
 'Token: hill, lemma: hill',
 'Token: ,, lemma: ,',
 'Token: \n, lemma: \n',
 'Token: miles, lemma: mile',
 'Token: house, lemma: house',
 'Token: ., lemma: .',
 'Token: car, lemma: car',
 'Token: \n, lemma: \n',
 'Token: hastily, lemma: hastily',
 'Token: packed, lemma: pack',
 'Token: Marta, lemma: Marta',
 'Token: inside, lemma: inside',
 'Token: trying, lemma: try',
 'Token: round, lemma: round',
 'Token: \n, lemma: \n',
 'Token: pets, lemma: pet',
 'Token: ., lemma: .',
 'Token: ", lemma: "',
 'Token: ?, lemma: ?',
 'Token: ", lemma: "',
 'Token: wondered, lemma: wonder',
 'Token: \n, lemma: \n',
 'Token: continued, lemma: continue',
 'Token: wait, lemma: wait',
 'Token: Marta, lemma: Marta',
 'Token: appear, lemma: appear',
 'Token: pets, lemma: pet',
 'Token: ., lemma: .',
 'Token: \n, lemma: \n']

# Vectorizing Text

Vectorization is a process that transforms a token into a vector, or a numeric array that, in the context of NLP, is unique to and represents various features of a token.

In [19]:
filtered_tokens[1].vector

array([ 1.8371646 ,  1.4529226 , -1.6147211 ,  0.678362  , -0.6594443 ,
        1.6417935 ,  0.5796405 ,  2.3021278 , -0.13260496,  0.5750932 ,
        1.5654886 , -0.6938864 , -0.59607106, -1.5377437 ,  1.9425622 ,
       -2.4552505 ,  1.2321601 ,  1.0434952 , -1.5102385 , -0.5787632 ,
        0.12055647,  3.6501784 ,  2.6160972 , -0.5710199 , -1.5221789 ,
        0.00629176,  0.22760668, -1.922073  , -1.6252862 , -4.226225  ,
       -3.495663  , -3.312053  ,  0.81387717, -0.00677544, -0.11603224,
        1.4620426 ,  3.0751472 ,  0.35958546, -0.22527039, -2.743926  ,
        1.269633  ,  4.606786  ,  0.34034157, -2.1272311 ,  1.2619178 ,
       -4.209798  ,  5.452852  ,  1.6940253 , -2.5972986 ,  0.95049495,
       -1.910578  , -2.374927  , -1.4227567 , -2.2528825 , -1.799806  ,
        1.607501  ,  2.9914255 ,  2.8065152 , -1.2510269 , -0.54964066,
       -0.49980402, -1.3882618 , -0.470479  , -2.9670253 ,  1.7884955 ,
        4.5282774 , -1.2602427 , -0.14885521,  1.0419178 , -0.08