<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/Review_Ch5_ML_Spacy_NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!rm -r  'uo_puddles' #flush the old directory
my_github_name = 'uo-puddles'
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'
!git clone $clone_url  #this adds the library to colab so you can now import it
import uo_puddles.uo_puddles as up

rm: cannot remove 'uo_puddles': No such file or directory
Cloning into 'uo_puddles'...
remote: Enumerating objects: 225, done.[K
remote: Counting objects: 100% (225/225), done.[K
remote: Compressing objects: 100% (189/189), done.[K
remote: Total 225 (delta 133), reused 64 (delta 33), pack-reused 0[K
Receiving objects: 100% (225/225), 56.74 KiB | 8.11 MiB/s, done.
Resolving deltas: 100% (133/133), done.


In [0]:
# Spacy is a text analysis library
import spacy
!python -m spacy download en_core_web_md # Spacy requires a dictionary of words. This code is making the en_core_web_md dictionary locally available to colab
import en_core_web_md # import the library
nlp = en_core_web_md.load()  # nlp is a parser function. It parses text into words. The result is a new data type called a doc.

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.1MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051305 sha256=a791213de60fd4a11ae468ea65f38ef6d0ebcf69c57ba4d8ad59b54c599d4d31
  Stored in directory: /tmp/pip-ephem-wheel-cache-tbq2ntik/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [0]:
# Practice using Spacy

practice_sentence = 'spaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at Explosion AI.'
doc = nlp(practice_sentence) # parse the practice sentence into words (called 'tokens' in spacy) - doc is an pbject that holds data similarly to a list

# spacy tokens (words) have different attributes, one of which is text (which makes the tokens into strings)

for i in range(len(doc)):
  token = doc[i]
  print(token.text) # notice the attributes don't use () after the name

# the above is equivalent to:
# for token in doc:
#   print(token.text)


spaCy
is
a
relatively
new
package
for
“
Industrial
strength
NLP
in
Python
”
developed
by
Matt
Honnibal
at
Explosion
AI
.


In [0]:
# Look more at spacy attributes

first_token = doc[0] # can index using docs
first_token

type(first_token)

print(first_token.text) # original word as a string
print(first_token.lemma_) # lemmatized version
print(first_token.pos_) # proper noun
print(first_token.tag_) # NNP?
print(first_token.dep_) # references parse tree
print(first_token.shape_) # lowercase,lowercase,lowercase,uppercase,lowercase
print(first_token.is_alpha) # yes, consists of all alphanumeric characters
print(first_token.is_stop) # no, is not a stopword
print(first_token.is_punct) # no, is not punctuation

spaCy
spaCy
PROPN
NNP
nsubj
xxxXx
True
False
False


In [0]:
# Look at the parse tree
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

In [0]:
# Spacy can also recognize entities
displacy.render(doc, style="ent", jupyter=True)

In [0]:
# Practice text analysis with spacy
import pandas as pd

gothic_sentences = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQqRwyE0ceZREKqhuaOw8uQguTG6Alr5kocggvAnczrWaimXE8ncR--GC0o_PyVDlb-R6Z60v-XaWm9/pub?output=csv',
                          encoding='utf-8')
len(gothic_sentences) # 19,579 gothic sentences
gothic_sentences.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [0]:
# Display all text when displaying the table
pd.set_option('display.max_colwidth',None) # None forces all of sentence to be shown
gothic_sentences.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL


In [0]:
# Machine Learning with Text Analysis Tools

## 1. Randomly shuffle the rows of gothic_sentences. Split into a training set (70%) and testing set (30%).
## 2. Build a "bag of words" from the training table.
## 3. Introduce Naive Bayes, which we will use to make the bag of words.
## 4. Evaluate on testing data.

In [0]:
## 1. Randomly shuffle the rows of gothic_sentences. Split into a training set (70%) and testing set (30%).

set_seed = 1234

import numpy as np
rsgen = np.random.RandomState(set_seed)


# Shuffled Gothic Sentences
shuffled_gothics = gothic_sentences.sample(frac=1, random_state = rsgen).reset_index(drop=True)
len(shuffled_gothics)


# Calculating n's for Testing and Training Tables
n_testing = (len(shuffled_gothics))*.3
n_testing # 5874

n_training = (len(shuffled_gothics)) - n_testing
n_training # 13705


# Training Set
training_table = shuffled_gothics[:13705].reset_index(drop=True)

# Testing Set
testing_table = shuffled_gothics[13705:].reset_index(drop=True)


# Grab the Text and Author Columns from the Training and Testing Sets and Convert into Lists (easier to work with text data this way)
training_text = training_table['text'].tolist()
training_authors = training_table['author'].tolist()

testing_text = testing_table['text'].tolist()
testing_authors = testing_table['author'].tolist()

In [0]:
## 2. Build a "bag of words" from the training table.

# Create an empty dataframe that will be the "word bag"
word_bag = pd.DataFrame(columns=['word','EAP','MWS','HPL']) # build a dataframe with columns for each word in the gothic texts and each authors' name abbreviated
word_bag.head() # currently empty


# See how you can add a row
row0 = ['indefinite',1,0,0]
word_bag.loc[0] = row0
word_bag.head()


# A function in uo_puddles has been written to add rows the way we did above (up.update_gothic_row)
up.update_gothic_row(word_bag, 'indefinite', 'EAP') # acts as a counter - adds another count underneath the author given for the corresponding word; if the word doesn't exist in the dataframe yet, it adds a new row for that word


Unnamed: 0,word,EAP,MWS,HPL
0,indefinite,2,0,0


In [0]:
# 2a. Strategy for building up the word bag

## 1. Look at sentence at row 0.
## 2. Get author from that row too.
## 3. Parse the sentence into words using spacy
## 4. For each word, update word_bag using up.update_gothic_row
## 5. Repeat steps 1-4 for all rows in the training table

### Note: Stop words are words that are so common they carry no value. These can be removed from the text analysis process (although be considerate of how you are choosing which words to remove - no universal agreement on which words should be considered stop words)

## Example using the first sentence from the training set
# First sentence in training set
first_training_sentence = training_text[0]
first_training_sentence

# Parse the first sentence into words using spacy
doc_first_training_sentence = nlp(first_training_sentence.lower()) # convert it into lowercase so the word bag won't be case sensitive

# We want to add a count for words in the word bag IF certain criteria are met
## If the word is_alpha and if the word is NOT a stop word (NOT is_stop)

### Example using 'indefinite' and 'EAP'
token = doc_first_training_sentence[1] # taking 'indefinite' from the first sentence in the trianing set that's been parsed into words
print(token)
author = 'EAP'

if token.is_alpha and not token.is_stop:
  up.update_gothic_row(word_bag,token,author) # if the word had been contained in spacy's "stop word" dictionary, it would not have been added

word_bag.head()

indefinite


Unnamed: 0,word,EAP,MWS,HPL
0,indefinite,4,0,0


In [0]:
# Nested Loops
## We will need this concept moving forward

# Example - list of lists
list_of_chars = [['a', 'b', '3'], ['c', '4'], ['5', '6', 's', 'd']]
list_of_chars

# Say we want just the letters from this list of lists
letter_list = []

for i in range(len(list_of_chars)):
  inner_list = list_of_chars[i] # list 1, list 2, list 3

  for j in range(len(inner_list)):
    alpha_chars = inner_list[j] # each item contained in each list
    if alpha_chars.isalpha():
      letter_list.append(alpha_chars)

letter_list

['a', 'b', 'c', 's', 'd']

In [0]:
# Chapter 5 Exercises
## Build a word bag using the training_text and training_authors lists

word_bag = pd.DataFrame(columns=['word','EAP','MWS','HPL']) # blank word bag

for i in range(len(training_text)):
  training_sentences = training_text[i].lower()
  parsed_sentences = nlp(training_sentences)
  author = training_authors[i]
  
  for token in parsed_sentences:  # question: are tokens automatically an object we can obtain from the doc (the output of nlp())? Because we didn't have to create a token variable in previous for loop
    if token.is_alpha and not token.is_stop:
      up.update_gothic_row(word_bag, token.text, author) # use token.text to make the output a string

In [0]:
word_bag.head()

Unnamed: 0,word,EAP,MWS,HPL
0,indefinite,10,0,2
1,sense,44,18,27
2,awe,12,6,4
3,sight,31,32,52
4,navigators,1,1,0


In [0]:
word_bag.tail()

Unnamed: 0,word,EAP,MWS,HPL
21547,differential,1,0,0
21548,nobly,0,1,0
21549,conferred,0,1,0
21550,uniformly,1,0,0
21551,branded,0,1,0


In [0]:
# Word bag has been created. Now, we need to perform a machine learning algorithm. We want to be able to predict which author most likely wrote a sentence given a particular word.

# Sort based on the word column
sorted_word_bag = word_bag.sort_values(by=['word']).reset_index(drop=True)
sorted_word_bag.head()

# Up until now, we've used a numerical index (0 through length of table)
## It may be useful to set the word as the index instead
sorted_word_bag = sorted_word_bag.set_index('word') 
sorted_word_bag.head()

# Now we can choose a row based on the desired word
sorted_word_bag.loc['indefinite','EAP'] # give index ('indefinite') and then column name ('EAP') to get the value in that cell
## EAP used the word indefinite 10 times

10

In [0]:
sorted_word_bag.head() # sorted alphabetically by word
sorted_word_bag.tail()

Unnamed: 0_level_0,EAP,MWS,HPL
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ærial,1,0,0
æronaut,2,0,0
æschylus,1,0,0
émeutes,1,0,0
οἶδα,0,0,2


In [0]:
# Storing the word bag
## We don't want to have to rebuild this word bag each time we open a notebook
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
with open('/content/gdrive/My Drive/word_bag_s20.csv', 'w') as file:
    sorted_word_bag.to_csv(file, index=True)

In [0]:
# First way to read the data back in
with open('/content/gdrive/My Drive/word_bag_s20.csv', 'r') as file:
    sorted_word_table = pd.read_csv(file, dtype={'word':str}, encoding='utf-8',
                                    index_col='word', na_filter=False)

In [0]:
# Second way to read the data back in
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQR107nAfeU_z-p6sUv3yhnti9vNsklgXsm2RXAExQBHPUE3APm32qMQxTuYCEBbSz09MCVx-rnOXGb/pub?output=csv'
sorted_word_bag = pd.read_csv(url, dtype={'word':str}, encoding='utf-8',
                                index_col='word', na_filter=False)
sorted_word_bag = sorted_word_bag.rename(index={'TRUE': 'true', 'FALSE': 'false'}) #need this because of bug in reading from url
sorted_word_bag.head()

Unnamed: 0_level_0,EAP,MWS,HPL
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ab,1,0,0
aback,2,0,0
abaft,0,1,0
abandon,3,1,2
abandoned,8,4,8


In [0]:
# A new ML algorithm: Naive Bayes
## Thinking of probability of outcome in terms of the count of different text items

Basic Bayes Theorem

<img src='https://www.dropbox.com/s/efpfgkenlit9rk1/Screenshot%202020-04-23%2009.42.25.png?raw=1' height=200>

The jargony terms for the various pieces are as follows:

* P(A|B): The *posterior* probability.

* P(B|A): The *liklihood* or *conditional probability*.

* P(A): A *marginal* or *prior* probability.

* P(B): A *marginal* or *prior* probability.



A More Complex Version of Bayes Theorem

<img src='https://www.dropbox.com/s/gstzvvtvh9b39o8/bayes.png?raw=1'>

O is equivalent to A from above; E is equivalent to B

O stands for one of the "classes" being used. In our case, these are our three authors.

E stands for the words in each sentence, in our case. More precisely, it's the spacy tokens that meet is_alpha and not is_stop.

For instance, P(indefinite|EAP) means "what is the probabilty of seeing the word indefinite in a sentence that EAP wrote?"

P(E) is the probabilty of seeing each word in the word bag.

P(O) is the probability of a word being from one of the authors.

In [0]:
# Compute probailities of a word being from each author - P(O)

n_training = len(training_text)
n_training

EAP_count = training_authors.count('EAP')
print(EAP_count) # 5470
MWS_count = training_authors.count('MWS')
print(MWS_count)  # 4278
HPL_count = training_authors.count('HPL')
print(HPL_count)  # 3957

prob_EAP = EAP_count/n_training
print(prob_EAP) # .40
prob_MWS = MWS_count/n_training
print(prob_MWS) # .31
prob_HPL = HPL_count/n_training
print(prob_HPL) # .29

5470
4278
3957
0.39912440715067493
0.31214885078438526
0.2887267420649398


In [0]:
# Compute P(indefinite|EAP) - P(E1|O)
## If we can solve this example, we can repeat this for all other words and authors
### We are going to assume that no word appears more than once in any sentence

p_indefinite_EAP = sorted_word_bag.loc['indefinite','EAP']/EAP_count # the number of times EAP used indefinite divided by the total number of unique words EAP used
p_indefinite_EAP # .0018 - this means if you see the word indefinite, there is a .18% chance it came from EAP?

0.0018281535648994515

In [0]:
# Compute P(indefinite) - P(E1)

p_indefinite = sum(sorted_word_bag.loc['indefinite'].tolist())/n_training # the number of times indefininte was used out of total number of words
p_indefinite # .0009 - this means there is a .09% chance any random word chosen from the word bag would be 'indefinite'

0.0008755928493250639

In [0]:
# Compute P(O|E1, E2, ..., En)
## Example: choose random sentence to practice on

random_sentence = training_text[2] 
random_sentence

# Build a list of words and store them in doc
# Then go through doc and add token.text to a new list 
# Remember the constraints: is_alpha and not is_stop

e_list = []

for i in range(1):
  doc = nlp(random_sentence.lower()) 

  for token in doc:  
    if token.is_alpha and not token.is_stop:
      e_list.append(token.text)

e_list

['laugh', 'oliver', 'grandfather', 'wo', 'ride', 'motor']

In [0]:
# Alternative solution

doc = nlp(random_sentence.lower())
e_list = []

for token in doc:
  if token.is_alpha and not token.is_stop:
    e_list.append(token.text)

In [0]:
# Assume that we use 'EAP' for 'O'
## Compute the values p(Ei|EAP) using e_list from above

p_ei_EAP_list = []

for i in range(len(e_list)):
  items = e_list[i]
  p_ei_EAP = sorted_word_bag.loc[items, 'EAP']/EAP_count
  p_ei_EAP_list.append(p_ei_EAP)
print(p_ei_EAP_list)

[0.0009140767824497258, 0.0, 0.0, 0.0021937842778793418, 0.0009140767824497258, 0.0]


In [0]:
# Alternative solution

p_ei_eap_list = []

for word in e_list:
  result = sorted_word_table.loc[word, 'EAP']/eap_count
  p_ei_eap_list.append(result)
print(p_ei_eap_list)

In [0]:
list(zip(e_list, p_ei_EAP_list))

[('laugh', 0.0009140767824497258),
 ('oliver', 0.0),
 ('grandfather', 0.0),
 ('wo', 0.0021937842778793418),
 ('ride', 0.0009140767824497258),
 ('motor', 0.0)]

In [0]:
# Compute p(Ei) for the values in the word e_list

p_ei_list = []

for i in range(len(e_list)):
  items = e_list[i]
  p_ei = sum(sorted_word_bag.loc[items].tolist())/n_training
  p_ei_list.append(p_ei)
print(p_ei_list)

[0.0010944910616563297, 7.296607077708865e-05, 0.0016052535570959504, 0.0015322874863188617, 0.0008026267785479752, 0.00036483035388544326]


In [0]:
# Calculate p(EAP)
## We already have it!

prob_EAP

0.39912440715067493

In [0]:
# Python doesn't have a built-in multiplication function equivlant to the sum function
## Below is a function written to achieve multiplication

def float_mult(number_list: list) -> float:
  assert isinstance(number_list, list), f'number_list should be a list but is instead a {type(number_list)}'
  assert all([isinstance(item, float) for item in number_list]), f'number_list must contain all floats'

  result = 1.
  for number in number_list:  #fancier version of for i in range(n):
    result *= number

  return result


# Test out the function
z = [4., 2., 3.]
x = float_mult(z)
x

24.0

In [0]:
# Now, calculate the final result using the formula from above

<img src='https://www.dropbox.com/s/gstzvvtvh9b39o8/bayes.png?raw=1'>

In [0]:
numerator = float_mult(p_ei_EAP_list)
numerator

denominator = float_mult(p_ei_list)
denominator

prob_EAP

p_EAP_ei = (numerator/denominator)*prob_EAP
p_EAP_ei # 0.0

0.0

We have concluded that, according to Naive Bayes, there's a zero probability that EAP wrote the random_sentence we picked out.

P(EAP|'laugh', 'oliver', 'grandfather', 'wo', 'ride', 'motor') = 0.0

In [0]:
# We need to do the same thing for MWS and HPL
## Hopefully will obtain HPL = 1 (the author who actually wrote the sentence)