<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/DataSciSeminar_ListOfFunctions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Review of Topics Covered in Data Science Seminar


*   ### Ch.1 - Introduction to Manipulating Strings Using Methods, Lists, and Indexing
*   ### Ch. 2 - Data Wrangling & For Loops
*   ### Ch. 3 - K Nearest Neighbor 
    - A way of calculating similarty between cases using this formula: <img src='https://www.dropbox.com/s/9wao0kf3u32i3e9/euclidian.png?raw=1'>
    - Examples: Calculated similarity between a chosen case ("Braund") and all other passengers based on which letters from the alphabet are contained in their names. Then, looked to see if the most similar passengers on this dimension scored similarly on the outcome variable (whether survived Tianic sinking or not)
    - The outcome variable is also called the *label* column
*   ### Ch. 4 - Introduction to Machine Learning
    - Intro to ML algorithms -- in this course, we cover KNN (K Nearest Neighbor), Naive Bayes, and ANNs (Artificial Neural Nets) 
    - A *model* is an algorithm paired with training data. The model is used to make predictions on the testing data.

    - Splitting data into *training* and *testing* sets
    - Covers *KNN* algorithm applied to ML example
    - Covers *Cosine Similarity* algorithm. Cosine similarity is a similarity measure that uses the equation: <img src='https://www.dropbox.com/s/oi1ttx99hf0uejn/cosine.png?raw=1'>

*   ### Ch. 5 - spaCy & Naive Bayes
    *   Basic Bayes Theorem

<img src='https://www.dropbox.com/s/efpfgkenlit9rk1/Screenshot%202020-04-23%2009.42.25.png?raw=1' height=200>

The jargony terms for the various pieces are as follows:

   * P(A|B): The *posterior* probability.

   * P(B|A): The *likelihood* or *conditional probability*.

   * P(A): A *marginal* or *prior* probability.

   * P(B): A *marginal* or *prior* probability.



#### A More Complex Version of Bayes Theorem

<img src='https://www.dropbox.com/s/gstzvvtvh9b39o8/bayes.png?raw=1'>

   * O is equivalent to A from above; E is equivalent to B

   * O stands for one of the "classes" being used. In our case, these are our three authors.

   * E stands for the words in each sentence, in our case. More precisely, it's the spacy tokens that meet is_alpha and not is_stop.

   * For instance, P(indefinite|EAP) means "what is the probabilty of seeing the word indefinite in a sentence that EAP wrote?"

   * P(E) is the probabilty of seeing each word in the word bag.

   * P(O) is the probability of a word being from one of the authors.


*   Naive Bayes requires building a 'Word-Bag'


### Ch. 6 - Other Considerations with Naive Bayes
*   The Laplace Smoothing Factor
    - A word may appear in the testing set that was not in the training set, and therefore would not be in our 'word-bag,' and would get assigned a probability of zero - not good!
    - The Laplace smoothing factor adds 1 to each numerator (**it's like adding 1 to all the words in the word-bag**)
    - This creates the problem that each author column will now be 1*|V|, where V is the vocabulary and |V| is the number of words we have in V (i.e., the length of the word-bag). To compensate, divide by |V|
    - The result is you should never get a value of zero for any `P(word|author)
*   Eliminating the denominator
    - The denominator of the Naive Bayes equation can actually be ignored because it doesn't change the relationship between the probability of each outcome (i.e., the probability of which author used a particular word)
*   Avoiding Underflow or The Vanishing Gradient
   - Problem: taking the product of very small numbers gives a number close to zero, and there's a limit to how small a number Python can represent (turns it into a zero instead)
   - Proposed Fix: If a product drops to zero (via underflow), we will set the product to the minimum numerical value that Python can represent - this avoids returning a zero probability from Naive Bayes

### Ch. 7 (part 1) - Text as Vectors 
   - Defining Functions For Subtracting, Adding, and Dividing, Vectors
   - Calculating the Mean of the Columns of a Matrix
   - `ordered_embeddings` function (for calculating euclidean distance between a specific vector and other vectors in a table, and sort)
   - `build_embedding_matrix` function (takes a string and a table as inputs and produces a matrix of values from it)

### Ch. 7 (part 2) - More with spaCy
*   Deriving meaning from text
    - The distributional hypothesis (can derive words' meanings from their contexts)
*   GloVe - Stanford's Global Vectors for Word Representation project
    - A database of 300-Dimensional vectors regarding the contextual meaning for all the words in the `en_core_web_md` dictionary that comes with spaCy
*   Calculating similarity between the meanings of different sentences based on their GloVe vectors

### Ch. 8 - Artificial Neural Nets
*   Another type of ML algorith. A model that has an input layer, a hidden layer, and an output layer.
*   Takes an input vector, multiplies it by (like regression) weights (using the dot-product function), then passes these through an activation function (for non-linear solutions) to get the output
*   Types of activation functions:
    - RELU (rectified linear activation function) -- a piecewise linear function that will output the input directly if the output is positive; otherwise, it will output zero (if the input is negative) -- <img src='https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png' height=200>
    - Sigmoid -- produces a value between 0 and 1 -- <img src='https://www.dropbox.com/s/58hr9e4iusnmapc/Screenshot%202020-02-18%2014.02.06.png?raw=1'> <img src='https://www.dropbox.com/s/wdqdl22m2l7jruo/Screenshot%202020-02-18%2014.02.21.png?raw=1'>


### Ch. 9 - ANNs with Numerical Data


### Ch. 10 - CNNs (Convolutional Neural Nets)
*   CNNs on 2D data -- like images!
*   CNNs on 1D data -- like a sentence

<img src='https://missinglink.ai/wp-content/uploads/2019/03/1D-convolutional-example_2x.png' height=300>


# Libraries

In [0]:
import pandas as pd
import numpy as np
import math

# Import uo_puddles library

In [0]:
# Flush the old uo_puddles directory and re-import
!rm -r 'uo_puddles'
my_github_name = 'uo-puddles' # can replace with your account name
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'
!git clone $clone_url
import uo_puddles.uo_puddles as up

# Store a Data Set in Google Drive

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')
with open('content/gdrive/My Drive/titanic_letters.csv', 'w') as file:
  table_name.to_csv(file, index=False)

# Read the Data Back In

In [0]:
# First Option to Read Data Back that's Been Stored in Google Drive
with open('/content/gdrive/My Drive/word_bag_s20.csv', 'r') as file:
    sorted_word_table = pd.read_csv(file, dtype={'word':str}, encoding='utf-8',
                                    index_col='word', na_filter=False)

In [0]:
# Second Option to Read Data Back that's Been Stored in Google Drive
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQR107nAfeU_z-p6sUv3yhnti9vNsklgXsm2RXAExQBHPUE3APm32qMQxTuYCEBbSz09MCVx-rnOXGb/pub?output=csv'
sorted_word_bag = pd.read_csv(url, dtype={'word':str}, encoding='utf-8',
                                index_col='word', na_filter=False)
sorted_word_bag = sorted_word_bag.rename(index={'TRUE': 'true', 'FALSE': 'false'}) #need this because of bug in reading from url

# String Methods

In [0]:
# Count Number of Time a Character or Phrase Occurs in a String
FF.count('n') # count the number of times a specific character occurs in a string


# Replace a Character or Phrase in a String with Another Character or Phrase
FF.replace('diverged', 'coalesced') # replaces a character or word with a new character or word


# Change Formatting of Characters in a String
FF.capitalize() # capitalizes first character
FF.casefold() # converts string to lower case
FF.lower() # returns lower case version of string
FF.upper() # converts string to upper case
FF.swapcase() # changes lower case characters to upper case and vice versa
FF.title() # converts first character of each word to upper case

# Indexing Characters or Phrases in a String
FF.index('v') # returns index of a character or phrase based on its FIRST instance in the string
FF.rindex('v') # returns index of a character or phrase based on its LAST instance in the string
FF.find('l') # returns the FIRST instance of a specific character by giving its index
FF.rfind('l') # returns the LAST instance in the string where a character is found by giving its index


# Checking Nature of the String
FF.isalnum() # returns True if all characters in the string are alphanumeric
FF.isalpha() # retruns True if all characters in the string are in the alphabet
FF.isdecimal() # returns True if all characters in the string are decimals
FF.isdigit() # returns True if all characters in the string are digits
FF.isidentifier() # returns True if the string is an identifier (a string that contains only alphanumeric characters or underscores; cannot start with a number or contain any spaces)
FF.islower() # returns True if all characters in the string are lower case
FF.isnumeric() # returns True if all characters in the string are numeric
FF.isprintable() # returns True if all characters in the string are printable
FF.isspace() # returns True if all characters in the string are whitespaces
FF.istitle() # returns True if the string follows the rules of a title (all words start with an upper case letter and the rest of the letters in the words are lower case; symbols and numbers are ignored)
FF.isupper() # returns True if all characters in the string are upper case
FF.endswith(",") # returns True/False for whether the string ends with the specified value
FF.startswith("T")


# String Alignment
FF.ljust(36) # returns left-aligned string; requires length of string argument
FF.rjust(36) # returns right-aligned string; requires length of string argument
FF.center(36) # center align a string; requires a length of string argument


# Partition a String into Parts Based on a Given Keyword
FF.partition('diverged') # returns a tuple based on the FIRST instance of the specified word in a string: 1) everything before the specific word, 2) the specified word, 3) everything after the specified word
FF.rpartition('diverged') # same as partition but it bases the separation on the LAST instance of the specified word


# Split a String into Parts Based on a Given Keyword
FF.split('diverged') # Splits the string at the specified separator and returns a list based on FIRST instance of given word
FF.rsplit('diverged') # Splits the string at the specified separator and returns a list based on LAST instance of given word


# Strip Character(s) from Beginning or End of a String
FF_for_lstrip = (',,,,,Two roads diverged in a yellow wood,')
FF_for_lstrip.lstrip(',') # removes specified leading characters from the string
FF_for_lstrip.rstrip(',') # trims from the right (notice no change in sentence occurred for the lstrip version of the string)


# Joining Items in a Dictionary with a Given Separator
Dict = {"name": "John", "country": "Normway"}
Separator = ";"
Separator.join(Dict) # join items in a dictionary into a string using a given separator 


# Other Formatting Changes
FF_for_format = 'Two {object} diverged in a {color} wood,' # the {word} are placeholders that can be modified with the format method
FF_for_format.format(object = "pandas", color = "pink")
  # Inside the placeholders, you can add formatting specifics, some examples:
  ## :< left aligns the result
  ## :e scientific format with a lowercase e
  ## :n number format
FF.encode() # encodes the string; default encoding is UTF-8
FF.expandtabs() # Sets the tab size of the string


# Fill the String with a Sepcified Number of 0 Values at the Beginning
FF.zfill(50) # fill the string with zeros at the beginning until it is a given number of characters long

# Dropping Unwanted Columns from Dataset

In [0]:
drop_list = ['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'] # Columns to drop
trimmed_table = titanic_table.drop(drop_list, axis = 1) # the drop function drops the given columns from the dataset; axis=1 means drop columns (axis=0 drops rows, which is default)

# Convert a Column Variable to a list
### (Need to do this to work with raw text data)

In [0]:
list_of_names = trimmed_table['Name'].tolist() 

# Create a New List from an Old List

In [0]:
new_list = []

for i in range(len(list_of_names)):
  items_old_list = old_list[i]
  new_list_items = len(items_old_list)
  new_list.append(new_list_items)
  

# Add a Column to a Data Set

In [0]:
trimmed_table['Length'] = lengths

# Index a Row from a Table

In [0]:
table_name.iloc[0].tolist() # and also convert the output to a list

# Calculate number of times each letter from the alphabet occurs in names of passengers from the Titanic data set

In [0]:
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

for i in range(len(alphabet)):
  alpha = alphabet[i] # take each letter of the alphabet at a time
  alpha_counts = []   # create an empty list 27 times

  for j in range(len(list_of_names)):
    name = list_of_names[j] # obtain each passenger name 
    lower_name = name.lower() # convert each passenger name to lower case to match the case of the alphabet list
    alpha_count = lower_name.count(alpha) # count the number of times each letter of the alphabet occurs in each name, one at a time
    alpha_counts.append(alpha_count) # add the count for each letter to the empty list for each name and each letter of the alphabet
  trimmed_table[alpha] = alpha_counts # create 27 new columns for each letter of the alphabet populated with the counts for each name

# Calculate Average Number of Words per Sentence in a Text

In [0]:
mystery_poem = '''
Dust always blowing about the town,
Except when sea-fog laid it down,
And I was one of the children told
Some of the blowing dust was gold.

All the dust the wind blew high
Appeared like god in the sunset sky,
But I was one of the children told
Some of the dust was really gold.

Such was life in the Golden Gate:
Gold dusted all we drank and ate,
And I was one of the children told,
"We all must eat our peck of gold"
'''

# First, remove punctuation or other characters that are not part of words
mystery_poem = mystery_poem.replace(',','')
mystery_poem = mystery_poem.replace(':','')
mystery_poem = mystery_poem.replace('"','')

# Split by sentence (using a period to indicate end of each sentence)
poem_sentences = mystery_poem.split('.')
poem_sentences

n = len(poem_sentences)

word_counts = []

for i in range(n):
  sentence = poem_sentences[i]
  words = sentence.split()
  counts = len(words)
  word_counts.append(counts)
print(word_counts)

average = sum(word_counts)/3
average

# Create a Matrix out of Rows with Specified Columns from Original Table

In [0]:
n = len(letters_table) # length of original table (number of rows)

all_passengers = [] # blank new matrix

for i in range(n):
  rows = letters_table.iloc[i].tolist() # convert each row to a list of numbers
  reducing_rows = rows[2:] # only want to keep the third column and on
  all_passengers.append(reducing_rows) # append chosen rows to the new matrix


# Calculate Euclidean Distance Between Two Chosen Cases using `euclidean_distance` function from the UO puddles (up) library

In [0]:
# Either import the uo_puddles library or include the function definition:

from numpy.linalg import norm

def euclidean_distance(vect1:list ,vect2:list) -> float:
  assert isinstance(vect1, list), f'vect1 is not a list but a {type(vect1)}'
  assert isinstance(vect2, list), f'vect2 is not a list but a {type(vect2)}'
  assert len(vect1) == len(vect2), f"Mismatching length for euclidean vectors: {len(vect1)} and {len(vect2)}"
  '''
  sum = 0
  for i in range(len(vect1)):
      sum += (vect1[i] - vect2[i])**2
      
  #could put assert here on result   
  return sum**.5  # I claim that this square root is not needed in K-means - see why?
  '''
  a = np.array(vect1)
  b = np.array(vect2)
  return norm(a-b)

In [0]:
up.euclidean_distance(braund_number_list, allen_number_list)

# Create a Matrix of All the Euclidean Distances Between a Specific Case and the Other Rows in a Table and Sort by Distance

In [0]:
all_distances_matrix = []

for i in range(n):
  passenger_list = all_passengers[i]
  braund = braund_number_list
  distances = up.euclidean_distance(braund, all_passengers[i])
  all_distances_matrix.append([distances, i]) # append new matrix with euclidean distances and index numbers

sorted_distances_matrix = sorted(all_distances_matrix, key = lambda x:x[1])  #sorted will default to using the first item in our 2-item lists


# Look at top 10 most similar cases to the chosen comparison case in sorted table
sorted_distances_matrix[:10]

# Pick out the row for the case most similar (at the top of the sorted table)
letters_table.iloc[477].tolist()  #go back to original table and see


# Write a for loop to get the top four most similar cases from the sorted table
top4 = [477, 482, 804, 806] # Index numbers of top four most similar cases from the sorted table

for i in range(4):
  index = top4[i]  #477, then 482, then 804, then 806
  row = letters_table.iloc[index].tolist() # identify the row from the table of each index and convert to a list
  outcome = row[0]  #The first value in row from the letters_table is the Survived value, 0 (survived) or 1 (didn't survive)
  print(outcome) # Print the survived scores for the four people grabbed and see if they have the same outcome as the comparison case

# The `ordered_distances_matrix` function from the UO puddles library does the same thing done above
   - Meaning it calculates the Euclidean distance between a reference case and all other cases, then sorts the cases from smallest distance to largest distance

In [0]:
def ordered_distances_matrix(target_vector:list, crowd_matrix:list,  dfunc=euclidean_distance) -> list:
  assert isinstance(target_vector, list), f'target_vector not a list but instead a {type(target_vector)}'
  assert isinstance(crowd_matrix, list), f'crowd_matrix not a list but instead a {type(crowd_matrix)}'
  assert all([isinstance(row, list) for row in crowd_matrix]), f'crowd_matrix not a list of lists'
  assert all([len(target_vector)==len(row) for row in crowd_matrix]), f'crowd_matrix has varied row lengths'
  assert callable(dfunc), f'dfunc not a function but instead a {type(dfunc)}'

  distance_list = [(index, dfunc(target_vector, row)) for index, row in enumerate(crowd_matrix)]
  return sorted(distance_list, key=lambda pair: pair[1])

# Import spacy (text analysis library) and a commonly used dictionary

In [0]:
import spacy
!python -m spacy download en_core_web_md # download the dictionary
import en_core_web_md
nlp = en_core_web_md.load()

# Parse a string into words (spaCy 'tokens') using `doc` function

In [0]:
doc = nlp(practice_sentence)

# Convert those spaCy 'tokens' back into strings

In [0]:
for token in doc:
  print(token.text)  #using the text attribute - gives me a string

# Display all text when displaying a table

In [0]:
pd.set_option('display.max_colwidth', None)  #None forces all of sentence to be shown

# Check if a Word is Contained in spaCy's Dictionary

In [0]:
nlp.vocab.has_vector('frankenstein') # Check to make sure word vectors have been loaded

# `get_vec` function

Stanford has a project called Global Vectors for Word Representation (GloVe), which contains 300-dimensional vectors associated with all of the words in spaCy's vocabulary. 

The vector for each word represents its semantic value according to the distributional hypothesis from linguistics - that you can derive the meaning of words based on their contexts.

The get_vec function gets the 300-dimensional vector associated with a word in spaCy's vocabulary and converts the vector to a string.

In [0]:
def get_vec(s:str) -> list:
  return nlp.vocab[s].vector.tolist()

# Compare Similarity of Two Text Items that Are Associated with Numerical Values using Cosine Similarity

In [0]:
up.cosine_similarity(get_vec('dog'), get_vec('puppy')) 

# `ordered_embeddings` function

This function takes a vector and a table and calculates the euclidean distance between the item represented by the vector and the items in each row of the table.



In [0]:
def ordered_embeddings(target_vector, table): # define a new function, called ordered_embeddings, that takes a target_vector and a table as inputs
                                              # the target_vector needs to be a list (convert to a list if it's not)
  names = table.index.tolist() # names is a list of the indexes from the provided table (in this case, would be the animal names)
  ordered_list = [] # the results ordering difference between each animal and the target_animal will be listed in order here
  for i in range(len(names)): # for each animal row
    name = names[i] # name is an interation of each row in the animal table
    row = table.loc[name].tolist() # convert each row to a list
    d = up.euclidean_distance(target_vector, row) # calculate distance between the target_animal and the animal in each other row of the table
    ordered_list.append([d, names[i]]) # fill the ordered_list with the calculated distances and names of each animal
  ordered_list = sorted(ordered_list) # sort the list from lowest to highest distance

  return ordered_list

# `subtractv` function

Subectract two vectors.

In [0]:
def subtractv(x:list, y:list) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead if {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  result = [] # blank list to contain results of subtracting each item in x and y
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1-c2)
  return result

# `addv` function

Add two vectors.

In [0]:
def addv(x:list, y:list) -> list: # define a function, called addv, that takes variables x & y (both lists) as inputs
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead is {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  #result = [(c1 - c2) for c1, c2 in zip(x, y)]  #one-line compact version - called a list comprehension

  result = []
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1+c2)

  return result

# `dividev` function

Divide a vector by a constant.

In [0]:
def dividev(x:list, y:int) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, int), f"y must be an integer but instead is {type(y)}"

  result = []
  for i in range(len(x)): 
    c1 = x[i]
    result.append(c1/y)
  return result 

# `meanv` function

Calculate the mean vector from a matrix.

In [0]:
def meanv(matrix:list) -> list:
  assert isinstance(matrix, list), f"matrix must be a list but instead is {type(x)}"
  assert len(matrix) >=1, f"matrix must have at least one row"

  sumv = matrix[0] # start with the first row
  for row in matrix[1:]: # add each row to the first row, starting with the second row
    sumv = addv(sumv, row) # take the sum of the first+second row, and then this resulting sum plus the third row
  mean = dividev(sumv, len(matrix)) # divide the sum of all the rows by the number of rows
  return mean

In [0]:
# Test the meanv function using matrix A

A = [[0,1], 
     [2,2], 
     [4,3]]

A[0] # [0,1]
A[1] # [2,2]
A[2] # [4,3]
len(A) # 3

# Expected result - note: the numbers in corresponding positions get added using the addv function
## First iteration:
# sumv = A[0] + A[1] = ([0,1]+[2,2]) = [2,3]
## Second iteration:
# sumv = [2,3] + A[2] = ([2,3]+[4,3]) = [6,6]
## mean = [6,6]/3 = [2.0,2.0]

meanv(A) # Finds the mean of each column

[2.0, 2.0]

# `dot` function

Compute the dot-product of two vectors

In [0]:
def dot(vector1: list, vector2:list) -> float:
  assert isinstance(vector1, list), f'vector1 should be a list but is instead a {type(vector1)}'
  assert isinstance(vector2, list), f'vector2 should be a list but is instead a {type(vector2)}'
  assert len(vector1) == len(vector2), f'both vectors should be the same length'

  result = 0
  for i in range(len(vector1)):
    term = vector1[i]*vector2[i]
    result += term
  return result

# Sentence Similarity

## Converting an entire sentence into a single GloVe vector

In [0]:
pilot_sentences = [
  'It was really cold yesterday.',
  'It will be really warm today, though.',
  "It'll be really hot tomorrow!'",
  'Will it be really cool Tuesday?'
]

# For a single sentence
first_sent = pilot_sentences[0]
doc = nlp(first_sent.lower())

vectors = []

for token in doc:
  word = token.text
  vecs = get_vec(word)
  vectors.append(vecs)

s0_average = meanv(vectors)
len(s0_average) # 300
print(s0_average[:10])

# `sent2vec` function

Takes a sentence (a raw string) as an argument and produces the average GloVe vector for it.

If you run across a sentence that adds nothing to the matrix (because it has no legal tokens), return this value: [0.0]*300 - which builds a 300-dimensional matrix of all zeros.

In [0]:
def sent2vec(raw_text: str):
  assert isinstance(raw_text, str), f'raw_text should be string but instead is {type(raw_text)}'

  matrix = [] 
  doc = nlp(raw_text.lower()) 

  for token in doc: 
    if token.is_alpha and not token.is_stop:
      word = token.text # convert each word to a string
      vectors = get_vec(word) # get the 300-D GloVe vector for each word
      matrix.append(vectors) # add each of these 300-D vectors to the blank matrix
      if token.is_stop or not token.is_alpha:
          return ([0.0]*300) # this line needs to be updated
  return meanv(matrix) # return the mean of the matrix

# Finding Similarity among Sentences

*   First, GloVe-ify all the sentences and put in a matrix (in this case, the sentences from Dracula)
*   Then, choose a comparison sentence and use sent2vec to get its GloVe vector
*   Use cosine similarity to compare GloVe vectors of the comparison sentence and all sentences from chosen text 
*   Sort the final result 

In [0]:
drac_matrix = []

for i in range(len(drac_sentences)):  #we defined drac_sentences above
  sentence = drac_sentences[i]
  vec = sent2vec(sentence.text)
  drac_matrix.append(vec)

# Test sentence
test_sentence = "My favorite food is strawberry ice cream."


# Find sentences in Dracula closest to the test sentence using sent2vec
# Based on cosine similarity
input_vec = sent2vec(test_sentence)

ordered_distances = []

for i in range(len(drac_matrix)):  #we defined drac_sentences above
  vec = drac_matrix[i]
  d = up.fast_cosine(np.array(input_vec), np.array(vec))  #using speedier version that relies on numpy
  ordered_distances.append([d, i])

for d,j in sorted(ordered_distances, reverse=True)[:10]:
  print(drac_sentences[j])
  print('=========') # puts this after each sentence so it's easy to tell them apart

# `build_embedding_matrix` function

Takes a string (e.g., a book) and a table (e.g., a table with all the colors represented by their RGB codes) as inputs and produces a matrix of values from it (e.g., the RGB values).

In [0]:
def build_embedding_matrix(raw_text: str, table) -> list:
  assert isinstance(raw_text, str), f'raw_text should be string but instead is {type(raw_text)}'
  assert isinstance(table, pd.core.frame.DataFrame), f'table not a dataframe but instead a {type(table)}'
  assert 'nlp' in globals(), f'This function assumes that the spacy nlp function has been defined'

  matrix = [] # a blank matrix that will be filled
  index_list = table.index.tolist() # convert the index of the table (in this example, color names) to a list
  doc = nlp(raw_text.lower()) # parse the raw text (in this case, the text from a book)

  # short version
  # matrix = [table.loc[token.text].tolist() for token in doc if token.text in index_list]

  for token in doc: # for each word in the parsed book
    word = token.text # convert each word to a string
    if word in index_list: # if any word from the book is also in the list of color names
      matrix.append(table.loc[word].tolist()) # add that word to the blank matrix
  return matrix

# `zip` function

Pairs together values in corresponding column positions from two separate lists.

In [0]:
list_A = [1,2,3]
list_B = [4,5,6]

zipped = list(zip(list_A, list_B))
zipped # [(1, 4), (2, 5), (3, 6)]

# Machine Learning with KNN and/or Cosine Similarity

1. Randomly shuffle the rows of the original data set.
2. Decide the percentage of data you want in the training/testing set. Calculate the sample size needed for each.
3. Split data into training & testing sets.

In [0]:
# Shuffle the rows of the full data set randomly
set_seed = 1234 # to be able to replicate random shuffling
import numpy as np
rsgen = np.random.RandomState(set_seed) # numpy's random number generator

# Use the .sample() method to randomly shuffle the rows of the original table
shuffled_table = letters_table.sample(frac=1, random_state = rsgen) # frac=1 means shuffle the entire table
shuffled_table.head()


# Calculate number of rows for the testing set by dividing the total length of the table by 3 (~33%)
# For this example, we want 30% of the full data set to go into the testing set (aka holdout sample), and 70% to go into the training set
## Because prefer more data to train the model on
n_for_test_set = int(len(letters_table)/3) # using int() rounds the result to the nearest whole number
n_for_test_set 

# Do the same for calculating number of rows for the training set by subtracting total length of the data by number of rows going into testing set
n_for_train_set = len(letters_table) - n_for_test_set # the other 70% goes in the training set
n_for_train_set 



## Split the randomly shuffled table into training & testing sets based on n's above
testing_table = shuffled_table[:n_for_test_set]
testing_table = testing_table.reset_index(drop=True) # reset the indices 

training_table = shuffled_table[n_for_test_set:]
training_table = training_table.reset_index(drop=True)


# Convert the training table into a matrix of lists taking only the third column and on
training_passengers_matrix = []

for i in range(len(training_table)):
  rows = training_table.iloc[i].tolist()
  sliced_rows = rows[2:]
  training_passengers_matrix.append(sliced_rows)

# Store the Scores on the Outcome Variable for the Training Set 
   - Note: We're doing supervised learning for building the ML model since the training set has known scores on the outcome variable (called labels)

In [0]:
training_labels = letters_table['Survived'].tolist() # 'Survived' is the outcome variable (0 = died, 1 = survived)

# Apply the KNN Algorithm to build a model and make predictions about cases in the testing set
   - Using the knn function from the UO puddles library
   - This produces a prediction for our reference person from the testing set based on who they are most similar to from the training set & what the majority outcome of those people was

In [0]:
# up.knn(testing_set_cases, training_set_cases, training_cases_outcomes, value_of_k, similarity_measure)
# k is a hyperparameter specified by the researcher (in this case, 5)

n = len(testing_table)

predictions = []

for i in range(n):
  rows = testing_table.iloc[i].tolist()
  values = rows[2:]
  preds = up.knn(values, training_passengers_matrix, training_labels, 5, 'euclidean')
  predictions.append(preds)
print(predictions)

# Were our predictions correct?

testing_labels = testing_table['Survived'].tolist() # Store actual scores on the outcome variable to a list
cases = list(zip(predictions,testing_labels)) # Pair together predictions with actual scores on outcome variable
print(cases[:10])  #[(1, 1), (1, 0), (0, 0), (1, 1), (0, 0), (1, 0), (0, 0), (1, 0), (0, 1), (0, 0)] # Evaluate which predictions were correct

# The code for the knn function from UO puddles library

In [0]:
def knn(target_vector:list, crowd_matrix:list,  labels:list, k:int, sim_type='euclidean') -> int:
  assert isinstance(target_vector, list), f'target_vector not a list but instead a {type(target_vector)}'
  assert isinstance(crowd_matrix, list), f'crowd_matrix not a list but instead a {type(crowd_matrix)}'

  #assert sim_type in sim_funs, f'sim_type must be one of {list(sim_funs.keys())}.'
    
  if sim_type in ['pearson', 'linear', 'correlation']:
    distance_list = [[index, abs(np.corrcoef(np.array(target_vector), np.array(row))[0][1])] for index,row in enumerate(crowd_matrix)]
    direction = True
  else:
    sim_funs = {'euclidean': [euclidean_distance, False], 'cosine': [cosine_similarity, True]}
    dfunc = sim_funs[sim_type][0]
    distance_list = [[index, dfunc(target_vector, row)] for index,row in enumerate(crowd_matrix)]
    direction = sim_funs[sim_type][1]

  sorted_crowd =  sorted(distance_list, key=lambda pair: pair[1], reverse=direction)  #False is ascending

  #Compute top_k
  top_k = [i for i,d in sorted_crowd[:k]]
  #Compute opinions
  opinions = [labels[index] for index in top_k]
  #Compute winner
  winner = 1 if opinions.count(1) > opinions.count(0) else 0
  #Return winner
  return winner

# Apply the Cosine Similarity Algorithm to build a model and make predictions about cases in the testing set

In [0]:
n = len(testing_table)

predictions = []

for i in range(n):
  rows = testing_table.iloc[i].tolist()
  values = rows[2:]
  preds = up.knn(values, training_passengers_matrix, training_labels, 5, 'cosine')
  predictions.append(preds)
print(predictions)

# Were our predictions correct?

testing_labels = testing_table['Survived'].tolist() # Store actual scores on the outcome variable to a list
cases = list(zip(predictions,testing_labels)) # Pair together predictions with actual scores on outcome variable
print(cases[:10])  #[(1, 1), (1, 0), (0, 0), (1, 1), (0, 0), (1, 0), (0, 0), (1, 0), (0, 1), (0, 0)] # Evaluate which predictions were correct

# Accuracy of ML Model

## Idea of True Positive, False Positive, True Negatives, and False Negatives

In [0]:
print('True positive: ', cases.count((1,1)))  #True positive:  69
print('True negative: ', cases.count((0,0)))  #True negative:  116
print('False positive: ', cases.count((1,0))) #False positive:  68
print('False negative: ', cases.count((0,1))) #False negative:  44

## One way of assessing accuracy is the number of true positive and negatives out of total number of predictions

In [0]:
(69+116)/testing_n  #0.622895622895623

# Other ways of defining accuracy

<img src='https://www.dropbox.com/s/zubecbzi8zsdzgg/confusion_matrix.png?raw=1'>

# Machine Learning with Naive Bayes Algorithm

1. Randomly shuffle the rows of the original data set. Split into a training set (70%) and testing set (30%).
2. Build a "bag of words" from the training table to use with the Naive Bayes Algorithm.
3. Apply Naive Bayes to build a model.
4. Evaluate on testing data.

In [0]:
## 1. Randomly shuffle the rows of gothic_sentences. Split into a training set (70%) and testing set (30%).

set_seed = 1234

import numpy as np
rsgen = np.random.RandomState(set_seed)


# Shuffled Gothic Sentences
shuffled_gothics = gothic_sentences.sample(frac=1, random_state = rsgen).reset_index(drop=True)
len(shuffled_gothics)


# Calculating n's for Testing and Training Tables
n_testing = (len(shuffled_gothics))*.3
n_testing # 5874

n_training = (len(shuffled_gothics)) - n_testing
n_training # 13705


# Training Set
training_table = shuffled_gothics[:13705].reset_index(drop=True)

# Testing Set
testing_table = shuffled_gothics[13705:].reset_index(drop=True)


# Grab the Text and Author Columns from the Training and Testing Sets and Convert into Lists (easier to work with text data this way)
training_text = training_table['text'].tolist()
training_authors = training_table['author'].tolist()

testing_text = testing_table['text'].tolist()
testing_authors = testing_table['author'].tolist()

In [0]:
## 2. Build a "bag of words" from the training table.

# Create an empty dataframe that will be the "word bag"
word_bag = pd.DataFrame(columns=['word','EAP','MWS','HPL']) # build a dataframe with columns for each word in the gothic texts and each authors' name abbreviated
word_bag.head() # currently empty


# See how you can add a row
row0 = ['indefinite',1,0,0]
word_bag.loc[0] = row0
word_bag.head()

## There's an `update_gothic_row` function in UO puddles library that updates rows in an empty word-bag 
*   The update_gothic_row function takes a word and an author. 
*   It first checks to see if the word is already in the table. If it is not, it creates a row for it.
*   It then finds the column that goes with the author and increments the value by 1.

In [0]:
# A function in uo_puddles has been written to add rows the way we did above (up.update_gothic_row)
word_table = pd.DataFrame(columns=['word', 'EAP', 'MWS', 'HPL'])
up.update_gothic_row(word_bag, 'indefinite', 'EAP') # acts as a counter - adds another count underneath the author given for the corresponding word; if the word doesn't exist in the dataframe yet, it adds a new row for that word

# `update_gothic_row` function from UO puddles

In [0]:
def update_gothic_row(word_table, word:str, author:str):
  assert author in word_table.columns.tolist(), f'{author} not found in {word_table.columns.tolist()}'

  word_list = word_table['word'].tolist()
  real_word = word if type(word) == str else word.text

  if real_word in word_list:
    j = word_list.index(real_word)
  else:
    j = len(word_table)
    word_table.loc[j] = [real_word] + [0]*(len(word_table.columns)-1)

  word_table.loc[j, author] += 1

  return word_table

# Create a 'Word-Bag' Using a For Loop that Goes Through All Words

In [0]:
# Reminder of what training_text and training_authors are:
## training_text = training_table['text'].to_list()
## The 'Text' column from the training set (which was assembled from taking 70% of the rows from a shuffled table with the original data)

# training_authors = training_table['author'].to_list()
## The 'Authors' column from the training set

word_table = pd.DataFrame(columns=['word', 'EAP', 'MWS', 'HPL'])  # Empty word-bag

for i in range (len(training_text)): # We're building the word-bag from the training set
  training_sentences = training_text[i].lower() # Each lowercase word in the training_text
  doc = nlp(training_sentences) # Tokenize each word in the training text
  author = training_authors[i] # Take each author name from the training_authors list

  for token in doc: # For each token in doc
    if token.is_alpha and not token.is_stop # Remove cases that are not a letter and only keep cases that are not stop words
      up.update_gothic_row(word_table, token.text, author) # Apply update_gothic_row function to each word

# Sort the Word Bag Alphabetically

In [0]:
sorted_word_table = word_table.sort_values(by=['word'])
sorted_word_table = sorted_word_table.reset_index(drop=True)

# Set a New Index for a Table

In [0]:
sorted_word_table = sorted_word_table.set_index('word')  #set the word column to be the table index
sorted_word_table.head() 

# Now, you can pull a row based on the word you want
sorted_word_table.loc['indefinite'].tolist()  # **Use loc instead of iloc**

# Can also use loc to get a specific column in a row (can't do this with iloc)
sorted_word_table.loc['indefinite', 'EAP']  # give index first then column name to get a cell

# Build a ML Model using Naive Bayes algorithm on the word-bag

In [0]:
result_list = [] # blank list to store the probabilities calculated via Naive Bayes

for i in range(len(testing_text)): 
  sentences = testing_text[i].lower() # take the sentences from the testing set
  
  word_list = [] # another blank list to store the tokenized words from the 
  doc = nlp(sentences) # tokenize the sentences from testing set

  for i in range(len(doc)): # for each token from the testing set
    token = doc[i]
    if token.is_alpha and not token.is_stop:
      word_list.append(token.text) # Add it to the word_list if it's alphabetical and not a stop word

  result = up.bayes_gothic(word_list, sorted_word_table, training_table) # Apply Naive Bayes to each word, comparing each to words in the word-bag and applying the Naive Bayes formula to calculate the probability that that word came from each author
  result_list.append(result) # store the probabilities in the result_list

`bayes_gothic` function from UO puddles

In [0]:
def bayes_gothic(evidence:list, evidence_bag:dframe, training_table:dframe, laplace:float=1.0) -> tuple:
  assert isinstance(evidence, list), f'evidence not a list but instead a {type(evidence)}'
  assert all([isinstance(item, str) for item in evidence]), f'evidence must be list of strings (not spacy tokens)'
  assert isinstance(evidence_bag, pd.core.frame.DataFrame), f'evidence_bag not a dframe but instead a {type(evidence_bag)}'
  assert isinstance(training_table, pd.core.frame.DataFrame), f'training_table not a dataframe but instead a {type(training_table)}'
  assert 'author' in training_table, f'author column is not found in training_table'

  author_list = training_table.author.unique().tolist()
  mapping = ['EAP', 'MWS', 'HPL']
  label_list = [mapping.index(auth) for auth in author_list]
  n_classes = len(set(label_list))
  #assert len(list(evidence_bag.values())[0]) == n_classes, f'Values in evidence_bag do not match number of unique classes ({n_classes}) in labels.'

  word_list = evidence_bag.index.values.tolist()

  evidence = list(set(evidence))  #remove duplicates
  counts = []
  probs = []
  for i in range(n_classes):
    ct = label_list.count(i)
    counts.append(ct)
    probs.append(ct/len(label_list))

  #now have counts and probs for all classes

  #CONSIDER CHANGING TO LN OF PRODUCTS. END UP SUMMING LOGS OF EACH ITEM. AVOIDS UNDERFLOW.
  results = []
  for a_class in range(n_classes):
    numerator = 1
    for ei in evidence:
      if ei not in word_list:
        #did not see word in training set
        the_value =  1/(counts[a_class] + len(evidence_bag))
      else:
        values = evidence_bag.loc[ei].tolist()
        the_value = ((values[a_class]+laplace)/(counts[a_class] + laplace*len(evidence_bag)))
      numerator *= the_value
    #if (numerator * probs[a_class]) == 0: print(evidence)
    results.append(max(numerator * probs[a_class], 2.2250738585072014e-308))

  return tuple(results)

#used week 5 and moved here week 6
def float_mult(number_list: list) -> float:
  assert isinstance(number_list, list), f'number_list should be a list but is instead a {type(number_list)}'
  assert all([isinstance(item, float) for item in number_list]), f'number_list must contain all floats'

  result = 1.
  for number in number_list:  #fancier version of for i in range(n):
    result *= number

  return result

# Make Predictions

In [0]:
authors = ['EAP', 'MWS', 'HPL']

predictions = []

for i in range(len(result_list)):
  result = result_list[i] # Each result (list of probabilities for each author)

  for i in range(len(result)):
    m = max(result) # find the max probability 
    author_index = result.index(m) # get the (numerical?) index (author name) for each
    author = authors[author_index] # find the author name corresponding to each index
    pred = author # prediction is the author predicted for each row based on which author had the max probability for that word

  predictions.append(pred) # fill the blank predictions list with the author predictions

predictions[:10] # Predicted outcomes
testing_authors[:10] # Compare to actual outcomes

# Calculate Accuracy

In [0]:
cases = list(zip(predictions,testing_authors))
print(cases[:10])

print(cases.count(('EAP', 'EAP')))
print(cases.count(('MWS', 'MWS')))
print(cases.count(('HPL', 'HPL')))

accuracy = (2008 + 1467 + 1330)/len(testing_authors)
accuracy

# Heat Map (a way of visualizing accuracy)

In [0]:
up.heat_map(cases, ['EAP', 'MWS', 'HPL'])  #EAP=0, MWS=1, HPL=2

# `heat_map` function from UO puddles

In [0]:
def heat_map(zipped, label_list):
  case_list = []
  for i in range(len(label_list)):
    inner_list = []
    for j in range(len(label_list)):
      inner_list.append(zipped.count((label_list[i], label_list[j])))
    case_list.append(inner_list)


  fig, ax = plt.subplots(figsize=(10, 10))
  ax.imshow(case_list)
  ax.grid(False)
  ax.set_xlabel('Predicted outputs', fontsize=32, color='black')
  ax.set_ylabel('Actual outputs', fontsize=32, color='black')
  ax.xaxis.set(ticks=range(len(label_list)))
  ax.yaxis.set(ticks=range(len(label_list)))
  
  for i in range(len(label_list)):
      for j in range(len(label_list)):
          ax.text(j, i, case_list[i][j], ha='center', va='center', color='white', fontsize=32)
  plt.show()
  return None

# Artificial Neural Nets (ANNs)

## Calculate an ANN from scratch

1. First, specify the input values & weights on the input values. Calculate the dot-product for these.
2. Apply an activiation function to the result from 1.

In [0]:
# Specify the input values and the weights on the input values
inputs = [.002, -.09, .6] # values coming from a previous layer
weights = [.5, .4, -.2] # weights on those values

# Calculate the dot-product of the weights and input values
z = dot(weights, inputs)
z

In [0]:
# Defining the activation functions

## Sigmoid function
def sigmoid(t:float) -> float:
  s = 1 / (1 + math.exp(-t)) # e to the -t power
  return s

## RELU function
def relu(t:float) -> float:
  result = max(t, 0.0)
  return result

In [0]:
# Apply the sigmoid function to the dot-product, z
sigmoid(z)

In [0]:
# Apply the RELU function to the dot-product, z.
relu(z)

# `neuron_output` function

A single function for computing the output node (using the sigmoid activation function).

In [0]:
def neuron_output(weights:list, inputs:list) -> float:
  assert isinstance(weights, list), f'weights should be a list but is instead a {type(weights)}'
  assert isinstance(inputs, list), f'inputs should be a list but is instead a {type(inputs)}'
  assert len(weights) == len(inputs), f'weights and inputs should be the same length'

  z = dot(weights, inputs)
  s = sigmoid(z)
  return s

In [0]:
# Test the neuron_output function out - it takes inputs and weights as arguments
neuron_output(weights, inputs)

# A feedforward function

Takes as arguments the set of weights in a network and input values.

Outputs the final result (i.e., prediction).

<img src='https://codingvision.net/imgs/posts/c-backpropagation-tutorial-xor/1.png'>

<img src='https://www.dropbox.com/s/fvko9fo71pp1cpr/Screenshot%202020-02-21%2009.14.37.png?raw=1'>


Even though there are three layers shown above, the input layer is implied.
- Only have to deal with two layers: the hidden and output

#### Choosing the weights

It's up to the researcher. One way is to use a random distribution of weights between -1 and 1. 

So for the first hidden node, we will have a list of two weights (notice two weights feed into it in the image above).

<pre>
hidden1 = [rdist1, rdist2]
</pre>

`rdist1` and `rdist2` are random numbers falling in a uniform distribution. We'll need the same for hidden2 and for the output node.


In [0]:
# Getting random numbers to use as the weights for the hidden nodes & output node
np.random.seed(1234)

hidden1 = list(np.random.uniform(-1,1,2)) # Create a list of 2 random items taken from a uniform distribution between -1 and 1
hidden2 = list(np.random.uniform(-1,1,2))
output = list(np.random.uniform(-1,1,2))


# Combine the weights into a single object - a list of lists

xor_network = [[hidden1, hidden2], [output]] # the weights for each node in the hidden layer are contained in a separate list from the weights for the output node

print(len(xor_network)) # 2 (there are two separate lists)
print(xor_network)

<img src='https://www.dropbox.com/s/a8s43c314op5qg8/Screenshot%202020-05-13%2015.11.06.png?raw=1' height=500>

# `layer_output` function

*   First, define the function `layer_output`.
*   The `layer` parameter below is something like xor_network[0] or xor_network[1]
*   The `inputs` are values from the preceding layer.
*   You will use the 'create a new list from an old list' gist and the `neuron_output` function.

In [0]:
 def layer_output(layer:list, inputs:list) -> list:
  assert isinstance(layer, list), f'layer must be a list but is a {type(layer)}'
  assert all([isinstance(item, list) for item in layer]), f'layer must be a list of lists'
  assert isinstance(inputs, list), f'inputs must be a list but is a {type(inputs)}'

  new_list = []

  for i in range(len(layer)):
    item = layer[i]  
    output = neuron_output(item, inputs)

    new_list.append(output)

  return new_list

# `feed_forward` function

Create a function called `feed_forward` that takes the entire network in as a parameter (instead of each layer in steps), as well as the initial input to the network. 

It will go through each layer calling `layer_output`.

In [0]:
def feed_forward(neural_network:list, input_vector:list) -> float:

  outputs = []

  for i in range(len(neural_network)):
    layer = neural_network[i]  #layer
    output = layer_output(layer, input_vector) # want to use the output from this as the input for next layer_output
    outputs.append(output)
    final_output = layer_output(layer, outputs[0])

  return final_output

# Convolutional Neural Nets (CNNs)

## on 1D data

In [0]:
# There are standard settings used to load the data and split into training & testing sets:

max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2

from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

### Build the 300-D vectors 

In [0]:

print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


# Model Summary

In [0]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))


# Accuracy

In [0]:
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test accuracy:', acc)