<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/DataSciSeminar_ListOfFunctions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Things need to know for the midterm:


*   sent2vec & any functions that sent2vec relies on
*   build a matrix of responses from the full dataset
*   call sent2vec on each response in the matrix and store them in a new vector that you add to the matrix
*   accuracy / epochs
*   ANN - all steps of performing an ANN
*   cross-validation

# Import uo_puddles library

In [0]:
# Flush the old uo_puddles directory and re-import
!rm -r 'uo_puddles'
my_github_name = 'uo-puddles' # can replace with your account name
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'
!git clone $clone_url
import uo_puddles.uo_puddles as up

# Import spacy (text analysis library) and a commonly used dictionary

In [0]:
import spacy
!python -m spacy download en_core_web_md # download the dictionary
import en_core_web_md
nlp = en_core_web_md.load()

# `ordered_embeddings` function

This function takes a vector and a table and calculates the euclidean distance between the item represented by the vector and the items in each row of the table.



In [0]:
def ordered_embeddings(target_vector, table): # define a new function, called ordered_embeddings, that takes a target_vector and a table as inputs
                                              # the target_vector needs to be a list (convert to a list if it's not)
  names = table.index.tolist() # names is a list of the indexes from the provided table (in this case, would be the animal names)
  ordered_list = [] # the results ordering difference between each animal and the target_animal will be listed in order here
  for i in range(len(names)): # for each animal row
    name = names[i] # name is an interation of each row in the animal table
    row = table.loc[name].tolist() # convert each row to a list
    d = up.euclidean_distance(target_vector, row) # calculate distance between the target_animal and the animal in each other row of the table
    ordered_list.append([d, names[i]]) # fill the ordered_list with the calculated distances and names of each animal
  ordered_list = sorted(ordered_list) # sort the list from lowest to highest distance

  return ordered_list

# `subtractv` function

Subectract two vectors.

In [0]:
def subtractv(x:list, y:list) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead if {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  result = [] # blank list to contain results of subtracting each item in x and y
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1-c2)
  return result

# `addv` function

Add two vectors.

In [0]:
def addv(x:list, y:list) -> list: # define a function, called addv, that takes variables x & y (both lists) as inputs
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead is {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  #result = [(c1 - c2) for c1, c2 in zip(x, y)]  #one-line compact version - called a list comprehension

  result = []
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1+c2)

  return result

# `dividev` function

Divide a vector by a constant.

In [0]:
def dividev(x:list, y:int) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, int), f"y must be an integer but instead is {type(y)}"

  result = []
  for i in range(len(x)): 
    c1 = x[i]
    result.append(c1/y)
  return result 

# `meanv` function

Calculate the mean vector from a matrix.

In [0]:
def meanv(matrix:list) -> list:
  assert isinstance(matrix, list), f"matrix must be a list but instead is {type(x)}"
  assert len(matrix) >=1, f"matrix must have at least one row"

  sumv = matrix[0] # start with the first row
  for row in matrix[1:]: # add each row to the first row, starting with the second row
    sumv = addv(sumv, row) # take the sum of the first+second row, and then this resulting sum plus the third row
  mean = dividev(sumv, len(matrix)) # divide the sum of all the rows by the number of rows
  return mean

In [9]:
# Test the meanv function using matrix A

A = [[0,1], 
     [2,2], 
     [4,3]]

A[0] # [0,1]
A[1] # [2,2]
A[2] # [4,3]
len(A) # 3

# Expected result - note: the numbers in corresponding positions get added using the addv function
## First iteration:
# sumv = A[0] + A[1] = ([0,1]+[2,2]) = [2,3]
## Second iteration:
# sumv = [2,3] + A[2] = ([2,3]+[4,3]) = [6,6]
## mean = [6,6]/3 = [2.0,2.0]

meanv(A) # Finds the mean of each column

[2.0, 2.0]

# `build_embedding_matrix` function

Takes a string (e.g., a book) and a table (e.g., a table with all the colors represented by their RGB codes) as inputs and produces a matrix of values from it (e.g., the RGB values).

In [0]:
def build_embedding_matrix(raw_text: str, table) -> list:
  assert isinstance(raw_text, str), f'raw_text should be string but instead is {type(raw_text)}'
  assert isinstance(table, pd.core.frame.DataFrame), f'table not a dataframe but instead a {type(table)}'
  assert 'nlp' in globals(), f'This function assumes that the spacy nlp function has been defined'

  matrix = [] # a blank matrix that will be filled
  index_list = table.index.tolist() # convert the index of the table (in this example, color names) to a list
  doc = nlp(raw_text.lower()) # parse the raw text (in this case, the text from a book)

  # short version
  # matrix = [table.loc[token.text].tolist() for token in doc if token.text in index_list]

  for token in doc: # for each word in the parsed book
    word = token.text # convert each word to a string
    if word in index_list: # if any word from the book is also in the list of color names
      matrix.append(table.loc[word].tolist()) # add that word to the blank matrix
  return matrix

# `get_vec` function

Stanford has a project called Global Vectors for Word Representation (GloVe), which contains 300-dimensional vectors associated with all of the words in spaCy's vocabulary. 

The vector for each word represents its semantic value according to the distributional hypothesis from linguistics - that you can derive the meaning of words based on their contexts.

The get_vec function gets the 300-dimensional vector associated with a word in spaCy's vocabulary and converts the vector to a string.

In [0]:
def get_vec(s:str) -> list:
  return nlp.vocab[s].vector.tolist()

# `sent2vec` function

Takes a sentence (a raw string) as an argument and produces the average GloVe vector for it.

If you run across a sentence that adds nothing to the matrix (because it has no legal tokens), return this value: [0.0]*300 - which builds a 300-dimensional matrix of all zeros.

In [0]:
def sent2vec(raw_text: str):
  assert isinstance(raw_text, str), f'raw_text should be string but instead is {type(raw_text)}'

  matrix = [] # a blank matrix that will be filled
  doc = nlp(raw_text.lower()) # parse the raw text (in this case, the text from a book)

  for token in doc: # for each word in the parsed book
    word = token.text # convert each word to a string
    vectors = get_vec(word)
    matrix.append(vectors)
  return meanv(matrix)