# Intro to Artificial Intelligence with Python

## Part VII - Language

Harvard CS50 Introduction to Artificial Intelligence with Python is an online course that I took in the Spring of 2020. It consisted of 6 lectures of which I have a notebook for each. Each lecture had 2 projects, those are located in the projects folder in the same directory as this notebook.

[Course Link](https://cs50.harvard.edu/ai/)

[Lecture Link](https://www.youtube.com/watch?v=_hAVVULrZ0Q&list=PLhQjrBD2T382Nz7z1AEXmioc27axa19Kv&index=8)

In [1]:
import csv
import itertools
import re
import math
import nltk
import os
import sys
from collections import Counter

---
## Natural Language Processing (NLP)
Natural language processing is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve:

* Automatic Summarization
* Information Extraction
* Language Identification
* Machine Translation
* Named Entity Recognition
* Speech Recognition
* Text Classification
* Word Sense Disambiguation


**Syntax** -In linguistics, syntax is the set of rules, principles, and processes that govern the structure of sentences in a given language,

**Formal Grammar** - a system of rules for generating sentences in a language

**Context-Free Grammar** - a method for generating sentences in a language by replacing words with symbols in order for an ai to be able to recognize word semantics

**Context-Free Grammar Examples:**
* Initial sentence: she saw the city.
* she ---> represented by N for Noun
* saw ---> represented by V for Verb
* the ---> represented by D for Determiner (the, a, an)
* city --> represented by N for Noun
* Final sentence structure: N-V-D-N
* N, V, D are known as non-terminal symbols, terminal symbols are the actual words

Non-terminal symbols are assigned to potential words that can be used with them in any given situation or language. For example:

* N ---> she | city | car | Harry | ...
* D ---> the | a | an | ...
* V ---> saw | ate | walked | ...
* P (preposition) ---> to | on | over | ...
* ADJ -> blue | busy | old | ...

In NLP there are also non-terminal symbols that signify phrases or multiple non-terminal symbols like so:

* NP (Noun Phrase) ---> N | D N  <----Means either a Noun or a Determiner followed by a Noun
* An example of a noun phrase: the(D) city(N)
* VP (Verb Phrase) ---> V | V NP, could be: saw(V) the(D) city(N)
* S (Sentence) ---> NP VP <--- a noun phrase followed by a verb phrase

Taking all of the above syntax rules, a sentence structure could look like this:

<img src='data/lan1.png'>

The image above represents a parse tree, note that the top level is the sentence itself (S) followed by a Noun Phrase (NP), which in this case is just a single noun 'she'. The Verb Phrase (VP) is composed of a single verb (V) and a noun phrase (NP) which contains a Determiner (D) and a  Noun(N)

**Parse Tree (syntax tree)** - A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

Structures like Parse Trees are useful when there are limited word amounts to a given word bank.

**ntlt (natural language toolkit)** - is a Python library designed to handle basic nlp tasks such as parsing context-free grammer


## Context-Free Grammar Examples using nltk library

In [5]:
# This example parses a sentences into a syntax tree using nltk, note
# the bank of words are pre-set

grammar = nltk.CFG.fromstring("""
    S -> NP VP

    NP -> D N | N
    VP -> V | V NP

    D -> "the" | "a"
    N -> "she" | "city" | "car"
    V -> "saw" | "walked"
""")

parser = nltk.ChartParser(grammar)

sentence = input("Sentence: ").split()
try:
    for tree in parser.parse(sentence):
        tree.pretty_print()
        tree.draw()
except ValueError:
    print("No parse tree possible.")

Sentence: she saw a car
         S             
  _______|___           
 |           VP        
 |    _______|___       
 NP  |           NP    
 |   |        ___|___   
 N   V       D       N 
 |   |       |       |  
she saw      a      car



In [11]:
# This example contains more complex rules than
# above, (AP - Adjective Phrases, PP - Prepositional Phrases)

grammar = nltk.CFG.fromstring("""
    S -> NP VP

    AP -> A | A AP
    NP -> N | D NP | AP NP | N PP
    PP -> P NP
    VP -> V | V NP | V NP PP

    A -> "big" | "blue" | "small" | "dry" | "wide"
    D -> "the" | "a" | "an"
    N -> "she" | "city" | "car" | "street" | "dog" | "binoculars"
    P -> "on" | "over" | "before" | "below" | "with"
    V -> "saw" | "walked"
""")

parser = nltk.ChartParser(grammar)

sentence = input("Sentence: ").split()
try:
    for tree in parser.parse(sentence):
        tree.pretty_print()
        tree.draw()
        break
except ValueError:
    print("No parse tree possible.")

Sentence: she saw the big city with binoculars
     S                                      
  ___|_______                                
 |           VP                             
 |    _______|_________________              
 |   |       NP                |            
 |   |    ___|___              |             
 |   |   |       NP            PP           
 |   |   |    ___|___      ____|______       
 NP  |   |   AP      NP   |           NP    
 |   |   |   |       |    |           |      
 N   V   D   A       N    P           N     
 |   |   |   |       |    |           |      
she saw the big     city with     binoculars



---

## n-grams
a continguous sequence of n items from a sample of text where n represents the number of continguous words to search for

**Character n-gram** - a continguous sequence of n charaters from a sample of text

**Word n-gram** - a continguous sequence of n words from a sample of text

**Unigram** - a continguous sequence of 1 item from a sample of text (where n=1)

**Bigram** - a continguous sequence of 2 items from a sample of text (where n=2)

**Trigram** - a continguous sequence of 3 items from a sample of text (where n=3)

An example of a trigram would be:
* Initial sentence: 'How often have I said to you to stop that?'
* Trigram 1: How often have
* Trigram 2: often have I
* Trigram 3: have I said, and so on

breaking down sentences into smaller sections allows an ai to more easily identify structure and meaning. 

---
## Tokenization
the task of splitting a sequence of characters into pieces (tokens), for example, python's built-in split() function tokenizes sentences and puts all the tokens into a list 

### The tokenization example below examines 17 Shelock Holmes stories and locates the most common n-grams from all combined

In [20]:
def main():
    """Calculate top term frequencies for a corpus of documents."""

    if len(sys.argv) != 3:
        sys.exit("Usage: python tfidf.py n corpus")
    print("Loading data...")

    n = 3
    corpus = load_data('data/holmes')

    # Compute n-grams
    ngrams = Counter(nltk.ngrams(corpus, n))

    # Print most common n-grams
    for ngram, freq in ngrams.most_common(10):
        print(f"{freq}: {ngram}")


def load_data(directory):
    contents = []

    # Read all files and extract words
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename)) as f:
            contents.extend([
                word.lower() for word in
                nltk.word_tokenize(f.read())
                if any(c.isalpha() for c in word)
            ])
    return contents

main()

Loading data...
62: ('it', 'was', 'a')
61: ('one', 'of', 'the')
53: ('i', 'think', 'that')
47: ('out', 'of', 'the')
47: ('that', 'he', 'was')
46: ('there', 'was', 'a')
43: ('it', 'is', 'a')
42: ('that', 'i', 'was')
42: ('that', 'it', 'was')
42: ('that', 'he', 'had')


---
**the above n-grams are useful for prediction, knowing the most common trigrams, one could feasibly predict the third word when the first two are seen together**

## Markov models can be used for predicting words
In the example below, a python library called makovify is used to generate random sentences that sound like Shakespeare (but that aren't specifically used by Shakespeare in his works). Note that the complete works of Shakespeare are used for this to work. This model is just predicting which words are likely to come next based on prior words (n-grams). Note that the sentences don't necessarily have any meaning. 

In [24]:
import markovify

# Read text from file
with open('data/shakespeare.txt') as f:
    text = f.read()

# Train model
text_model = markovify.Text(text)

# Generate sentences
print()
for i in range(5):
    print(text_model.make_sentence())
    print()


I never Deny your asking: take your way to give breathing to his wish?

If that the lips of those broils, Beginning in the incursions, thou strikest as slow as another.

Come, cordial and not mean to look upon my sin.

Have you a good memory And witness of their death.

Not knowing what they swear in both the worlds suffer, Ere we depart we'll share a bounteous time In different beds of flow'rs: Love-thoughts lie rich when canopied with bow'rs.



---
## Text Categorization
Used to classify text, good for determining text sentiment (i.e. positive or negative). Product reviews are example of text where categorization can be used.


### Bag-Of-Words Model
Model that represents text as an unordered collection of words where syntax is irrelvant just the words and their meanings themselves are relevant

### Naive Bayes Model (Similar to Bayes Rule for probality)
For text classification purposes we might want to know the probability that a specific piece of text is positive (P(Positive)) or negative (P(Negative)). For text classification purposes, information to help determine positive or negative can be through specific words or text samples

Naive Bayes Model Example:
* P(positive | 'my grandson loved it'), given the sentence 'my grandson love it, what is the probablity that the message is positive.
* The bag-of-words approach will treat the sentence as just a bunch of words like P(positive | 'my', 'grandson', 'loved', 'it)
* Bayes Rule, can then be applied to the function like so (see Uncertainty lecture for more info on Bayes Rule):

**P('my', 'grandson', 'loved', 'it' | positive) * P(positive) / P('my', 'granson', 'loved', 'it')**

Taking the above idea, we can use joint probablity rules to further discern positive and negative messages by multiplying each individual word based on its own positivity/negativity like so:


### Explained well 58 minutes into lecture
**P(positive) * P('my' | positive) * P('grandson' | positive) * P('loved' | positive) * P('it' | positive)**

* The pro of positive P(positive) = # positive samples/ # total samples
* prob of P('loved' | positive) = # of positive samples with 'loved' / number of positive samples
* all the other probs are calculated the same

**Additive Smoothing** - adding a value $\alpha$ to each value in our distributino to smooth the data (handle words that have 0 probability in a data set) 

**Laplace Smoothing** - adding 1 to each value in our distribution: pretending we've sen eachv value one more time than we actually have (prevents multiplication by 0)



In [19]:
def main(filename):

    # Read data from files
    positives, negatives = load_data(filename)

    # Create a set of all words
    words = set()
    for document in positives:
        words.update(document)
    for document in negatives:
        words.update(document)

    # Extract features from text
    training = []
    training.extend(generate_features(positives, words, "Positive"))
    training.extend(generate_features(negatives, words, "Negative"))

    # Classify a new sample
    classifier = nltk.NaiveBayesClassifier.train(training)
    s = input("s: ")
    result = (classify(classifier, s, words))
    for key in result.samples():
        print(f"{key}: {result.prob(key):.4f}")


def extract_words(document):
    return set(
        word.lower() for word in nltk.word_tokenize(document)
        if any(c.isalpha() for c in word)
    )


def load_data(directory):
    result = []
    for filename in ["positives.txt", "negatives.txt"]:
        with open(os.path.join(directory, filename)) as f:
            result.append([
                extract_words(line)
                for line in f.read().splitlines()
            ])
    return result


def generate_features(documents, words, label):
    features = []
    for document in documents:
        features.append(({
            word: (word in document)
            for word in words
        }, label))
    return features


def classify(classifier, document, words):
    document_words = extract_words(document)
    features = {
        word: (word in document_words)
        for word in words
    }
    return classifier.prob_classify(features)

main('data/test_corpus')

s: I liked it
Positive: 0.7485
Negative: 0.2515


---
## Information Retrieval
The task of finding relevant documents in response to a user query

**Topic Modeling** - models for discovering the topics for a set of documents

### Term Frequency (TF)
number of times a term appears in a document



The below examples use term frequency to find the most common words from a corpus of Shelock Holmes stories

In [15]:
def main(filename):
    """Calculate top term frequencies for a corpus of documents."""
    corpus = load_data(filename)

    # Get all words in corpus
    print("Extracting words from corpus...")
    words = set()
    for filename in corpus:
        print(filename)
        words.update(corpus[filename])
        
    # Calculate TF-IDFs
    print("Calculating term frequencies...")
    tfidfs = dict()
    for filename in corpus:
        tfidfs[filename] = []
        for word in corpus[filename]:
            tf = corpus[filename][word]
            tfidfs[filename].append((word, tf))

    # Sort and get top 5 term frequencies for each file
    print("Computing top terms...")
    for filename in corpus:
        tfidfs[filename].sort(key=lambda tfidf: tfidf[1], reverse=True)
        tfidfs[filename] = tfidfs[filename][:5]

    # Print results
    print()
    for filename in corpus:
        print(filename)
        for term, score in tfidfs[filename]:
            print(f"    {term}: {score:.4f}")


def load_data(directory):
    files = dict()
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename)) as f:

            # Extract words
            contents = [
                word.lower() for word in
                nltk.word_tokenize(f.read())
                if word.isalpha()
            ]

            # Count frequencies
            frequencies = dict()
            for word in contents:
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] += 1
            files[filename] = frequencies

    return files


main('data/holmes2')

Extracting words from corpus...
bohemia.txt
squires.txt
carbuncle.txt
clerk.txt
problem.txt
league.txt
blaze.txt
boscombe.txt
coronet.txt
ritual.txt
twisted.txt
interpreter.txt
patient.txt
engineer.txt
speckled.txt
face.txt
treaty.txt
gloria_scott.txt
crooked.txt
copper.txt
bachelor.txt
Calculating term frequencies...
Computing top terms...

bohemia.txt
    the: 443.0000
    i: 261.0000
    and: 254.0000
    to: 245.0000
    of: 237.0000
squires.txt
    the: 508.0000
    of: 206.0000
    and: 169.0000
    to: 168.0000
    a: 152.0000
carbuncle.txt
    the: 463.0000
    of: 233.0000
    a: 208.0000
    and: 199.0000
    i: 188.0000
clerk.txt
    the: 312.0000
    i: 210.0000
    a: 186.0000
    and: 180.0000
    of: 174.0000
problem.txt
    the: 427.0000
    i: 231.0000
    to: 209.0000
    of: 191.0000
    and: 187.0000
league.txt
    the: 460.0000
    and: 271.0000
    i: 264.0000
    a: 239.0000
    of: 224.0000
blaze.txt
    the: 641.0000
    of: 242.0000
    and: 242.0000
    a: 24

The above results are not very useful as they just return the most common words from all the Sherlock Holmes stories which happen to be very common words in English like I, the, and a These words are known as **function words**

**Function Words** - words that have little meaning on their own, but are used to grammatically connect other words (am, by, the, a, an, to, etc.)

**Content Words** - words that have meaining on their own (algorithm, category, computer)

The example below takes a text file containing a number of common function words and uses it in order to end up with content words rather than function words for the most frequent word counts. 

In [14]:
def main(filename):
    """
    Calculate top term frequencies for a corpus of documents.
    Excludes stop words.
    """
    corpus = load_data(filename)

    # Get all words in corpus
    print("Extracting words from corpus...")
    words = set()
    for filename in corpus:
        words.update(corpus[filename])

    # Calculate TF-IDFs
    print("Calculating term frequencies...")
    tfidfs = dict()
    for filename in corpus:
        tfidfs[filename] = []
        for word in corpus[filename]:
            tf = corpus[filename][word]
            tfidfs[filename].append((word, tf))

    # Sort and get top 5 term frequencies for each file
    print("Computing top terms...")
    for filename in corpus:
        tfidfs[filename].sort(key=lambda tfidf: tfidf[1], reverse=True)
        tfidfs[filename] = tfidfs[filename][:5]

    # Print results
    print()
    for filename in corpus:
        print(filename)
        for term, score in tfidfs[filename]:
            print(f"    {term}: {score:.4f}")


def load_data(directory):

    with open("data/function_words.txt") as f:
        function_words = set(f.read().splitlines())

    files = dict()
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename)) as f:

            # Extract words
            contents = [
                word.lower() for word in
                nltk.word_tokenize(f.read())
                if word.isalpha()
            ]

            # Count frequencies
            frequencies = dict()
            for word in contents:

                if word in function_words:
                    continue
                elif word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] += 1
            files[filename] = frequencies

    return files

main('data/holmes2')

Extracting words from corpus...
Calculating term frequencies...
Computing top terms...

bohemia.txt
    holmes: 48.0000
    man: 22.0000
    photograph: 21.0000
    own: 18.0000
    street: 18.0000
squires.txt
    holmes: 53.0000
    man: 35.0000
    cunningham: 31.0000
    inspector: 29.0000
    colonel: 25.0000
carbuncle.txt
    man: 39.0000
    holmes: 38.0000
    hat: 27.0000
    see: 27.0000
    goose: 26.0000
clerk.txt
    holmes: 26.0000
    pycroft: 23.0000
    man: 22.0000
    little: 17.0000
    mawson: 17.0000
problem.txt
    holmes: 28.0000
    watson: 22.0000
    man: 21.0000
    moriarty: 20.0000
    away: 15.0000
league.txt
    holmes: 53.0000
    man: 26.0000
    little: 25.0000
    see: 23.0000
    wilson: 22.0000
blaze.txt
    holmes: 51.0000
    horse: 50.0000
    colonel: 44.0000
    straker: 38.0000
    inspector: 21.0000
boscombe.txt
    holmes: 46.0000
    man: 43.0000
    mccarthy: 37.0000
    father: 33.0000
    son: 28.0000
coronet.txt
    coronet: 27.0000
   

While the above methodology of weeding out function words is a step in the right direction, it still does not really get to the bottom of what each story is really about other than that they are Sherlock Holmes stories. In order to actually discern differences bewteen the stories, words between the stories must also be compared

**Inverse Document Frequency (IDF)** - measures how common or rare a word is across a corpus of documents calculated by:
* log(total documents / number of documents containing the word), note with this formula if all documents contain a specific word the result is zero, which means that it is not rare, higher values result in less common words

**tf-idf** - ranking of what words are important in a document by multiplying term frequency(TF) by inverse document frequency(IDF)

In the below example, tf-idf is used to find the mose important words per story

In [2]:
def main(filename):
    """Calculate top TF-IDF for a corpus of documents."""
    corpus = load_data(filename)

    # Get all words in corpus
    print("Extracting words from corpus...")
    words = set()
    for filename in corpus:
        words.update(corpus[filename])

    # Calculate IDFs
    print("Calculating inverse document frequencies...")
    idfs = dict()
    for word in words:
        f = sum(word in corpus[filename] for filename in corpus)
        idf = math.log(len(corpus) / f)
        idfs[word] = idf

    # Calculate TF-IDFs
    print("Calculating term frequencies...")
    tfidfs = dict()
    for filename in corpus:
        tfidfs[filename] = []
        for word in corpus[filename]:
            tf = corpus[filename][word]
            
            # HERE IS WHERE TF-IDF IS CALCULATED
            tfidfs[filename].append((word, tf * idfs[word]))

    # Sort and get top 5 TF-IDFs for each file
    print("Computing top terms...")
    for filename in corpus:
        tfidfs[filename].sort(key=lambda tfidf: tfidf[1], reverse=True)
        tfidfs[filename] = tfidfs[filename][:5]

    # Print results
    print()
    for filename in corpus:
        print(filename)
        for term, score in tfidfs[filename]:
            print(f"    {term}: {score:.4f}")


def load_data(directory):
    files = dict()
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename)) as f:

            # Extract words
            contents = [
                word.lower() for word in
                nltk.word_tokenize(f.read())
                if word.isalpha()
            ]

            # Count frequencies
            frequencies = dict()
            for word in contents:
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] += 1
            files[filename] = frequencies

    return files

main('data/holmes2')

Extracting words from corpus...
Calculating inverse document frequencies...
Calculating term frequencies...
Computing top terms...

bohemia.txt
    majesty: 54.8014
    briony: 33.4897
    irene: 32.9193
    adler: 30.5679
    photograph: 26.3080
squires.txt
    cunningham: 94.3802
    alec: 57.8459
    acton: 45.6678
    william: 31.5063
    colonel: 31.3191
carbuncle.txt
    goose: 61.1358
    geese: 51.7569
    horner: 39.5788
    ryder: 36.5343
    peterson: 33.4897
clerk.txt
    pycroft: 70.0240
    mawson: 51.7569
    pinner: 25.8651
    hardware: 21.3117
    birmingham: 21.1624
problem.txt
    moriarty: 60.8904
    professor: 28.2165
    spray: 18.2671
    rock: 18.2671
    meiringen: 15.2226
league.txt
    wilson: 42.8100
    league: 37.6220
    merryweather: 36.5343
    jones: 33.4897
    assistant: 32.9193
blaze.txt
    straker: 115.6919
    colonel: 55.1216
    horse: 54.9306
    trainer: 51.7569
    moor: 48.7124
boscombe.txt
    mccarthy: 112.6473
    lestrade: 56.4330
   

In [3]:
# Example using tf-df to get most important words from all
# the federalist papers

main('data/federalist')

Extracting words from corpus...
Calculating inverse document frequencies...
Calculating term frequencies...
Computing top terms...

73.hamilton.txt
    negative: 18.0359
    executive: 15.0754
    his: 9.5552
    resolutions: 8.1790
    preventing: 8.1790
37.madison.txt
    obscurity: 15.2818
    difficulties: 10.0204
    predetermination: 8.8853
    allowances: 8.8853
    organ: 8.8853
18.madison.txt
    macedon: 41.2445
    cities: 37.4511
    achaeans: 33.7455
    amphictyonic: 22.2133
    philip: 22.2133
58.disputed.txt
    representatives: 14.5816
    majority: 9.4764
    coalition: 9.1691
    reapportionments: 8.8853
    house: 8.5153
12.hamilton.txt
    patrols: 13.3280
    contraband: 13.3280
    revenue: 12.1645
    trade: 10.4076
    duties: 10.1168
56.disputed.txt
    knowledge: 17.3630
    information: 14.9821
    representatives: 9.4780
    eight: 9.1691
    hundred: 9.0180
29.hamilton.txt
    militia: 46.8934
    disciplining: 13.3280
    posse: 13.3280
    comitatus: 13.

## Information Extraction
The task of extracting knowledge from documents

### Note: their were examples for basic information extraction in the lexture at 1:25, but the source code supplied did not work, so I deleted it. 

### WordNet
A built-in part of nltk that is basically a dictionary for many known words

In [11]:
from nltk.corpus import wordnet

word = input("Word: ")
synsets = wordnet.synsets(word)

for synset in synsets:
    print()
    print(f"{synset.name()}: {synset.definition()}")
    for hypernym in synset.hypernyms():
        print(f"  {hypernym.name()}")

Word: city

city.n.01: a large and densely populated urban area; may include several independent administrative districts
  municipality.n.01

city.n.02: an incorporated administrative district established by state charter
  administrative_district.n.01

city.n.03: people living in a large densely populated municipality
  municipality.n.02


## Word Representation
Ideally we want an ai to be able to take a word in a given sentence and understand what that word means, one of the easiest way for a computer to do so is by converting words to numbered vectors.


**One-Hot Representation** - A representation of meaning as a vector with a single 1, and with other values as 0, the above word assignments for 'He wrote a book.' is an example of this. This type of word representation will only really work for smaller word banks. Also another issue is that all words will have different vectors, even words with the same meaning like 'book' and 'novel', see example below:

* Initial Sentence: 'He wrote a book.'
* he = [1, 0, 0, 0]
* wrote = [0, 1, 0, 0]
* a = [0, 0, 1, 0]
* book = [0, 0, 0, 1]

**Distribution Representation** - A representation of meaning distributed across multiple numerical values. With this type of representation, similar words should have vectors with close to the same values, see example below:

* Initial Sentence: 'He wrote a book.'
* he = [-0.34, -0.08, 0.02, -0.18, ...]
* wrote = [-0.27, 0.40, 0.00, -0.65, ...]
* a = [-0.12, -0.25, 0.29, -0.09, ...]
* book = [-0.23, -0.16, -0.05, -0.57]




## word2vec
A model for generating word vectors


**Skip-Gram Architecture** - Neural network architecture for predicting context words given a target word. In such a neural network there is usually 1 hidden node layer that assigns numerical values to each input word (every word is an input node). The hidden layer node values will actually be the vector representation of that word (see distribution representation above). So if a hidden layer with 4 nodes is created, every target word will get it's own vector representation with 4 numerical values. Words with similar meanings should have vector values that are close together

In [24]:
from scipy.spatial.distance import cosine

import math
import numpy as np

with open("data/words.txt") as f:
    words = dict()
    for i in range(50000):
        row = next(f).split()
        word = row[0]
        vector = np.array([float(x) for x in row[1:]])
        words[word] = vector


def distance(w1, w2):
    return cosine(w1, w2)


def closest_words(embedding):
    distances = {
        w: distance(embedding, words[w])
        for w in words
    }
    return sorted(distances, key=lambda w: distances[w])[:10]


def closest_word(embedding):
    return closest_words(embedding)[0]

# This is the vector array that represents the word 'city'
print(words['city'])

print()

print('Distance "book" is from itself:')
print(distance(words['book'], words['book']))

print()

print('Distance "book" is from the word "breakfast":')
print(distance(words['book'], words['breakfast']))

print()

print('Distance "book" is from the word "novel":')
print(distance(words['book'], words['novel']))

print()
print('10 closest words to "book"')
print(closest_words(words['book'])[:10])

[ 0.231087 -0.238098  0.584713 -0.524351  0.40278   0.148448  0.386096
 -0.493994 -0.198922 -0.411161  0.556962  0.220978 -0.304637 -0.499713
 -0.092555  0.262613  0.752704  0.463667  0.054477  0.155809 -0.195134
 -0.009269  0.378139 -0.651306 -0.029372 -0.563472  0.024709  0.366842
 -0.476904 -0.42565  -0.094642 -0.052822  0.124612  0.296046 -0.244881
  0.195957  0.223666  0.064116  0.577874  0.083096 -0.378262  0.196044
 -0.220993 -0.630213 -0.311214  0.435611  0.351486  0.342794 -0.229961
 -0.157521  0.204315  0.253944 -0.562277  0.534482 -0.4158    0.120161
  0.649395 -0.227012 -0.130488 -0.332326  0.691952 -0.400436  0.410125
  0.026237 -0.408483  0.188236  0.130957 -0.320686  0.225932 -0.171665
 -0.335107 -0.009982  0.680831 -0.023788 -0.165798  0.345986 -0.232295
  0.021137  0.08515  -0.24387  -0.142469 -0.058325  0.086046 -0.173068
  0.198108  0.009103  0.381725  0.095911  0.317972 -0.10012   0.143178
  0.106724 -0.419844 -0.175785 -0.251805  0.211927  0.411175  0.317378
  0.45

### Because word2vec utilizes numerical vectors for word representation, mathematicaly formulas can be used to calculate words. See example below:

In [26]:
closest_word(words['king'] - words['man'] + words['woman'])

'queen'

In [27]:
closest_word(words['paris'] - words['france'] + words['england'])

'london'