# Assignment #01 - Word Vectors
Deep Learning / Winter 1399, Khatam University



---



**Please pay attention to these notes:**
<br><br>



- **Assignment Due:** <b><font color='red'>1399.12.17</font></b> 23:59:00
- If you need any additional information, please review the assignment page on the course website.
- The items you need to answer are highlighted in <font color="SeaGreen">**bold SeaGreen**</font> and the coding parts you need to implement are denoted by:
```
# ------------------
# Put your implementation here     
# ------------------
```
- We always recommend co-operation and discussion in groups for assignments. However, **each student has to finish all the questions by him/herself**. If our matching system identifies any sort of copying, you'll be responsible for consequences.
- Students who audit this course should submit their assignments like other students to be qualified for attending the rest of the sessions.
- If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course Microsoft Teams channel.
- You must run this notebook on Google Colab platform, it depends on Google Colab VM for some of the depencecies.
- You can double click on collapsed code cells to expand them.
- <b><font color='red'>When you are ready to submit, please follow the instructions at the end of this notebook.</font></b>


<br>



# Introduction

This assignment is derived from first assignment in stanford CS224n course. Of course there are many solutions for it out there (specially for the code parts), but this is an super easy warm-up assignment and we believe that you don't need any external help to do the implementation parts.

## Word Vectors

Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from *co-occurrence matrices*, and those derived via *word2vec*. 

**Note on Terminology:** The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding) states, "*conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension*".

In [None]:
#@title Prepare environment
# All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------
!pip install fasttext
!pip install arabic-reshaper
!pip install python-bidi

from IPython.display import clear_output

import arabic_reshaper
from bidi.algorithm import get_display

import sys, os, re
from collections import Counter

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 8]

import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)

clear_output()
# ----------------

## Part 1: Count-Based Word Vectors

Most word vector models start from the following idea:

*You shall know a word by the company it keeps ([Firth, J. R. 1957:11](https://en.wikipedia.org/wiki/John_Rupert_Firth))*

Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, *co-occurrence matrices* (for more information, see [here](http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf) or [here](https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285)).

### Co-Occurrence

A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the document, we consider the *context window* surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{i-n} \dots w_{i-1}$ and $w_{i+1} \dots w_{i+n}$. We build a *co-occurrence matrix* $M$, which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$'s window.

**Example: Co-Occurrence with Fixed Window of n=1**:

Document 1: "all that glitters is not gold"

Document 2: "all is well that ends well"


|          | START | all | that | glitters | is   | not  | gold  | well | ends | END |
|----------|-------|-----|------|----------|------|------|-------|------|------|-----|
| START    | 0     | 2   | 0    | 0        | 0    | 0    | 0     | 0    | 0    | 0   |
| all      | 2     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |
| that     | 0     | 1   | 0    | 1        | 0    | 0    | 0     | 1    | 1    | 0   |
| glitters | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |
| is       | 0     | 1   | 0    | 1        | 0    | 1    | 0     | 1    | 0    | 0   |
| not      | 0     | 0   | 0    | 0        | 1    | 0    | 1     | 0    | 0    | 0   |
| gold     | 0     | 0   | 0    | 0        | 0    | 1    | 0     | 0    | 0    | 1   |
| well     | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 1    | 1   |
| ends     | 0     | 0   | 1    | 0        | 0    | 0    | 0     | 1    | 0    | 0   |
| END      | 0     | 0   | 0    | 0        | 0    | 0    | 1     | 1    | 0    | 0   |

**Note:** In NLP, we often add START and END tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine START and END tokens encapsulating each document, e.g., "START All that glitters is not gold END", and include these tokens in our co-occurrence counts.

The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top $k$ principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length-$k$ word vectors in $U_k$.

![Picture of an SVD](https://github.com/teias-courses/nlp99/raw/gh-pages/assets/img/svd.png)

This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. 

**Notes:** If you can barely remember what an eigenvalue is, here's [a slow, friendly introduction to SVD](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf). If you want to learn more thoroughly about PCA or SVD, these course notes provide a great high-level treatment of these general purpose algorithms: [1](https://web.stanford.edu/class/cs168/l/l7.pdf), [2](http://theory.stanford.edu/~tim/s15/l/l8.pdf), [3](https://web.stanford.edu/class/cs168/l/l9.pdf). Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top $k$ vector components for relatively small $k$ — known as *[Truncated SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD)* — then there are reasonably scalable techniques to compute those iteratively.

### Plotting Co-Occurrence Word Embeddings

Here, we will be using the [Bijankhan corpus](https://www.peykaregan.ir/dataset/%D9%BE%DB%8C%DA%A9%D8%B1%D9%87-%D8%A8%DB%8C%E2%80%8C%D8%AC%D9%86%E2%80%8C%D8%AE%D8%A7%D9%86). We provide a `read_corpus` function below that reads and processes a portion of Bijankhan corpus. The function also adds START and END tokens to each of the documents, and lowercases words. You do **not** have to perform any other kind of pre-processing.

In [None]:
#@title Download Bijankhan corpus
!git clone https://github.com/tihu-nlp/normalized_bijankhan
!7z x /content/normalized_bijankhan/bijankhan.7z
clear_output()
print('Done!')

Done!


In [None]:
#@title Read corpus

DATA_DIR = '/content/bijankhan.txt'

def preprocess_token(token):

  token = re.sub('[0-9]', '#', token)       # truncate semi-space
  token = re.sub('\u200c\S*', '', token)       # truncate semi-space
  return token.strip()

def read_corpus():
  punctuations = '[،.!«»؟؛()]'
  with open(DATA_DIR) as txt_file:
    tokens = []
    for i, line in enumerate(txt_file):
      if line.startswith('!') or line.startswith('#'):
        if len(tokens) > 10:
          yield [START_TOKEN] + tokens + [END_TOKEN]
        tokens = []
        continue
      else:
        new_token = line.partition('\t')[0]
        new_token = preprocess_token(new_token)
        if new_token not in punctuations:
          tokens.append(new_token)

corpus = list(read_corpus())

print(len(corpus), '\tdocuments')
print(sum(map(lambda t: len(t), corpus)), '\ttokens')

1176 	documents
266016 	tokens


Let's have a look what these documents are like….

In [None]:
pprint.pprint(corpus[:2], compact=True, width=100)

### Question 1.1: Implement `distinct_words` [code] (2 points)

<font color="SeaGreen"><b>Write a method to work out the distinct words (word types) that occur in the corpus.</b></font>
 You can do this with `for` loops, but it's more efficient to do it with Python list comprehensions. In particular, [this](https://coderwall.com/p/rcmaea/flatten-a-list-of-lists-in-one-line-in-python) may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's [more information](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).

You may find it useful to use [Python sets](https://www.w3schools.com/python/python_sets.asp) to remove duplicate words.

In [None]:
def distinct_words(corpus):
  """ Determine a list of distinct words for the corpus.
      Params:
          corpus (list of list of strings): corpus of documents
      Return:
          corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
          num_corpus_words (integer): number of distinct words across the corpus
  """
  corpus_words = []
  num_corpus_words = -1
  
  # ------------------
  # Write your implementation here.
  # ------------------

  return corpus_words, num_corpus_words

In [None]:
#@title Sanity check for Q1.1

# Define toy corpus
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)

# Correct answers
ans_test_corpus_words = sorted(list(set(["START", "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", "END"])))
ans_num_corpus_words = len(ans_test_corpus_words)

# Test correct number of words
assert (num_corpus_words == ans_num_corpus_words), "Incorrect number of distinct words. Correct: {}. Yours: {}".format(ans_num_corpus_words, num_corpus_words)

# Test correct words
assert (test_corpus_words == ans_test_corpus_words), "Incorrect corpus_words.\nCorrect: {}\nYours:   {}".format(str(ans_test_corpus_words), str(test_corpus_words))

# Print Success
print("-" * 80)
print("Passed All Tests!")
print("-" * 80)

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------


### Question 1.2: Implement `compute_co_occurrence_matrix` [code]

<font color="SeaGreen"><b>Write a method that constructs a co-occurrence matrix for a certain window-size $n$ (with a default of 4), considering words $n$ before and $n$ after the word in the center of the window. Here, we start to use `numpy (np)` to represent vectors, matrices, and tensors.</b></font>



In [None]:
def compute_co_occurrence_matrix(corpus, window_size=4):
  """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).

    Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
      number of co-occurring words.
      
      For example, if we take the document "START All that glitters is not gold END" with window size of 4,
      "All" will co-occur with "START", "that", "glitters", "is", and "not".

    Params:
      corpus (list of list of strings): corpus of documents
      window_size (int): size of context window
    Return:
      M (numpy matrix of shape (number of corpus words, number of corpus words)): 
        Co-occurence matrix of word counts. 
        The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
      word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
  """
  words, num_words = distinct_words(corpus)
  M = None
  word2Ind = dict()
  
  # ------------------
  # Write your implementation here.
  # ------------------

  return M, word2Ind

In [None]:
#@title Sanity check for Q1.2


# Define toy corpus and get student's co-occurrence matrix
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)

# Correct M and word2Ind
M_test_ans = np.array( 
    [[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,],
     [0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,],
     [0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],
     [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,],
     [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],
     [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],
     [0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,],
     [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],
     [1., 0., 0., 0., 1., 1., 0., 0., 0., 1.,],
     [0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,]]
)
word2Ind_ans = {'All': 0, "All's": 1, 'END': 2, 'START': 3, 'ends': 4, 'glitters': 5, 'gold': 6, "isn't": 7, 'that': 8, 'well': 9}

# Test correct word2Ind
assert (word2Ind_ans == word2Ind_test), "Your word2Ind is incorrect:\nCorrect: {}\nYours: {}".format(word2Ind_ans, word2Ind_test)

# Test correct M shape
assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect shape.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)

# Test correct M values
for w1 in word2Ind_ans.keys():
  idx1 = word2Ind_ans[w1]
  for w2 in word2Ind_ans.keys():
    idx2 = word2Ind_ans[w2]
    student = M_test[idx1, idx2]
    correct = M_test_ans[idx1, idx2]
    if student != correct:
      print("Correct M:")
      print(M_test_ans)
      print("Your M: ")
      print(M_test)
      raise AssertionError("Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx2, w1, w2, student, correct))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------


### Question 1.3: Implement `reduce_to_k_dim` [code]

<font color="SeaGreen"><b>Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings. </b></font>

**Note:** All of numpy, scipy, and scikit-learn (`sklearn`) provide *some* implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).

In [None]:
def reduce_to_k_dim(M, k=2):
  """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
    to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
        - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

    Params:
      M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
      k (int): embedding size of each word after dimension reduction
    Return:
      M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
            In terms of the SVD from math class, this actually returns U * S
  """    
  n_iters = 100     # Use this parameter in your call to `TruncatedSVD`
  M_reduced = None
  print("Running Truncated SVD over %i words..." % (M.shape[0]))
  
  # ------------------
  # Write your implementation here.
  # ------------------

  print("Done.")
  return M_reduced

In [None]:
#@title Sanity check for Q1.3


# Define toy corpus and run student code
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)

# Test proper dimensions
assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should have {}".format(M_test_reduced.shape[0], 10)
assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; should have {}".format(M_test_reduced.shape[1], 2)

# Print Success
print("-" * 80)
print("Passed All Tests!")
print("-" * 80)

Running Truncated SVD over 10 words...
Done.
--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------


### Question 1.4: Implement `plot_embeddings` [code]

Here you will write a function to plot a set of 2D vectors in 2D space. 
<font color="SeaGreen"><b>Plot 2D vectors for given words (scatter plot), and annotate each vector (point) with its corresponding word. To show vector distances fairly, the plot should have same scale for x, y axes.</b></font>

For this example, you may find it useful to adapt [this code](https://www.pythonmembers.club/2018/05/08/matplotlib-scatter-plot-annotate-set-text-at-label-each-point/). In the future, a good way to make a plot is to look at [the Matplotlib gallery](https://matplotlib.org/gallery/index.html), find a plot that looks somewhat like what you want, and adapt the code they give.

In [None]:
def plot_embeddings(M_reduced, word2Ind, words):
  """ Plot in a scatterplot the embeddings of the words specified in the list "words".
      NOTE: do not plot all the words listed in M_reduced / word2Ind.
      Include a label next to each point.
      
      Params:
          M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
          word2Ind (dict): dictionary that maps word to indices for matrix M
          words (list of strings): words whose embeddings we want to visualize
  """

  # ------------------
  # Write your implementation here.
  # ------------------

plot_embeddings(M_test_reduced, word2Ind_test, ['All', 'ends', 'well', 'that', 'gold'])

### Question 1.5: Co-Occurrence Plot Analysis [written]

Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4, over the Bijankhan corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U\*S, so we normalize the returned vectors, so that all the vectors will appear around the unit circle (**therefore closeness is directional closeness**). 

Run the below cell to produce the plot. It'll probably take a few seconds to run. 

<font color="SeaGreen"><b>What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have?
</b></font>

In [None]:
#@title Compute normalized 2D word vectors

corpus = list(read_corpus())
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

Running Truncated SVD over 17151 words...
Done.


In [None]:
words = ['خوزستان', 'آلمان', 'نفت', 'عربستان', 'عراق',
         'ایتالیا', 'بنزین', 'شعر', 'کویت', 'معادن', 'یونان',
         'صنعت', 'ادبیات', 'خراسان', 'شیراز']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

#### <font color="red">Write your answer here.</font>

#Part 2 - Prediction-Based Word Vectors

In [None]:
#@title Download the Persian model

!gdown https://drive.google.com/uc?id=1blEFKa9253O4HM9J58SCzN7_RSoC-bhA

clear_output()
print ("Done!")

In [None]:
#@title Load the persian model
import fasttext

ft = fasttext.load_model('cc.fa.100.bin')

clear_output()
print ("Done!")

In [None]:
#@title Reduce dims

from sklearn.decomposition import PCA

def reduce_dims(X, k=2):
  pca = PCA(n_components=2)
  pca.fit(X)
  return pca.transform(X)

In [None]:
#@title plot points

import matplotlib.pyplot as plt
from bidi.algorithm import get_display
import matplotlib.pyplot as plt
import arabic_reshaper

def plot_points(P, labels):
  fig, ax = plt.subplots(figsize=(6, 6))
  ax.plot(P[:,0], P[:,1], "b.")
  for i, txt in enumerate(labels):
    reshaped_text = arabic_reshaper.reshape(txt)
    fatext = get_display(reshaped_text)

    ax.annotate(fatext, (P[i,0], P[i,1]), fontsize=14)

  plt.axis('off')
  plt.show()

### Question 2.1: Word2Vec Plot Analysis 
Run the cell below to plot the 2D word2vec embeddings for: 

`['سوئد', 'موز', 'انرژی', 'صنعت', 'کویت', 'نفت', 'سیب', 'بنزین', 'ایران']`.

<font color="SeaGreen"><b>What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? How is the plot different from the one generated earlier from the co-occurrence matrix?</font></b>

In [None]:
import numpy as np
words = ['سوئد', 'موز', 'انرژی', 'صنعت', 'کویت', 'نفت', 'سیب', 'بنزین', 'ایران']
vectors = []
for word in words:
  vectors.append(ft.get_word_vector(word))
reduced_vectors = reduce_dims(np.array(vectors))

plot_points(reduced_vectors, words)

#### <font color="red">Write your answer here.</font>

### Question 2.2: Polysemous Words 
<font color="SeaGreen"><b>Find a [polysemous](https://en.wikipedia.org/wiki/Polysemy) word (for example, "شیر") such that the top-10 most similar words (according to cosine similarity) contains related words from *both* meanings. You will probably need to try several polysemous words before you find one. Please state the polysemous word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous words you tried didn't work?</font></b>

**Note**: You should use the `ft.get_nearest_neighbors(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word.

In [None]:
# ------------------
# Write your polysemous word exploration code here.
# ------------------

#### <font color="red">Write your answer here.</font>

### Question 2.3: Synonyms & Antonyms 
When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

<font color="SeaGreen"><b>Find three words (w1,w2,w3) where w1 and w2 are synonyms and w1 and w3 are antonyms, but Cosine Distance(w1,w3) < Cosine Distance(w1,w2). For example, w1="شاد" is closer to w3="غمگین" than to w2="خوشحال". 

Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.</font></b>

You should use the the `sklearn.metrics.pairwise.cosine_similarity` function here in order to compute the cosine distance between two words. Please see the __[sklearn documentation](https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity)__ for further assistance.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# ------------------
# Write your synonym & antonym exploration code here.
# ------------------

#### <font color="red">Write your answer here.</font>

### Solving Analogies with Word Vectors
Word2Vec vectors have been shown to *sometimes* exhibit the ability to solve analogies. 

As an example, for the analogy 

شاه : مرد --> ملکه : ؟

The answer would be "زن" or similar words.
In the cell below, we show you how to use word vectors to find x. The `get_analogies` function finds words that are most similar to the vector of `wordA - wordB + wordC`. The answer to the analogy will be the word ranked most similar (largest numerical value).

In [None]:
# Run this cell to answer the analogy

ft.get_analogies(wordA="مرد",
                 wordB="شاه",
                 wordC="ملکه", 
                 k=1)

### Question 2.4: Finding Analogies 
<font color="SeaGreen"><b>Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.</font></b>

**Note**: You may have to try many analogies to find one that works!

In [None]:
# ------------------
# Write your analogy exploration code here.
# ------------------

#### <font color="red">Write your answer here.</font>

### Question 2.5: Incorrect Analogy 
<font color="SeaGreen"><b>Find an example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the (incorrect) value of b according to the word vectors.</font></b>

In [None]:
# ------------------
# Write your incorrect analogy exploration code here.
# ------------------

[(0.6603483557701111, 'پلچرخی')]

#### <font color="red">Write your answer here.</font>

### Question 2.6: Guided Analysis of Bias in Word Vectors 

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit to our word embeddings.

<font color="SeaGreen"><b>Run the cell below, to examine wether the word "مرد" is closer to word "رییس" or "زن"? 

Explain your observations:</font></b>

In [None]:
# Run this cell

w1 = "رییس"
w2 = "مرد"
w3 = "زن"

w1_vec = ft.get_word_vector(w1)
w2_vec = ft.get_word_vector(w2)
w3_vec = ft.get_word_vector(w3)

w1_w2_dist = 1-cosine_similarity([w1_vec], [w2_vec])
w1_w3_dist = 1-cosine_similarity([w1_vec], [w3_vec])

print("{}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist[0][0]))
print("{}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist[0][0]))


رییس, مرد have cosine distance: 0.714638352394104
رییس, زن have cosine distance: 0.9129302501678467


#### <font color="red">Write your answer here.</font>

### Question 2.7: Independent Analysis of Bias in Word Vectors

<font color="SeaGreen"><b>Use the method above to find another case where some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.</font></b>

In [None]:
# ------------------
# Write your bias exploration code here.
# ------------------

#### <font color="red">Write your answer here.</font>

### Question 2.8: Thinking About Bias

<font color="SeaGreen"><b>What might be the cause of these biases in the word vectors?</font></b>

#### <font color="red">Write your answer here.</font>

# Submission

Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instructions:

1. Check and review your answers. Make sure all of the cell outputs are what you want. 
2. Select File > Save.
3. **Fill your information** & run the cell bellow.
4. Run **Make Submission** cell, It may take several minutes and it may ask you for your credential.
5. Run **Download Submission** cell to obtain your submission as a zip file.
6. Grab the downloaded file (`dl_asg01__xx__xx.zip`) and hand it over in microsoft teams.

## Fill your information (Run the cell)

In [None]:
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = "" #@param {type:"string"}
student_name = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg01')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

## Make Submission (Run the cell)

In [None]:
#@title Make submission
! pip install -U --quiet PyDrive > /dev/null
! pip install -U --quiet jdatetime > /dev/null

# ! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 


import os
import time
import yaml
import json
import jdatetime

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'Assignment_1'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
# repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg01__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'dateime': str(jdatetime.date.today()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

In [None]:
drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']

In [None]:
files.download(submission_file_name)