# Homework 5: Advanced Vector Space Models

## Due Date: Jun 17
## Total Points: 116 points + 8 bonus
- **Overview**: In this assignment, we will examine some advanced uses of vector representations of words. We are going to look at three different problems:

  - Solving word relation problems like analogies using word embeddings.
  - Comparing correlation for human judgments of similarity to the vector similarities
  - Discovering the different senses of a ‘polysemous’ word by clustering together its paraphrases.


- **Delieverables:** This assignment has several deliverables:
  - Code (this notebook) *(Automatic Graded)*
    - Part 1: answers to questions
    - Part 3: 4 different clustering functions
  - Write Up (include in this notebook or a separate **writeup.pdf**) *(Manually Graded)*
    - Answers to all questions labeled as `Answer #.#` in a file named `writeup.pdf`
      - Part 2: answers to questions **[writeup.pdf]**
      - Part 3: F-scores for clustering algorithms & discussions about your models **[writeup.pdf]**
  - Leaderboard Without K *(Automatic Graded on GradeScope)*
    - `test_nok_output_leaderboard.txt` = Task 3.4 output file
  - Leaderboard With K *(Automatic Graded on GradeScope)*
    - `test_output_leaderboard.txt` = Task 3.2 or 3.3 output file

- **Grading**: We will use the auto-grading system called `PennGrader`. To complete the homework assignment, you should implement anything marked with `#TODO` and run the cell with `#PennGrader` note.


## Recommended Readings
- [Vector Semantics](https://web.stanford.edu/~jurafsky/slp3/6.pdf). Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft).
- [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf?). Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. ArXiV 2013.
- [Linguistic Regularities in Continuous Space Word Representations](https://www.aclweb.org/anthology/N13-1090). Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. NAACL 2013.
- [Discovering Word Senses from Text](https://cs.uwaterloo.ca/~cdimarco/pdf/cs886/Pantel+Lin02.pdf). Patrick Pangel and Dekang Ling. KDD 2002.
- [Linguistic Regularities in Sparse and Explicit Word Representations](https://aclanthology.org/W14-1618.pdf). Patrick Pangel and Dekang Ling. CoNLL 2014.
- [Clustering Paraphrases by Word Sense](https://www.cis.upenn.edu/~ccb/publications/clustering-paraphrases-by-word-sense.pdf). Anne Cocos and Chris Callison-Burch. NAACL 2016.

## To get started, **make a copy** of this colab notebook into your google drive!

## Setup 1: PennGrader Setup [4 points]

In [5]:
## DO NOT CHANGE ANYTHING, JUST RUN
%%capture
!pip install penngrader-client

In [6]:
%%writefile notebook-config.yaml

grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

Overwriting notebook-config.yaml


In [7]:
!cat notebook-config.yaml


grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'


In [8]:
from penngrader.grader import *

## TODO - Start
STUDENT_ID = 62502470 # YOUR PENN-ID GOES HERE AS AN INTEGER#
## TODO - End

SECRET = STUDENT_ID
grader = PennGrader('notebook-config.yaml', 'CIS5300_OL_23Su_HW5', STUDENT_ID, SECRET)

PennGrader initialized with Student ID: 62502470

Make sure this correct or we will not be able to store your grade


In [9]:
# check if the PennGrader is set up correctly
# do not chance this cell, see if you get 4/4!
name_str = 'Rui Jiang'
grader.grade(test_case_id = 'name_test', answer = name_str)

Correct! You earned 4/4 points. You are a star!

Your submission has been successfully recorded in the gradebook.


## Setup 2: Dataset / Packages
- **Run the following cells without changing anything!**

In [10]:
### This cell might take 3 min to run ###
! echo "Installing Magnitude.... (please wait, can take a while)"
! (curl https://raw.githubusercontent.com/plasticityai/magnitude/master/install-colab.sh | /bin/bash 1>/dev/null 2>/dev/null)
! echo "Done installing Magnitude."

Installing Magnitude.... (please wait, can take a while)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   137  100   137    0     0    387      0 --:--:-- --:--:-- --:--:--   389
Done installing Magnitude.


In [11]:
!gdown 1luyNlDu0GdH_B3D6rjYLKIGGfz_S3yU5 # GoogleNews-vectors-negative300.filter.magnitude
!gdown 17a4uC7eNrYdtVlW60wshjyDLcFxTtq0y # SimLex-999.txt
!gdown 1h2DHMuubO7OEVxmGQGbvb2Ovj_A6hakC # dev_input.txt
!gdown 1I83_VA_i_UB-9cf9GcEe5oGPoc8-ZmLh # dev_output.txt
!gdown 1CjK3eYkacyxo3gdLbf9IGdk1DFEfAYvM # test_input.txt
!gdown 1sZuq8a2zHJfe6bLjrK3wD2jrZWkQ0-6S # test_nok_input.txt
!gdown 1gK13ZVDMA5XYi8sZY8G1gOIZMdxGTuay # coocvec-500mostfreq-window-3.filter.magnitude

!gdown 1r0ebRDG-_4ALl3PJ7Vko0DkLcMdLPIoL # glove.6B.50d.magnitude
!gdown 1TQ5W7mma_fYKqVL-Dm7_ogwIftyJpXAT # glove.6B.100d.magnitude
!gdown 1LiKprfuwD434FGC-bf8OARMIKCtNIL4Z # glove.6B.200d.magnitude
!gdown 1_p-9y15JvbobeJ37L5v4kXnWMXsfHsD4 # glove.6B.300d.magnitude
!gdown 1zs0Z-m7YbbVbKvqkq-HEIxNYp3e75-7e # glove.840B.300d.magnitude

# if the above wget command gives you an error, then uncomment the line below and run this cell
!gdown 115ryZ01s_guR1ySc7YLD2kbAm6UpL7VP # GoogleNews-vectors-negative300.magnitude

Downloading...
From: https://drive.google.com/uc?id=1luyNlDu0GdH_B3D6rjYLKIGGfz_S3yU5
To: /content/GoogleNews-vectors-negative300.filter.magnitude
100% 3.99M/3.99M [00:00<00:00, 117MB/s]
Downloading...
From: https://drive.google.com/uc?id=17a4uC7eNrYdtVlW60wshjyDLcFxTtq0y
To: /content/SimLex-999.txt
100% 43.0k/43.0k [00:00<00:00, 48.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1h2DHMuubO7OEVxmGQGbvb2Ovj_A6hakC
To: /content/dev_input.txt
100% 17.4k/17.4k [00:00<00:00, 26.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1I83_VA_i_UB-9cf9GcEe5oGPoc8-ZmLh
To: /content/dev_output.txt
100% 23.1k/23.1k [00:00<00:00, 32.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1CjK3eYkacyxo3gdLbf9IGdk1DFEfAYvM
To: /content/test_input.txt
100% 3.81k/3.81k [00:00<00:00, 11.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1sZuq8a2zHJfe6bLjrK3wD2jrZWkQ0-6S
To: /content/test_nok_input.txt
100% 4.55k/4.55k [00:00<00:00, 13.2MB/s]
Downloading...
From: https://drive.

In [12]:
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/nvidia_cublas_cu12-12.4.5.8-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/pymagnitude-0.1.143-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/nvidia_cusolver_cu12-11.6.1.9-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: L

In [13]:
!pip uninstall -y annoy
!pip install annoy

Found existing installation: annoy 1.17.3
Uninstalling annoy-1.17.3:
  Successfully uninstalled annoy-1.17.3
[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/nvidia_cublas_cu12-12.4.5.8-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/pymagnitude-0.1.143-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/nvidia_cusolver_cu12-11.6.1.9-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package insta

In [14]:
!pip install -U lz4==1.0.0
!pip install -U xxhash==1.0.1
!pip install -U fasteners==0.14.1

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/nvidia_cublas_cu12-12.4.5.8-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/pymagnitude-0.1.143-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/nvidia_cusolver_cu12-11.6.1.9-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: L

In [15]:
import spacy

In [16]:
# Check your python version
!python --version

Python 3.11.13


If your python version is >= 3.9, run the code cell below before importing from pymaginitude:

In [17]:
spacy.load('en_core_web_sm')
import collections
collections.Sequence = collections.abc.Sequence
collections.Mapping = collections.abc.Mapping
collections.MutableMapping = collections.abc.MutableMapping
collections.Iterable = collections.abc.Iterable
collections.MutableSet = collections.abc.MutableSet
collections.Callable = collections.abc.Callable

In [19]:
from pymagnitude import * # if you encounter an error for this line, try re-running it - I know it's silly but it might work

In [None]:
# this might take ~2min to run
!wget http://magnitude.plasticity.ai/word2vec/light/GoogleNews-vectors-negative300.magnitude

--2025-06-11 02:20:09--  http://magnitude.plasticity.ai/word2vec/light/GoogleNews-vectors-negative300.magnitude
Resolving magnitude.plasticity.ai (magnitude.plasticity.ai)... 52.217.116.149, 3.5.8.241, 16.15.192.128, ...
Connecting to magnitude.plasticity.ai (magnitude.plasticity.ai)|52.217.116.149|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2025-06-11 02:20:09 ERROR 403: Forbidden.



In [None]:
# if the above wget command gives you an error, then uncomment the line below and run this cell
!gdown 115ryZ01s_guR1ySc7YLD2kbAm6UpL7VP

Downloading...
From (original): https://drive.google.com/uc?id=115ryZ01s_guR1ySc7YLD2kbAm6UpL7VP
From (redirected): https://drive.google.com/uc?id=115ryZ01s_guR1ySc7YLD2kbAm6UpL7VP&confirm=t&uuid=5c56bb4c-a7a8-4df2-a953-749b3c294125
To: /content/GoogleNews-vectors-negative300.magnitude
100% 4.21G/4.21G [00:56<00:00, 74.4MB/s]


In [None]:
# !curl -L "https://drive.usercontent.google.com/download?id=115ryZ01s_guR1ySc7YLD2kbAm6UpL7VP&export=download&confirm" -o GoogleNews-vectors-negative300.magnitude

In [20]:
import os
os.listdir('./')

['.config',
 'glove.840B.300d.magnitude',
 'wiki-news-300d-1M.vec',
 'notebook-config.yaml',
 'glove.6B.300d.magnitude',
 'test_nok_input.txt',
 'glove.6B.200d.magnitude',
 'wiki-news-300d-1M.vec.zip',
 'dev_output.txt',
 'glove.6B.100d.magnitude',
 'test_input.txt',
 'dev_input.txt',
 'glove.6B.50d.magnitude',
 'SimLex-999.txt',
 'GoogleNews-vectors-negative300.filter.magnitude',
 'GoogleNews-vectors-negative300.magnitude',
 'coocvec-500mostfreq-window-3.filter.magnitude',
 'sample_data']

In [21]:
from itertools import combinations
from prettytable import PrettyTable
from sklearn.cluster import KMeans
import random
import pandas as pd
import numpy as np
import scipy.stats as stats

In [None]:
# first time run will take 10 minutes
file_path = "/content/GoogleNews-vectors-negative300.magnitude"
vectors = Magnitude(file_path)

sims = vectors.most_similar("picnic")


In [None]:
# !gdown 1luyNlDu0GdH_B3D6rjYLKIGGfz_S3yU5 # GoogleNews-vectors-negative300.filter.magnitude
!gdown 17a4uC7eNrYdtVlW60wshjyDLcFxTtq0y # SimLex-999.txt
!gdown 1h2DHMuubO7OEVxmGQGbvb2Ovj_A6hakC # dev_input.txt
!gdown 1I83_VA_i_UB-9cf9GcEe5oGPoc8-ZmLh # dev_output.txt
!gdown 1CjK3eYkacyxo3gdLbf9IGdk1DFEfAYvM # test_input.txt
!gdown 1sZuq8a2zHJfe6bLjrK3wD2jrZWkQ0-6S # test_nok_input.txt
!gdown 1gK13ZVDMA5XYi8sZY8G1gOIZMdxGTuay # coocvec-500mostfreq-window-3.filter.magnitude

Downloading...
From: https://drive.google.com/uc?id=17a4uC7eNrYdtVlW60wshjyDLcFxTtq0y
To: /content/SimLex-999.txt
100% 43.0k/43.0k [00:00<00:00, 67.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1h2DHMuubO7OEVxmGQGbvb2Ovj_A6hakC
To: /content/dev_input.txt
100% 17.4k/17.4k [00:00<00:00, 38.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1I83_VA_i_UB-9cf9GcEe5oGPoc8-ZmLh
To: /content/dev_output.txt
100% 23.1k/23.1k [00:00<00:00, 42.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1CjK3eYkacyxo3gdLbf9IGdk1DFEfAYvM
To: /content/test_input.txt
100% 3.81k/3.81k [00:00<00:00, 8.44MB/s]
Downloading...
From: https://drive.google.com/uc?id=1sZuq8a2zHJfe6bLjrK3wD2jrZWkQ0-6S
To: /content/test_nok_input.txt
100% 4.55k/4.55k [00:00<00:00, 13.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1gK13ZVDMA5XYi8sZY8G1gOIZMdxGTuay
To: /content/coocvec-500mostfreq-window-3.filter.magnitude
100% 3.50M/3.50M [00:00<00:00, 24.7MB/s]


# Section 1: Exploring Analogies and Other Word Pair Relationships [4 points]
**Background:** Word2vec is a very cool word embedding method that was developed by [Thomas Mikolov et al](https://aclanthology.org/N13-1090/). One of the noteworthy things about the method is that it can be used to solve word analogy problems like:

***man is to king as woman is to [blank]***

The way that it they take the vectors representing king, man and woman and perform some vector arithmetic to produce a vector that is close to the expected answer:

***king − man + woman ≈ queen***

We can find the nearest vector in the vocabulary by looking for argmax *cos(x, king − man + woman)*. Omar Levy has an explanation of the method in this [Quora post](https://www.quora.com/unanswered/How-does-Mikolovs-word-analogy-for-word-embedding-work-How-can-I-code-such-a-function) and in the [paper](https://aclanthology.org/W14-1618/).

In addition to solving this sort of analogy problem, the same sort of vector arithmetic was used with word2vec embeddings to find relationships between pairs of words like the following:

<img src='https://drive.google.com/uc?id=1_ewkcJ6EQMuIK0SrgBulzK7LFi8kD9nD'>

In the first part of the assigment, you will play around with the [Magnitude](https://github.com/plasticityai/magnitude) library. You will use Magnitude to load a vector model trained using word2vec, and use it to manipulate and analyze the vectors. In order to proceed further, you need to use the Medium Google-word2vec embedding model trained on Google News by using file `GoogleNews-vectors-negative300.magnitude`. Once the file is downloaded use the following Python commands:

In [22]:
# file_path = "/content/GoogleNews-vectors-negative300.filter.magnitude"
file_path = "/content/GoogleNews-vectors-negative300.magnitude"
vectors = Magnitude(file_path)

In [None]:
# rjiang mount google drive to avoid download file again
from google.colab import drive
drive.mount('/content/drive')

# file_path = "/content/drive/CIS5300/HW5/"

Now you can use vectors to perform queries. For instance, you can query the distance of cat and dog in the following way:

In [None]:
print(vectors.distance("cat", "dog")) # should be ~0.69

0.69145405


The questions below are designed to familiarize you with the Magnitude word2vec package and get you thinking about what type of semantic information word embeddings can encode. We recommend reading using the [library section](https://github.com/plasticityai/magnitude#using-the-library) to reply to the following set of questions:

- **Problem 1.1:** What is the dimensionality of these word embeddings? Provide an integer answer. [1 point]

In [None]:
vectors.dim

300

In [None]:
# TODO
dimensionality = vectors.dim

# PennGrader - DO NOT CHANGE
grader.grade(test_case_id = 'test_q11_dim', answer = dimensionality) # we only check partial data

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


 - **Problem 1.2:** What are the top-5 most similar words to `picnic` (not including `picnic` itself)? (Hint: try using `vectors.most_similar`) Please return these as a list of strings named `mostsim`. [1 point]

In [None]:
### The first time you run "vectors.most_similar" it will take about 5~10 mins to run
sims = vectors.most_similar("picnic")
sims_top5 = []
for i in range(5):
    sims_top5.append(sims[i][0])
sims_top5

['picnics', 'picnic_lunch', 'Picnic', 'potluck_picnic', 'picnic_supper']

In [None]:
# TODO
mostsim = sims_top5

# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q12_picnic', answer = mostsim) # we only check partial data

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


 - **Problem 1.3:** According to the word embeddings, which of these words is not like the others? `['tissue', 'papyrus', 'manila', 'newsprint', 'parchment', 'gazette']` [1 point]

In [None]:
word_list = ['tissue', 'papyrus', 'manila', 'newsprint', 'parchment', 'gazette']
for i in range(len(word_list)):
    for j in range(i):
        print("[{}] and [{}] sim scores is [{}]".format(word_list[i], word_list[j], vectors.similarity(word_list[i], word_list[j])))

[papyrus] and [tissue] sim scores is [0.1519498974084854]
[manila] and [tissue] sim scores is [0.13553906977176666]
[manila] and [papyrus] sim scores is [0.2507871985435486]
[newsprint] and [tissue] sim scores is [0.19044463336467743]
[newsprint] and [papyrus] sim scores is [0.2438289374113083]
[newsprint] and [manila] sim scores is [0.21090111136436462]
[parchment] and [tissue] sim scores is [0.20001494884490967]
[parchment] and [papyrus] sim scores is [0.5869774222373962]
[parchment] and [manila] sim scores is [0.2911486029624939]
[parchment] and [newsprint] sim scores is [0.3022960424423218]
[gazette] and [tissue] sim scores is [-0.00041741711902432144]
[gazette] and [papyrus] sim scores is [0.22188228368759155]
[gazette] and [manila] sim scores is [0.2529136538505554]
[gazette] and [newsprint] sim scores is [0.20029675960540771]
[gazette] and [parchment] sim scores is [0.18534383177757263]


In [None]:
# TODO
doesnt_match = "tissue"

# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q13_does_not_match', answer = doesnt_match) # we only check partial data

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


 -  **Problem 1.4:** Solve the following analogy: `leg` is to `jump` as X is to `throw` [1 point]

In [None]:
vectors.most_similar(positive = ["leg", "throw"], negative = ["jump"])

[('forearm', np.float32(0.48294652)),
 ('shin', np.float32(0.47376162)),
 ('elbow', np.float32(0.4679689)),
 ('metacarpal_bone', np.float32(0.46781474)),
 ('metacarpal_bones', np.float32(0.46605822)),
 ('ankle', np.float32(0.46434426)),
 ('shoulder', np.float32(0.46183354)),
 ('thigh', np.float32(0.45393682)),
 ('knee', np.float32(0.4455708)),
 ('ulna_bone', np.float32(0.4423491))]

In [None]:
# TODO
analogy = "forearm"

# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q14_analogy', answer = analogy)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


# Section 2: SimLex-999 Dataset Revisited [10 points + 5 Bonus]
Let us revisit [SimLex-999](https://fh295.github.io/simlex.html) dataset from Extra Credit in HW4. We will use `SimLex-999.txt`.

We provided you a script below that:

1. Takes `word1`, `word2`, and `SimLex` columns from the `SimLex-999.txt` dataset,
2. Computes the similarity between `word1` and `word2` using `GoogleNews-vectors-negative300.magnitude` from Part 1
3. Displays correlation for human judgments of similarity to the vector similarities using [Kendall’s Tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient).

In [23]:
# Reference Code - DO NOT CHANGE
vectors = Magnitude("/content/GoogleNews-vectors-negative300.magnitude")
df = pd.read_csv('/content/SimLex-999.txt', sep='\t')[['word1', 'word2', 'SimLex999']]
human_scores = []
vector_scores = []

counter = 0
for word1, word2, score in df.values.tolist():
    human_scores.append(score)
    similarity_score = vectors.similarity(word1, word2)
    vector_scores.append(similarity_score)
    if counter < 5: # only print the first five
        print(f'{word1},{word2},{score},{similarity_score:.4f}')
        counter += 1

print()
correlation, p_value = stats.kendalltau(human_scores, vector_scores)
print(f'Correlation = {correlation}, P Value = {p_value}')

old,new,1.58,0.2228
smart,intelligent,9.2,0.6495
hard,difficult,8.77,0.6026
happy,cheerful,9.55,0.3838
hard,easy,0.95,0.4710

Correlation = 0.30913428432001067, P Value = 2.6592796177776212e-48


In this part of the assignment we would like for you to explore how the Kendall’s Tau correlation changes based on the similarity. You may use the script we provided or create your own script.

**Please respond to the following questions in your report

Note: **5 Extra points** will be awarded for creativity and a more thorough qualitative analysis.)

 - **Answer 2.1:** What is the least similar 2 pairs of words based on human judgement scores and vector similarity? Do the pairs match? [3 points]

**TODO**: [Least similar pairs] **[writeup.pdf]**

In [24]:
word_pairs = []
sim_scores = []
human_sim_scores = []
for word1, word2, score in df.values.tolist():
    word_pairs.append((word1, word2))
    sim_score = vectors.similarity(word1, word2)
    sim_scores.append(sim_score)
    human_sim_scores.append(score)

sim_scores_array = np.array(sim_scores)
idx = np.argmin(sim_scores_array)
print("least pair in vector similarity {}".format(word_pairs[idx]))

human_scores_array = np.array(human_sim_scores)
idx = np.argmin(human_scores_array)
print("least pair in human similarity {}".format(word_pairs[idx]))

least pair in vector similarity ('house', 'key')
least pair in human similarity ('new', 'ancient')


 - **Answer 2.2:** What is the most similar 2 pairs of words based on human judgement scores and vector similarity? Do the pairs match? [3 points]

**TODO**: [Most similar pairs] **[writeup.pdf]**

In [25]:
idx = np.argmax(sim_scores_array)
print("most similar pair in vector similarity {}".format(word_pairs[idx]))

human_scores_array = np.array(human_sim_scores)
idx = np.argmax(human_scores_array)
print("most similar pair in human similarity {}".format(word_pairs[idx]))

most similar pair in vector similarity ('south', 'north')
most similar pair in human similarity ('vanish', 'disappear')


- **Answer 2.3:** Provide correlation scores and p values for the following models:
   - (Stanford - GloVe Wikipedia 2014 + Gigaword 5 6B Medium 50D) `glove.6B.50d.magnitude`
   - (Stanford - GloVe Wikipedia 2014 + Gigaword 5 6B Medium 100D)`glove.6B.100d.magnitude`
   - (Stanford - GloVe Wikipedia 2014 + Gigaword 5 6B Medium 200D) `glove.6B.200d.magnitude`
   - (Stanford - GloVe Wikipedia 2014 + Gigaword 5 6B Medium 300D) `glove.6B.300d.magnitude`
   - (Stanford - GloVe Common Crawl Medium 300D) `love.840B.300d.magnitude`

  **How do those correlation value compare to each other?** [4 points]

**TODO**: [Discussion] **[writeup.pdf]**

In [None]:
%%capture
!wget http://magnitude.plasticity.ai/glove/medium/glove.6B.50d.magnitude
!wget http://magnitude.plasticity.ai/glove/medium/glove.6B.100d.magnitude
!wget http://magnitude.plasticity.ai/glove/medium/glove.6B.200d.magnitude
!wget http://magnitude.plasticity.ai/glove/medium/glove.6B.300d.magnitude
!wget http://magnitude.plasticity.ai/glove/medium/glove.840B.300d.magnitude

In [None]:
# if the above links do not work, please uncomment the below lines and run them

In [None]:
!gdown 1r0ebRDG-_4ALl3PJ7Vko0DkLcMdLPIoL # glove.6B.50d.magnitude
!gdown 1TQ5W7mma_fYKqVL-Dm7_ogwIftyJpXAT # glove.6B.100d.magnitude
!gdown 1LiKprfuwD434FGC-bf8OARMIKCtNIL4Z # glove.6B.200d.magnitude
!gdown 1_p-9y15JvbobeJ37L5v4kXnWMXsfHsD4 # glove.6B.300d.magnitude
!gdown 1zs0Z-m7YbbVbKvqkq-HEIxNYp3e75-7e # glove.840B.300d.magnitude

Downloading...
From (original): https://drive.google.com/uc?id=1r0ebRDG-_4ALl3PJ7Vko0DkLcMdLPIoL
From (redirected): https://drive.google.com/uc?id=1r0ebRDG-_4ALl3PJ7Vko0DkLcMdLPIoL&confirm=t&uuid=a8570c14-3ffa-4621-a2c2-112d08b6e932
To: /content/glove.6B.50d.magnitude
100% 211M/211M [00:02<00:00, 82.8MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1TQ5W7mma_fYKqVL-Dm7_ogwIftyJpXAT
From (redirected): https://drive.google.com/uc?id=1TQ5W7mma_fYKqVL-Dm7_ogwIftyJpXAT&confirm=t&uuid=faa12057-7a32-404b-bb06-f67e56a8eb20
To: /content/glove.6B.100d.magnitude
100% 302M/302M [00:03<00:00, 79.3MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1LiKprfuwD434FGC-bf8OARMIKCtNIL4Z
From (redirected): https://drive.google.com/uc?id=1LiKprfuwD434FGC-bf8OARMIKCtNIL4Z&confirm=t&uuid=ff13bcb6-0636-4223-9188-938ee4474d12
To: /content/glove.6B.200d.magnitude
100% 507M/507M [00:03<00:00, 133MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1_p-9y15Jvbo

In [None]:
os.listdir("./")

['.config',
 'GoogleNews-vectors-negative300.magnitude',
 'notebook-config.yaml',
 'sample_data']

In [26]:
def get_correlation(file_path):
    vectors = Magnitude(file_path)
    df = pd.read_csv('/content/SimLex-999.txt', sep='\t')[['word1', 'word2', 'SimLex999']]
    human_scores = []
    vector_scores = []

    for word1, word2, score in df.values.tolist():
        human_scores.append(score)
        similarity_score = vectors.similarity(word1, word2)
        vector_scores.append(similarity_score)

    correlation, p_value = stats.kendalltau(human_scores, vector_scores)
    print(f'file_path = {file_path} Correlation = {correlation}, P Value = {p_value}')
    return correlation, p_value

In [27]:
## YOUR CODE HERE ##
# you can re-use the code from the Reference Code
magnitude_files = [
    "/content/glove.6B.50d.magnitude",
    "/content/glove.6B.100d.magnitude",
    "/content/glove.6B.200d.magnitude",
    "/content/glove.6B.300d.magnitude",
    "/content/glove.840B.300d.magnitude",
]

for file_path in magnitude_files:
    get_correlation(file_path)



file_path = /content/glove.6B.50d.magnitude Correlation = 0.18100126067449063, P Value = 1.2242211264976856e-17
file_path = /content/glove.6B.100d.magnitude Correlation = 0.20506409092608713, P Value = 3.41228663395174e-22
file_path = /content/glove.6B.200d.magnitude Correlation = 0.23670323199262908, P Value = 4.9936324557833286e-29
file_path = /content/glove.6B.300d.magnitude Correlation = 0.25894302181101986, P Value = 2.080389068003349e-34
file_path = /content/glove.840B.300d.magnitude Correlation = 0.2860664813618063, P Value = 1.293335613361039e-41


# Section 3: Creating Word Sense Clusters [96 points]
**Background:** Many Natural Language Processing (NLP) tasks require knowing the sense of polysemous words, which are words with multiple meanings. For example, the word bug can mean:

1. A creepy crawly thing
2. An error in your computer code
3. A virus or bacteria that makes you sick
4. A listening device planted by the FBI

In past research my PhD students and I have looked into automatically deriving the different meaning of polysemous words like bug by clustering their paraphrases. We have developed a resource called [the paraphrase database (PPDB)](http://paraphrase.org/) that contains of paraphrases for tens of millions words and phrases. For the target word bug, we have an unordered list of paraphrases including: insect, glitch, beetle, error, microbe, wire, cockroach, malfunction, microphone, mosquito, virus, tracker, pest, informer, snitch, parasite, bacterium, fault, mistake, failure and many others. We used automatic clustering group those into sets like:

<img src='https://drive.google.com/uc?id=1-YbbvZ0qwRKPiHZ1ZWOm62dfCkfu4tLJ'>

The clusters in the image above approximate the different word senses of bug, where the 4 circles are the 4 senses of bug. The input to this problem is all the paraphrases in a single list, and the task is to separate them correctly. As humans, this is pretty intuitive, but computers are not that smart. You will explore the main idea underlying our word sense clustering method: which measure the similarity between each pair of paraphrases for a target word and then group together the paraphrases that are most similar to each other. This affinity matrix gives an example of one of the methods for measuring similarity that we tried in our [paper](https://www.cis.upenn.edu/~ccb/publications/clustering-paraphrases-by-word-sense.pdf):

<img src='https://drive.google.com/uc?id=1v1dBzwoSM3S3Y1wDUwqcVBEZ7GxxKKJ4'>

Here the darkness values give an indication of how similar paraphrases are to each other. For instance in this example similarity between *insect* and *pest* is greater than the similarity between insect and error. You can read more about this task in [these](https://www.cis.upenn.edu/~ccb/publications/clustering-paraphrases-by-word-sense.pdf) [papers](https://cs.uwaterloo.ca/~cdimarco/pdf/cs886/Pantel+Lin02.pdf).

In this assignment, we will use vector representations in order to measure their similarities of pairs of paraphrases. You will play with different vector space representations of words to create clusters of word senses. We expect that you have read Jurafsky and Martin [Chapter 6](https://web.stanford.edu/~jurafsky/slp3/6.pdf). Word vectors, also known as word embeddings, can be thought of simply as points in some high-dimensional space. Remember in geometry class when you learned about the Euclidean plane, and 2-dimensional points in that plane? It’s not hard to understand distance between those points – you can even measure it with a ruler. Then you learned about 3-dimensional points, and how to calculate the distance between these. These 3-dimensional points can be thought of as positions in physical space.

Now, do your best to stop thinking about physical space, and generalize this idea in your mind: you can calculate a distance between 2-dimensional and 3-dimensional points, now imagine a point with `N` dimensions. The dimensions don’t necessarily have meaning in the same way as the X,Y, and Z dimensions in physical space, but we can calculate distances all the same.

This is how we will use word vectors in this assignment: as points in some high-dimensional space, where distances between points are meaningful. The interpretation of distance between word vectors depends entirely on how they were made, but for our purposes, we will consider distance to measure semantic similarity. Word vectors that are close together should have meanings that are similar.

With this framework, we can see how to solve our paraphrase clustering problem.

**The Data:**
The input data to be used for this assignment consists of sets of paraphrases corresponding to one of polysemous target words, e.g.

Target	  | Paraphrase set
----------|------------------
note.v    | comment mark tell observe state notice say remark mention
hot.a     | raging spicy blistering red-hot live

(Here the `.v` following the target `note` indicates the part of speech)

Your objective is to automatically cluster each paraphrase set such that each cluster contains words pertaining to a single sense, or meaning, of the target word. Note that a single word from the paraphrase set might belong to one or more clusters.

**Development Data:** The development data consists of two files:

1. words file (input)
2. clusters file (output)

The words file `dev_input.txt` is formatted such that each line contains one target, its paraphrase set, and the number of ground truth clusters `k`, separated by a `::` symbol. You can use `k` as input to your clustering algorithm.

`target.pos :: k :: paraphrase1 paraphrase2 paraphrase3 ...`

The clusters file `dev_output.txt` contains the ground truth clusters for each target word’s paraphrase set, split over k lines:

```
target.pos :: 1 :: paraphrase2 paraphrase6
target.pos :: 2 :: paraphrase3 paraphrase4 paraphrase5
    .
    .
    .
target.pos :: k :: paraphrase1 paraphrase9
```

**Test data:** For testing Tasks 3.1 – 3.3, you will receive only words file `test_input.txt` containing the test target words, number of ground truth clusters and their paraphrase sets. For testing Task 3.4, you will receive only words file `test_nok_input.txt` containing the test target words and their paraphrases sets. Neither order of senses, nor order of words in a cluster matter.

**Evaluation:** There are many possible ways to evaluate clustering solutions. For this homework we will rely on the paired F-score, which you can read more about in [this paper](https://www.cs.york.ac.uk/semeval2010_WSI/paper/semevaltask14.pdf).

The general idea behind paired F-score is to treat clustering prediction like a classification problem; given a target word and its paraphrase set, we call a *positive instance* any pair of paraphrases that appear together in a ground-truth cluster. Once we predict a clustering solution for the paraphrase set, we similarly generate the set of word pairs such that both words in the pair appear in the same predicted cluster. We can then evaluate our set of predicted pairs against the ground truth pairs using precision, recall, and F-score.

V-Measure is another metric that is used to evaluate clustering solutions, however we will not be using it in this Assignment.

**Tasks:**
Your task is to fill in 4 functions: `cluster_random`, `cluster_with_sparse_representation`, `cluster_with_dense_representation`, `cluster_with_no_k`.

We provided 5 utility functions for you to use:

1. `load_input_file(file_path)` that converts the input data (the words file) into 2 dictionaries. The first dictionary is a mapping between a target word and a list of paraphrases. The second dictionary is a mapping between a target word and a number of clusters for a given target word.

2. `load_output_file(file_path)` that converts the output data (the clusters file) into a dictionary, where a key is a target word and a value is it’s list of list of paraphrases. Each list of paraphrases is a cluster. Remember that Neither order of senses, nor order of words in a cluster matter.

3. `get_paired_f_score(gold_clustering, predicted_clustering)` that calculates paired F-score given a gold and predicted clustering for a target word.

4. `evaluate_clusterings(gold_clusterings, predicted_clusterings)` that calculates paired F-score for all target words present in the data and prints the final F-Score weighted by the number of senses that a target word has.

5. `write_to_output_file(file_path, clusterings)` that writes the result of the clustering for each target word into the output file (clusters file)
Full points will be awarded for each of the tasks if your implementation gets above a certain threshold on the test dataset. Please submit to autograder to see thresholds. Note that thresholds are based on the scores from the previous year and might be lowered depending on the average performance.

In [28]:
# Helper functions, DO NOT MODIFY
def load_input_file(file_path):
    """
    Loads the input file to two dictionaries
    :param file_path: path to an input file
    :return: 2 dictionaries:
    1. Dictionary, where key is a target word and value is a list of paraphrases
    2. Dictionary, where key is a target word and value is a number of clusters
    """
    word_to_paraphrases_dict = {}
    word_to_k_dict = {}

    with open(file_path, 'r') as fin:
        for line in fin:
            target_word, k, paraphrases = line.split(' :: ')
            word_to_k_dict[target_word] = int(k)
            word_to_paraphrases_dict[target_word] = paraphrases.split()

    return word_to_paraphrases_dict, word_to_k_dict

    #Example for word note, one dictionary value is list of paraphrase [currency, comment, mark, tell], 2nd dictionary has k value of 4 as value
    #{'note': [currency, comment]}, {'note': 5}

def load_output_file(file_path):
    """
    :param file_path: path to an output file
    :return: A dictionary, where key is a target word and value is a list of list of paraphrases
    """
    clusterings = {}

    with open(file_path, 'r') as fin:
        for line in fin:
            target_word, _, paraphrases_in_cluster = line.strip().split(' :: ')
            paraphrases_list = paraphrases_in_cluster.strip().split()
            if target_word not in clusterings:
                clusterings[target_word] = []
            clusterings[target_word].append(paraphrases_list)

    return clusterings

        #{ #key is target word
    #    'note': [['comment', 'remark'], ['mark', 'observe', 'state'], ['tell', 'say', 'mention']] each list is predicted cluster
    #}


def get_paired_f_score(gold_clustering, predicted_clustering):
    """
    :param gold_clustering: gold list of list of paraphrases
    :param predicted_clustering: predicted list of list of paraphrases
    :return: Paired F-Score
    """
    gold_pairs = set()
    for gold_cluster in gold_clustering:
        for pair in combinations(gold_cluster, 2):
            gold_pairs.add(tuple(sorted(pair)))

    predicted_pairs = set()
    for predicted_cluster in predicted_clustering:
        for pair in combinations(predicted_cluster, 2):
            predicted_pairs.add(tuple(sorted(pair)))

    overlapping_pairs = gold_pairs & predicted_pairs

    precision = 1. if len(predicted_pairs) == 0 else float(len(overlapping_pairs)) / len(predicted_pairs)
    recall = 1. if len(gold_pairs) == 0 else float(len(overlapping_pairs)) / len(gold_pairs)
    paired_f_score = 0. if precision + recall == 0 else 2 * precision * recall / (precision + recall)

    return paired_f_score


    #example call below
    #gold_clustering = [['comment', 'remark'], ['mark', 'observe', 'state'], ['tell', 'say', 'mention']]
    #predicted_clustering = [['comment', 'remark', 'mark'], ['observe', 'state'], ['tell', 'say', 'mention']]
    #f_score = get_paired_f_score(gold_clustering, predicted_clustering)
    #print(f_score)  # Output: 0.6

def evaluate_clusterings(gold_clusterings, predicted_clusterings):
    """
    Displays evaluation scores between gold and predicted clusterings
    :param gold_clusterings: dictionary where key is a target word and value is a list of list of paraphrases
    :param predicted_clusterings: dictionary where key is a target word and value is a list of list of paraphrases
    :return: N/A
    """
    target_words = set(gold_clusterings.keys()) & set(predicted_clusterings.keys())

    if len(target_words) == 0:
        print('No overlapping target words in ground-truth and predicted files')
        return None

    paired_f_scores = np.zeros((len(target_words)))
    ks = np.zeros((len(target_words)))

    table = PrettyTable(['Target', 'k', 'Paired F-Score'])
    for i, target_word in enumerate(target_words):
        paired_f_score = get_paired_f_score(gold_clusterings[target_word], predicted_clusterings[target_word])
        k = len(gold_clusterings[target_word])
        paired_f_scores[i] = paired_f_score
        ks[i] = k
        table.add_row([target_word, k, f'{paired_f_score:0.4f}'])

    average_f_score = np.average(paired_f_scores, weights=ks)
    print(table)
    print(f'=> Average Paired F-Score:  {average_f_score:.4f}')

    #example call below -> averages paired f score also weighted by no of senses that a target word has
    #gold_clusterings = {'note.v': [['comment', 'remark'], ['mark', 'observe', 'state'], ['tell', 'say', 'mention']]}
    #predicted_clusterings = {'note.v': [['comment', 'remark', 'mark'], ['observe', 'state'], ['tell', 'say', 'mention']]}
    #evaluate_clusterings(gold_clusterings, predicted_clusterings)

def write_to_output_file(file_path, clusterings):
    """
    Writes the result of clusterings into an output file
    :param file_path: path to an output file
    :param clusterings:  A dictionary, where key is a target word and value is a list of list of paraphrases
    :return: N/A
    """
    with open(file_path, 'w') as fout:
        for target_word, clustering in clusterings.items():
            for i, cluster in enumerate(clustering):
                fout.write(f'{target_word} :: {i + 1} :: {" ".join(cluster)}\n')
        fout.close()

## 3.1. Cluster Randomly [11 points]
Write a function `cluster_random(word_to_paraphrases_dict, word_to_k_dict)` that accepts 2 dictionaries:

1. word_to_paraphrases_dict = a mapping between a target word and a list of paraphrases

2. word_to_k_dict = a mapping between a target word and a number of clusters for a given target

The function outputs a dictionary, where the key is a target word and a value is a list of list of paraphrases, where a list of paraphrases represents a distinct sense of a target word.

For this task put paraphrases into distinct senses at random. That is, assign to pick a random word for each cluster, as opposed to picking a random cluster for each word. This will ensure that all clusters have at lease one word in them. We recommend using random packages. Please use 123 as a random seed. Your output should look similar to this on the development dataset:

```
word_to_paraphrases_dict, word_to_k_dict = load_input_file('dev_input.txt')
gold_clusterings = load_output_file('dev_output.txt')
predicted_clusterings = cluster_random(word_to_paraphrases_dict, word_to_k_dict)
evaluate_clusterings(gold_clusterings, predicted_clusterings)
```
```
+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    paper.n     | 7  |     0.2978     |
|     play.v     | 34 |     0.0896     |
|     miss.v     | 8  |     0.2376     |
|   produce.v    | 7  |     0.2335     |
|    party.n     | 5  |     0.2480     |
|     note.v     | 3  |     0.6667     |
|     bank.n     | 9  |     0.1515     |
    .
    .
    .
|     eat.v      | 6  |     0.2908     |
|    climb.v     | 6  |     0.2427     |
|    degree.n    | 7  |     0.2891     |
|   interest.n   | 5  |     0.2093     |
+----------------+----+----------------+
=> Average Paired F-Score:  0.2318
```

- **Problem 3.1:** Implement `cluster_random` function. **The augograder for 3.2, 3.3, 3.4 will grade your implementation based on the test-set `f_score` achieved by the clustering.**  [10 points]


# Recitation


        word_to_paraphrases_dict = {
        'note.v': ['comment', 'mark', 'tell', 'observe', 'state', 'notice', 'say', 'remark', 'mention'],
        'hot.a': ['raging', 'spicy', 'blistering', 'red-hot', 'live']
                                    }

          word_to_k_dict = {
          'note.v': 3,
          'hot.a': 2
                           }

        Expected output :
        {
            'note.v': [['remark', 'mention'], ['comment', 'state', 'tell', 'mark'], ['observe', 'say', 'notice']],
            'hot.a': [['blistering', 'red-hot'], ['live', 'spicy', 'raging']]
        }

      shuffled = ['tell', 'remark', 'say', 'comment', 'mention', 'observe', 'state', 'mark', 'notice'] ->order reshuffled


In [29]:
import random
random.seed(123)
import numpy as np
np.random.seed(123)
def cluster_random(word_to_paraphrases_dict, word_to_k_dict):
    """
    Clusters paraphrases randomly
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :param word_to_k_dict: dictionary, where key is a target word and value is a number of clusters
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        paraphrase_list = word_to_paraphrases_dict[target_word]
        k = word_to_k_dict[target_word]
        # TODO: Implement beg
        cluster = []
        for i in range(k):
            cluster.append([])
        if k == 1 or k == 0:
            cluster[0] = paraphrase_list[:]
        else:
            numbers = [i for i in range(0, len(paraphrase_list))]
            random.shuffle(numbers)

            for i in range(len(paraphrase_list)):
                bucket = i % k
                cluster[bucket].append(paraphrase_list[numbers[i]])

            # for word in paraphrase_list:
            #     random_number = random.randint(0, k-1)
            #     cluster[random_number].append(word)

        clusterings[target_word] = cluster
        # TODO: Implement end

    return clusterings

- **Answer 3.1:** Run clustering on `dev` data, report the `f_scores` from the `dev` data [1 point]

**TODO**: [Report f_scores from the dev data] **[writeup.pdf]**

In [None]:
### Reference Code ###
###### You can use the following code to test your clustering on dev data ######
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_random(word_to_paraphrases_dict_dev, word_to_k_dict_dev)
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    smell.v     | 4  |     0.2857     |
|    image.n     | 9  |     0.1204     |
|   express.v    | 7  |     0.1562     |
|     talk.v     | 6  |     0.2428     |
|     play.v     | 34 |     0.0308     |
|     miss.v     | 8  |     0.1972     |
|   produce.v    | 7  |     0.1968     |
|    write.v     | 9  |     0.1507     |
|   provide.v    | 7  |     0.2204     |
|    party.n     | 5  |     0.2017     |
|     bank.n     | 9  |     0.0741     |
|     plan.n     | 3  |     0.3693     |
|   shelter.n    | 5  |     0.2500     |
|  difference.n  | 5  |     0.2564     |
|     eat.v      | 6  |     0.2186     |
|     mean.v     | 6  |     0.1772     |
|    treat.v     | 8  |     0.1610     |
|     use.v      | 6  |     0.2407     |
|   suspend.v    | 6  |     0.1304     |
|   judgment.n   | 7  |     0.1480     |
| organization.n | 7  |     0.1734     |
|   interest.n  

Run the following command to generate the output file for the predicted clusterings for the test dataset.

In [None]:
word_to_paraphrases_dict, word_to_k_dict = load_input_file('test_input.txt')
predicted_clusterings_random = cluster_random(word_to_paraphrases_dict, word_to_k_dict)
# write_to_output_file('test_output_random.txt', predicted_clusterings_random)

In [None]:
# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q3_clusters_random', answer = (predicted_clusterings_random, 'random'))

Correct! You earned 10/10 points. You are a star!

Your submission has been successfully recorded in the gradebook.


## 3.2 Cluster with Sparse Representations [26 points]

Write a function `cluster_with_sparse_representation(word_to_paraphrases_dict, word_to_k_dict)`. The input and output remains the same as in Task 1, however the clustering of paraphrases will no longer be random and is based on sparse vector representation.

We will feature-based (not dense) vector space representation. In this type of VSM, each dimension of the vector space corresponds to a specific feature, such as a context word (see, for example, the term-context matrix described in [Chapter 6.1.2 of Jurafsky & Martin](https://web.stanford.edu/~jurafsky/slp3/6.pdf)). You will calculate cooccurrence vectors on the Reuters RCV1 corpus. It can take a long time to build cooccurrence vectors, so we have pre-built set called `coocvec-500mostfreq-window-3.filter.magnitude`. To save on space, these include only the words used in the given files. This representation of words uses a term-context matrix `M` of size `|V| x D`, where `|V|` is the size of the vocabulary and D=500. Each feature corresponds to one of the top 500 most-frequent words in the corpus. The value of matrix entry `M[i][j]` gives the number of times the context word represented by column `j` appeared within W=3 words to the left or right of the word represented by row `i` in the corpus.

Use one of the clustering algorithms, for instance K-means clustering in `cluster_with_sparse_representation(word_to_paraphrases_dict, word_to_k_dict)`. Here is an example of the K-means clustering code:

# Recitation
This function is tasked with clustering paraphrases based on sparse vector representations.
We use concept of cooccurence matrix and sparse vector representations

1. Cooccurrence Vectors: Cooccurrence vectors represent how often each word in the vocabulary co-occurs with every other word in a given context window.
Suppose we have the following sentences in our corpus:

* "I like eating apples."
* "Apples are delicious."
* "She bought some apples from the market."
* "The teacher gave us apples as a snack."

Our vocabulary consists of the following words: "I", "like", "eating", "apples", "are", "delicious", "she", "bought", "some", "from", "the", "market", "teacher", "gave", "us", "as", "a", "snack".

The cooccurrence matrix would look something like this (simplified):

|        | I | like | eating | apples | are | delicious | she | bought | some | from | the | market | teacher | gave | us | as | a | snack |
|--------|---|------|--------|--------|-----|-----------|-----|--------|------|------|-----|--------|---------|------|----|----|---|-------|
| I      | 0 | 1    | 0      | 0      | 0   | 0         | 0   | 0      | 0    | 0    | 0   | 0      | 0       | 0    | 0  | 0  | 0 | 0     |
| like   | 1 | 0    | 1      | 0      | 0   | 0         | 0   | 0      | 0    | 0    | 0   | 0      | 0       | 0    | 0  | 0  | 0 | 0     |
| eating | 0 | 1    | 0      | 1      | 0   | 0         | 0   | 0      | 0    | 0    | 0   | 0      | 0       | 0    | 0  | 0  | 0 | 0     |
| apples | 0 | 0    | 1      | 0      | 1   | 1         | 1   | 1      | 1    | 1    | 1   | 1      | 0       | 0    | 0  | 0  | 0 | 0     |


The value at position (i, j) in the matrix represents the number of times word i co-occurs with word j within the context window.

This is an example of sparse vector representation because most of the matrix above is sparse. We are using sparse vector representation from Magnitude:

**coocvec-500mostfreq-window-3.filter.magnitude**: Derived from a co-occurrence matrix based on a corpus like Reuters RCV1.Each dimension represents the count of the word's co-occurrence with one of the top 500 most frequent context words within a window of 3 words.




```
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k).fit(X)
print(kmeans.labels_)
```



In [37]:
vectors_filter = Magnitude("coocvec-500mostfreq-window-3.filter.magnitude")
vectors_filter.dim

500

In [38]:
len(vectors_filter)

2178

In [None]:
vectors_filter.query("wage")

array([4.596719e-01, 3.361610e-01, 3.025449e-01, 2.642704e-01,
       2.494139e-01, 2.509247e-01, 3.181569e-01, 3.522766e-01,
       1.480619e-01, 4.935400e-02, 8.561400e-02, 6.924700e-03,
       6.093710e-02, 2.366980e-02, 1.913730e-02, 1.578824e-01,
       9.039840e-02, 3.802270e-02, 2.832820e-02, 6.219610e-02,
       8.397730e-02, 5.514550e-02, 4.910220e-02, 5.136840e-02,
       2.845410e-02, 7.931890e-02, 1.611560e-02, 0.000000e+00,
       3.600830e-02, 3.361610e-02, 4.532500e-03, 2.304030e-02,
       3.424560e-02, 2.278840e-02, 1.573790e-02, 2.027040e-02,
       2.631370e-02, 4.041490e-02, 3.009080e-02, 4.343650e-02,
       3.273480e-02, 8.309600e-03, 2.933540e-02, 3.764500e-02,
       2.052220e-02, 1.888500e-03, 1.070180e-02, 2.304030e-02,
       5.162000e-03, 1.334570e-02, 2.505470e-02, 2.769900e-03,
       0.000000e+00, 1.548610e-02, 1.775230e-02, 4.784300e-03,
       1.095360e-02, 1.485660e-02, 4.532500e-03, 1.548610e-02,
       1.158310e-02, 1.057590e-02, 2.404750e-02, 1.1834

In [None]:
vectors_filter.most_similar("provide", topn=10)

[('pay', np.float32(0.96469593)),
 ('make', np.float32(0.9622039)),
 ('initiate', np.float32(0.95320475)),
 ('deliver', np.float32(0.95313513)),
 ('produce', np.float32(0.95056176)),
 ('develop', np.float32(0.950259)),
 ('give', np.float32(0.94986254)),
 ('add', np.float32(0.9494212)),
 ('invite', np.float32(0.9436469)),
 ('help', np.float32(0.9426739))]

In [None]:
# random pick a vector
random_idx = random.randint(0, len(vectors_filter)-1)
vectors_filter.index(random_idx)[1]

array([0. , 0.5, 0. , 0.5, 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
       0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.

In [None]:
len(vectors_filter)

2178

In [None]:
"bewhisker" in vectors_filter

False

In [None]:
# testing purpose
candidates = ["ramp", "wage", "computerise", "yield", "charge", "articulate", "nourish", "upholster", "fix", "match", "glut", "rail", "furnish", "edge", "fire", "canal", "engage", "grate", "cater", "dish",  "indulge", "arm", "sustain", "capitalise",  "provision", "headquarter", "fulfil", "feed", "fill"]
# "transistorise",  "crenel", "bewhisker", "interleave", "reflectorise", "alphabetize","transistorize", "subtitle",
for cand in candidates:
    sim_score = vectors_filter.similarity("provide", cand)
    print(f"{cand}: {sim_score}")

ramp: 0.768477201461792
wage: 0.6933643817901611
computerise: 0.8486056923866272
yield: 0.6685950756072998
charge: 0.564055323600769
articulate: 0.7511733174324036
nourish: 0.8423349261283875
upholster: 0.4503878951072693
fix: 0.7347373962402344
match: 0.633495032787323
glut: 0.5237604379653931
rail: 0.6514250040054321
furnish: 0.7317972183227539
edge: 0.6045271158218384
fire: 0.6418688893318176
canal: 0.6681944727897644
engage: 0.6400899887084961
grate: 0.6223409175872803
cater: 0.9186713099479675
dish: 0.6929568648338318
indulge: 0.5828347206115723
arm: 0.4609805643558502
sustain: 0.9171574115753174
capitalise: 0.6914789080619812
provision: 0.5384178161621094
headquarter: 0.4618830978870392
fulfil: 0.9044535756111145
feed: 0.7542145252227783
fill: 0.9162811636924744


In [None]:
import os
os.listdir("./")

['.config',
 'dev_output.txt',
 'glove.840B.300d.magnitude',
 'SimLex-999.txt',
 'glove.6B.200d.magnitude',
 'coocvec-500mostfreq-window-3.filter.magnitude',
 'glove.6B.100d.magnitude',
 'GoogleNews-vectors-negative300.magnitude',
 'glove.6B.50d.magnitude',
 'glove.6B.300d.magnitude',
 'test_output_dense.txt',
 'test_nok_input.txt',
 'notebook-config.yaml',
 'dev_input.txt',
 'test_input.txt',
 'sample_data']

In [39]:
vectors2 = Magnitude("coocvec-500mostfreq-window-3.filter.magnitude")
len(vectors2)

2178

- **Problem 3.2:** Implement `cluster_with_sparse_representation` function [20 points]

In [40]:
from sklearn.cluster import KMeans

def cluster_with_sparse_representation(word_to_paraphrases_dict, word_to_k_dict):
    """
    Clusters paraphrases using sparse vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :param word_to_k_dict: dictionary, where key is a target word and value is a number of clusters
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("coocvec-500mostfreq-window-3.filter.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        paraphrase_list = word_to_paraphrases_dict[target_word]
        k = word_to_k_dict[target_word]
        cluster = []
        for i in range(k):
            cluster.append([])
        # TODO: Implement
        X = np.zeros((len(paraphrase_list), vectors.dim))
        kmeans = KMeans(n_clusters=k, random_state=42)
        for i in range(len(paraphrase_list)):
            paraphrase = paraphrase_list[i]
            if paraphrase in vectors:
                X[i] = vectors.query(paraphrase)
            else:
                random_idx = random.randint(0, len(vectors)-1)
                vectors.index(random_idx)[1]
                X[i] = vectors.index(random_idx)[1].copy()
                # random_idx = random.randint(0, len(vectors_filter)-1)
                # vectors_filter.index(random_idx)[1]
                # X[i] = vectors_filter.index(random_idx)[1].copy()
        kmeans.fit(X)
        labels = kmeans.labels_
        for i in range(len(paraphrase_list)):
            cluster[labels[i]].append(paraphrase_list[i])
        # TODO: end
        clusterings[target_word] = cluster

    return clusterings

# Recitation
Example:

        word_to_paraphrases_dict = {
        'note.v': ['comment', 'mark', 'tell', 'observe', 'state', 'notice', 'say', 'remark', 'mention'],
        'hot.a': ['raging', 'spicy', 'blistering', 'red-hot', 'live']
                                    }

          word_to_k_dict = {
          'note.v': 3,
          'hot.a': 2
                           }
        Expected output :
        {
            'note.v': [['remark', 'mention'], ['comment', 'state', 'tell', 'mark'], ['observe', 'say', 'notice']],
            'hot.a': [['blistering', 'red-hot'], ['live', 'spicy', 'raging']]
        }
          x (matrix of features):         Represent each word in 500 dimensions
          Word      dim1 dim2 dim3....dim500
          comment   ........................
          mark      ........................
          .
          .
          .
          mention   ,,,,,,,,,,,,,,,,,,,,,,,,
          if dimensions of the word not present in magnitude, randomly draw a vector from magnitude
        
        Once you have your X, run k means on that.
        #Kmeans must have assigned a cluster no to each of the 3 paraphrase
            #comment 0
            #tell 1
            #state 1
            #say 0

        Based on above (paraphrase,cluster no) pair, make your list of lists and hence your output dictionary
        

**Answer 3.2.1:** Run clustering on `dev` data, report the `f_scores` from the `dev` data [1 point]

**TODO**: [Report f_scores from the dev data] **[writeup.pdf]**

In [41]:
# YOUR CODE HERE (you can re-use reference code from 3.1)
###### You can use the following code to test your clustering on dev data ######
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_sparse_representation(word_to_paraphrases_dict_dev, word_to_k_dict_dev)
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.3273     |
|  atmosphere.n  | 6  |     0.3058     |
|     play.v     | 34 |     0.0845     |
|    smell.v     | 4  |     0.2947     |
|     talk.v     | 6  |     0.3223     |
|     plan.n     | 3  |     0.5772     |
|    party.n     | 5  |     0.2322     |
|   express.v    | 7  |     0.2340     |
|     win.v      | 4  |     0.4051     |
|    paper.n     | 7  |     0.4899     |
|     rule.v     | 7  |     0.2174     |
|   produce.v    | 7  |     0.2159     |
|     note.v     | 3  |     0.5333     |
| performance.n  | 5  |     0.2684     |
|     bank.n     | 9  |     0.3373     |
|     miss.v     | 8  |     0.2182     |
|   operate.v    | 7  |     0.2283     |
|    write.v     | 9  |     0.1660     |
|    source.n    | 9  |     0.2337     |
|     mean.v     | 6  |     0.3804     |
|    expect.v    | 6  |     0.3661     |
|  difference.n 

Run the following command to generate the output file for the predicted clusterings for the test dataset.

In [42]:
word_to_paraphrases_dict, word_to_k_dict = load_input_file('test_input.txt')
predicted_clusterings_sparse = cluster_with_sparse_representation(word_to_paraphrases_dict, word_to_k_dict)
# write_to_output_file('test_output_sparse.txt', predicted_clusterings_sparse)

In [43]:
# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q3_clusters_sparse', answer = (predicted_clusterings_sparse, 'sparse'))

Correct! You earned 20/20 points. You are a star!

Your submission has been successfully recorded in the gradebook.


**Answer 3.2.2:** Provide a brief description of your method in the report, making sure to describe the vector space model you chose, the clustering algorithm you used, and the results of any preliminary experiments you might have run on the dev set.  [5 points]

Suggestions to improve the performance of your model:

 - ~~What if you reduce or increase `D` in the baseline implementation?  **-> Increase or decrease Dimensions from 500, use such vector representation**~~

 - ~~Does it help to change the window `W` used to extract contexts? **-> The file coocvec-500mostfreq-window-3.filter.magnitude contains vectors that were generated with a window size W=3.This means that when the co-occurrence matrix was built, the context for each word was determined by considering up to 3 words to the left and 3 words to the right.Try changing that. use such vector representation**~~

 - Play around with the feature weighting – instead of raw counts, would it help to use PPMI? -**> Convert your co-occurrence matrix to a PPMI matrix and use these weighted vectors for clustering.**

 - Try a different clustering algorithm that’s included with the [scikit-learn clustering package](http://scikit-learn.org/stable/modules/clustering.html), or implement your own. **-> Agglomerative Clustering, DBSCAN**

 - What if you include additional types of features, like paraphrases in the [Paraphrase Database](http://www.paraphrase.org/) or the part-of-speech of context words? **-> Enrich your vectors with additional features and observe the impact on clustering performance.**

The only feature types that are off-limits are WordNet features.

Provide a brief description of your method in the Report, making sure to describe the vector space model you chose, the clustering algorithm you used, and the results of any preliminary experiments you might have run on the dev set.

In [44]:
# AgglomerativeCLustering
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

def cluster_with_sparse_representation_agglo(word_to_paraphrases_dict, word_to_k_dict, linkage_opt):
    """
    Clusters paraphrases using sparse vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :param word_to_k_dict: dictionary, where key is a target word and value is a number of clusters
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("coocvec-500mostfreq-window-3.filter.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        paraphrase_list = word_to_paraphrases_dict[target_word]
        k = word_to_k_dict[target_word]
        cluster = []
        for i in range(k):
            cluster.append([])
        # TODO: Implement
        X = np.zeros((len(paraphrase_list), vectors.dim))
        for i in range(len(paraphrase_list)):
            paraphrase = paraphrase_list[i]
            if paraphrase in vectors:
                X[i] = vectors.query(paraphrase)
            else:
                random_idx = random.randint(0, len(vectors)-1)
                vectors.index(random_idx)[1]
                X[i] = vectors.index(random_idx)[1].copy()
        agglo_model = AgglomerativeClustering(n_clusters=k, linkage=linkage_opt)
        labels = agglo_model.fit_predict(X)

        for i in range(len(paraphrase_list)):
            cluster[labels[i]].append(paraphrase_list[i])
        # TODO: end
        clusterings[target_word] = cluster

    return clusterings

In [49]:
# ward
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_sparse_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "ward")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.2674     |
|  atmosphere.n  | 6  |     0.2597     |
|     play.v     | 34 |     0.0761     |
|    smell.v     | 4  |     0.2955     |
|     talk.v     | 6  |     0.3026     |
|     plan.n     | 3  |     0.5239     |
|    party.n     | 5  |     0.2497     |
|   express.v    | 7  |     0.2630     |
|     win.v      | 4  |     0.3873     |
|    paper.n     | 7  |     0.3442     |
|     rule.v     | 7  |     0.2484     |
|   produce.v    | 7  |     0.2740     |
|     note.v     | 3  |     0.4762     |
| performance.n  | 5  |     0.2428     |
|     bank.n     | 9  |     0.2299     |
|     miss.v     | 8  |     0.1895     |
|   operate.v    | 7  |     0.2174     |
|    write.v     | 9  |     0.2181     |
|    source.n    | 9  |     0.2067     |
|     mean.v     | 6  |     0.3505     |
|    expect.v    | 6  |     0.2533     |
|  difference.n 

In [46]:
# average
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_sparse_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "average")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.4654     |
|  atmosphere.n  | 6  |     0.3924     |
|     play.v     | 34 |     0.1538     |
|    smell.v     | 4  |     0.3462     |
|     talk.v     | 6  |     0.5996     |
|     plan.n     | 3  |     0.6387     |
|    party.n     | 5  |     0.3567     |
|   express.v    | 7  |     0.4131     |
|     win.v      | 4  |     0.5247     |
|    paper.n     | 7  |     0.5109     |
|     rule.v     | 7  |     0.2911     |
|   produce.v    | 7  |     0.4287     |
|     note.v     | 3  |     0.6400     |
| performance.n  | 5  |     0.4931     |
|     bank.n     | 9  |     0.2429     |
|     miss.v     | 8  |     0.2667     |
|   operate.v    | 7  |     0.2844     |
|    write.v     | 9  |     0.2740     |
|    source.n    | 9  |     0.3248     |
|     mean.v     | 6  |     0.4436     |
|    expect.v    | 6  |     0.3302     |
|  difference.n 

In [47]:
# complete
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_sparse_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "complete")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.3727     |
|  atmosphere.n  | 6  |     0.3003     |
|     play.v     | 34 |     0.0985     |
|    smell.v     | 4  |     0.2892     |
|     talk.v     | 6  |     0.3480     |
|     plan.n     | 3  |     0.5809     |
|    party.n     | 5  |     0.3024     |
|   express.v    | 7  |     0.2855     |
|     win.v      | 4  |     0.3776     |
|    paper.n     | 7  |     0.3645     |
|     rule.v     | 7  |     0.2584     |
|   produce.v    | 7  |     0.3088     |
|     note.v     | 3  |     0.5333     |
| performance.n  | 5  |     0.3973     |
|     bank.n     | 9  |     0.2597     |
|     miss.v     | 8  |     0.1609     |
|   operate.v    | 7  |     0.2385     |
|    write.v     | 9  |     0.2881     |
|    source.n    | 9  |     0.2305     |
|     mean.v     | 6  |     0.4094     |
|    expect.v    | 6  |     0.3564     |
|  difference.n 

In [48]:
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_sparse_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "single")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.4666     |
|  atmosphere.n  | 6  |     0.3595     |
|     play.v     | 34 |     0.1712     |
|    smell.v     | 4  |     0.4839     |
|     talk.v     | 6  |     0.6142     |
|     plan.n     | 3  |     0.6387     |
|    party.n     | 5  |     0.3572     |
|   express.v    | 7  |     0.4345     |
|     win.v      | 4  |     0.4813     |
|    paper.n     | 7  |     0.4570     |
|     rule.v     | 7  |     0.2880     |
|   produce.v    | 7  |     0.4385     |
|     note.v     | 3  |     0.6400     |
| performance.n  | 5  |     0.4427     |
|     bank.n     | 9  |     0.2429     |
|     miss.v     | 8  |     0.3511     |
|   operate.v    | 7  |     0.2844     |
|    write.v     | 9  |     0.3516     |
|    source.n    | 9  |     0.3378     |
|     mean.v     | 6  |     0.4361     |
|    expect.v    | 6  |     0.4205     |
|  difference.n 

**TODO**: Description of your method **[writeup.pdf]**

## 3.3 Cluster with Dense Representations [28 points]

Write a function `cluster_with_dense_representation(word_to_paraphrases_dict, word_to_k_dict)`. The input and output remains the same as in Task 1 and 2, however the clustering of paraphrases is based on dense vector representation.

We would like to see if dense word embeddings are better for clustering the words in our test set. Run the word clustering task again, but this time use a dense word representation.

For this task, we have also included a file called `GoogleNews-vectors-negative300.filter.magnitude`, which is filtered to contain only the words in the dev/test splits.

As before, use the provided word vectors to represent words and perform one of the clusterings. Here are some suggestions to improve the performance of your model:

 - Try downloading a different dense vector space model from the web, like [Paragram](http://www.cs.cmu.edu/~jwieting/) or [fastText](https://fasttext.cc/docs/en/english-vectors.html).
 - Train your own word vectors, either on the provided corpus or something you find online. You can try the skip-gram, CBOW models, or [GLOVE](https://nlp.stanford.edu/projects/glove/). Try experimenting with the dimensionality.
 - [Retrofitting](https://www.cs.cmu.edu/~hovy/papers/15HLT-retrofitting-word-vectors.pdf) is a simple way to add additional semantic knowledge to pre-trained vectors. The retrofitting code is available here. Experiment with different lexicons, or even try [counter-fitting](http://www.aclweb.org/anthology/N16-1018).

- **Problem 3.3:** Implement `cluster_with_dense_representation` function [20 points]

# Recitation
**GoogleNews-vectors-negative300.magnitude:**



*   Dense vector representation.
*   Trained on a large corpus of Google News articles.
*   300 dimensions
*   Please keep the `np.random.seed(5)` it is important for the autograder

300 dimensions


In [51]:
def cluster_with_dense_representation(word_to_paraphrases_dict, word_to_k_dict):
    """
    Clusters paraphrases using dense vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :param word_to_k_dict: dictionary, where key is a target word and value is a number of clusters
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        paraphrase_list = word_to_paraphrases_dict[target_word]
        k = word_to_k_dict[target_word]
        np.random.seed(5)

        # TODO: Implement
        cluster = [] # init an empty array for clustering
        for i in range(k):
            cluster.append([])

        X = np.zeros((len(paraphrase_list), vectors.dim))
        kmeans = KMeans(n_clusters=k, random_state=42)
        for i in range(len(paraphrase_list)):
            paraphrase = paraphrase_list[i]
            if paraphrase in vectors:
                X[i] = vectors.query(paraphrase)
            else:
                # random_idx = random.randint(0, len(vectors_filter)-1)
                random_idx = np.random.randint(0, len(vectors))
                vectors.index(random_idx)[1]
                X[i] = vectors.index(random_idx)[1].copy()
        kmeans.fit(X)
        labels = kmeans.labels_
        for i in range(len(paraphrase_list)):
            cluster[labels[i]].append(paraphrase_list[i])
        # TODO: end
        clusterings[target_word] = cluster

    return clusterings

**Answer 3.3.1:** Run clustering on `dev` data, report the `f_scores` from the `dev` data [1 point]

**TODO**: [Report f_scores from the dev data] **[writeup.pdf]**

In [52]:
# YOUR CODE HERE (you can re-use reference code from 3.1)

word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_dense_representation(word_to_paraphrases_dict_dev, word_to_k_dict_dev)
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.3531     |
|  atmosphere.n  | 6  |     0.3123     |
|     play.v     | 34 |     0.1218     |
|    smell.v     | 4  |     0.3500     |
|     talk.v     | 6  |     0.2812     |
|     plan.n     | 3  |     0.4451     |
|    party.n     | 5  |     0.4209     |
|   express.v    | 7  |     0.2698     |
|     win.v      | 4  |     0.4737     |
|    paper.n     | 7  |     0.3784     |
|     rule.v     | 7  |     0.2846     |
|   produce.v    | 7  |     0.2705     |
|     note.v     | 3  |     0.6190     |
| performance.n  | 5  |     0.3798     |
|     bank.n     | 9  |     0.7333     |
|     miss.v     | 8  |     0.2703     |
|   operate.v    | 7  |     0.2491     |
|    write.v     | 9  |     0.1633     |
|    source.n    | 9  |     0.3111     |
|     mean.v     | 6  |     0.3145     |
|    expect.v    | 6  |     0.3512     |
|  difference.n 

As before, run the following command to generate the output file for the predicted clusterings for the test dataset.

In [53]:
word_to_paraphrases_dict, word_to_k_dict = load_input_file('test_input.txt')
predicted_clusterings_dense = cluster_with_dense_representation(word_to_paraphrases_dict, word_to_k_dict)
# write_to_output_file('test_output_dense.txt', predicted_clusterings_dense)

In [54]:
# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q3_clusters_dense', answer = (predicted_clusterings_dense, 'dense'))

You earned 16/20 points.

But, don't worry, you can re-submit and we will keep only your latest score.


**Answer 3.3.2:** Provide a brief description of your method in the report that includes the vectors you used, and any experimental results you have from running your model on the dev set.  [5 points]

**TODO**: [Describe your method] **[writeup.pdf]**

**Answer 3.3.3:** In addition, for Task 3.2 and 3.3, do an analysis of different errors made by each system – i.e. look at instances that the word-context matrix representation gets wrong and dense gets right, and vice versa, and see if there are any interesting patterns. There is no right answer for this. [2 points]

In [56]:
# AgglomerativeCLustering
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

def cluster_with_dense_representation_agglo(word_to_paraphrases_dict, word_to_k_dict, linkage_opt):
    """
    Clusters paraphrases using sparse vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :param word_to_k_dict: dictionary, where key is a target word and value is a number of clusters
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        paraphrase_list = word_to_paraphrases_dict[target_word]
        k = word_to_k_dict[target_word]
        cluster = []
        for i in range(k):
            cluster.append([])
        # TODO: Implement
        X = np.zeros((len(paraphrase_list), vectors.dim))
        for i in range(len(paraphrase_list)):
            paraphrase = paraphrase_list[i]
            if paraphrase in vectors:
                X[i] = vectors.query(paraphrase)
            else:
                random_idx = random.randint(0, len(vectors)-1)
                vectors.index(random_idx)[1]
                X[i] = vectors.index(random_idx)[1].copy()
        agglo_model = AgglomerativeClustering(n_clusters=k, linkage=linkage_opt)
        labels = agglo_model.fit_predict(X)

        for i in range(len(paraphrase_list)):
            cluster[labels[i]].append(paraphrase_list[i])
        # TODO: end
        clusterings[target_word] = cluster

    return clusterings

In [57]:
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_dense_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "ward")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.3283     |
|  atmosphere.n  | 6  |     0.2236     |
|     play.v     | 34 |     0.1390     |
|    smell.v     | 4  |     0.2857     |
|     talk.v     | 6  |     0.2830     |
|     plan.n     | 3  |     0.4488     |
|    party.n     | 5  |     0.3990     |
|   express.v    | 7  |     0.2657     |
|     win.v      | 4  |     0.3699     |
|    paper.n     | 7  |     0.3660     |
|     rule.v     | 7  |     0.3360     |
|   produce.v    | 7  |     0.2832     |
|     note.v     | 3  |     0.5714     |
| performance.n  | 5  |     0.3806     |
|     bank.n     | 9  |     0.3103     |
|     miss.v     | 8  |     0.2778     |
|   operate.v    | 7  |     0.2461     |
|    write.v     | 9  |     0.1729     |
|    source.n    | 9  |     0.2488     |
|     mean.v     | 6  |     0.3478     |
|    expect.v    | 6  |     0.4048     |
|  difference.n 

In [58]:
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_dense_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "complete")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.3319     |
|  atmosphere.n  | 6  |     0.2931     |
|     play.v     | 34 |     0.1204     |
|    smell.v     | 4  |     0.3500     |
|     talk.v     | 6  |     0.4376     |
|     plan.n     | 3  |     0.5688     |
|    party.n     | 5  |     0.3825     |
|   express.v    | 7  |     0.3270     |
|     win.v      | 4  |     0.3404     |
|    paper.n     | 7  |     0.3838     |
|     rule.v     | 7  |     0.2937     |
|   produce.v    | 7  |     0.2707     |
|     note.v     | 3  |     0.5238     |
| performance.n  | 5  |     0.3339     |
|     bank.n     | 9  |     0.4167     |
|     miss.v     | 8  |     0.2821     |
|   operate.v    | 7  |     0.2187     |
|    write.v     | 9  |     0.1714     |
|    source.n    | 9  |     0.2390     |
|     mean.v     | 6  |     0.2993     |
|    expect.v    | 6  |     0.6139     |
|  difference.n 

In [60]:
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_dense_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "average")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.4062     |
|  atmosphere.n  | 6  |     0.4215     |
|     play.v     | 34 |     0.1794     |
|    smell.v     | 4  |     0.4348     |
|     talk.v     | 6  |     0.5852     |
|     plan.n     | 3  |     0.6679     |
|    party.n     | 5  |     0.6483     |
|   express.v    | 7  |     0.4139     |
|     win.v      | 4  |     0.4157     |
|    paper.n     | 7  |     0.5536     |
|     rule.v     | 7  |     0.2963     |
|   produce.v    | 7  |     0.3726     |
|     note.v     | 3  |     0.8400     |
| performance.n  | 5  |     0.4231     |
|     bank.n     | 9  |     0.6154     |
|     miss.v     | 8  |     0.3218     |
|   operate.v    | 7  |     0.2791     |
|    write.v     | 9  |     0.2697     |
|    source.n    | 9  |     0.2428     |
|     mean.v     | 6  |     0.3689     |
|    expect.v    | 6  |     0.4428     |
|  difference.n 

In [61]:
# single
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_dense_representation_agglo(word_to_paraphrases_dict_dev, word_to_k_dict_dev, "single")
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.4232     |
|  atmosphere.n  | 6  |     0.4354     |
|     play.v     | 34 |     0.1533     |
|    smell.v     | 4  |     0.4677     |
|     talk.v     | 6  |     0.6511     |
|     plan.n     | 3  |     0.6471     |
|    party.n     | 5  |     0.4323     |
|   express.v    | 7  |     0.3982     |
|     win.v      | 4  |     0.4892     |
|    paper.n     | 7  |     0.5633     |
|     rule.v     | 7  |     0.2975     |
|   produce.v    | 7  |     0.4542     |
|     note.v     | 3  |     0.8400     |
| performance.n  | 5  |     0.4327     |
|     bank.n     | 9  |     0.6154     |
|     miss.v     | 8  |     0.3217     |
|   operate.v    | 7  |     0.3113     |
|    write.v     | 9  |     0.3581     |
|    source.n    | 9  |     0.2733     |
|     mean.v     | 6  |     0.3646     |
|    expect.v    | 6  |     0.4675     |
|  difference.n 

In [74]:
word_to_paraphrases_dict_test, word_to_k_dict_test = load_input_file('test_input.txt')
predicted_clusterings_random_test = cluster_with_dense_representation_agglo(word_to_paraphrases_dict_test, word_to_k_dict_test, "single")
predicted_clusterings_random_test

{'important.a': [['authoritative'], ['significant'], ['crucial']],
 'argument.n': [['variable',
   'casuistry',
   'conflict',
   'word',
   'clincher',
   'summary',
   'arguing',
   'argy-bargy',
   'tilt',
   'statement',
   'contestation',
   'polemic',
   'fight',
   'contention',
   'adducing',
   'address',
   'proof',
   'pro',
   'policy',
   'argumentation',
   'sparring',
   'disputation',
   'value',
   'difference',
   'determiner',
   'dispute',
   'case',
   'firestorm',
   'parameter',
   'reasoning',
   'discussion',
   'reference',
   'con',
   'evidence',
   'controversy',
   'counterargument',
   'debate'],
  ['sum-up'],
  ['argle-bargle'],
  ['give-and-take'],
  ['logomachy'],
  ['line'],
  ['disceptation']],
 'encounter.v': [['find',
   'see',
   'chance',
   'be',
   'play',
   'happen',
   'meet',
   'replay',
   'face',
   'receive',
   'confront',
   'have'],
  ['bump'],
  ['experience'],
  ['cross'],
  ['intersect']],
 'activate.v': [['initiate',
   'modify',

# Recitation
To compare, make a dataframe
**Target word | F score_ sparse | F score_dense | Diff sparse - dense | Diff dense - sparse**

Now sort based on diff_sparse and diff_dense to identify words where one works better than other

**TODO**: [Error analysis] **[writeup.pdf]**

## 3.4 Cluster without K [31 points]

So far we made the clustering problem deliberately easier by providing you with `k`, the number of clusters, as an input. But in most clustering situations the best `k` is not given. To take this assignment one step further, see if you can come up with a way to automatically choose `k`.

Write a function `cluster_with_no_k(word_to_paraphrases_dict)` that accepts only the first dictionary as an input and produces clusterings for given target words.

We have provided an additional test set `test_nok_input.txt`, where the `k` field has been zeroed out. See if you can come up with a method that clusters words by sense, and chooses the best `k` on its own. You can start by assigning `k=5` for all target words as a baseline model.

You might want to try and use the development data to analyze how got is your model in determining `k`.

One of the ways to approach this challenge is to try and select best `k` for a target word and a list of paraphrases is to use try out a range of `k`'s and judge the performance of the clusterings based on some metric, for instance a [silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

Be sure to describe your method in the Report.

- **Problem 3.4:** Implement `cluster_with_no_k` function [25 points]


# Recitation
**Till now we had been giving K and you used that in K means to split your paraphrase list into list of lists (clusters). Now in this section we dont give you K. For this, Keep logic similar where you were preparing matrix of features. Next step : Try K means using diff number of clusters. Calculate silhouette score for each number and finally chose the optimal no of clusters for that target word**

In [65]:
from sklearn.metrics import silhouette_score # Hint: this could be useful if you want to use silhouette_score as distance metric

In [63]:
def get_fscore(cluster_func):
    word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
    predicted_clusterings_random_dev = cluster_func(word_to_paraphrases_dict_dev, word_to_k_dict_dev)
    gold_clusterings_dev = load_output_file('dev_output.txt')
    f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
    return f_score

In [64]:
def get_cluster_with_k(paraphrase_list, k):
    cluster = [] # init an empty array for clustering
    for i in range(k):
        cluster.append([])

    X = np.zeros((len(paraphrase_list), vectors.dim))
    kmeans = KMeans(n_clusters=k, random_state=42)
    for i in range(len(paraphrase_list)):
        paraphrase = paraphrase_list[i]
        if paraphrase in vectors:
            X[i] = vectors.query(paraphrase)
        else:
            # random_idx = random.randint(0, len(vectors_filter)-1)
            random_idx = np.random.randint(0, len(vectors))
            vectors.index(random_idx)[1]
            X[i] = vectors.index(random_idx)[1].copy()
    kmeans.fit(X)
    labels = kmeans.labels_
    for i in range(len(paraphrase_list)):
        cluster[labels[i]].append(paraphrase_list[i])
    s_score = silhouette_score(X, labels)
    return s_score, cluster


In [62]:
v = Magnitude("GoogleNews-vectors-negative300.magnitude")
w = "different.a"

word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
p_list = word_to_paraphrases_dict_dev[w]
print(p_list)

clusters = [] # init an empty array for clustering
s_scores = []
potential_k = min(11, len(p_list))
for i in range(2, potential_k):
    s_score, cluster = get_cluster_with_k(p_list, i)
    print("score is {}".format(s_score))
    print(f"s_score: {s_score} cluster: {cluster}")
    clusters.append(cluster)
    s_scores.append(s_score)
s_score_array = np.asarray(s_scores)
s_scores
# best_k = np.argmax(s_score_array)

# print(f"Best k for {w}: {best_k} tested clustering {potential_k}")

['dissimilar', 'unlike']


[]

In [None]:
def cluster_with_no_k(word_to_paraphrases_dict):
    """
    Clusters paraphrases using any vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        # print(f"Processing {target_word}")
        paraphrase_list = word_to_paraphrases_dict[target_word]
        # TODO: Implement
        # hint: first find the best k value, you can define a seperate function, if needed.
        # hint: then fit a KMeans model on the data with k clusters
        clusters = [] # init an empty array for clustering
        s_scores = []
        potential_k = min(11, len(paraphrase_list))
        if potential_k <= 2:
            clusterings[target_word] = [paraphrase_list]
            continue
        for i in range(2, potential_k):
            s_score, cluster = get_cluster_with_k(paraphrase_list, i)
            clusters.append(cluster)
            s_scores.append(s_score)
        s_score_array = np.asarray(s_scores)
        best_k = np.argmax(s_score_array)
        # print(f"Best k for {target_word}: {best_k} tested clustering {potential_k}")
        clusterings[target_word] = clusters[best_k]

    return clusterings

In [None]:
v = Magnitude("GoogleNews-vectors-negative300.magnitude")
w = "different.a"

word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
p_list = word_to_paraphrases_dict_dev[w]
print(p_list)

clusters = [] # init an empty array for clustering
s_scores = []
potential_k = min(11, len(p_list))
for i in range(2, potential_k):
    s_score, cluster = get_cluster_with_k(p_list, i)
    print("score is {}".format(s_score))
    print(f"s_score: {s_score} cluster: {cluster}")
    clusters.append(cluster)
    s_scores.append(s_score)
s_score_array = np.asarray(s_scores)
s_scores

**Answer 3.4.1:** Run clustering on `dev` data, report the `f_scores` from the `dev` data [1 point]

**TODO:** [Report f_score on dev data] **[writeup.pdf]**

In [None]:
# YOUR CODE HERE (you can re-use reference code from 3.1)
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_no_k(word_to_paraphrases_dict_dev)
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    write.v     | 9  |     0.2859     |
|    treat.v     | 8  |     0.3296     |
|     note.v     | 3  |     0.1875     |
|    paper.n     | 7  |     0.5691     |
|     talk.v     | 6  |     0.4923     |
|     wash.v     | 13 |     0.2747     |
|    source.n    | 9  |     0.2827     |
|    degree.n    | 7  |     0.4367     |
|   operate.v    | 7  |     0.3035     |
|     bank.n     | 9  |     0.4231     |
|    smell.v     | 4  |     0.4340     |
|     use.v      | 6  |     0.6310     |
|   provide.v    | 7  |     0.5948     |
|     mean.v     | 6  |     0.3425     |
|     hear.v     | 5  |     0.2222     |
|   suspend.v    | 6  |     0.4878     |
|    party.n     | 5  |     0.4551     |
|    watch.v     | 5  |     0.2927     |
|    climb.v     | 6  |     0.2400     |
|    image.n     | 9  |     0.3075     |
| performance.n  | 5  |     0.3264     |
| organization.n

As before, run the following command to generate the output file for the predicted clusterings for the test dataset.

In [None]:
word_to_paraphrases_dict, _ = load_input_file('/content/test_nok_input.txt')
predicted_clusterings_nok = cluster_with_no_k(word_to_paraphrases_dict)
# write_to_output_file('test_output_nok.txt', predicted_clusterings_nok)

In [None]:
# PennGrader - DO NOT CHANGE
# reload_grader()
grader.grade(test_case_id = 'test_q3_clusters_no_k', answer = (predicted_clusterings_nok, 'no k'))

Correct! You earned 25/25 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [None]:
 # Agglomerative cluster
def cluster_with_no_k_agglo(word_to_paraphrases_dict):
    """
    Clusters paraphrases using any vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        # print(f"Processing {target_word}")
        paraphrase_list = word_to_paraphrases_dict[target_word]
        # TODO: Implement
        # hint: first find the best k value, you can define a seperate function, if needed.
        # hint: then fit a KMeans model on the data with k clusters
        clusters = [] # init an empty array for clustering
        s_scores = []
        potential_k = min(11, len(paraphrase_list))
        if potential_k <= 2:
            clusterings[target_word] = [paraphrase_list]
            continue
        for i in range(2, potential_k):
            s_score, cluster = get_cluster_with_k(paraphrase_list, i)
            clusters.append(cluster)
            s_scores.append(s_score)
        s_score_array = np.asarray(s_scores)
        best_k = np.argmax(s_score_array)
        # print(f"Best k for {target_word}: {best_k} tested clustering {potential_k}")
        clusterings[target_word] = clusters[best_k]

    return clusterings

**Answer 3.4.2:** Provide a brief description of your method in the report that includes the vectors you used, and any experimental results you have from running your model on the dev set.  [5 points]

**TODO**: [Describe your method] **[writeup.pdf]**

In [66]:
# Agglomerative clustering
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

def get_cluster_with_k_agglomerative (paraphrase_list, k, linkage_opt="single"):
    cluster = [] # init an empty array for clustering
    for i in range(k):
        cluster.append([])

    X = np.zeros((len(paraphrase_list), vectors.dim))
    agglo_model = AgglomerativeClustering(n_clusters=k, linkage=linkage_opt)
    for i in range(len(paraphrase_list)):
        paraphrase = paraphrase_list[i]
        if paraphrase in vectors:
            X[i] = vectors.query(paraphrase)
        else:
            # random_idx = random.randint(0, len(vectors_filter)-1)
            random_idx = np.random.randint(0, len(vectors))
            vectors.index(random_idx)[1]
            X[i] = vectors.index(random_idx)[1].copy()
    agglo_model.fit(X)
    labels = agglo_model.labels_
    for i in range(len(paraphrase_list)):
        cluster[labels[i]].append(paraphrase_list[i])
    s_score = silhouette_score(X, labels)
    return s_score, cluster

In [67]:
def cluster_with_no_k_agglo(word_to_paraphrases_dict):
    """
    Clusters paraphrases using any vector representation
    :param word_to_paraphrases_dict: dictionary, where key is a target word and value is a list of paraphrases
    :return: dictionary, where key is a target word and value is a list of list of paraphrases,
    where each list corresponds to a cluster
    """
    # Note: any vector representation should be in the same directory as this file
    vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")
    clusterings = {}

    for target_word in word_to_paraphrases_dict.keys():
        # print(f"Processing {target_word}")
        paraphrase_list = word_to_paraphrases_dict[target_word]
        # TODO: Implement
        # hint: first find the best k value, you can define a seperate function, if needed.
        # hint: then fit a KMeans model on the data with k clusters
        clusters = [] # init an empty array for clustering
        s_scores = []
        potential_k = min(11, len(paraphrase_list))
        if potential_k <= 2:
            clusterings[target_word] = [paraphrase_list]
            continue
        for i in range(2, potential_k):
            s_score, cluster = get_cluster_with_k_agglomerative(paraphrase_list, i)
            clusters.append(cluster)
            s_scores.append(s_score)
        s_score_array = np.asarray(s_scores)
        best_k = np.argmax(s_score_array)
        # print(f"Best k for {target_word}: {best_k} tested clustering {potential_k}")
        clusterings[target_word] = clusters[best_k]

    return clusterings

In [68]:
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('dev_input.txt')
predicted_clusterings_random_dev = cluster_with_no_k_agglo(word_to_paraphrases_dict_dev)
gold_clusterings_dev = load_output_file('dev_output.txt')
f_score = evaluate_clusterings(gold_clusterings_dev, predicted_clusterings_random_dev)
f_score

+----------------+----+----------------+
|     Target     | k  | Paired F-Score |
+----------------+----+----------------+
|    degree.n    | 7  |     0.4460     |
|  atmosphere.n  | 6  |     0.3825     |
|     play.v     | 34 |     0.1753     |
|    smell.v     | 4  |     0.5033     |
|     talk.v     | 6  |     0.6414     |
|     plan.n     | 3  |     0.6745     |
|    party.n     | 5  |     0.3739     |
|   express.v    | 7  |     0.4256     |
|     win.v      | 4  |     0.5445     |
|    paper.n     | 7  |     0.6022     |
|     rule.v     | 7  |     0.3317     |
|   produce.v    | 7  |     0.4968     |
|     note.v     | 3  |     0.7719     |
| performance.n  | 5  |     0.4761     |
|     bank.n     | 9  |     0.5688     |
|     miss.v     | 8  |     0.3058     |
|   operate.v    | 7  |     0.3060     |
|    write.v     | 9  |     0.3567     |
|    source.n    | 9  |     0.3109     |
|     mean.v     | 6  |     0.4011     |
|    expect.v    | 6  |     0.4642     |
|  difference.n 

In [73]:
word_to_paraphrases_dict_dev, word_to_k_dict_dev = load_input_file('test_nok_input.txt')
predicted_clusterings_random_dev = cluster_with_no_k_agglo(word_to_paraphrases_dict_dev)
predicted_clusterings_random_dev


{'terrible.a': [['horrendous',
   'painful',
   'dreaded',
   'fearful',
   'fearsome',
   'awful',
   'atrocious',
   'dire',
   'frightful',
   'tremendous',
   'severe',
   'frightening',
   'abominable',
   'unspeakable',
   'dread',
   'wicked',
   'dreadful',
   'horrific'],
  ['direful']],
 'closely.r': [['tight', 'close', 'intimately'], ['nearly']],
 'around.r': [['roughly', 'approximately'], ['round'], ['some'], ['about']],
 'range.n': [['spectrum',
   'Primus',
   'hearing',
   'compass',
   'image',
   'eyeshot',
   'motley',
   'earshot',
   'reach',
   'ambit',
   'assortment',
   'grasp',
   'smorgasbord',
   'palette',
   'formation',
   'band',
   'contrast',
   'view',
   'internationalism',
   'potpourri',
   'chain',
   'installation',
   'tract',
   'internationality',
   'miscellanea',
   'horizon',
   'ken',
   'potbelly',
   'facility',
   'ballpark',
   'set',
   'capability',
   'capableness',
   'latitude',
   'purview',
   'stove',
   'extent',
   'confines',

# Submissions


## Leaderboards [2 points + 3 bonus]
In order to stir up some friendly competition, we would also like you to submit the clustering from your best model to a leaderboard. There will be 2 leaderboards to submit to:
- **Clusters with no K**: Copy the output file from your best model **(has to come from `3.4`)** to a file called `test_nok_output_leaderboard.txt` and include it with your submission in `HW5: Leaderboard Without K` following the format of the clusters file. [1 point]

- **Clusters with K**: Copy the output file from your best model **(has to come from `3.2 or 3.3`)** to a file called `test_output_leaderboard.txt` and include it with your submission in `HW5: Leaderboard With K` following the format of the clusters file. [1 point]

The first 10 places in either of the two leaderboards get 3 extra points.

## Free-response Checklist (check if you missed anything!)
We will look for the following free-responses in this notebook:
- Section 2: Question responses and analysis of correlations
- Section 3: For each clustering algorithm, you would need to report the `f_score` from the `dev` set and description of your methods.

## GradeScope File Submission
Here are the deliverables you need to submit to GradeScope:
- Write-up:
    - Part 2: answers to questions
    - Part 3: F-scores for clustering algorithms & discussions about your models
- Code:
    - This notebook and py file: rename to `homework5.ipynb` and `homework5.py`. You can download the notebook and py file by going to the top-left corner of this webpage, `File -> Download -> Download .ipynb/.py`
- Leaderboard Results:
  - Leaderboard Without K: `test_nok_output_leaderboard.txt` (Task 3.4 output file)
  - Leaderboard With K: `test_output_leaderboard.txt` (Task 3.2 or 3.3 output file)