# SNLP Project
Name 1: <br/>
Student id 1: <br/>
Email 1: <br/>


Name 2: <br/>
Student id 2: <br/>
Email 2: <br/>

**Instructions:** Read each question carefully. <br/>
Make sure you appropriately comment your code wherever required. Your final submission should contain the completed Notebook and the respective Python files for any additional exercises necessary. There is no need to submit the data files should they exist. <br/>
Upload the zipped folder on CMS. Please follow the naming convention of **Name1_studentID1_Name2_studentID2.zip**. Make sure to click on "Turn-in" (or the equivalent on CMS) after your upload your submission, otherwise the assignment will not be considered as submitted. Only one member of the group should make the submisssion.

---

# Question 1: The Theory

## GloVe (4 points)
1.  Let's start with the original paper [\[Link\]](https://nlp.stanford.edu/pubs/glove.pdf). Read the paper and answer these questions: </br>
  a. What do you think of the jump from (Equation 3 + Equation 4) -> Equation 5 ? Provide a counter-example where the given result may not hold. Why does this not affect the algorithm?  (2 points) </br>

  b. Why does GloVe use a smoothing function ? How does the smoothing function interact with the loss objective? (1 point) </br>

  c. Look at the [Mittens](https://arxiv.org/pdf/1803.09901.pdf) Paper, which extends GloVe. What do they add to the GloVe loss function? What happens as a result?(1 point)


## Word2Vec (5 points)



2. The two word2vec papers can be skimmed through to answer the following questions. [[Paper](https://arxiv.org/pdf/1301.3781.pdf)], \[[2nd Paper](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)\] (Optional)]. You can also use [this](https://docs.chainer.org/en/latest/examples/word2vec.html) post for reference. Additionally, for an intuition behind vector operations check this [blog post](https://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html/). The word2vec paper has a neural focus, however, it is not crucial to understand how neural networks work at this time. Answer the following questions: </br>

  a. Describe the two proposed methods/training objectives for obtaining word embeddings presented in the original paper. Focus only on the **training objective** and not the technical/optimization details. (2 points) </br>

  b. In section 2.1 of the paper, the authors mention using a hierarchical softmax function where they represent the vocabulary as a Huffman binary tree. They go on to claim that only $\log_2(Unigram\_perplexity(V))$ evaluations are needed to arrive at a result, where $V$ is the size of the vocabulary. Why is this the case? (1 point) </br>

  c. An extension of the skip-gram approach to obtaining word2vec embeddings. This extension is called skip-gram with negative sampling. It is described in the second paper. Explain what this method consists in and why it is more efficient. (1 point) </br>

  d. Attempts at explaining what word2vec is doing have tried to link the resulting $W$ embedding matrix to the good ol' point-wise mutual information (PMI) approaches. Explain why this might make sense as an intuition. Feel free to answer with what has been covered in the SNLP course or take a look at this [paper](https://proceedings.neurips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf) which claims that $WC$ is the PMI shifted by a constant, where $W$ is our embedding matrix and $C$ is our context matrix. (1 point)

**Notes**:
* (Question 2) In a neural language model, we predict the probability of each token in the vocabulary to be the next token given the context. In order to get this distribution we use the $softmax(x) = \frac{\exp(x)}{\sum_{i=0}^V \exp(w_i)}$ function. It is not completely necessary to understand this fully. Hint: If you wish to learn more about this, feel free to take the Neural Networks course.


# Question 2: Training GloVe Embeddings(11 points)

Let's train our own GloVe embeddings!
1. Start by splitting the corpus into train:test using a 50:50 ratio. Remove all punctuations and lowercase the corpus. (1 point)

2. Write a function that computes the co-occurrence matrix for a fixed vocabulary and given window length. (2 points)

3. Train your own GloVe embeddings from scratch using the [glove](https://github.com/stanfordnlp/glove) repo. Use the default parameters in ```demo.sh``` for this question. Check for empty and duplicate embeddings!
 (2 points)

4. Use the resulting embeddings to train a sentiment classifier using your train data. Represent each sentence as the **sum** of its word vectors and train a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Replace all OOV entries with a zero vector. Use 5-fold cross validation on your training data to fix the depth of your Random forest. Keep all other hyperparameters unchanged. Print the F1 score (macro) for your classifier. (2 points)

5. Plot the classifier performance across a basket of  vector embedding sizes {50,100,200} + {300 if you can} and context lengths [5,10,15] for developing the embeddings. What do you think about the trends? (1 point)


6. Replace your embeddings with the global GloVe embeddings [\[Link\]](https://glove-embeddings.github.io/) (Mind the vocabularies). Re-run the classifier training and testing. How does the results  compare to your in-house embeddings? Plot the classifier performance against vector length [50,100,200] for the in-house and global embeddings.  (2 points)

7. Let's see if the trends you observed hold across domains. Compare the IMDB embeddings you built against the global embeddings (50 sized vectors) when building a similar classifier for the [Financial Phrasebank corpus](https://huggingface.co/datasets/financial_phrasebank). What do you observe? (1 point)


Note: The default setting for Glove is a context length of 10 and an embedding size of 50. If you are experimenting with one of the variables, you can default the other one.



## Downloading data, Glove repo

In [None]:
!git clone http://github.com/stanfordnlp/glove
!cd glove && make;
!wget https://github.com/Ankit152/IMDB-sentiment-analysis/raw/master/IMDB-Dataset.csv

## Into the Void . . .

In [None]:
#Write your code here.

### GloVE training

In [None]:
#Make edits to demo.sh before running!
!cd glove && ./demo.sh

## Financial Phrasebank

In [None]:
  pip install datasets

In [None]:
from datasets import load_dataset
dataset = load_dataset("financial_phrasebank", 'sentences_75agree')
#1 - Neutral, 2- positive, 3 - negative

test_financial=dataset['train'].to_pandas()

# Question 3: Training word2vec embeddings (10 points)

Let's train our own word2vec embeddings!

Some libraries are readily available to train your own word2vec embeddings. We're gonna keep it fun and you will implement some small intermediary code to gain further intuitions.

1. Re-use the splits you used for the GloVe training. Alternatively, if you skipped the GloVe training, split the corpus into `train:test` using a `50:50` ratio. Remove all punctuations and lowercase the corpus.

2. Implement a function that, given a window of words, a negative sample rate and a vocabulary, for every word generates training examples for a model with a negative example rate of `negative_sample_rate`. Note: you will **not** be using the resulting code of this exercise for further points. (3 points)

2. Train your own word2vec embeddings from scratch using the [gensim](https://radimrehurek.com/gensim/models/word2vec.html) library. Perform 5-fold cross validation to find optimal values for the sampling rate, the length of the vector and the context size. (2 points)

3. Use the resulting embeddings to train a sentiment classifier using your train data. Represent each sentence as the **sum** of word2vec vectors for the tokens in the text and train a random classifier (use the parameters from Q2.4; if you skipped that question use 5-fold cross validation for the tree depth). Handle OOV tokens with zero vectors. Print the F1 score (macro) for your classifier. (2 points)

4. Now replace your embeddings with the pre-trained word2vec embeddings trained on Google News [[Link](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models)]. Re-run the classification experiments. Plot the classifier performance against vector length for the in-house and the Google News embeddings. To what do you attribute the shift in performance? (2 points)

5. Compare the word2vec and GloVe embeddings you built. (1 point)

In [None]:
# Install necessary packages
!pip install -q pandas gensim nltk tqdm scikit-learn

In [None]:
# Import necessary packages
# Feel free to add more libraries here
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
def preprocess_text(text: str, remove_stopwords: bool) -> str:
    # SOLUTION HERE

In [None]:
# Preprocess your data with the above method
train = [preprocess_text(review, remove_stopwords=True) for review in train.review.tolist()]

### Implement your own negative sampling

In [None]:
# Implement your own negative sampling method
def negative_sampling(context: str,
                      window_size: int,
                      vocabulary: list,
                      sampling_rate: float):
    raise NotImplementedError

### Train your own embeddings

In [None]:
from gensim.models import Word2Vec

# Train your Word2Vec model
model = Word2Vec(sentences=train,
                 vector_size=100, # Dimensionality of vectors
                 min_count = 5, # Restricting vocabulary based on counts
                 window = 5, # Window size
                 max_vocab_size = None, # Restrict vocab size by fixed number
                 sg = 0, # skip-gram
                 hs = 0, # hierarchical softmax
                 negative = 5 # use negative sampling and the rate
         )

In [None]:
#check similar words to test
model.wv.most_similar(positive=['horrendous'], topn=3)

### Use pre-trained embeddings