1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [1]:
SOURCE_DIR = '/content/gdrive/MyDrive/NLP/HW2/Q3_data.csv'

In [2]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [3]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [None]:
import csv

In [4]:
# 1. extract all tweets from file and save them in memory
df = pd.read_csv(SOURCE_DIR)
df.head()
# 2. remove urls, hashtags and usernames. use the prepared functions
df['preprocessed_text'] = df['Text'].apply(delete_hashtag_usernames)
df['preprocessed_text'] = df['preprocessed_text'].apply(delete_url)
df['preprocessed_text'] = df['preprocessed_text'].apply(delete_ex)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Datetime,Text,PureText,Language,Sentiment,Date,preprocessed_text
0,0,2022-09-22 09:14:35+00:00,بنشین تا شود نقش فال ما \nنقش هم‌ فردا شدن\n#م...,بنشین تا شود نقش فال ما نقش هم‌ فردا شدن,fa,negative,2022-09-22,بنشین تا شود نقش فال ما نقش هم فردا شدن
1,1,2022-10-06 01:44:55+00:00,@Tanasoli_Return @dr_moosavi این گوزو رو کی گر...,این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده...,fa,very negative,2022-10-06,این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده...
2,2,2022-09-22 15:12:28+00:00,@ghazaleghaffary برای ایران، برای مهسا.\n#OpIr...,برای ایران، برای مهسا.,fa,positive,2022-09-22,برای ایران، برای مهسا.
3,3,2022-09-22 09:35:50+00:00,@_hidden_ocean مرگ بر دیکتاتور \n#OpIran \n#Ma...,مرگ بر دیکتاتور,fa,very negative,2022-09-22,مرگ بر دیکتاتور
4,4,2022-09-22 01:31:25+00:00,نذاریم خونشون پایمال شه.‌‌.‌‌.\n#Mahsa_Amini #...,نذاریم خونشون پایمال شه.‌‌.‌‌.,fa,negative,2022-09-22,نذاریم خونشون پایمال شه...


# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [6]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    # Compute the dot product of u and v
    dot_product = np.dot(u, v)

    # Compute the Euclidean norms of u and v
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)

    # Calculate the cosine similarity
    cosine_similarity = dot_product / (norm_u * norm_v)

    return cosine_similarity

## find k nearest neighbors

In [7]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
  """

  # Get the embedding vector for the input word
  word_vector = embedding_dict.get(word)
  if word_vector is None:
      raise ValueError(f"Word '{word}' not found in the embedding dictionary.")

  # Calculate cosine similarity with all other words
  similarities = {}
  for other_word, other_vector in embedding_dict.items():
      if other_word != word:
          similarity = cosine_similarity(word_vector, other_vector)
          similarities[other_word] = similarity

  # Sort words by similarity (higher similarity first)
  sorted_words = sorted(similarities, key=lambda x: similarities[x], reverse=True)

  # Return the top k words (excluding the input word itself)
  nearest_words = sorted_words[:k]
  return nearest_words

# 2. One hot encoding

In [8]:
# making the vocabulary out of tweet texts
vocab = []
for sentence in df['preprocessed_text']:
  for word in sentence.split():
    if word not in vocab:
      vocab.append(word)

In [None]:
# 1. find one hot encoding of each word
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")

# Fit and transform the vocabulary
one_hot_encoded = encoder.fit_transform(np.array(vocab).reshape(-1, 1))

# Create the dictionary
word_to_one_hot = {word: encoding for word, encoding in zip(vocab, one_hot_encoded)}
word_to_one_hot.get("آزادی")

# 2. find 10 nearest words from "آزادی"
find_k_nearest_neighbors("آزادی", word_to_one_hot, 10)



['بنشین', 'تا', 'شود', 'نقش', 'فال', 'ما', 'هم', 'فردا', 'شدن', 'این']

[link text](https://)##### Describe advantages and disadvantages of one-hot encoding

Advantage:  

Simplicity: It's a straightforward and easy-to-understand method for representing categorical variables.
Efficiency: For models that expect numerical input, it provides a clear and efficient way to encode categorical data.
Binary Representation: Each category is represented by a binary vector, which can be useful for certain types of algorithms that benefit from such distinct representations.

Disadvantage:  

Dimensionality: It can lead to a high-dimensional feature space, especially if the categorical variable has many unique values, which can be problematic for model performance and computation.
Sparsity: The resulting encoding is sparse, meaning that there are many zeros in the dataset, which can be inefficient for storage and computation.
Lack of Relationship Representation: One-hot encoding does not capture any relationships between categories since each category is equally distant from every other category in the encoding space.


# 3. TF-IDF

In [None]:
import math
# 1. find the TF-IDF of all tweets.
tweets = list(df['preprocessed_text'])
tf_idfs = {}
idf = {}

# calculating tf
mat = []
for tweet in tweets:
  word_cnt = {}
  length_of_tweet = len(tweet.split())
  for word in tweet.split():
    word_cnt[word] = word_cnt.get(word, 0) + 1

  for key, value in word_cnt.items():
    word_cnt[key] = value/length_of_tweet
  mat.append(word_cnt)
tf = []
for number, voc in enumerate(vocab):
  tf.append([])
  for word_dict in mat:
    tf[number].append(word_cnt.get(voc, 0))

In [None]:
# calculating idf
total = len(tweets)
for word in vocab:
  num = 0
  for tweet in tweets:
    if word in tweet:
      num += 1
  idf[word] = math.log(total / num)


In [None]:
# calculating tf-idf
for word in vocab:
  word_tf = tf.pop(0)
  tf_idfs[word] = np.array(word_tf) * idf[word]

In [None]:
import random
# 2. choose one tweets randomly.
# Generate a random number within the range
random_number = random.randint(0, len(tweets)-1)
random_tweet = tweets[random_number]

In [None]:
random_tweet

'با اعتماد به نفس نوشته حرکت سال ۸۸ بسیار ساختارمند بودو شکست خورد شماها هم عقب بکشید.اگر خیلی آدم درستی هستی تجربه بده، اگر خیلی لاشی هستی چرا ناامیدمون میکنی؟ تجربه شما باعث آگاهیتون نشده باعث ترسو شدنتون شده ولی برای ما آگاهی و بینش به وجود آورده.'

In [None]:
# centroid of a tweet which is average of its words' tf-idf vectors
def calculate_centroid(tweet, tf_idfs, n):
  centroid = np.zeros(shape=(n,))
  tokens = tweet.split()
  for token in tokens:
    centroid += tf_idfs[token]
  centroid = centroid / len(tokens)
  return centroid

In [None]:
# 3. find 10 nearest tweets from chosen tweet.
scores = []
random_centroid = calculate_centroid(random_tweet, tf_idfs, len(tweets))
for ind, tweet in enumerate(tweets):
  if ind != random_number:
    other_centroid = calculate_centroid(tweet, tf_idfs, len(tweets))
    score = cosine_similarity(random_centroid, other_centroid)
    scores.append(score)

In [None]:
import heapq
# Find the indices of the 10 largest numbers
largest_indices = heapq.nlargest(10, range(len(scores)), key=lambda i: scores[i])

In [None]:
largest_indices

[17419, 15728, 11261, 8700, 4095, 5170, 1666, 14229, 11992, 7088]

In [None]:
print("10 nearest tweets to the random tweet above are:")
for index in largest_indices:
  print(tweets[index])

10 nearest tweets to the random tweet above are:
برای اسم زیبای فارسیم که تو غربت میرینن توش با لحجشون 🌺✊
نایاک و بقیه کاسه لیسای این حکومت بدونن دیگه تمومه ماجرا.
چقدر حرف داره میزنه این پوستر
مردم امشب خیابان را در تسخیر جوانان است 
يا الآن يا هيچوقت!!
مگه ما ربات نبودیم؟ ساواک آخه؟!
برای چه کلمه ای به موهایت بافته بودی که همه را به خیابان کشاند
برای این همه دروغ. 
برخی تصاویر ممکن است شما را به راه رایت هدایت کند
پشت سر ژینا، ندا، نوید، پویا، بکتاش و... ایستادن. این خونها روی زمین نمیمونه مرگ بر جمهوری منحوس اسلامی 


##### Describe advantages and disadvantages of TF-IDF

Advatages:  

Relevance: TF-IDF helps to highlight words that are more relevant to the context of a particular document, as opposed to common words that appear across all documents.
Simplicity: It is straightforward to implement and understand. The calculations involved are not complex, making it computationally efficient.
Versatility: TF-IDF can be adapted for various NLP tasks, including information retrieval, text summarization, and keyword extraction.


Disadvantages:  

Context Ignorance: TF-IDF does not account for the meaning of words or their context within the document. It treats every term independently.
Document Length Bias: It may bias against longer documents where term frequency is naturally higher, even if the term is not particularly relevant.
Static: The method does not learn from data. It doesn't evolve or improve over time as more documents are added to the corpus.


# 4. Word2Vec

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
# 1. train a word2vec model base on all tweets
tweets = [word_tokenize(tweet) for tweet in df['preprocessed_text']]
model = Word2Vec(sentences=tweets, vector_size=100, window=5, min_count=1, sg=0)
emb_dict = {}
for word in model.wv.index_to_key:
  emb_dict[word] = model.wv[word]
# 2. find 10 nearest words from "آزادی"
find_k_nearest_neighbors("آزادی", emb_dict, 10)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['زن،',
 'ازادی',
 'آزادی،',
 'عدالت',
 'وطنم',
 'زن',
 'مرد،',
 'زندگی',
 'اولیه',
 'زندگی،']

##### Describe advantages and disadvantages of Word2Vec

Advantages:  

Semantic Relationships: Word2Vec is excellent at capturing semantic relationships between words. Words with similar meanings are positioned closer together in the vector space.
Handling Rare Words: It works well with large datasets and is effective at handling rare words.
Flexibility: The model is flexible to linguistic context and can capture a variety of semantic nuances.


Disadvantages:  

Computational Cost: The Skip-Gram model of Word2Vec, which predicts context words from a target word, can be computationally expensive due to its task of predicting multiple context words.
Memory Usage: Storing co-occurrence matrices for the GloVe model, which is often compared with Word2Vec, requires more memory, which can be a limitation for some applications.
Context Window: The effectiveness of Word2Vec can be influenced by the size of the 'context window' used during training, which may not always capture the full semantic relationship if set improperly.


# 5. Contextualized embedding

In [None]:
!pip install transformers[sentencepiece]



In [9]:
# Load model and tokenizer
from transformers import BertForMaskedLM, BertTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling

model_name_tokenizer = "HooshvareLab/bert-base-parsbert-uncased"
model_name = "Sina-Alinejad-2002/NLP_HW2"
model = BertForMaskedLM.from_pretrained(model_name)
print(model.num_parameters()) #10457864
# model = BertForMaskedLM.from_pretrained(model_name)
# tokenizer = BertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


162942880


In [None]:
!pip install accelerate -U

In [None]:
!pip install datasets

In [11]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
from transformers import AutoTokenizer
from datasets import Dataset

#load base tokenizer to train on dataset
tokenizer_base = AutoTokenizer.from_pretrained(model_name_tokenizer)
# convert pandas dataset to HF dataset
dataset = Dataset.from_pandas(df)

# define iterator
training_corpus = (
    dataset[i : i + 500]["preprocessed_text"]
    for i in range(0, len(dataset), 500)
)

#train the new tokenizer for dataset
tokenizer = tokenizer_base.train_new_from_iterator(training_corpus, 3000)
#test trained tokenizer for sample text
text = dataset['preprocessed_text'][123]
print(text)

config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

تو روزای عادی هرچی کیر پرت میکنی اخونده که میپره با دندون میگیره تو ببین این روزا یکیشون تخم میکنه نفس بکشه؟ سوراخ موشا رو بگردید بەرخودان ژیانە✌🏻


In [12]:
# let's check tokenization process
input_ids = tokenizer(text).input_ids
subword_view = [tokenizer.convert_ids_to_tokens(id) for id in input_ids]
np.array(subword_view)

array(['[CLS]', 'تو', 'روزا', '##ی', 'عادی', 'هرچی', 'کیر', 'پرت',
       'میکنی', 'اخوند', '##ه', 'که', 'میپ', '##ره', 'با', 'دن', '##د',
       '##ون', 'میگیره', 'تو', 'ببین', 'این', 'روزا', 'یکی', '##شون',
       'تخم', 'میکنه', 'نفس', 'بک', '##شه', '؟', 'سو', '##را', '##خ',
       'موش', '##ا', 'رو', 'بگ', '##ردی', '##د', 'ب', '##ە', '##رخ',
       '##ود', '##ان', 'ژیان', '##ە', '##✌🏻', '[SEP]'], dtype='<U6')

In [13]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [17]:
def tokenize_function(examples):
        # examples["preprocessed_text"] = [
        #     line for line in examples["preprocessed_text"] if len(line[0]) > 0 and not line[0].isspace()
        # ]
        a = tokenizer(
            examples["preprocessed_text"],
            padding="max_length",
            truncation=True,
            max_length=128,
            return_special_tokens_mask=True,
        )
        return a

In [None]:
tokenized_dataset = dataset.map(tokenize_function, batched=True, batch_size=1)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'stor

In [None]:
from transformers import Trainer, TrainingArguments
from time import time
training_args = TrainingArguments(
    run_name=f'First Run-{time()}',
    output_dir='outputs', overwrite_output_dir=False,
    auto_find_batch_size=False, # pip install accelerate
    per_device_train_batch_size = 8,
    num_train_epochs=2,
    save_strategy='steps', save_steps=512,
    save_total_limit=5,
    save_safetensors=False,
    group_by_length=True,
    push_to_hub=True,
    hub_model_id='Sina-Alinejad-2002/NLP_HW2',
    hub_strategy='every_save',
    hub_private_repo=False,
    hub_token='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',
    learning_rate = 1e-4,
    # report_to='wandb'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

Step,Training Loss
500,5.3903
1000,6.3889
1500,6.2898
2000,6.0937
2500,6.0206
3000,5.8448
3500,5.6948
4000,5.5496


Step,Training Loss
500,5.3903
1000,6.3889
1500,6.2898
2000,6.0937
2500,6.0206
3000,5.8448
3500,5.6948
4000,5.5496
4500,5.4765
5000,5.364


TrainOutput(global_step=5000, training_loss=5.811301806640625, metrics={'train_runtime': 1912.6785, 'train_samples_per_second': 20.913, 'train_steps_per_second': 2.614, 'total_flos': 2634182492160000.0, 'train_loss': 5.811301806640625, 'epoch': 2.0})

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [26]:
test_str = "آزادی"
model_inputs = tokenizer(vocab[:500], return_tensors="pt", padding=True, truncation=True, max_length=128)
model_inputs.to(device)
# Unpack the outputs from the tokenizer (might vary depending on tokenizer)
# input_ids = inputs['input_ids']  # Assuming 'input_ids' is the key for token ids
# attention_mask = inputs['attention_mask']  # Assuming 'attention_mask' is a key
# type(torch.Tensor(input_ids))
outputs = model(**model_inputs)  # Pass unpacked arguments

In [31]:
word_embeddings = model.bert.embeddings.word_embeddings

model_embedding_dict = {word: word_embeddings(torch.tensor([tokenizer.convert_tokens_to_ids(word)]).to(device)).cpu().squeeze(0).detach().numpy() for word in tokenizer.get_vocab().keys()}

In [2]:
word = "ازادی"
nearest = find_k_nearest_neighbors(word, model_embedding_dict, 10)
nearest

['ازادی',
 'ازاد',
 'ازادیها',
 'رهایی',
 'ازادیهای',
 'ازادسازی',
 'ازادیها',
 'اسایش',
 'استقلال',
 'زندگی']

##### Describe advantages and disadvantages of Contextualized embedding

Advantages:  

Context Awareness: Unlike traditional embeddings, contextualized embeddings take into account the context in which a word appears, allowing for a more nuanced understanding of its meaning.
Polysemy Handling: They can differentiate between the meanings of words that have multiple meanings based on their usage in a sentence.
Improved Performance: For many NLP tasks, contextualized embeddings have been shown to significantly outperform non-contextual embeddings, leading to state-of-the-art results.


Disadvantages:  

Computational Resources: Training models like BERT from scratch requires significant computational resources, which can be costly and time-consuming.
Complexity: The complexity of these models makes them harder to understand and work with compared to simpler models like word2vec or GloVe.
Data Requirements: While they can work well with less data, they still require a substantial amount of training data to perform effectively.
