Task 1.  Describe your DataSet including the following factors size, number of distinct words, word frequencies, and number of frequent and rare words. (10%)

Task 2. Use Word2Vec for training the model and describe chosen parameters. (https://github.com/tmikolov/word2vec). (30%)
                Use your model for finding synonyms (or words with similar meaning) for 10 different words. Was your model accurate? Describe results.
                Apply mathematical equations on vector values of similar words. Are there any patterns visible? Describe results.

Task 3. Use GloVe for training the model and describe chosen parameters. (https://github.com/stanfordnlp/GloVe) (30%)
                Use your model for finding synonyms (or words with similar meaning) for 10 different words. Was your model accurate? Describe results.
                Apply mathematical equations on vector values of similar words. Are there any patterns visible? Describe results.
Task 4: Compare GloVe results with Word2Vec (10%) 


In [2]:
from gensim.models import Word2Vec
import logging


# 1. Dataset Describtion

In [7]:
import pandas as pd
import re

def tokenize_document(document):
    # Use regular expression to split tokens based on non-alphanumeric characters
    words = re.split(r'\W+', document)
    # Filter out empty strings and single characters
    words = [word.strip() for word in words if len(word.strip()) > 1]
    return words


# Load the news dataset from CSV
news_df = pd.read_csv("oxu_az_500_000.csv")

# Size of the dataset (number of news articles)
dataset_size = len(news_df)
    
news_df['content'] = news_df['content'].str.lower()

# Concatenate all news articles into one large text
all_text = " ".join(news_df['content'])
    
# Tokenize the text to get distinct words
distinct_words = set(tokenize_document(all_text))
    
# Number of distinct words
num_distinct_words = len(distinct_words)
    
# Word frequencies
word_freq = {}
for word in tokenize_document(all_text):
    word_freq[word] = word_freq.get(word, 0) + 1
    
# Number of frequent and rare words
frequent_words_threshold = 100  # Define your threshold for frequent words
rare_words_threshold = 5  # Define your threshold for rare words
frequent_words = [word for word, freq in word_freq.items() if freq >= frequent_words_threshold]
rare_words = [word for word, freq in word_freq.items() if freq <= rare_words_threshold]
    
# Display dataset description
print("Dataset Description:")
print("====================")
print(f"Size of the dataset: {dataset_size} news articles")
print(f"Number of distinct words: {num_distinct_words}")
print(f"Number of frequent words (appearing >= {frequent_words_threshold} times): {len(frequent_words)}")
print(f"Number of rare words (appearing <= {rare_words_threshold} times): {len(rare_words)}")
    
# Optionally, you can print some frequent and rare words
print("\nExample frequent words:")
print("========================")
print(frequent_words[:10])  # Print first 10 frequent words
print("\nExample rare words:")
print("========================")
print(rare_words[:10])  # Print first 10 rare words

Dataset Description:
Size of the dataset: 15571 news articles
Number of distinct words: 120750
Number of frequent words (appearing >= 100 times): 3434
Number of rare words (appearing <= 5 times): 90100

Example frequent words:
['altında', 'yeni', 'tapılıb', 'güclü', 'sonra', 'cü', 'gündə', 'aşkar', 'edilib', 'müvafiq']

Example rare words:
['dalia', 'tampa', 'ucundan', 'şimala', 'muniri', 'parkından', 'ötürərək', 'dağdan', 'enmək', 'səhhətlərində']


In [4]:
news_df = news

In [6]:
news_df['content']

0        mərakeşdə dağıntılar altında yeni doğulmuş kör...
1        “i̇dalia” tropik qasırğası kubanın qərb əyalət...
2        bizim whatsapp kanalımıza buradan abunə ola bi...
3        dağlıq ərazidə itkin düşmüş şəxslər tapılıblar...
4        bir gündür ki, azərbaycan qızıl aypara cəmiyyə...
                               ...                        
15566    bizim whatsapp kanalımıza buradan abunə ola bi...
15567    bizim whatsapp kanalımıza buradan abunə ola bi...
15568    bizim whatsapp kanalımıza buradan abunə ola bi...
15569    bizim whatsapp kanalımıza buradan abunə ola bi...
15570    bizim whatsapp kanalımıza buradan abunə ola bi...
Name: content, Length: 15571, dtype: object

In [4]:
print(distinct_words)

{'dəvətlidir', 'dağlıq', 'bilmirsə', 'tutulmasına', 'ödənilmədiyi', 'kartını', 'anlayacaqsan', 'damlı', 'komandanızın', 'məqbərəsinin', 'Hansen', 'həsrətində', 'Junior718Braziliya', 'Hikməti', 'qarşılanmadığı', 'Klaksvik', '625', '3385', 'dönüklüyün', 'החמאס', 'Perkuşin', 'Nurlan', 'nədəsə', 'seçkilik', 'deputatlardan', 'hədəfləyirlər', 'Ehtiyat', 'Bayramlının', 'dincələn', 'like', 'bilməsəm', 'oğluNiftəliyev', 'Marketə', '612', 'Hüseynovun', 'güləşdən', 'Məzuniyyəti', 'oynayacağımıza', 'Mxitaryan', 'toplarla', 'parikləri', '340', 'kampusdur', 'bacalarda', 'Makinenin', 'qaranlığında', 'etimad', 'dözülməzdir', 'Tikinti', 'Qudaf', 'qovşaqlar', 'israillini', 'vidalaşırdı', 'Cessi', 'obyektin', 'Yerablur', 'köçsün', 'yerləşdirilmişdir', 'salındılar', 'Ətrafı', 'Fiat', 'xanımının', 'zibilliklər', 'Sənayedə', 'başındadır', 'Komandanım', 'qiymətə', 'Markalanma', 'yekdil', '3KA0f1NCari', 'FSX2023', 'qatılmışdı', 'kolonializmindən', 'nutriya', 'setteyim', 'körpəyə', 'Smolensk', 'daşıdı', 'payım

# 2. Word2Vec

In [8]:
# Enable logging to see training progress
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

# Tokenize the text data
tokenized_text = [tokenize_document(text) for text in news_df['content']]

# Train the Word2Vec model
vector_size = 100  # Dimensionality of the word vectors
window = 5  # Maximum distance between the current and predicted word within a sentence
min_count = 5  # Ignore all words with total frequency lower than min_count
workers = 4  # Number of CPU cores to use for training the model

word2vec_model = Word2Vec(tokenized_text, vector_size=vector_size, window=window, min_count=min_count, workers=workers)

print("Word2Vec Model Parameters:")
print("===========================")
print(f"Vector size: {vector_size}")
print(f"Window size: {window}")
print(f"Minimum count: {min_count}")
print(f"Number of workers: {workers}")


2024-05-09 21:09:43,797 : INFO : collecting all words and their counts
2024-05-09 21:09:43,798 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-05-09 21:09:43,955 : INFO : PROGRESS: at sentence #10000, processed 1571196 words, keeping 95991 word types
2024-05-09 21:09:44,047 : INFO : collected 120750 word types from a corpus of 2476044 raw words and 15571 sentences
2024-05-09 21:09:44,048 : INFO : Creating a fresh vocabulary
2024-05-09 21:09:44,104 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 34727 unique words (28.76% of original 120750, drops 86023)', 'datetime': '2024-05-09T21:09:44.104372', 'gensim': '4.3.2', 'python': '3.11.5 (v3.11.5:cce6ba91b3, Aug 24 2023, 10:50:31) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-14.3-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-05-09 21:09:44,105 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 2331493 word corpus (94.16% of original 2476044, drops 1

Word2Vec Model Parameters:
Vector size: 100
Window size: 5
Minimum count: 5
Number of workers: 4


In [9]:
# Find synonyms for 10 different words and describe results
def find_synonyms(word2vec_model):
    words_to_find_synonyms = ["şəhər", "görüntülər","tələ","hücum","görüş","ağac","yük","nəticəsinə","silah","xəsarət"]
    print("Synonyms for 10 different words:")
    print("=================================")
    for word in words_to_find_synonyms:
        similar_words = word2vec_model.wv.most_similar(word)
        print(f"{word}:")
        for similar_word, similarity_score in similar_words:
            print(f"  - {similar_word} (Similarity Score: {similarity_score})")
        print()  # Add a blank line after each word's synonyms

    
# Call the functions
find_synonyms(word2vec_model)

Synonyms for 10 different words:
şəhər:
  - rayon (Similarity Score: 0.8197852373123169)
  - şəhəri (Similarity Score: 0.8026642799377441)
  - gəncə (Similarity Score: 0.7909250259399414)
  - cra (Similarity Score: 0.7889021039009094)
  - qazax (Similarity Score: 0.7811596393585205)
  - şəhərinin (Similarity Score: 0.7736684083938599)
  - lənkəran (Similarity Score: 0.7677494287490845)
  - goranboy (Similarity Score: 0.7533689737319946)
  - sumqayıt (Similarity Score: 0.7508994340896606)
  - cəlilabad (Similarity Score: 0.7432382106781006)

görüntülər:
  - kadrlar (Similarity Score: 0.8771671056747437)
  - kadrlarda (Similarity Score: 0.8698485493659973)
  - video (Similarity Score: 0.8671796321868896)
  - yayılan (Similarity Score: 0.8338212370872498)
  - görüntülərini (Similarity Score: 0.8278599977493286)
  - paylaşılıb (Similarity Score: 0.8252955675125122)
  - videoda (Similarity Score: 0.8190869092941284)
  - kadrları (Similarity Score: 0.8118186593055725)
  - görüntüləri (Simila

In [10]:
# Apply mathematical equations on vector values of similar words and describe results
def apply_mathematical_equations(word2vec_model):
    print("\nApplying mathematical equations on vector values of similar words:")
    print("====================================================================")
    
    results = word2vec_model.wv.most_similar_cosmul(positive=["taxta","şüşə"])#, negative=["qadın"])
    print("Şüşə + Taxta:")
    for result in results:
        word, similarity = result
        print(f"{word}: {similarity:.4f}")

apply_mathematical_equations(word2vec_model)




Applying mathematical equations on vector values of similar words:
Şüşə + Taxta:
birmərtəbəli: 0.9234
günəbaxan: 0.9070
kürsülü: 0.9037
üçotaqlı: 0.9032
birotaqlı: 0.9024
otaqlı: 0.8986
atmaları: 0.8963
ikiotaqlı: 0.8933
toyuq: 0.8855
qurumuş: 0.8843


# 3. Glove

In [8]:
from collections import defaultdict

def create_coocurrence_matrix(tokens, window_size=5):
    cooccur_matrix = defaultdict(lambda: defaultdict(int))
    for i, token in enumerate(tokens):
        for j in range(max(0, i - window_size), min(len(tokens), i + window_size + 1)):
            if i != j:
                cooccur_matrix[token][tokens[j]] += 1
    return cooccur_matrix


In [None]:
from glove import Glove
import pandas as pd

# Load the news dataset from CSV
news_df = pd.read_csv("oxu_az_500_000.csv")

corpus = [' '.join(tokenize_document(text)) for text in news_df['content']]

cooccur_matrix = create_coocurrence_matrix(corpus)

vector_size = 100

glove_model = Glove(vector_size=vector_size, learning_rate=0.05)
glove_model.fit(cooccur_matrix, epochs=30, no_threads=4, verbose=True)

glove_model.save('glove_model.model')


In [11]:
import pandas as pd
import re
import csv
from gensim.models import Word2Vec

# Predefined functions
def load_stopwords(stopwords_file):
    with open(stopwords_file, 'r', encoding='utf-8') as f:
        stopwords = [word.strip() for word in f.readlines()]
    return stopwords

def tokenize_document(document):
    words = re.split(r'\W+', document)
    words = [word.strip() for word in words if len(word.strip()) > 1]
    return words

def remove_stopwords(text, stopwords):
    words = tokenize_document(text)
    filtered_words = [word for word in words if word.lower() not in stopwords]
    return ' '.join(filtered_words)

def remove_stopwords_from_csv(input_csv, output_csv, stopwords):
    with open(input_csv, 'r', encoding='utf-8') as csv_input:
        reader = csv.DictReader(csv_input)
        fieldnames = reader.fieldnames
        
        with open(output_csv, 'w', encoding='utf-8', newline='') as csv_output:
            writer = csv.DictWriter(csv_output, fieldnames=fieldnames)
            writer.writeheader()

            for row in reader:
                row['content'] = remove_stopwords(row['content'], stopwords)
                writer.writerow(row)

# Load and clean data
stopwords_az = load_stopwords('azerbaijani.txt')
remove_stopwords_from_csv('oxu_az_500_000.csv', 'news_without_stopwords.csv', stopwords_az)
print("Stopwords removed from the CSV file and saved to 'news_without_stopwords.csv'")

# Load data into pandas DataFrame
df = pd.read_csv('news_without_stopwords.csv')

def remove_digits_from_dataframe(df):
    df['content'] = df['content'].str.replace('\d+', '', regex=True)
    return df

df = remove_digits_from_dataframe(df)

# Tokenize documents
df['tokens'] = df['content'].apply(tokenize_document)

# Prepare texts: list of list of words
texts = df['tokens'].tolist()

# Train Word2Vec model: as a proxy for GloVe
model = Word2Vec(sentences=texts, vector_size=100, window=5, min_count=2, sg=0)  # sg=0 uses Glove

# Save the model
model.save("word2vec_model_news.model")
print("Model trained and saved as 'word2vec_model_news.model'")


Stopwords removed from the CSV file and saved to 'news_without_stopwords.csv'


2024-05-09 22:20:17,791 : INFO : collecting all words and their counts
2024-05-09 22:20:17,792 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-05-09 22:20:17,931 : INFO : PROGRESS: at sentence #10000, processed 1264739 words, keeping 102883 word types
2024-05-09 22:20:17,998 : INFO : collected 129866 word types from a corpus of 1986515 raw words and 15571 sentences
2024-05-09 22:20:17,998 : INFO : Creating a fresh vocabulary
2024-05-09 22:20:18,084 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 75973 unique words (58.50% of original 129866, drops 53893)', 'datetime': '2024-05-09T22:20:18.084444', 'gensim': '4.3.2', 'python': '3.11.5 (v3.11.5:cce6ba91b3, Aug 24 2023, 10:50:31) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-14.3-arm64-arm-64bit', 'event': 'prepare_vocab'}
2024-05-09 22:20:18,085 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 1932622 word corpus (97.29% of original 1986515, drops 

Model trained and saved as 'word2vec_model_news.model'


In [14]:
from gensim.models import Word2Vec

# Load your model
model = Word2Vec.load("word2vec_model_news.model")

# List of words to find synonyms for
words = ["şəhər", "görüntülər","tələ","hücum","görüş","ağac","yük","nəticəsinə","silah","xəsarət"]

# Find synonyms and analyze accuracy
print("Synonyms and Similar Words:")
for word in words:
    try:
        similar_words = model.wv.most_similar(word)
        print(f"{word}:")
        for similar_word, similarity_score in similar_words:
            print(f"  - {similar_word} (Similarity Score: {similarity_score})")
        print()  # Add a blank line after each word's synonyms
    except KeyError:
        print(f"{word}: Word not in vocabulary.")


2024-05-09 22:28:05,208 : INFO : loading Word2Vec object from word2vec_model_news.model
2024-05-09 22:28:05,267 : INFO : loading wv recursively from word2vec_model_news.model.wv.* with mmap=None
2024-05-09 22:28:05,268 : INFO : setting ignored attribute cum_table to None
2024-05-09 22:28:05,461 : INFO : Word2Vec lifecycle event {'fname': 'word2vec_model_news.model', 'datetime': '2024-05-09T22:28:05.461871', 'gensim': '4.3.2', 'python': '3.11.5 (v3.11.5:cce6ba91b3, Aug 24 2023, 10:50:31) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-14.3-arm64-arm-64bit', 'event': 'loaded'}


Synonyms and Similar Words:
şəhər:
  - şəhəri (Similarity Score: 0.8837113380432129)
  - rayon (Similarity Score: 0.8772793412208557)
  - şəhərinin (Similarity Score: 0.8667839765548706)
  - Ağcabədi (Similarity Score: 0.8511502146720886)
  - qəsəbə (Similarity Score: 0.8500770926475525)
  - Yevlax (Similarity Score: 0.8477436304092407)
  - Qazax (Similarity Score: 0.8436801433563232)
  - Gəncə (Similarity Score: 0.8423677682876587)
  - Goranboy (Similarity Score: 0.8371660709381104)
  - rayonlarının (Similarity Score: 0.8214628100395203)

görüntülər:
  - kadrlar (Similarity Score: 0.9159531593322754)
  - Kadrlarda (Similarity Score: 0.9134451150894165)
  - video (Similarity Score: 0.9041030406951904)
  - görüntüləri (Similarity Score: 0.8978232741355896)
  - videogörüntülər (Similarity Score: 0.893683910369873)
  - kadrları (Similarity Score: 0.8813112378120422)
  - yayılan (Similarity Score: 0.8721815347671509)
  - görüntülərini (Similarity Score: 0.8711584806442261)
  - mediada (Sim

In [11]:

# Example of vector arithmetic to find semantic relationships
try:
    results = model.wv.most_similar(positive=['şüşə', 'taxta'])#, negative=['kişi'])
    print("\nApplying mathematical equations on vector values of similar words:")
    print("====================================================================")  
    print("Şüşə + taxta:")
    for result in results:
        word, similarity = result
        print(f"{word}: {similarity:.4f}")

except KeyError as e:
    print(e)



Applying mathematical equations on vector values of similar words:
Şüşə + taxta:
mətbəxində: 0.9690
palçıq: 0.9640
uzunluğunda: 0.9638
gümüşü: 0.9619
diri: 0.9613
bərkidilmiş: 0.9605
traktorun: 0.9590
soyuducu: 0.9585
çörək: 0.9580
qum: 0.9566


In [12]:

# Example of vector arithmetic to find semantic relationships
try:
    results = model.wv.most_similar(positive=['avtovağzal'], negative=['avtobus'])
    print("====================================================================")    
    print("\nVector arithmetic (King - Man + Woman):")
    for result in results:
        word, similarity = result
        print(f"{word}: {similarity:.4f}")

except KeyError as e:
    print(e)



Vector arithmetic (King - Man + Woman):
Müttəfiq: 0.5222
İcmasında: 0.5131
bilərsinizNoyabr: 0.4987
adamlarımızın: 0.4919
tanımayacaq: 0.4630
bilərsinizŞimal: 0.4560
Pedoqoji: 0.4483
özəlliklə: 0.4394
addımıdır: 0.4379
bilərsinizMaliyyə: 0.4364
