# Data Engineering Pipeline for the Second Part of the Project : Fine-tuning of the 30-dim embeds


In this notebook, we will fine-tune the reduced embeddings. To do so, we will fetch a corpus training data from a few Wikipedia pages of x words. We also create id pairs of context and central words. Once this done, we fetch the x 300-dim corresponding word embeddings from fasttext. We then put them through the previously saved AutoEncoder that returns the bottleneck neurons, which are the reduced embeddings of 30 dimensions. Finally, we save the reduced embeddings and the data corpuses.

## 1. Load Required Libraries

We will start by loading the necessary libraries.

In [1]:
import os
os.chdir("..")  # Move up one directory
import numpy as np
import torch
from torch.utils.data import DataLoader
import wikipediaapi
import spacy
from utils.utils import DataSamplization
from sklearn.preprocessing import StandardScaler
from models.autoencoder import AutoEncoder

## 2. Load Wikipedia-Extracted French Data

We will load the Wikipedia-extracted French data for training the word2vec model.

In [None]:
user_agent = "WikipediaAPI/0.5 (Academic Project; rayan.hanader@gmail.com)"
wiki_fr = wikipediaapi.Wikipedia(language='fr', extract_format=wikipediaapi.ExtractFormat.WIKI, user_agent=user_agent)


topics = [
    "Animal domestique", "Animal de compagnie"
]


output_dir = "data/wikipediaDump"
os.makedirs(output_dir, exist_ok=True)


for topic in topics:
    page = wiki_fr.page(topic)
    if page.exists():
        print(f"Fetching article: {topic}")
        with open(f"{output_dir}/{topic.replace(' ', '_')}.txt", "w", encoding="utf-8") as f:
            f.write(page.text)
    else:
        print(f"Article not found: {topic}")

print(f"Articles fetched and saved in {output_dir}")

## 3. Preprocessing of the extracted data

We preprocess the wikipedia data

In [2]:
# Load SpaCy's French language model
nlp = spacy.load("fr_core_news_sm")

# Define directories
input_dir = "data/wikipediaDump"
preprocessed_dir = "data/preprocessedWikiDump"
os.makedirs(preprocessed_dir, exist_ok=True)

def preprocess_text_spacy(text):
    """
    Preprocess the input text using spaCy.
    - Tokenization
    - Lowercasing
    - Stopword removal
    - Removal of non-alphabetic tokens
    """
    # Process text with spaCy
    doc = nlp(text)
    # Filter tokens: keep alphabetic tokens, not stopwords, and in lowercase
    tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    return tokens

# Process each article
words = []
output_path = os.path.join(preprocessed_dir, "preprocessedWikiDump.txt")
for file_name in os.listdir(input_dir):
    input_path = os.path.join(input_dir, file_name)

    if file_name.endswith(".txt"):
        print(f"Processing {file_name}")
        with open(input_path, "r", encoding="utf-8") as f:
            text = f.read()

        # Preprocess the text
        tokens = preprocess_text_spacy(text)
        words+=tokens



dataSamplization = DataSamplization()


Processing Animal_de_compagnie.txt
Processing Animal_domestique.txt
Loading saved FastText model...


## 4. Fetching of the corresponding embeddings from fasttext

We fecth the 300-dim embeddings corresponding to the words of the corpus previously fetched from Wikipedia.
We pass the embeddings through the encoder to get the corresponding bottleneck embeddings of 30-dim.
We also extract the IDs of the words in order to construct the future Skip-gram pairs.

In [7]:
wordIds = dataSamplization.getWordsIds(words)
wordIdsToMatchEmbeds = dataSamplization.getWordsIdsWihtoutRepeat(words)
cp = wordIds.copy()
for i in range(len(cp)-1, -1, -1):
    if cp[i] == -1:
        # Remove the unfound word from the corpus
        words.pop(i)
        wordIds.pop(i)
cp = wordIdsToMatchEmbeds.copy()
for i in range(len(cp)-1, -1, -1):
    if cp[i] == -1:
        wordIdsToMatchEmbeds.pop(i)
embeddings = dataSamplization.getWordsEmbeddings(words)

# Save the embeddings before reducing as a dict {word : embed}
wordsToMatchEmbeds = []
alreadySeen = []
for word in words:
    if word not in alreadySeen:
        wordsToMatchEmbeds.append(word)
        alreadySeen.append(word)
os.makedirs('data/modelsSavedLocally/wikipedia', exist_ok=True)
embedsPath = 'data/modelsSavedLocally/wikipedia/300dim_embeddings_DictWithWords.npy'
embeds300withWords = {word: embeddings[i] for i, word in enumerate(wordsToMatchEmbeds)}
print(list(embeds300withWords.items())[:3])
np.save(embedsPath, embeds300withWords)
print(f"Embeddings saved in {embedsPath}")

# Save the preprocessed articles in a file
with open(output_path, "w", encoding="utf-8") as f:
    f.write(" ".join(words))
print(f"Preprocessed articles saved in {preprocessed_dir}")

scaler = StandardScaler()
best_config = {'hidden_dim1':256, 'hidden_dim2':128, 'learning_rate':0.001, 'batch_size':64}
embedding_matrix = np.array(embeddings)
embedding_matrix_normalized = scaler.fit_transform(embedding_matrix)
embedding_tensor = torch.tensor(embedding_matrix_normalized, dtype=torch.float32)
embedding_dataloader = DataLoader(embedding_tensor, batch_size=best_config['batch_size'], shuffle=False)

autoEncoder = AutoEncoder(input_dim=300, hidden_dim1=best_config['hidden_dim1'], hidden_dim2=best_config['hidden_dim2'], bottleneck_dim=30)
autoEncoder.load_state_dict(torch.load('data/modelsSavedLocally/autoencoder.pth'))

# Get the bottleneck outputs for the embeddings
print("\n\nPassing the embeddings through the AutoEncoder... Please wait...\n\n")
bottleneck_outputs = []
autoEncoder.eval()
with torch.no_grad():
    for batch in embedding_dataloader:
        outputs = autoEncoder.encoder(batch)
        bottleneck_outputs.append(outputs)
bottleneck_outputs = torch.cat(bottleneck_outputs)
print("\n\nEmbeddings reduced by the AutoEncoder")


# Save the bottleneck outputs
bottleneck_outputs = bottleneck_outputs.detach().numpy()
embeds_with_id = {wordIdsToMatchEmbeds[i]: bottleneck_outputs[i] for i in range(len(wordIdsToMatchEmbeds))}
embedsDictWithWords = {wordsToMatchEmbeds[i] : bottleneck_outputs[i] for i in range(len(wordsToMatchEmbeds))}
cp2 = embeds_with_id.copy()
cp = wordIds.copy()
for idx, (key, embed)  in enumerate(cp2.items()):
    for wordIdsIdx, wordId in enumerate(cp):
        if key == wordId:
            wordIds[wordIdsIdx] = idx
embedsPath = 'data/modelsSavedLocally/wikipedia/30dim_embeddings_ArraySimple.npy'
finalEmbeds = np.array(list(embeds_with_id.values()))
np.save(embedsPath, finalEmbeds)
print(f"Embeddings saved in {embedsPath}")
embedsPath = 'data/modelsSavedLocally/wikipedia/30dim_embeddings_DictWithWords.npy'
np.save(embedsPath, embedsDictWithWords)
print(f"Embeddings saved in {embedsPath}")

[('animal', array([-3.20413075e-02, -8.37033242e-03,  7.30173737e-02, -7.97412023e-02,
       -7.69651830e-02, -5.42337494e-03,  5.64670451e-02,  1.16956020e-02,
        3.85769606e-02, -3.51716466e-02,  8.05939436e-02,  2.11401861e-02,
       -2.21901424e-02, -7.26218969e-02,  3.11316494e-02, -7.59886065e-03,
        2.42326912e-02,  7.00246096e-02,  2.48890072e-02, -2.52103899e-02,
       -1.01999179e-01,  7.01836944e-02, -8.36341269e-03, -7.77921686e-03,
        7.37397596e-02,  3.73775661e-02, -7.65878484e-02, -1.55226355e-02,
        1.63189676e-02,  1.38134500e-02,  3.77668813e-02,  6.13541678e-02,
       -1.33383647e-02, -7.59667009e-02,  3.31273861e-02,  2.04847157e-02,
        4.56130020e-02,  5.28351776e-02, -1.96153522e-02,  7.45140985e-02,
       -5.10749705e-02, -9.68552101e-03, -4.37516011e-02,  4.10298221e-02,
       -7.65891895e-02,  3.67603004e-02, -7.95678701e-03,  3.80145684e-02,
        7.39008859e-02,  2.94812862e-02, -5.93920425e-03, -1.81104522e-02,
        5.187

  autoEncoder.load_state_dict(torch.load('data/modelsSavedLocally/autoencoder.pth'))


## 5. Skip-gram (input, output) word ID pairs construction

In [5]:
# Skip-gram (input, output) word ID pairs construction for a window size of 2


word_pairs = []
for i in range(1, len(wordIds)):
    input_id = wordIds[i]
    context = []

    if i + 2 < len(wordIds):
        context.append(wordIds[i + 1])
        context.append(wordIds[i + 2])
    else :
        if i + 1 < len(wordIds):
            context.append(wordIds[i + 1])

    if i - 2 >= 0:
        context.append(wordIds[i - 2])
        context.append(wordIds[i - 1])
    else :
        if i - 1 >= 0:
            context.append(wordIds[i - 1])
    
    for output_id in context:
        word_pairs.append((input_id, output_id))

# Save the word ID pairs into a txt file
wordPairsPath = 'data/skipgramPairs/word_pairs_fromWikiDump.txt'
os.makedirs('data/skipgramPairs', exist_ok=True)
with open(wordPairsPath, "w") as f:
    for pair in word_pairs:
        f.write(f"{pair[0]} {pair[1]}\n")
print("Word pairs saved")


Word pairs saved


## 6. Similarity test dataset creation

In [10]:
wordsToTestSimilarity = [
    ("chien", "chat"), ("chien", "loup"),
    ("perruche", "perroquet"), ("domestication", "apprivoisement"),
    ("élevage", "captivité"), ("animal", "compagnon"),
    ("animal", "chien"), ("animal", "chat"),
    ("Animal", "Compagnie"),
    ("Animal", "Espèce"),
    ("Compagnie", "Présence"),
    ("Espèce", "Animaux"),
    ("Animaux", "Familiers"),
    ("Chien", "Chat"),
    ("Maison", "Jardin"),
    ("Objet", "Domestication"),
    ("Présence", "Rassurante"),
    ("Beaux", "Talents")
    ]

os.makedirs('data/comparaisonDataSet', exist_ok=True)
np.save(file="data/comparaisonDataSet/wordsToTestSimilarity.npy", arr=wordsToTestSimilarity)

## 6. Conclusion

In this notebook, we finished the whole data pipeline for our project. All that is left to do now, is training and applying the Word2Vec model.