<a href="https://colab.research.google.com/github/wwww3341/-Projects/blob/main/Spanish_Corpus_with_Word_Embedding_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings and Word2Vec: A Brief Theoretical Overview
Word embeddings are a fundamental concept in natural language processing (NLP) that represent words as dense vectors in a continuous vector space. These embeddings capture semantic relationships between words, allowing algorithms to understand and work with the contextual meaning of words.

Word2Vec is a popular technique for learning word embeddings from large text corpora. Developed by Tomas Mikolov et al., Word2Vec models utilize neural networks to map words into continuous vector spaces. The core idea is that words appearing in similar contexts are likely to have similar meanings. Word2Vec achieves this by training on context windows within sentences, learning to predict the likelihood of words given their context or vice versa.

In the provided Python code, we leverage the Gensim library to implement a Word2Vec model. The steps involve loading a dataset (in this case, the 'wikicorpus' in Spanish), tokenizing sentences, and training the Word2Vec model with specified parameters such as vector size, window size, minimum word frequency, and the number of workers.

The trained model can then be used to find words similar to a user-input word. The most_similar function in Gensim's Word2Vec model retrieves words that share contextual similarities with the provided word. This tutorial demonstrates a practical implementation of Word2Vec for word similarity analysis in Spanish, showcasing the power of word embeddings in capturing semantic relationships within language.

In [None]:
# Install the 'datasets' package for easy dataset loading
!pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


In [None]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from datasets import load_dataset
from gensim.models import Word2Vec

# Download the dataset
# Using the 'wikicorpus' dataset in Spanish ('raw_es' version)
dataset = load_dataset("wikicorpus", "raw_es")

# Download the 'punkt' tokenizer
# 'punkt' is a tokenizer from the NLTK library used for word tokenization
nltk.download('punkt')

# Get texts from the dataset
# Extracting the 'text' field from each item in the dataset
texts = [item['text'] for item in dataset.values()]

# Tokenize sentences into lists of words
# Using NLTK's word_tokenize to tokenize each sentence into a list of words
tokenized_sentences = [word_tokenize(sentence) for text in texts for sentence in text]

# Remove empty lists (if any)
# Filtering out any empty lists from the tokenized sentences
tokenized_sentences = [sentence for sentence in tokenized_sentences if sentence]

# Train Word2Vec model
# Using Word2Vec to create word embeddings with the specified parameters
# - vector_size: Size of the word vectors (100 in this case)
# - window: Maximum distance between the current and predicted word within a sentence (5 in this case)
# - min_count: Ignores all words with a total frequency lower than this (5 in this case)
# - workers: Number of CPU cores to use during training (4 in this case)
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=5, workers=4)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/8.27k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.75k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/253M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/259409 [00:00<?, ? examples/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# Find and print words similar to the user-input word
# Prompt the user to enter a word in Spanish
user_input = input('Enter a word in Spanish: ')

# Find the most similar words using the trained Word2Vec model
# - model.wv.most_similar(): Returns the top-N most similar words for the given word
# - topn: Number of top similar words to retrieve (5 in this case)
similar_words = model.wv.most_similar(user_input, topn=5)

# Print the results
print(f"Words similar to '{user_input}': {similar_words}")

Enter a word in Spanish: saludo
Words similar to 'saludo': [('gesto', 0.7299390435218811), ('grito', 0.7242612242698669), ('sermón', 0.6907185316085815), ('discurso', 0.6808280944824219), ('himno', 0.6709105968475342)]
