## Introduction to Approach 2: TF-IDF and Word2Vec with Grid Search Optimization

In this notebook, we explore two advanced text vectorization techniques, TF-IDF (Term Frequency-Inverse Document Frequency) and Word2Vec, combined with logistic regression to develop a robust language identification model. Each method offers unique advantages in processing and classifying textual data across various languages.

### TF-IDF Vectorization
TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. Here, we use a character-level TF-IDF representation, which effectively captures the linguistic patterns at the character level, making it particularly useful for language identification tasks where such granular features can significantly distinguish between languages.

### Word2Vec Embeddings
Conversely, Word2Vec provides a dense vector representation of words, capturing the context within which words appear. This model is trained using the surrounding words and thus, tends to capture semantic relationships more effectively than TF-IDF.

### Grid Search for Hyperparameter Optimization
To fine-tune our logistic regression models trained with these vectorizations, we employ Grid Search CV. This method exhaustively searches through a specified parameter space, allowing us to identify the optimal settings for our models and ensure the best possible performance.



### References

1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. *Proceedings of Workshop at ICLR*. Available at [arXiv:1301.3781](https://arxiv.org/abs/1301.3781).

In [25]:
import numpy as np
import random
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from joblib import dump
from gensim.models.callbacks import CallbackAny2Vec

In [26]:
# Ensure necessary NLTK components are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Data preprocessing

In [35]:
import re
#not used for now
def preprocess_text(text):
    """
    Preprocess the input text for language identification.

    Args:
    text (str): The input text to preprocess.

    Returns:
    str: The preprocessed text.
    """
    return text

#Data Loading and Sampling

This section covers the loading and random sampling of the training and testing data. Sampling is particularly useful to reduce computation time while developing the model on Google Colab. We use a sample size of 25,000 for training and 10,000 for testing to ensure a representative subset of the full dataset.

In [6]:
def load_data_sample(filepath, sample_size=25000, random_state=42):
      """Load a random sample of lines from a file to reduce memory usage and speed up computations."""

      with open(filepath, 'r', encoding='utf-8') as file:
        lines = file.readlines()

      random.seed(random_state)
      sampled_indices = random.sample(range(len(lines)), sample_size)
      sampled_lines = [lines[i].strip() for i in sampled_indices]
      return sampled_lines

def load_datasets(train_data_path, train_labels_path, test_data_path, test_labels_path, train_sample_size=25000, test_sample_size=10000):
    """
    Load training and testing data and labels from specified file paths.

    Args:
    train_data_path (str): Path to the training data file.
    train_labels_path (str): Path to the training labels file.
    test_data_path (str): Path to the testing data file.
    test_labels_path (str): Path to the testing labels file.
    train_sample_size (int): Number of samples to load from the training data.
    test_sample_size (int): Number of samples to load from the testing data.

    Returns:
    tuple: Tuple containing loaded training data, training labels, testing data, testing labels.
    """
    # Load training data and labels
    X_train = load_data_sample(train_data_path, sample_size=train_sample_size)
    y_train = load_data_sample(train_labels_path, sample_size=train_sample_size)

    # Load testing data and labels
    X_test = load_data_sample(test_data_path, sample_size=test_sample_size)
    y_test = load_data_sample(test_labels_path, sample_size=test_sample_size)

    return X_train, y_train, X_test, y_test


# Define the base path for data files
base_path = '/content/drive/MyDrive/data'

# Paths to data files using the base path
train_data_path = f'{base_path}/train/x_train.txt'
train_labels_path = f'{base_path}/train/y_train.txt'
test_data_path = f'{base_path}/test/x_test.txt'
test_labels_path = f'{base_path}/test/y_test.txt'

In [7]:
X_train, y_train, X_test, y_test = load_datasets(train_data_path, train_labels_path, test_data_path, test_labels_path)

## Word2Vec Model Training and Transformation

This section defines the Word2Vec transformation and training process. We utilize the `Word2Vec` model from the Gensim library, which is effective for natural language processing tasks such as embedding generation based on the context of words.

### EpochLogger Callback
To monitor the training progress of the Word2Vec model, we implement the `EpochLogger` class as a callback. This callback logs the completion of each epoch during training, providing visibility into the model's training process. This is especially useful for long training sessions, as it gives real-time feedback about the progress.

### Word2Vec Transformation Function
The `word2vec_transform` function is responsible for:
1. **Initializing and training the Word2Vec model**: The function takes sentences as input and trains a Word2Vec model. Key parameters such as `vector_size`, `window`, and `min_count` are configurable, allowing customization based on specific dataset characteristics.
2. **Generating word embeddings**: After training, the model computes the average Word2Vec embedding for each sentence. This average embedding represents the sentence in a dense vector form, which can be used for further machine learning tasks.

This approach leverages the semantic richness of Word2Vec embeddings, providing a robust feature set for subsequent classification or clustering tasks.


In [22]:
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec

class EpochLogger(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 0
    def on_epoch_end(self, model):
        print(f"Epoch {self.epoch} completed")
        self.epoch += 1

def word2vec_transform(sentences, vector_size=100, window=5, min_count=1, epochs=10):
    model = Word2Vec(sentences, vector_size=vector_size, window=window, min_count=min_count, epochs=epochs)
    model.build_vocab(sentences, progress_per=1000)
    model.train(sentences, total_examples=model.corpus_total_words, epochs=epochs)
    return {word: model.wv[word] for word in model.wv.index_to_key}  # Dictionary of word vectors

## Prediction and Model Training Functions

This section outlines the core functionalities of our language identification pipeline, covering the model training, prediction, and evaluation processes. These functions are designed to work with different types of text vectorization techniques, namely TF-IDF and Word2Vec.

### Predict Language Function
The `predict_language` function handles the prediction of the language for a given piece of text based on the specified model and vectorizer type:
- **TF-IDF Vectorization**: If the model is trained with TF-IDF, the prediction is straightforward—passing the text through the pipeline to get the predicted language.
- **Word2Vec Vectorization**: For Word2Vec, the text needs to be tokenized and transformed into Word2Vec embeddings before making a prediction. This involves converting the text into tokens, transforming these tokens into embeddings, and then reshaping the result to match the expected input structure for the classifier.

### Train and Tune Model Function
The `train_and_tune_model` function configures and trains the model based on the specified vectorization technique:
- **TF-IDF**: A pipeline is set up with a TF-IDF vectorizer and a logistic regression classifier. Hyperparameters are tuned using GridSearchCV to find the best model configuration.
- **Word2Vec**: The training data are first transformed into Word2Vec embeddings, followed by training a logistic regression classifier. This function outputs the trained model ready for making predictions.

### Predict and Evaluate Function
The `predict_and_evaluate` function is used to assess the performance of the trained model. It iterates over a dataset, applies the `predict_language` function to each text entry, and collects predictions. It then calculates and prints the accuracy of the model based on these predictions, providing a straightforward metric to evaluate the model's effectiveness.

This setup allows for a flexible application of different text vectorization techniques, facilitating easy comparison and selection based on performance metrics.


In [23]:
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec


# Adjust in predict_language:
def predict_language(text, model, vectorizer_type='word2vec'):
    if vectorizer_type == 'word2vec':
        tokenized_text = word_tokenize(text.lower())
        transformed_text = np.mean([word2vec[word] for word in tokenized_text if word in word2vec], axis=0, keepdims=True)
        return model.predict(transformed_text)[0]

def train_and_tune_model(X_train, y_train, vectorizer_type='tfidf'):
    if vectorizer_type == 'tfidf':
        vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
        pipeline = make_pipeline(vectorizer, LogisticRegression(max_iter=500))
        parameters = {'logisticregression__C': [0.1, 1], 'tfidfvectorizer__ngram_range': [(2, 2)]}
        grid_search = GridSearchCV(pipeline, parameters, cv=5, verbose=2)
        grid_search.fit(X_train, y_train)
        print("Best TF-IDF model selected.")
        return grid_search.best_estimator_
    elif vectorizer_type == 'word2vec':
        print("Transforming training data using Word2Vec...")
        X_train_transformed = word2vec_transform(X_train)
        pipeline = Pipeline([('classifier', LogisticRegression(max_iter=10000))])
        pipeline.fit(X_train_transformed, y_train)
        print("Word2Vec model fitted.")
        return pipeline

def predict_and_evaluate(X, y, model, vectorizer_type):
    predictions = []
    for index, text in enumerate(X):
        predictions.append(predict_language(text, model, vectorizer_type))
        if (index + 1) % 100 == 0 or index + 1 == len(X):
            print(f"Completed {index + 1}/{len(X)} predictions")
    accuracy = accuracy_score(y, predictions)
    print(f"{vectorizer_type} Accuracy: {accuracy}")
    return accuracy

## Training, Evaluating, and Saving the TF-IDF Model

This section is dedicated to handling the operations for the TF-IDF vectorized model. We follow a structured approach to train the model, evaluate its performance, and then save it for future use. This process ensures that we have a deployable model at hand without the need to retrain.

### Training the Model
We start by training our model using the `train_and_tune_model` function with the TF-IDF vectorization approach. This function not only trains the model but also tunes its hyperparameters using GridSearchCV to find the optimal model settings. This ensures that our model performs at its best.

### Evaluating the Model
Once the model is trained, we use the `predict_and_evaluate` function to test the model's performance on a separate test dataset. This function predicts the languages of the texts in the test dataset and calculates the accuracy, providing a quantitative measure of the model's effectiveness.



In [None]:
model_tfidf = train_and_tune_model(X_train, y_train, vectorizer_type='tfidf')
y_pred_tfidf = predict_and_evaluate(X_test, y_test, model_tfidf, 'tfidf')
# dump(model_tfidf, 'model_tfidf.joblib')
# model:lr_tfidf_bigram_best, ngram_range=(2, 2), iter = 500, tfidf, max_iter=500, Accuracy: 0.9274

In [30]:
from google.colab import files
files.download('model_tfidf.joblib')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [31]:
from joblib import dump, load
loaded_model_tfidf = load('model_tfidf.joblib')
print(loaded_model_tfidf)

Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='char', ngram_range=(2, 2))),
                ('logisticregression', LogisticRegression(C=1, max_iter=500))])


## Training, Evaluating, and Saving the Word2Vec Model

This section focuses on the procedures using the Word2Vec model. We demonstrate the comprehensive steps from training the model with Word2Vec embeddings, evaluating its performance on the test data, and finally saving the model for later use. These steps ensure our model is both effective and reusable.

### Training the Word2Vec Model
The `train_and_tune_model` function is utilized here with the `word2vec` vectorization type. Unlike traditional TF-IDF, Word2Vec provides a dense representation of words which captures semantic relationships. This function trains the model using these embeddings and integrates them into a logistic regression framework to predict the languages. It's designed to offer a deep understanding of context within the text data.

### Evaluating the Model
After training, the performance of the Word2Vec model is evaluated using the `predict_and_evaluate` function. This function applies the trained model to the test dataset to predict languages, and calculates the accuracy to quantify how well our model performs in real-world scenarios. This step is crucial for validating the effectiveness of the Word2Vec embeddings in the task of language identification.


In [None]:
# Call the modified function for predictions and evaluations
model_word2vec = train_and_tune_model(X_train, y_train, vectorizer_type='word2vec')
y_pred_word2vec = predict_and_evaluate(X_test, y_test, model_word2vec, 'word2vec')

In [70]:
# dump(model_word2vec, 'model_word2vec.joblib')
files.download('model_word2vec.joblib')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Conclusion

In our exploration of the second approach using Word2Vec embeddings and hyperparameter optimization via GridSearch, we have gained significant insights into the nuances of language identification. This method, while offering a deeper understanding of contextual relationships within text data, presented several challenges:

- **Model Complexity and Training Time**: The Word2Vec model requires substantial computational resources and time for training, especially as the size of the dataset increases.
- **Convergence Issues**: Despite hyperparameter tuning, logistic regression sometimes struggled to converge, indicating potential limitations in handling high-dimensional data effectively.

These observations underscore the need for alternative approaches that can leverage existing linguistic knowledge and reduce the dependency on extensive training regimes.
