<a href="https://colab.research.google.com/github/rashed963/LanguageIdentificationNLP/blob/main/notebooks/LR_Word2Vec%26GridSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [60]:
import numpy as np
import random
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from joblib import dump
from gensim.models.callbacks import CallbackAny2Vec

In [61]:
# Ensure necessary NLTK components are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [62]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Data preprocessing

In [35]:
import re

def preprocess_text(text):
    """
    Preprocess the input text for language identification.

    Args:
    text (str): The input text to preprocess.

    Returns:
    str: The preprocessed text.
    """
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters
    text = re.sub(r'[^a-z\s]', '', text)  # Keep lowercased letters and spaces

    return text

#Data Loading and Sampling

This section covers the loading and random sampling of the training and testing data. Sampling is particularly useful to reduce computation time while developing the model on Google Colab. We use a sample size of 25,000 for training and 10,000 for testing to ensure a representative subset of the full dataset.

In [76]:
def load_data_sample(filepath, sample_size=25000, random_state=42):
      """Load a random sample of lines from a file to reduce memory usage and speed up computations."""

      with open(filepath, 'r', encoding='utf-8') as file:
        lines = file.readlines()

      random.seed(random_state)
      sampled_indices = random.sample(range(len(lines)), sample_size)
      sampled_lines = [lines[i].strip() for i in sampled_indices]
      return sampled_lines

def load_datasets(train_data_path, train_labels_path, test_data_path, test_labels_path, train_sample_size=25000, test_sample_size=10000):
    """
    Load training and testing data and labels from specified file paths.

    Args:
    train_data_path (str): Path to the training data file.
    train_labels_path (str): Path to the training labels file.
    test_data_path (str): Path to the testing data file.
    test_labels_path (str): Path to the testing labels file.
    train_sample_size (int): Number of samples to load from the training data.
    test_sample_size (int): Number of samples to load from the testing data.

    Returns:
    tuple: Tuple containing loaded training data, training labels, testing data, testing labels.
    """
    # Load training data and labels
    X_train = load_data_sample(train_data_path, sample_size=train_sample_size)
    y_train = load_data_sample(train_labels_path, sample_size=train_sample_size)

    # Load testing data and labels
    X_test = load_data_sample(test_data_path, sample_size=test_sample_size)
    y_test = load_data_sample(test_labels_path, sample_size=test_sample_size)

    return X_train, y_train, X_test, y_test


# Define the base path for data files
base_path = '/content/drive/MyDrive/data'

# Paths to data files using the base path
train_data_path = f'{base_path}/train/x_train.txt'
train_labels_path = f'{base_path}/train/y_train.txt'
test_data_path = f'{base_path}/test/x_test.txt'
test_labels_path = f'{base_path}/test/y_test.txt'

In [77]:
X_train, y_train, X_test, y_test = load_datasets(train_data_path, train_labels_path, test_data_path, test_labels_path)

In [3]:
# def load_data_sample(filepath, sample_size=10, random_state=42):
#     with open(filepath, 'r', encoding='utf-8') as file:
#         lines = file.readlines()

#     random.seed(random_state)
#     sampled_indices = random.sample(range(len(lines)), sample_size)
#     sampled_lines = [lines[i].strip() for i in sampled_indices]
#     return sampled_lines

# def train_model(X_train, y_train):
#     model = make_pipeline(CountVectorizer(analyzer='char', ngram_range=(2, 2)), LogisticRegression(max_iter=500))
#     model.fit(X_train, y_train)
#     return model

# def predict_language(text, model):
#     return model.predict([text])[0]


In [4]:
# filepath = '/content/drive/MyDrive/data/train/x_train.txt'
# X_train = load_data_sample(filepath,sample_size=25000)
# filepath = '/content/drive/MyDrive/data/train/y_train.txt'
# y_train = load_data_sample(filepath,sample_size=25000)
# filepath = '/content/drive/MyDrive/data/test/x_test.txt'
# X_test = load_data_sample(filepath,sample_size=10000)
# filepath = '/content/drive/MyDrive/data/test/y_test.txt'
# y_test = load_data_sample(filepath,sample_size=10000)

## Word2Vec Model Training and Transformation

This section defines the Word2Vec transformation and training process. We utilize the `Word2Vec` model from the Gensim library, which is effective for natural language processing tasks such as embedding generation based on the context of words.

### EpochLogger Callback
To monitor the training progress of the Word2Vec model, we implement the `EpochLogger` class as a callback. This callback logs the completion of each epoch during training, providing visibility into the model's training process. This is especially useful for long training sessions, as it gives real-time feedback about the progress.

### Word2Vec Transformation Function
The `word2vec_transform` function is responsible for:
1. **Initializing and training the Word2Vec model**: The function takes sentences as input and trains a Word2Vec model. Key parameters such as `vector_size`, `window`, and `min_count` are configurable, allowing customization based on specific dataset characteristics.
2. **Generating word embeddings**: After training, the model computes the average Word2Vec embedding for each sentence. This average embedding represents the sentence in a dense vector form, which can be used for further machine learning tasks.

This approach leverages the semantic richness of Word2Vec embeddings, providing a robust feature set for subsequent classification or clustering tasks.


In [78]:
from gensim.models.callbacks import CallbackAny2Vec

# Callback to print loss after each epoch
class EpochLogger(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        print(f"Epoch {self.epoch} completed")
        self.epoch += 1

def word2vec_transform(sentences, vector_size=100, window=5, min_count=1, epochs=10):
    epoch_logger = EpochLogger()  # Initialize the logger
    model = Word2Vec(sentences, vector_size=vector_size, window=window, min_count=min_count, callbacks=[epoch_logger], epochs=epochs)
    word_vectors = model.wv
    return np.array([
        np.mean([word_vectors[w] for w in words if w in word_vectors.key_to_index] or [np.zeros(vector_size)], axis=0)
        for words in sentences
    ])
# from gensim.models import Word2Vec
# from gensim.models.callbacks import CallbackAny2Vec

# class EpochLogger(CallbackAny2Vec):
#     def __init__(self):
#         self.epoch = 0
#     def on_epoch_end(self, model):
#         print(f"Epoch {self.epoch} completed")
#         self.epoch += 1

# def word2vec_transform(sentences, vector_size=100, window=5, min_count=1, epochs=10):
#     epoch_logger = EpochLogger()
#     model = Word2Vec(vector_size=vector_size, window=window, min_count=min_count)
#     model.build_vocab(sentences)  # Build vocabulary from the sentences
#     model.train(sentences, total_examples=model.corpus_count, epochs=epochs, callbacks=[epoch_logger])
#     word_vectors = model.wv
#     return np.array([
#         np.mean([word_vectors[w] for w in words if w in word_vectors.key_to_index] or [np.zeros(vector_size)], axis=0)
#         for words in sentences
#     ])

## Prediction and Model Training Functions

This section outlines the core functionalities of our language identification pipeline, covering the model training, prediction, and evaluation processes. These functions are designed to work with different types of text vectorization techniques, namely TF-IDF and Word2Vec.

### Predict Language Function
The `predict_language` function handles the prediction of the language for a given piece of text based on the specified model and vectorizer type:
- **TF-IDF Vectorization**: If the model is trained with TF-IDF, the prediction is straightforward—passing the text through the pipeline to get the predicted language.
- **Word2Vec Vectorization**: For Word2Vec, the text needs to be tokenized and transformed into Word2Vec embeddings before making a prediction. This involves converting the text into tokens, transforming these tokens into embeddings, and then reshaping the result to match the expected input structure for the classifier.

### Train and Tune Model Function
The `train_and_tune_model` function configures and trains the model based on the specified vectorization technique:
- **TF-IDF**: A pipeline is set up with a TF-IDF vectorizer and a logistic regression classifier. Hyperparameters are tuned using GridSearchCV to find the best model configuration.
- **Word2Vec**: The training data are first transformed into Word2Vec embeddings, followed by training a logistic regression classifier. This function outputs the trained model ready for making predictions.

### Predict and Evaluate Function
The `predict_and_evaluate` function is used to assess the performance of the trained model. It iterates over a dataset, applies the `predict_language` function to each text entry, and collects predictions. It then calculates and prints the accuracy of the model based on these predictions, providing a straightforward metric to evaluate the model's effectiveness.

This setup allows for a flexible application of different text vectorization techniques, facilitating easy comparison and selection based on performance metrics.


In [81]:
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec


def predict_language(text, model, vectorizer_type='tfidf'):
    if vectorizer_type == 'tfidf':
        # For TF-IDF, predict directly using the model
        return model.predict([text])[0]
    elif vectorizer_type == 'word2vec':
        # For Word2Vec, transform the text first
        # Tokenize the single text
        tokenized_text = word_tokenize(text)
        # Transform the text using Word2Vec
        transformed_text = word2vec_transform([tokenized_text])
        # Reshape the input to (1, -1), which is (1 sample, N features)
        transformed_text = transformed_text.reshape(1, -1)
        return model.predict(transformed_text)[0]

# Function to train Word2Vec and transform data
def word2vec_transform(sentences, vector_size=100, window=5, min_count=1, epochs=10):
    # Initialize and train a Word2Vec model
    model = Word2Vec(sentences, vector_size=vector_size, window=window, min_count=min_count, epochs=epochs)
    word_vectors = model.wv
    # Compute the average of word vectors for each sentence
    return np.array([
        np.mean([word_vectors[w] for w in words if w in word_vectors.key_to_index] or [np.zeros(vector_size)], axis=0)
        for words in sentences
    ])

# Define the model and hyperparameters
def train_and_tune_model(X_train, y_train, vectorizer_type='tfidf'):
    if vectorizer_type == 'tfidf':
        vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
        pipeline = make_pipeline(vectorizer, LogisticRegression(max_iter=500))
    elif vectorizer_type == 'word2vec':
        print("Transforming training data using Word2Vec...")
        X_train_transformed = word2vec_transform(X_train)
        pipeline = Pipeline([('classifier', LogisticRegression())])
        print("Fitting Word2Vec model...")
        pipeline.fit(X_train_transformed, y_train)
        return pipeline

    parameters = {'logisticregression__C': [0.1, 1]}
    if vectorizer_type == 'tfidf':
        parameters['tfidfvectorizer__ngram_range'] = [(2, 2)]
        print("Starting grid search for TF-IDF...")
    grid_search = GridSearchCV(pipeline, parameters, cv=5, verbose=2)  # Increase verbose for detailed logging
    grid_search.fit(X_train, y_train)
    print("Best model selected.")
    return grid_search.best_estimator_

def predict_and_evaluate(X, y, model, vectorizer_type):
    predictions = []
    total = len(X)
    print(f"Starting predictions for {vectorizer_type}...")
    for index, text in enumerate(X):
        prediction = predict_language(text, model, vectorizer_type)
        predictions.append(prediction)
        if (index + 1) % 100 == 0 or index + 1 == total:  # Update every 100 samples or last sample
            print(f"Completed {index + 1}/{total} predictions")
    accuracy = accuracy_score(y, predictions)
    print(f"{vectorizer_type} Accuracy: {accuracy}")
    return accuracy

# def predict_language(text, model, vectorizer_type='tfidf'):
#     if vectorizer_type == 'tfidf':
#         return model.predict([text])[0]
#     elif vectorizer_type == 'word2vec':
#         tokenized_text = word_tokenize(text)
#         transformed_text = word2vec_transform([tokenized_text])
#         transformed_text = transformed_text.reshape(1, -1)
#         return model.predict(transformed_text)[0]

# def train_and_tune_model(X_train, y_train, vectorizer_type='tfidf'):
#     if vectorizer_type == 'tfidf':
#         vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
#         pipeline = make_pipeline(vectorizer, LogisticRegression(max_iter=500))
#         parameters = {'logisticregression__C': [0.1, 1], 'tfidfvectorizer__ngram_range': [(2, 2)]}
#         grid_search = GridSearchCV(pipeline, parameters, cv=5, verbose=2)
#         grid_search.fit(X_train, y_train)
#         print("Best TF-IDF model selected.")
#         return grid_search.best_estimator_
#     elif vectorizer_type == 'word2vec':
#         print("Transforming training data using Word2Vec...")
#         X_train_transformed = word2vec_transform(X_train)
#         pipeline = Pipeline([('classifier', LogisticRegression(max_iter=500))])
#         pipeline.fit(X_train_transformed, y_train)
#         print("Word2Vec model fitted.")
#         return pipeline

# def predict_and_evaluate(X, y, model, vectorizer_type):
#     predictions = []
#     for index, text in enumerate(X):
#         predictions.append(predict_language(text, model, vectorizer_type))
#         if (index + 1) % 100 == 0 or index + 1 == len(X):
#             print(f"Completed {index + 1}/{len(X)} predictions")
#     accuracy = accuracy_score(y, predictions)
#     print(f"{vectorizer_type} Accuracy: {accuracy}")
#     return accuracy

## Training, Evaluating, and Saving the TF-IDF Model

This section is dedicated to handling the operations for the TF-IDF vectorized model. We follow a structured approach to train the model, evaluate its performance, and then save it for future use. This process ensures that we have a deployable model at hand without the need to retrain.

### Training the Model
We start by training our model using the `train_and_tune_model` function with the TF-IDF vectorization approach. This function not only trains the model but also tunes its hyperparameters using GridSearchCV to find the optimal model settings. This ensures that our model performs at its best.

### Evaluating the Model
Once the model is trained, we use the `predict_and_evaluate` function to test the model's performance on a separate test dataset. This function predicts the languages of the texts in the test dataset and calculates the accuracy, providing a quantitative measure of the model's effectiveness.

### Saving the Model
After confirming the model's performance, we save the trained model to disk using `joblib`. This step is crucial as it allows us to reuse the trained model without undergoing the training process again, saving both time and computational resources. The model is stored in the 'models' directory under the name `model_tfidf.joblib`.

This end-to-end process from training to saving ensures that our model is not only accurate but also readily available for deployment or further experimentation.


In [15]:
model_tfidf = train_and_tune_model(X_train, y_train, vectorizer_type='tfidf')
y_pred_tfidf = predict_and_evaluate(X_test, y_test, model_tfidf, 'tfidf')
dump(model_tfidf, 'model_tfidf.joblib')

Starting grid search for TF-IDF...
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] END logisticregression__C=0.1, tfidfvectorizer__ngram_range=(2, 2); total time= 3.4min
[CV] END logisticregression__C=0.1, tfidfvectorizer__ngram_range=(2, 2); total time= 3.2min
[CV] END logisticregression__C=0.1, tfidfvectorizer__ngram_range=(2, 2); total time= 3.0min
[CV] END logisticregression__C=0.1, tfidfvectorizer__ngram_range=(2, 2); total time= 3.2min
[CV] END logisticregression__C=0.1, tfidfvectorizer__ngram_range=(2, 2); total time= 3.1min
[CV] END logisticregression__C=1, tfidfvectorizer__ngram_range=(2, 2); total time= 7.3min
[CV] END logisticregression__C=1, tfidfvectorizer__ngram_range=(2, 2); total time= 7.3min
[CV] END logisticregression__C=1, tfidfvectorizer__ngram_range=(2, 2); total time= 7.3min
[CV] END logisticregression__C=1, tfidfvectorizer__ngram_range=(2, 2); total time= 6.9min
[CV] END logisticregression__C=1, tfidfvectorizer__ngram_range=(2, 2); total time= 7.

['model_tfidf.joblib']

In [30]:
from google.colab import files
files.download('model_tfidf.joblib')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [31]:
from joblib import dump, load
loaded_model_tfidf = load('model_tfidf.joblib')
print(loaded_model_tfidf)

Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='char', ngram_range=(2, 2))),
                ('logisticregression', LogisticRegression(C=1, max_iter=500))])


## Training, Evaluating, and Saving the Word2Vec Model

This section focuses on the procedures using the Word2Vec model. We demonstrate the comprehensive steps from training the model with Word2Vec embeddings, evaluating its performance on the test data, and finally saving the model for later use. These steps ensure our model is both effective and reusable.

### Training the Word2Vec Model
The `train_and_tune_model` function is utilized here with the `word2vec` vectorization type. Unlike traditional TF-IDF, Word2Vec provides a dense representation of words which captures semantic relationships. This function trains the model using these embeddings and integrates them into a logistic regression framework to predict the languages. It's designed to offer a deep understanding of context within the text data.

### Evaluating the Model
After training, the performance of the Word2Vec model is evaluated using the `predict_and_evaluate` function. This function applies the trained model to the test dataset to predict languages, and calculates the accuracy to quantify how well our model performs in real-world scenarios. This step is crucial for validating the effectiveness of the Word2Vec embeddings in the task of language identification.

### Saving the Model
Once we verify the model's performance and are satisfied with the results, the final step is to save this model using `joblib`. Saving the model to the 'models' directory as `model_word2vec.joblib` allows us to deploy or further experiment with the model without the need for retraining, ensuring efficiency and readiness for practical applications.

By following these structured steps, we ensure that our Word2Vec model is not only trained to capture the nuances of different languages but also ready for immediate deployment or further development.


In [82]:
# Call the modified function for predictions and evaluations

model_word2vec = train_and_tune_model(X_train, y_train, vectorizer_type='word2vec')
y_pred_word2vec = predict_and_evaluate(X_test, y_test, model_word2vec, 'word2vec')
# dump(model_word2vec, 'model_word2vec.joblib')



Transforming training data using Word2Vec...
Fitting Word2Vec model...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Starting predictions for word2vec...
Completed 100/10000 predictions
Completed 200/10000 predictions
Completed 300/10000 predictions
Completed 400/10000 predictions
Completed 500/10000 predictions
Completed 600/10000 predictions
Completed 700/10000 predictions
Completed 800/10000 predictions
Completed 900/10000 predictions
Completed 1000/10000 predictions
Completed 1100/10000 predictions
Completed 1200/10000 predictions
Completed 1300/10000 predictions
Completed 1400/10000 predictions
Completed 1500/10000 predictions
Completed 1600/10000 predictions
Completed 1700/10000 predictions
Completed 1800/10000 predictions
Completed 1900/10000 predictions
Completed 2000/10000 predictions
Completed 2100/10000 predictions
Completed 2200/10000 predictions
Completed 2300/10000 predictions
Completed 2400/10000 predictions
Completed 2500/10000 predictions
Completed 2600/10000 predictions
Completed 2700/10000 predictions
Completed 2800/10000 predictions
Completed 2900/10000 predictions
Completed 3000/

KeyboardInterrupt: 

In [70]:
files.download('model_word2vec.joblib')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [75]:
word = 'example'
if word in model_word2vec.named_steps['word2vec'].model:
    print(f"Vector for '{word}':", model_word2vec.wv[word])
else:
    print(f"Word '{word}' not in vocabulary.")

KeyError: 'word2vec'