# Natural Language Process (NPL) Project

### Author: Raúl Varela Ferrando

The following work consists of applying language detection in texts, as well as the preparation, evaluation, and comparison of models. This function is of great interest since many additional tasks depend on the language, such as lexical analysis, for example.



### 1. Preparation of the working dataset


The dataset we are going to use is available on the page https://www.statmt.org/europarl/, called *“European Parliament Proceedings Parallel Corpus 1996-2011”*, which is nothing more than the transcripts of the sessions of the European Parliament and their translations into the different languages of the European Union. In our case, we will use the transcripts in German, Spanish, French, English, Italian, and Polish.  

Since this corpus was prepared for the training of translation systems, we have an aligned corpus for each language, along with the same corpus in English. Since in this last case we do not have a specific corpus, we will use the English corpus from which the Spanish corpus used derives (*"europarl-v7-es-en.en"*).  

The first task is to construct the training and validation datasets in a balanced way. In the assignment statement, we are suggested to use a strategy that consists of using a random selection model so that when traversing the original file, for every 11 lines traversed, 10 are sent to the training file and 1 to the evaluation file randomly.  

First, we define a function that allows us to read each of the files within the folder that contains all the corpora.


In [None]:
def read_documents(document):
    with open(document, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    # Combine all lines into a single document
    merged = ' '.join(lines)
    
    # Split the document into individual documents
    documents = merged.split('\n.')
    
    # Remove leading and trailing whitespace from each document
    documents = [doc.strip() for doc in documents if doc.strip()]
    
    return documents

languages = {
    "german": "europarl-v7.de-en.de",
    "english": "europarl-v7.es-en.en",
    "spanish": "europarl-v7.es-en.es",
    "french": "europarl-v7.fr-en.fr",
    "italian": "europarl-v7.it-en.it",
    "polish": "europarl-v7.pl-en.pl"
}

# A dictionary to store the documents by language
document_dict = {}

for key, value in languages.items():
    documents = read_documents(value)
    document_dict[key] = documents

# Now, document_dict contains the documents by language
# You can access them using document_dict["german"], document_dict["english"], etc.


Once the dictionary containing the datasets by language has been created, we generate the training and validation datasets as described earlier, thus obtaining a dictionary for each set. Additionally, we have implemented a function to remove common punctuation marks from our text and normalize it to lowercase.

In [None]:
import random
import re  # Import the regular expressions module

# Define the proportion of lines for training and validation (10 to 1)
train_ratio = 10
validation_ratio = 1

# Function to normalize and clean the text
def clean_text(text):
    # Remove common punctuation marks (except the apostrophe if relevant)
    text = re.sub(r'[^\w\s\']', '', text)

    # Normalize text to lowercase
    text = text.lower()

    return text

# Dictionary to store training and validation documents by language
train_data = {}
validation_data = {}

# Iterate over each language in document_dict
for language, documents in document_dict.items():
    # Initialize lists for training and validation documents
    train_documents = []
    validation_documents = []

    # Randomly split lines into training and validation
    for document in documents:
        lines = document.split('\n')
        random.shuffle(lines)  # Shuffle the lines randomly
        total_lines = 110000  # Line limit to use

        # Calculate the number of lines for training and validation
        num_train_lines = (total_lines // (train_ratio + validation_ratio)) * train_ratio
        train_lines = lines[:num_train_lines]
        validation_lines = lines[num_train_lines:total_lines]

        # Join the lines back into documents and apply normalization
        train_document = ' '.join([clean_text(line) for line in train_lines])
        validation_document = ' '.join([clean_text(line) for line in validation_lines])

        train_documents.append(train_document)
        validation_documents.append(validation_document)

    # Store the training and validation documents in the dictionaries
    train_data[language] = train_documents
    validation_data[language] = validation_documents

# Now, train_data contains the normalized training documents, and validation_data contains the validation documents
# You can access them using train_data["german"], validation_data["german"], etc.


Once this is done, we proceed to calculate the 100 most frequent words for each of the languages.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Modify train_data to be a list of documents
train_data_list = [train_data[language] for language in languages]

# Define a function to get the 100 most frequent words
def get_top_words(train_documents):
    # Combine all training documents into a single text
    corpus = ' '.join(train_documents)

    # Tokenize the text into words
    words = word_tokenize(corpus)

    # Calculate the frequency of each word
    fdist = FreqDist(words)

    # Get the 100 most frequent words
    top_words = fdist.most_common(100)

    return top_words

# List to store the 100 most frequent words per language
top_words_by_language = []

# Iterate through each language in train_data
for train_documents in train_data_list:
    top_words = get_top_words(train_documents)
    top_words_by_language.append(top_words)

# Print the 100 most frequent words per language
for i, top_words in enumerate(top_words_by_language):
    language = list(languages.keys())[i]
    print(f"Language: {language}")
    for word, frequency in top_words:
        print(f"{word}: {frequency}")
    print("\n")


Idioma: german
die: 103713
der: 89201
und: 69987
in: 40850
zu: 33089
den: 30163
wir: 26167
ich: 25407
das: 25076
für: 24987
von: 24372
ist: 24084
es: 20082
dass: 19555
nicht: 18956
des: 18896
auf: 18228
eine: 17735
werden: 16302
im: 15721
mit: 14780
sie: 14280
auch: 14168
dem: 12602
ein: 12015
wird: 11281
sich: 11048
haben: 10526
sind: 10308
hat: 9148
um: 9134
wie: 9089
europäischen: 9063
als: 8597
kommission: 8381
über: 8361
daß: 8291
diese: 8274
herr: 7919
an: 7838
zur: 7731
bei: 6893
einer: 6832
union: 6564
dieser: 6458
uns: 6450
wenn: 6390
müssen: 6022
einen: 6005
möchte: 5725
aber: 5567
aus: 5308
vor: 5254
präsident: 5246
noch: 5155
können: 5104
so: 5090
nach: 5086
nur: 5059
diesem: 4942
kann: 4922
zum: 4760
was: 4738
europäische: 4695
parlament: 4675
bericht: 4552
durch: 4541
sein: 4495
oder: 4457
sehr: 4404
mitgliedstaaten: 4316
dies: 4175
dieses: 3893
einem: 3737
ihre: 3647
frau: 3642
muss: 3638
europa: 3342
wurde: 3329
alle: 3266
er: 3237
hier: 3174
man: 3115
mehr: 3106
damit:

### 2. Implementation of a TF-IDF model for language detection

In this section, we are asked to build a TF-IDF model. To implement a TF-IDF model for document classification in the 6 languages, we first need to calculate the TF-IDF matrix for the training documents in each language. Then, we save the TF-IDF matrix.

Once we have the TF-IDF matrices, we move on to the model implementation. Since we are dealing with a classification problem, the model could be Naive-Bayes, SVC, or any of the models studied in this master's program applied to these cases. I have chosen to use an SVC due to the good performance this model provides.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
from sklearn.svm import SVC

# SVM classifier for language detection
language_classifier = SVC(kernel='linear')

# Create a dictionary that associates each language with a number (label)
language_labels = {language: i for i, language in enumerate(languages.keys())}

# Create lists to store training documents and labels
X_train = []
y_train = []

# Iterate over each language in train_data
for language, train_documents in train_data.items():
    # Add the normalized documents to X_train
    X_train.extend(train_documents)
    # Assign the corresponding label to each document
    y_train.extend([language_labels[language]] * len(train_documents))

# Now, X_train contains the training documents and y_train the corresponding labels.

# Initialize the TF-IDF vectorizer with appropriate parameters
tfidf_vectorizer = TfidfVectorizer()

# Fit the TF-IDF vectorizer to the training documents
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Save the dictionary with the TF-IDF matrices to a file
with open('tfidf_matrices.pkl', 'wb') as file:
    pickle.dump(X_train_tfidf, file)

# Train the SVM classifier with the TF-IDF data and labels
language_classifier.fit(X_train_tfidf, y_train)

# Save the trained model to a file
with open('language_classifier_model.pkl', 'wb') as file:
    pickle.dump(language_classifier, file)


### 3. Evaluation of the language detection model using TF-IDF

Once we have our training and validation sets, we proceed to implement the model evaluation, which we have indexed to the training in the same code, which is as follows:


In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

# Load the SVM classifier from the file
with open('language_classifier_model.pkl', 'rb') as file:
    language_classifier = pickle.load(file)

# Create lists to store evaluation documents and labels
X_eval = []
y_eval = []

# Iterate over each language in validation_data (as done in Section 1)
for language, eval_documents in validation_data.items():
    # Add the normalized documents to X_eval
    X_eval.extend(eval_documents)
    # Assign the corresponding label to each document
    y_eval.extend([language_labels[language]] * len(eval_documents))

# Transform the evaluation documents into TF-IDF representation using the loaded TF-IDF model
X_eval_tfidf = tfidf_vectorizer.transform(X_eval)

# Predict the languages of the evaluation documents
y_pred = language_classifier.predict(X_eval_tfidf)

# Compute the confusion matrix
confusion = confusion_matrix(y_eval, y_pred)

# Compute the overall model accuracy
accuracy = accuracy_score(y_eval, y_pred)

# Print the confusion matrix and overall accuracy
print("Confusion Matrix:")
print(confusion)
print("\nOverall Accuracy:", accuracy)


Matriz de Confusión:
[[1 0 0 0 0 0]
 [0 1 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 1 0 0]
 [0 0 0 0 1 0]
 [0 0 0 0 0 1]]

Precisión Global: 1.0


### 4. Cleaning of the training and evaluation corpus

In this section, we are suggested to clean the corpus of each language, since, as detailed in the statement of the work, there are some lines or words from one language that are not found in the corpus of that language. For example, in the Spanish corpus, there are words like "the" that do not belong to this language.

The strategy suggested in the statement is to calculate the 100 most frequent words in each language, search for the lines that contain those words in other languages, and ignore these lines. In this case, we need to redo the code used in the first sections to adapt it.

In [None]:
import os
import re
import random

# Define the function to get the 100 most frequent words of a language
def get_top_words(train_documents):
    # Combine all training documents into a single text
    corpus = ' '.join(train_documents)

    # Tokenize the text into words
    words = re.findall(r'\b\w+\b', corpus)

    # Calculate the frequency of each word
    word_freq = {}
    for word in words:
        word = word.lower()  # Normalize to lowercase
        if word not in word_freq:
            word_freq[word] = 0
        word_freq[word] += 1

    # Sort words by frequency in descending order
    sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)

    # Get the 100 most frequent words
    top_words = [word for word, _ in sorted_words[:100]]

    return top_words

# Directory where the original corpus files are stored and where the cleaned corpora will be saved
clean_corpus_dir = "clean_corpus"

# Create a directory for the cleaned corpora if it does not exist
if not os.path.exists(clean_corpus_dir):
    os.makedirs(clean_corpus_dir)

# Iterate over each language in languages
for language, filename in languages.items():
    # Read the original training documents
    with open(filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # Get the 100 most frequent words of this language
    top_words = get_top_words(train_data[language])

    # Initialize lists for valid training and evaluation lines
    valid_train_lines = []
    valid_eval_lines = []

    # Iterate over all lines in the original corpus
    for line in lines:
        # Tokenize the line into words
        words = re.findall(r'\b\w+\b', line)

        # Check if any word is among the 100 most frequent
        if any(word.lower() in top_words for word in words):
            # Consider the line as valid
            valid_train_lines.append(line)

    # Take 100,000 lines for training and 10,000 lines for evaluation
    num_train_lines = 100000
    num_eval_lines = 10000
    random.shuffle(valid_train_lines)
    valid_eval_lines = valid_train_lines[:num_eval_lines]
    valid_train_lines = valid_train_lines[num_eval_lines:num_eval_lines + num_train_lines]

    # Save the valid lines in cleaned corpus files
    clean_train_filename = os.path.join(clean_corpus_dir, f"{language}_train.txt")
    clean_eval_filename = os.path.join(clean_corpus_dir, f"{language}_eval.txt")

    with open(clean_train_filename, 'w', encoding='utf-8') as train_file:
        train_file.writelines(valid_train_lines)

    with open(clean_eval_filename, 'w', encoding='utf-8') as eval_file:
        eval_file.writelines(valid_eval_lines)


### **5. TF-IDF with the clean corpus model**

In this section, we must redo points 2 and 3 with the cleaned corpus, so that we can observe the impact of corpus cleaning on the model's performance.

First, we read the files with the cleaned corpora and store them in dictionaries.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
import pickle
import os
import re
import random

# Dictionary to store cleaned training documents by language
clean_train_data = {}

# Dictionary to store cleaned validation documents by language
clean_validation_data = {}

# Iterate through each language in languages
for language in languages.keys():
    # Read cleaned training documents
    clean_train_filename = os.path.join(clean_corpus_dir, f"{language}_train.txt")
    with open(clean_train_filename, 'r', encoding='utf-8') as file:
        clean_train_documents = file.readlines()
    
    # Read cleaned validation documents
    clean_eval_filename = os.path.join(clean_corpus_dir, f"{language}_eval.txt")
    with open(clean_eval_filename, 'r', encoding='utf-8') as file:
        clean_eval_documents = file.readlines()

    # Store cleaned documents in dictionaries
    clean_train_data[language] = clean_train_documents
    clean_validation_data[language] = clean_eval_documents


We create the TF-IDF vectorizer, fit the cleaned training data, and train an SVC model with them.

In [None]:
# SVM Classifier for Language Detection
language_classifier_clean = SVC(kernel='linear')

# Create lists to store training documents and labels
X_train_clean = []
y_train_clean = []

# Iterate through each language in clean_train_data
for language, clean_train_documents in clean_train_data.items():
    # Add normalized documents to X_train
    X_train_clean.extend(clean_train_documents)
    # Assign the corresponding label to each document
    y_train_clean.extend([language_labels[language]] * len(clean_train_documents))

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the TF-IDF vectorizer to the training documents
X_train_clean_tfidf = tfidf_vectorizer.fit_transform(X_train_clean)

# Save the dictionary with the TF-IDF matrices to a file
with open('tfidf_matrices_clean.pkl', 'wb') as file:
    pickle.dump(X_train_clean_tfidf, file)

# Train the SVM classifier with TF-IDF data and labels
language_classifier_clean.fit(X_train_clean_tfidf, y_train_clean)

# Save the trained model to a file
with open('language_classifier_clean_model.pkl', 'wb') as file:
    pickle.dump(language_classifier_clean, file)

# Create lists to store clean evaluation documents and labels
X_eval_clean = []
y_eval_clean = []

# Iterate through each language in clean_validation_data
for language, clean_eval_documents in clean_validation_data.items():
    # Add clean documents to X_eval_clean
    X_eval_clean.extend(clean_eval_documents)
    # Assign the corresponding label to each document
    y_eval_clean.extend([language_labels[language]] * len(clean_eval_documents))

# Transform the evaluation documents into TF-IDF representation using the TF-IDF model with clean data
X_eval_clean_tfidf = tfidf_vectorizer.transform(X_eval_clean)


Finally, we evaluate our model.

In [None]:
# Predict the languages of the clean evaluation documents
y_pred_clean = language_classifier_clean.predict(X_eval_clean_tfidf)

# Compute the confusion matrix
confusion_clean = confusion_matrix(y_eval_clean, y_pred_clean)

# Compute the overall accuracy of the model with clean data
accuracy_clean = accuracy_score(y_eval_clean, y_pred_clean)

# Print the confusion matrix and overall accuracy with clean data
print("Confusion Matrix with Clean Data:")
print(confusion_clean)
print("\nOverall Accuracy with Clean Data:", accuracy_clean)


Matriz de Confusión con Datos Limpios:
[[9993    2    1    0    2    2]
 [   0 9991    4    2    2    1]
 [   0    3 9989    0    2    6]
 [   0    1    0 9994    3    2]
 [   0    0    2    0 9995    3]
 [   0    1    0    0    0 9999]]

Precisión Global con Datos Limpios: 0.99935


### 6. Language detection using n-grams or other language models

Due to the repetition of certain words in many languages, such as "a", which can be observed in various corpora, the use of **n-gramas** is recommended to detect word sequences of length **n**. In this case, we will use bigrams.

In [None]:
from collections import Counter

# Create a dictionary to store bigram frequencies by language
bigram_frequencies_by_language = {}

# Iterate over each language in clean_train_data
for language, clean_train_documents in clean_train_data.items():
    # Initialize a bigram counter
    bigram_counter = Counter()
    
    # Iterate over documents and count bigrams
    for document in clean_train_documents:
        words = document.split()  # Split the document into words
        bigrams = zip(words, words[1:])  # Create bigrams
        bigram_counter.update(bigrams)  # Update the bigram counter
    
    # Store bigram frequencies in the dictionary by language
    bigram_frequencies_by_language[language] = bigram_counter


Once you have trained the bigram model, we will use it to evaluate language detection on the evaluation data. To do this, we need to calculate the probability that a sequence of bigrams belongs to a specific language.

In [None]:
# Dictionary to store results by language with bigrams
results_by_language_bigram = {}

# Iterate through each language in bigram_frequencies_by_language
for language, bigram_frequencies in bigram_frequencies_by_language.items():
    # Calculate the total number of bigrams in the language
    total_bigrams = sum(bigram_frequencies.values())
    
    # Calculate the probabilities of bigrams in the language
    probabilities = {bigram: freq / total_bigrams for bigram, freq in bigram_frequencies.items()}
    
    # Initialize a list to store probabilities of evaluation documents
    evaluation_probabilities = []
    
    # Iterate through evaluation documents
    for document in clean_validation_data[language]:
        words = document.split()  # Split the document into words
        bigrams = zip(words, words[1:])  # Create bigrams
        
        # Calculate the probability of the bigram sequence
        prob_sequence = 1.0
        for bigram in bigrams:
            prob_sequence *= probabilities.get(bigram, 0)  # Use 0 if the bigram does not exist in the language
        
        evaluation_probabilities.append(prob_sequence)
    
    # Predict the language with the highest probability
    predicted_language = language  # Assume the current language is the most probable initially
    max_probability = max(evaluation_probabilities)
    for lang, prob in evaluation_probabilities.items():
        if prob > max_probability:
            predicted_language = lang
            max_probability = prob
    
    # Store the results in the dictionary with bigrams
    results_by_language_bigram[language] = {
        'predicted_language': predicted_language,
        'evaluation_probabilities': evaluation_probabilities
    }

# Print the results by language with bigrams
for language, results in results_by_language_bigram.items():
    print(f"Language: {language}")
    print(f"Predicted Language with Bigrams: {results['predicted_language']}\n")
