# Language Detection Project

This project focuses on detecting the language of text data using **n-gram models** implemented with the **Natural Language Toolkit (NLTK)**. The project involves data cleaning, preprocessing, model training, and evaluation to accurately identify languages such as English, Portuguese, and Spanish from StackOverflow datasets.


## Table of Contents

1. [Introduction](#Introduction)
2. [Environment Setup](#Environment-Setup)
3. [Data Loading](#Data-Loading)
4. [Data Preprocessing](#Data-Preprocessing)
    - [Cleaning Text](#Cleaning-Text)
    - [Generating Bigrams](#Generating-Bigrams)
    - [Padding Text](#Padding-Text)
5. [Model Training](#Model-Training)
    - [Training MLE Models](#Training-MLE-Models)
    - [Training Laplace Models](#Training-Laplace-Models)
6. [Model Evaluation](#Model-Evaluation)
    - [Calculating Perplexity](#Calculating-Perplexity)
    - [Assigning Languages](#Assigning-Languages)
7. [Results](#Results)
8. [Conclusion](#Conclusion)
9. [References](#References)

---

## Introduction

Language detection is a fundamental task in Natural Language Processing (NLP) with applications ranging from text classification to machine translation. This project leverages **n-gram language models** to classify text into three languages: English, Portuguese, and Spanish. By using NLTK's powerful tools, we can build robust models that understand the contextual nuances of each language.

## Environment Setup

### Required Libraries

Ensure you have the following libraries installed. You can install any missing libraries using `pip`.


```bash
pip install pandas nltk scikit-learn

### Importing Libraries

In [12]:
import pandas as pd
import re
from nltk.util import bigrams
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.lm import MLE, Laplace
from nltk.tokenize import WhitespaceTokenizer
from sklearn.model_selection import train_test_split


---

## Data Loading


We have three cleaned datasets for English, Portuguese, and Spanish stored in the `NLP_LM's_Regex/data_cleaned/` directory.

### Loading the Data

In [19]:
# Define file paths
caminho_portugues = "../data_cleaned/stackoverflow_portuguese_clean.csv"
caminho_espanhol = "../data_cleaned/stackoverflow_spanish_clean.csv"
caminho_ingles = "../data_cleaned/stackoverflow_english_clean.csv"

# Load datasets
dados_portugues = pd.read_csv(caminho_portugues)
dados_espanhol = pd.read_csv(caminho_espanhol)
dados_ingles = pd.read_csv(caminho_ingles)

# Display the first few rows of each dataset
print("Dados em Português:")
print(dados_portugues.head())

print("\nDados em Espanhol:")
print(dados_espanhol.head())

print("\nDados em Inglês:")
print(dados_ingles.head())

Dados em Português:
      Id                                             Título  \
0   2402         Como fazer hash de senhas de forma segura?   
1   6441  Qual é a diferença entre INNER JOIN e OUTER JOIN?   
2    579  Por que não devemos usar funções do tipo mysql_*?   
3   2539           As mensagens de erro devem se desculpar?   
4  17501  Qual é a diferença de API, biblioteca e Framew...   

                                             Questão  \
0  <p>Se eu fizer o <em><a href="http://pt.wikipe...   
1  <p>Qual é a diferença entre <code>INNER JOIN</...   
2  <p>Uma dúvida muito comum é por que devemos pa...   
3  <p>É comum encontrar uma mensagem de erro que ...   
4  <p>Me parecem termos muito próximos e eventual...   

                                         Tags  Pontuação  Visualizações  \
0     <hash><segurança><senhas><criptografia>        350          22367   
1                                 <sql><join>        276         176953   
2                                <php><

---

## Data Processing

### Cleaning Text

Although the data has been pre-cleaned using regex functions to remove HTML tags, code snippets, punctuation, digits, extra spaces, and newlines, we will perform additional preprocessing steps to prepare the data for model training.

In [23]:
# Adding language identification
dados_portugues["idioma"] = "portugues"
dados_espanhol["idioma"] = "espanhol"
dados_ingles["idioma"] = "ingles"

# Verify the datasets
print(dados_portugues.info())
print(dados_espanhol.info())
print(dados_ingles.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                500 non-null    int64 
 1   Título            500 non-null    object
 2   Questão           500 non-null    object
 3   Tags              500 non-null    object
 4   Pontuação         500 non-null    int64 
 5   Visualizações     500 non-null    int64 
 6   cleaned_code_tag  500 non-null    object
 7   idioma            500 non-null    object
dtypes: int64(3), object(5)
memory usage: 31.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                500 non-null    int64 
 1   Título            500 non-null    object
 2   Questão           500 non-null    object
 3   Tags              500 non-null    object


### Splitting the Data

We will split each language dataset into training and testing sets.

In [25]:
# Splitting datasets into training and testing
port_treino, port_teste = train_test_split(dados_portugues['cleaned_code_tag'], test_size=0.2, random_state=123)
esp_treino, esp_teste = train_test_split(dados_espanhol['cleaned_code_tag'], test_size=0.2, random_state=123)
ing_treino, ing_teste = train_test_split(dados_ingles['cleaned_code_tag'], test_size=0.2, random_state=123)


### Tokenization and Bigrams


Tokenizing the text and generating bigrams helps the model understand the context within the text.

In [26]:
# Function to tokenize and generate bigrams
def gerar_bigrams(lista_textos):
    todas_questoes = ' '.join(lista_textos)
    todas_palavras = WhitespaceTokenizer().tokenize(todas_questoes)
    return todas_palavras

# Tokenizing words for each language
todas_palavras_port = gerar_bigrams(port_treino)
todas_palavras_esp = gerar_bigrams(esp_treino)
todas_palavras_ing = gerar_bigrams(ing_treino)

print(todas_palavras_port[:50])

['sou', 'iniciante', 'em', 'p', 'hp', 'e', 'gostaria', 'de', 'saber', 'se', 'p', 'do', 'ph', 'p', 'data', 'objects', 'é', 'a', 'maneira', 'mais', 'segura', 'de', 'se', 'conectar', 'a', 'um', 'banco', 'de', 'dados', 'preciso', 'também', 'de', 'um', 'exemplo', 'de', 'como', 'fazer', 'esta', 'conexão', 'e', 'inserirselecionar', 'dados', 'por', 'exemplo', 'code', 'estou', 'fazendo', 'um', 'efeito', 'aqui']


---

## Model Training

### Training MLE Models


First, we will train Maximum Likelihood Estimation (MLE) models for each language.

In [27]:
# Function to train MLE model
def treinar_modelo_MLE(lista_textos):
    todas_questoes = ' '.join(lista_textos)
    todas_palavras = WhitespaceTokenizer().tokenize(todas_questoes)
    bigrams_tratados, vocabulario = padded_everygram_pipeline(2, todas_palavras)
    modelo = MLE(2)
    modelo.fit(bigrams_tratados, vocabulario)
    return modelo

# Training MLE models
modelo_port_MLE = treinar_modelo_MLE(port_treino)
modelo_esp_MLE = treinar_modelo_MLE(esp_treino)
modelo_ing_MLE = treinar_modelo_MLE(ing_treino)


### Training Laplace Models

To handle unseen bigrams and avoid infinite perplexity, we will train Laplace-smoothed models.

In [28]:
# Function to train Laplace model
def treinar_modelo_Laplace(lista_textos):
    todas_questoes = ' '.join(lista_textos)
    todas_palavras = WhitespaceTokenizer().tokenize(todas_questoes)
    bigrams_tratados, vocabulario = padded_everygram_pipeline(2, todas_palavras)
    modelo = Laplace(2)
    modelo.fit(bigrams_tratados, vocabulario)
    return modelo

# Training Laplace models
modelo_port_Laplace = treinar_modelo_Laplace(port_treino)
modelo_esp_Laplace = treinar_modelo_Laplace(esp_treino)
modelo_ing_Laplace = treinar_modelo_Laplace(ing_treino)


---

## Model Evaluation

Calculating Perplexity

Perplexity is a measure of how well a probability model predicts a sample. Lower perplexity indicates a better model.

In [29]:
# Function to calculate perplexity
def calcular_perplexidade(modelo, texto):
    """
    Calcula a perplexidade de um texto baseado no modelo fornecido.
    - modelo: modelo treinado (MLE ou Laplace)
    - texto: string a ser avaliada
    """
    perplexidade = 0
    palavras = WhitespaceTokenizer().tokenize(texto)
    palavras_fakechar = [list(pad_both_ends(palavra, n=2)) for palavra in palavras]
    palavras_bigramns = [list(bigrams(palavra)) for palavra in palavras_fakechar]
    
    for palavra in palavras_bigramns:
        perplexidade += modelo.perplexity(palavra)
    
    return perplexidade


### Assigning Languages Based on Perplexity

We will assign a language to each text by comparing the perplexity scores from each language model. The language with the lowest perplexity score is considered the predicted language.

In [30]:
# Function to assign language based on perplexity
def atribui_idioma(lista_textos):
    """
    Atribui um idioma a cada texto da lista baseado na menor perplexidade.
    - lista_textos: lista de textos (str) a serem classificados
    """
    idiomas = []
    for texto in lista_textos:
        portugues = calcular_perplexidade(modelo_port_Laplace, texto)
        ingles = calcular_perplexidade(modelo_ing_Laplace, texto)
        espanhol = calcular_perplexidade(modelo_esp_Laplace, texto)
        
        if portugues <= ingles and portugues <= espanhol:
            idiomas.append("portugues")
        elif ingles < portugues and ingles <= espanhol:
            idiomas.append("ingles")
        else:
            idiomas.append("espanhol")
    
    return idiomas


---

## Results

### Classifying Test Data

We will classify the test data for each language and calculate the accuracy.

In [31]:
# Classify test data
resultados_portugues = atribui_idioma(port_teste)
resultados_ingles = atribui_idioma(ing_teste)
resultados_espanhol = atribui_idioma(esp_teste)

# Calculate accuracy rates
taxa_portugues = resultados_portugues.count("portugues") / len(resultados_portugues)
taxa_ingles = resultados_ingles.count("ingles") / len(resultados_ingles)
taxa_espanhol = resultados_espanhol.count("espanhol") / len(resultados_espanhol)

# Display the results
print(f"Taxa de acerto - Português: {taxa_portugues:.2f}")
print(f"Taxa de acerto - Inglês: {taxa_ingles:.2f}")
print(f"Taxa de acerto - Espanhol: {taxa_espanhol:.2f}")


Taxa de acerto - Português: 0.99
Taxa de acerto - Inglês: 1.00
Taxa de acerto - Espanhol: 0.98


### Handling Perplexity Issues

In some cases, the perplexity might be infinite or undefined, which we handled by using Laplace smoothing in our models.

In [32]:
# Example of handling perplexity
texto_exemplo = "good morning"
print(calcular_perplexidade(modelo_ing_Laplace, texto_exemplo))

texto_port_teste = port_teste.iloc[0]
print(calcular_perplexidade(modelo_ing_Laplace, texto_port_teste))


30.315093820557337
5825.03337622124


### Final Model Perplexity Calculation

We will define a function to calculate perplexity for any given text.

In [33]:
# Final function to calculate perplexity for any text
def calcular_perplexidade(modelo, texto):
    """
    Calcula a perplexidade de um texto baseado no modelo fornecido.
    - modelo: modelo treinado (MLE ou Laplace)
    - texto: string a ser avaliada
    """
    perplexidade = 0
    palavras = WhitespaceTokenizer().tokenize(texto)
    palavras_fakechar = [list(pad_both_ends(palavra, n=2)) for palavra in palavras]
    palavras_bigramns = [list(bigrams(palavra)) for palavra in palavras_fakechar]
    
    for palavra in palavras_bigramns:
        perplexidade += modelo.perplexity(palavra)
    
    return perplexidade

# Testing the function
print(calcular_perplexidade(modelo_ing_Laplace, "good morning"))
print(calcular_perplexidade(modelo_port_Laplace, port_teste.iloc[0]))
print(calcular_perplexidade(modelo_ing_Laplace, port_teste.iloc[0]))  # Expected to be higher for incorrect language


30.315093820557337
2008.8011005039791
5825.03337622124


---

## Conclusion

This project successfully demonstrates the use of n-gram language models for language detection using NLTK. By preprocessing the data and training Laplace-smoothed models, we achieved reasonable accuracy rates in identifying English, Portuguese, and Spanish texts. Future improvements can include expanding the model to support more languages, experimenting with different n-gram sizes, and integrating more sophisticated NLP techniques.

---

## References

1. **NLTK Documentation**:
   - The official NLTK documentation can be consulted to understand the preprocessing functions, model training, and n-gram generation:
     - [NLTK Docs](https://www.nltk.org/)
   
2. **Bigrams and Language Modeling**:
   - The `bigrams` function from NLTK is used to create pairs of words (bigrams) in a text corpus. More details can be found in [NLTK Util Bigram](https://www.nltk.org/api/nltk.html#module-nltk.util).
   - The use of `pad_both_ends` helps to add start and end tokens so that bigrams also capture the beginning and ending words of each sentence.
   
3. **Language Model with NLTK**:
   - The NLTK Language Modeling module, such as `MLE` and `Laplace`, is used to estimate probabilities of word sequences. These models are useful for building language detection systems, as described in:
     - [NLTK LM](https://www.nltk.org/api/nltk.html#module-nltk.lm)
   
4. **Calculating Perplexity**:
   - Perplexity measures the quality of a language model. The lower the perplexity, the better the model can predict a sequence of words. More information can be found in:
     - [Perplexity Explained](https://en.wikipedia.org/wiki/Perplexity)

5. **Scikit-Learn Documentation**:
   - To understand how to split data into training and test sets, [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) provides a function called `train_test_split`.
   
6. **Code References**:
   - The code was adapted from various sources and best practices within the Natural Language Processing (NLP) and Machine Learning community.
