<a href="https://colab.research.google.com/github/reddyzhub/Language-Recognition-Model/blob/main/Language_Recognition_using_Distributed_High_Dimensional_Representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Project Summary**

This project builds a language recognition model to identify the language of a given text. It uses a dataset of sentences from 21 languages. The model employs a `CountVectorizer` for feature extraction and a `RidgeClassifier` for classification, achieving high accuracy and F1-scores, demonstrating its effectiveness.


## Mount google drive

Mount the  Google Drive to access the corpus.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Verifying Data Accessibility
Verifying that the data is accessible in the Google Drive by listing the files in the Corpora folder

In [3]:
import os

# The path to the folder in your Google Drive
folder_path = "/content/drive/MyDrive/Corpora"

# List the files in the folder
if os.path.exists(folder_path):
    files = os.listdir(folder_path)
    if files:
        print("Files in the folder:")
        for file in files:
            print(file)
    else:
        print("The folder is empty.")
else:
    print(f"The folder {folder_path} does not exist.")

Files in the folder:
Bulgarian.tar.gz
Danish.tar.gz
Czech.tar.gz
German.tar.gz
Greek.tar.gz
English.tar.gz
Estonian.tar.gz
Finnish.tar.gz
French.tar.gz
Hungarian.tar.gz
Italian.tar.gz
Latvian.tar.gz
Lithuanian.tar.gz
Dutch.tar.gz
Polish.tar.gz
Romanian.tar.gz
Slovak.tar.gz
Slovene.tar.gz
Spanish.tar.gz
Swedish.tar.gz
French
Finnish
Estonian
English
Dutch
Danish
Romanian
Polish
Lithuanian
Latvian
Italian
Hungarian
Greek
German
Czech
Bulgarian
Swedish
Spanish
Slovene
Slovak
Portuguese.tar.gz
Portuguese


## Loading the Data
This code iterates through the files in the `Corpora` folder,
extracts the data from the tar.gz files (if necessary),
and creates a pandas DataFrame.

In [4]:
import tarfile
import pandas as pd
import os

# The path to the folder in your Google Drive
folder_path = "/content/drive/MyDrive/Corpora"

# A list to store the data
data = []

# Iterate through the files in the folder
for item in os.listdir(folder_path):
    item_path = os.path.join(folder_path, item)
    if os.path.isdir(item_path):
        # The item is a directory, so the data is already extracted
        language = item
        for root, _, files in os.walk(item_path):
            for file_name in files:
                if file_name.endswith("-sentences.txt"):
                    # The path to the sentences file
                    file_path = os.path.join(root, file_name)

                    # Read the sentences from the file
                    with open(file_path, 'r', encoding='utf-8') as f:
                        for line in f:
                            text = line.strip()
                            if text:
                                data.append({'text': text.split('\t')[1].lower(), 'language': language})
    elif item.endswith(".tar.gz"):
        # The item is a tar.gz file, so we need to extract it first
        language = item.split(".")[0]

        # The path to the tar.gz file
        tar_path = os.path.join(folder_path, item)

        # The path to the extracted folder
        extracted_folder_path = os.path.join(folder_path, language)

        # Extract the tar.gz file
        if not os.path.exists(extracted_folder_path):
            with tarfile.open(tar_path, "r:gz") as tar:
                tar.extractall(path=folder_path)

        # Read the data from the extracted folder
        for root, _, files in os.walk(extracted_folder_path):
            for file_name in files:
                if file_name.endswith("-sentences.txt"):
                    # The path to the sentences file
                    file_path = os.path.join(root, file_name)

                    # Read the sentences from the file
                    with open(file_path, 'r', encoding='utf-8') as f:
                        for line in f:
                            text = line.strip()
                            if text:
                                data.append({'text': text.split('\t')[1].lower(), 'language': language})

# Create a pandas DataFrame from the data
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
display(df.head())

Unnamed: 0,text,language
0,“12 км крос в гората.,Bulgarian
1,"""6 от жените, включително екскурзоводката, са ...",Bulgarian
2,air france какво правят тук не мога да си обясня.,Bulgarian
3,"a la carte културен свят, който носи пластично...",Bulgarian
4,aqua е известна като една от най-здравословнит...,Bulgarian


## Splitting the Data
This code splits the data into training and testing sets using the train_test_split function from scikit-learn. The data is split into 80% for training and 20% for testing.

In [5]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['language'], test_size=0.2, random_state=42)

# Print the size of the training and testing sets
print(f"Training data size: {len(X_train)}")
print(f"Testing data size: {len(X_test)}")

Training data size: 336000
Testing data size: 84000


## Vectorizing Text Data
This code converts the text data into a numerical format using CountVectorizer.
It creates a vocabulary from the training data and transforms both the training
and testing data into a matrix of token counts, which can be used by the model.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit the vectorizer on the training data and transform the training data
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_vec = vectorizer.transform(X_test)

# Print the shape of the vectorized data
print(f"Shape of vectorized training data: {X_train_vec.shape}")
print(f"Shape of vectorized testing data: {X_test_vec.shape}")

Shape of vectorized training data: (336000, 562753)
Shape of vectorized testing data: (84000, 562753)


## Training and Evaluating the Model
This code creates a RidgeClassifier model, trains it on the vectorized training data,
and then evaluates its performance on the test data. The accuracy and F1-score
are calculated and printed to assess the model's effectiveness.

In [7]:
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score, f1_score

# Create a RidgeClassifier model
model = RidgeClassifier()

# Train the model
model.fit(X_train_vec, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test_vec)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the results
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")

Accuracy: 0.9955595238095238
F1 Score: 0.9955633976924689


### **Results and Interpretation**

The model achieves an accuracy and F1-score of over 99.5%, indicating excellent performance. The high F1-score highlights a good balance between precision and recall, suggesting the model is not biased towards any specific language.

## Testing the Model with a Sample Sentence
This code defines a function to predict the language of a given sentence.
It uses the trained vectorizer and model to make a prediction.
The function is then tested with a sample German sentence to demonstrate its usage.


In [8]:
def predict_language(sentence):
    """
    Predicts the language of a given sentence.
    """
    # Vectorize the sentence
    sentence_vec = vectorizer.transform([sentence])

    # Predict the language
    predicted_language = model.predict(sentence_vec)[0]

    return predicted_language

# Example of how to use the function with a German sentence
german_sentence = "Das ist ein schöner Tag."
predicted_language = predict_language(german_sentence)
print(f"The predicted language of the sentence '{german_sentence}' is: {predicted_language}")

The predicted language of the sentence 'Das ist ein schöner Tag.' is: German


## Testing the Model with Multiple Languages
This code tests the model's performance across all the languages in the dataset.
It iterates through a dictionary of sample sentences, predicts the language of each one,
and prints the predicted and actual languages to evaluate the model's accuracy.

In [9]:
# A dictionary of sample sentences in each language
sample_sentences = {
    "Bulgarian": "Това е изречение на български.",
    "Czech": "Toto je věta v češtině.",
    "Danish": "Dette er en sætning på dansk.",
    "German": "Dies ist ein Satz auf Deutsch.",
    "Greek": "Αυτή είναι μια πρόταση στα ελληνικά.",
    "English": "This is a sentence in English.",
    "Estonian": "See on lause eesti keeles.",
    "Finnish": "Tämä on lause suomeksi.",
    "French": "Ceci est une phrase en français.",
    "Hungarian": "Ez egy mondat magyarul.",
    "Italian": "Questa è una frase in italiano.",
    "Latvian": "Šis ir teikums latviešu valodā.",
    "Lithuanian": "Tai sakinys lietuvių kalba.",
    "Dutch": "Dit is een zin in het Nederlands.",
    "Polish": "To jest zdanie w języku polskim.",
    "Portuguese": "Esta é uma frase em português.",
    "Romanian": "Aceasta este o propoziție în limba română.",
    "Slovak": "Toto je veta v slovenčine.",
    "Slovene": "To je stavek v slovenščini.",
    "Spanish": "Esta es una oración en español.",
    "Swedish": "Detta är en mening på svenska."
}

# Iterate through the sample sentences and predict the language
for language, sentence in sample_sentences.items():
    predicted_language = predict_language(sentence)
    print(f"Sentence: '{sentence}'")
    print(f"Predicted language: {predicted_language}")
    print(f"Actual language: {language}")
    print("-" * 20)

Sentence: 'Това е изречение на български.'
Predicted language: Bulgarian
Actual language: Bulgarian
--------------------
Sentence: 'Toto je věta v češtině.'
Predicted language: Czech
Actual language: Czech
--------------------
Sentence: 'Dette er en sætning på dansk.'
Predicted language: Danish
Actual language: Danish
--------------------
Sentence: 'Dies ist ein Satz auf Deutsch.'
Predicted language: German
Actual language: German
--------------------
Sentence: 'Αυτή είναι μια πρόταση στα ελληνικά.'
Predicted language: Greek
Actual language: Greek
--------------------
Sentence: 'This is a sentence in English.'
Predicted language: English
Actual language: English
--------------------
Sentence: 'See on lause eesti keeles.'
Predicted language: Estonian
Actual language: Estonian
--------------------
Sentence: 'Tämä on lause suomeksi.'
Predicted language: Finnish
Actual language: Finnish
--------------------
Sentence: 'Ceci est une phrase en français.'
Predicted language: French
Actual lang

### Discussion of Test Results
The model correctly predicts all languages except for Slovene, which it misclassifies as Dutch. This might be due to a limited Slovene vocabulary in the training data or similarities between the two languages. For example, both languages share some common words and grammatical structures, which could confuse the model.