<a href="https://colab.research.google.com/github/smrithisriram/Language-Detection-Using-Naive-Bayes-Classifier/blob/main/Language_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction


This is a machine learning project using Python modules for detecting the language of a given text. We use a Language Classification dataset from Kaggle.

We use the multinomial Naive Bayes classifier since we aim to classify an input text based multiple languages. A Naive Bayes classifier is a type of probabilistic classifier that is based on Baye's theorem. It is a simple but effective algorithm that is often used for text classification, spam filtering, and sentiment analysis.
The Naive Bayes classifier makes the assumption that the features of a data point are independent of each other. This means that the probability of a data point belonging to a particular class is equal to the product of the probabilities of each feature belonging to that class.

# Data Extraction & Analysis

In [None]:
#Importing the libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [None]:
#Importing the dataset
df = pd.read_csv('/content/dataset.csv')
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [None]:
#Checking the dataset for NULL values
df.isnull()

Unnamed: 0,Text,language
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
21995,False,False
21996,False,False
21997,False,False
21998,False,False


In [None]:
#Count the number of lines for each language
df["language"].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

# Building the Language Detection Model

In [None]:
#Determine the target features for the model
x = np.array(df["Text"])
y = np.array(df["language"])

cv = CountVectorizer()

X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
#Train the model
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.953168044077135

In [None]:
#Test the accuracy of the model using a sample input
string = input("Enter a text as test input : ")
test_data = cv.transform([string]).toarray()
target_result = model.predict(test_data)
print(target_result)

Enter a text as test input : La vie est belle
['French']


# Success Criteria

The main success criteria of the project is :

1.   Analyse the given dataset to detect language.
2.   Build a machine learning model for classification.
3.   Understand the concept of Naive Bayes Classifier and its types.





# Conclusion

Using various python libraries like pandas, numpy and sklearn, we built a machine learning model to detect language of a given string. We used Multinomial Naive Bayes classifier since the problem statement required a multi class classification. Although the Bayesian approach is an effective classification methodology, it might be inaccurate sometimes, because it assumes the features are independent of each other. We imported the "Language Identification Dataset" from Kaggle. We trained the model for a 95% accuracy rate. We also tested the model for a French input statement and the model's prediction was right.

#

# References

Dataset : https://www.kaggle.com/datasets/zarajamshaid/language-identification-datasst?resource=download

Source Code : https://thecleverprogrammer.com/2021/10/30/language-detection-with-machine-learning/
