# Language Detection with Machine Learning

https://thecleverprogrammer.com/2021/10/30/language-detection-with-machine-learning/

Language detection is a natural language processing task where we need to identify the language of a text or document. Using machine learning for language identification was a difficult task a few years ago because there was not a lot of data on languages, but with the availability of data with ease, several powerful machine learning models are already available for language identification.

## Language Detection

As a human, you can easily detect the languages you know. For example, I can easily identify Hindi and English, but being an Indian, it is also not possible for me to identify all Indian languages. This is where the language identification task can be used. Google Translate is one of the most popular language translators in the world which is used by so many people around the world. It also includes a machine learning model to detect languages that you can use if you don’t know which language you want to translate.

![](https://i0.wp.com/thecleverprogrammer.com/wp-content/uploads/2021/10/google-translate.png?resize=768%2C371&ssl=1)

The most important part of training a language detection model is data. The more data you have about every language, the more accurate your model will perform in real-time. The dataset that I am using is collected from Kaggle, which contains data about 22 popular languages and contains 1000 sentences in each of the languages, so it will be an appropriate dataset for training a language detection model with machine learning. 

## Language Detection using Python

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/dataset.csv")

In [3]:
df.tail()

Unnamed: 0,Text,language
21995,hors du terrain les années et sont des année...,French
21996,ใน พศ หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...,Thai
21997,con motivo de la celebración del septuagésimoq...,Spanish
21998,年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...,Chinese
21999,aprilie sonda spațială messenger a nasa și-a ...,Romanian


In [4]:
df.isnull().sum()

Text        0
language    0
dtype: int64

In [6]:
df.language.value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

This dataset contains 22 languages with 1000 sentences from each language. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

## Language Detection Model

In [14]:
X = np.array(df.Text)
y = np.array(df.language)

In [15]:
cv = CountVectorizer()

In [16]:
X = cv.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)

In [24]:
X[0].todense().shape

(1, 277720)

In [25]:
model = MultinomialNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9529545454545455

In [26]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)

Enter a Text:  देखकर अच्छा लगता है


In [27]:
output

array(['Hindi'], dtype='<U10')