# Language Detection with Machine Learning

<img src='images/language.jpg'>

Language detection is a natural language processing task where we need to identify the language of a text or document. Using machine learning for language identification was a difficult task a few years ago because there was not a lot of data on languages, but with the availability of data with ease, several powerful machine learning models are already available for language identification. So, if you want to learn how to train a machine learning model for language detection, then this project is for you. In this project, I will walk you through the task of language detection with machine learning using Python.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_column', 100)

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/dataset.csv")

## EDA - Exploratory Data Analysis

In [4]:
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [5]:
df.shape

(22000, 2)

In [6]:
df.isnull().sum()

Text        0
language    0
dtype: int64

In [7]:
df["language"].value_counts()

language
Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: count, dtype: int64

## Modelling

In [8]:
x = np.array(df["Text"])
y = np.array(df["language"])

In [9]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()), 
    ('classifier', MultinomialNB())     
])

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [11]:
pipeline.fit(x_train, y_train)

In [12]:
pred = pipeline.predict(x_test)

In [13]:
from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_test, pred)

0.9574380165289256

In [14]:
pipeline.predict(["Hello, how are you doing?"])

array(['English'], dtype='<U10')

In [15]:
pipeline.predict(["Merhaba, benim adım Zafer."])

array(['Turkish'], dtype='<U10')

### Saving Model

In [16]:
from joblib import dump
dump(pipeline, 'pipeline.joblib')

['pipeline.joblib']