# Problem Statement
The aim is to implement a machine learning model to identify the language a document is written in. 

# Data



The dataset is WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. A subset of this dataset from kaggle, having 22 languages are used here for modeling.

https://www.kaggle.com/zarajamshaid/language-identification-datasst

### Loading packages

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### Reading Data

In [3]:
df = pd.read_csv('data/dataset.csv')

In [4]:
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [5]:
df = df.drop_duplicates(subset='Text')
df = df.reset_index(drop=True)

In [6]:
df['language'].value_counts()

English       1000
Urdu          1000
Japanese      1000
Persian       1000
Korean        1000
Turkish       1000
Chinese       1000
Romanian      1000
Thai          1000
Russian        999
Estonian       999
Arabic         998
Portugese      997
Spanish        996
Dutch          996
Pushto         993
Swedish        992
French         990
Hindi          990
Tamil          981
Indonesian     975
Latin          953
Name: language, dtype: int64

### Spliting train and test data

In [7]:
train_doc,test_doc,train_labels,test_labels = train_test_split(df['Text'].values,df['language'].values,test_size=0.33, random_state=42)

# Calculating Features

Count vectors of character level n-grams for a range upto 4 of input text is used as features. Converting the input text to n-gram is called tokenization. 

For example, the features for word $'match'$ are $ m , ma, mat, matc, a, at, atc, atch, t, tc, tch, c, ch, h$

Countvectorization is the most basic method of transforming words into vectors by counting occurrence of each character ngram in each document. The output is a document-term matrix with each row representing a document and each column addressing a token (weight assigned to each token based on counting the occurence).

We use training data to built vocabulory for count vectorization.

In [8]:
vectorizer = CountVectorizer(ngram_range=(1,4),analyzer='char',max_features=25000) 

In [9]:
vector = vectorizer.fit_transform(train_doc)
train_df= pd.DataFrame(vector.toarray())

# Model

RandomForest model is used for classification.

In [10]:
clf=RandomForestClassifier(n_estimators=1000)
clf.fit(train_df.values,train_labels)

RandomForestClassifier(n_estimators=1000)

## Prediction

In [11]:
vector_test = vectorizer.transform(test_doc)
test_df = pd.DataFrame(vector_test.toarray())

In [12]:
y_pred = clf.predict(test_df.values)

In [13]:
print(classification_report(test_labels,y_pred))

              precision    recall  f1-score   support

      Arabic       1.00      1.00      1.00       330
     Chinese       0.99      0.98      0.99       301
       Dutch       1.00      0.98      0.99       323
     English       0.81      0.99      0.90       332
    Estonian       0.99      0.97      0.98       317
      French       0.97      1.00      0.98       320
       Hindi       1.00      0.97      0.99       316
  Indonesian       1.00      0.98      0.99       352
    Japanese       1.00      0.98      0.99       302
      Korean       1.00      0.99      0.99       355
       Latin       0.93      0.94      0.93       326
     Persian       1.00      1.00      1.00       330
   Portugese       0.98      0.98      0.98       331
      Pushto       1.00      0.96      0.98       305
    Romanian       1.00      0.99      0.99       356
     Russian       0.99      0.99      0.99       308
     Spanish       1.00      0.96      0.98       353
     Swedish       1.00    