<a href="https://colab.research.google.com/github/vaibhav-vemula/Multilingual_Language_Identification/blob/main/Multilingual_Language_Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Language Identification in Multilingual sentence**

---



In [None]:
import os
os.chdir("/content/drive/MyDrive/MultiLingual_Detection/")

In [None]:
import pandas as pd
import re

from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Dataset 

Dataset used - https://drive.google.com/file/d/1-2lhUZy9x1WW3WHhTP2DrFl_t9vbPr58/view?usp=sharing

Languages in the dataset - 

1. English
2. French
3. Spanish
4. Hindi
5. Portugeese
6. Italian
7. Russian
8. Sweedish
9. Malayalam
10. Dutch
11. Arabic
12. Turkish
13. German
14. Tamil
15. Danish
16. Kannada
17. Greek

In [None]:
df = pd.read_csv('Languages_Dataset.csv')
df.drop(['Unnamed: 0'], axis=1, inplace =True)
df

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
...,...,...
11050,31 अक्टूबर 1984 को काला दिवस कहा जाता है। इस द...,Hindi
11051,\n\nगुंडे को देख सत्तर और अस्सी के दशक का सिने...,Hindi
11052,Chandermohan.sharma@timesgroup.com ग्लैमर इंडस...,Hindi
11053,"निर्माता :\nसुनीता गोवारीकर, अजय बिजली, संजीव ...",Hindi


In [None]:
df['Language'].value_counts()

English       1385
French        1014
Spanish        819
Hindi          781
Portugeese     739
Italian        698
Russian        692
Sweedish       676
Malayalam      594
Dutch          546
Arabic         536
Turkish        474
German         470
Tamil          469
Danish         428
Kannada        369
Greek          365
Name: Language, dtype: int64

## Preprocessing

Removing Digits, english letters in few languages, brackets, special characters, punctuations etc.

In [None]:
tmp = ['Russian','Malyalam','Hindi','Kannada','Tamil','Arabic']
def preprocess_text(x,y):
  if y in tmp:
    x = re.sub(r'[a-zA-Z]+', '', x)
  x = re.sub(r'[!@#$(),\n"%^*?\:;~`0-9]', '', x)
  x = re.sub(r'[[]]', '', x)
  return x.lower()

In [None]:
x = df.apply(lambda x: preprocess_text(x.Text, x.Language), axis = 1)
y = df['Language']

In [None]:
x

0         nature in the broadest sense is the natural p...
1        nature can refer to the phenomena of the physi...
2        the study of nature is a large if not the only...
3        although humans are part of nature human activ...
4         the word nature is borrowed from the old fren...
                               ...                        
11050     अक्टूबर  को काला दिवस कहा जाता है। इस दिन तत्...
11051    गुंडे को देख सत्तर और अस्सी के दशक का सिनेमा य...
11052    .. ग्लैमर इंडस्ट्री में आर. बाल्की को बिग बी क...
11053    निर्माता सुनीता गोवारीकर अजय बिजली संजीव के. ब...
11054    फोर्स  उन अंडरकवर एजेंट्स को समर्पित है जो समय...
Length: 11055, dtype: object

In [None]:
y

0        English
1        English
2        English
3        English
4        English
          ...   
11050      Hindi
11051      Hindi
11052      Hindi
11053      Hindi
11054      Hindi
Name: Language, Length: 11055, dtype: object

## Model

Two models in a pipeline-
1. TfidfVectorizer() - word vector with trigrams(ngram = 3)
2. LogisticRegression()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 10, test_size=0.25)

In [None]:
vect = TfidfVectorizer(ngram_range=(1,3), analyzer='char', encoding= 'utf-8')
lg = LogisticRegression()

In [None]:
pipe = Pipeline([('idf_vector', vect),('logreg', lg)])

In [None]:
pipe.fit(x_train,y_train)

Pipeline(memory=None,
         steps=[('idf_vector',
                 TfidfVectorizer(analyzer='char', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 3), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('logreg',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept

In [None]:
y_pred = pipe.predict(x_test)
acc = accuracy_score(y_test,y_pred)
print("Accuracy = ",acc*100,"%" )

Accuracy =  98.84225759768451 %


## Prediction - 

In [None]:
def pred(sentence):
  words = sentence.split(' ')
  # print(words)
  l=[]
  print('Languages Detected in the Sentence - \n')
  for i,word in enumerate(words):
    lang = pipe.predict([word])
    l.append(lang[0])
    if i == 0:
      print(l[i])
    else:
      if l[i-1] == l[i]:
        continue
      else:
        print(l[i])

In [None]:
sentence = 'भाषा भाषा ПРОВЕРКА ലാംഗ്വേജ് ലാംഗ്വേജ് ലാംഗ്വേജ് ലാംഗ്വേജ് ലാംഗ്വേജ് ലാംഗ്വേജ് ലാംഗ്വേജ് VÉRIFICATION'
pred(sentence)

Languages Detected in the Sentence - 

Hindi
Russian
Malayalam
French


In [None]:
sentence = 'ಕನ್ನಡ ലാംഗ്വേജ് को काला कहा जाता है മലയാളം மொழி' 
pred(sentence)

Languages Detected in the Sentence - 

Kannada
Malayalam
Hindi
Malayalam
Tamil
