Required Libraries

In [2]:
import string
import pandas as pd
import numpy as np
import re
import matplotlib as plt
import seaborn as sns

Reading the dataset

In [3]:
df = pd.read_csv('Language Detection.csv')
df.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


The dataset is not clean. First we need to clean the dataset.

So, first we will remove the puntuations

In [4]:
# This is a string funtion to recognize all the puntuations
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
# "my name is Supratim Paul".replace("Supratim", "0")

We will create function() to remove the puntuations from our dataset and convert all the uppercase alphabets to lowercase

In [6]:
def remove_pun(text):
    for pun in string.punctuation:
        text = text.replace(pun, "")
    text = text.lower()
    return(text)


As we can see the function() works perfectly. Removing the puntuations and converting to lowercase

In [7]:
remove_pun('"Nature" can refer to the phenomena of the physical world, and also to life in general. ! ok: i will come back"')

'nature can refer to the phenomena of the physical world and also to life in general  ok i will come back'

Now we can perfectly able this function to our dataset

In [8]:
df['Text'].apply(remove_pun)
df.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


We will split our data set into training and testing

In [9]:
from sklearn.model_selection import train_test_split

Before Splitting we will divide the data set

X = all the text

Y = languages

In [10]:
X = df.iloc[:, 0]
Y = df.iloc[:, 1]

In [11]:
X

0         Nature, in the broadest sense, is the natural...
1        "Nature" can refer to the phenomena of the phy...
2        The study of nature is a large, if not the onl...
3        Although humans are part of nature, human acti...
4        [1] The word nature is borrowed from the Old F...
                               ...                        
10332    ನಿಮ್ಮ ತಪ್ಪು ಏನು ಬಂದಿದೆಯೆಂದರೆ ಆ ದಿನದಿಂದ ನಿಮಗೆ ಒ...
10333    ನಾರ್ಸಿಸಾ ತಾನು ಮೊದಲಿಗೆ ಹೆಣಗಾಡುತ್ತಿದ್ದ ಮಾರ್ಗಗಳನ್...
10334    ಹೇಗೆ ' ನಾರ್ಸಿಸಿಸಮ್ ಈಗ ಮರಿಯನ್ ಅವರಿಗೆ ಸಂಭವಿಸಿದ ಎ...
10335    ಅವಳು ಈಗ ಹೆಚ್ಚು ಚಿನ್ನದ ಬ್ರೆಡ್ ಬಯಸುವುದಿಲ್ಲ ಎಂದು ...
10336    ಟೆರ್ರಿ ನೀವು ನಿಜವಾಗಿಯೂ ಆ ದೇವದೂತನಂತೆ ಸ್ವಲ್ಪ ಕಾಣು...
Name: Text, Length: 10337, dtype: object

In [12]:
Y

0        English
1        English
2        English
3        English
4        English
          ...   
10332    Kannada
10333    Kannada
10334    Kannada
10335    Kannada
10336    Kannada
Name: Language, Length: 10337, dtype: object

The function train_test_split() returns 4 values so creating for variables train and test for our X and Y

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .2)


In [14]:
X_train

8106                                                şimdi.
2881     [215] Em 2005 a Agence France-Presse citou Dan...
9287     ربما كنت تتحدث على الهاتف وكان شخص ما يتحدث مع...
10160    ಈ ವೀಡಿಯೊದಲ್ಲಿ ನಾನು ನಿಮಗೆ ಸಂಭಾಷಣೆಯಲ್ಲಿ ಬಳಸಬಹುದಾ...
6764     Jeg forvirrede dig, så lad mig give dig nogle ...
                               ...                        
981      Algorithmic bias is a potential result from da...
2220     எனவே நீங்கள் ஒருவரைச் சந்திக்கிறீர்கள், அது எப...
3547     Sa politique de licence libre a obligé le mond...
5064     [171]​ Algunas muestras son las siguientes: Lo...
8824                                              åsikter.
Name: Text, Length: 8269, dtype: object

-->  The selectetiom here is random so we need as we can see our first

-->  But there is a problem that our model cannot absorb string values.

-->  So we will use VECTORIZATION (converting the strings into features/numerical values)

TF-IDF (Term Frequency-Inverse Document Frequency)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

In [15]:
from sklearn import feature_extraction

In [16]:
vec = feature_extraction.text.TfidfVectorizer(ngram_range=(1,2), analyzer='char')

In [17]:
from sklearn import pipeline
from sklearn import linear_model

We will use Logistic Regression

In [18]:
model_pipe = pipeline.Pipeline([('vec', vec), ('clf', linear_model.LogisticRegression())])

In [19]:
model_pipe.fit(X_train, Y_train)

These are the languages we have

In [20]:
model_pipe.classes_

array(['Arabic', 'Danish', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Hindi', 'Italian', 'Kannada', 'Malayalam', 'Portugeese',
       'Russian', 'Spanish', 'Sweedish', 'Tamil', 'Turkish'], dtype=object)

In [21]:
predict_val = model_pipe.predict(X_test)

In [22]:
from sklearn import metrics

Shows Accuracy

In [23]:
metrics.accuracy_score(Y_test, predict_val)*100

98.45261121856866

In [24]:
metrics.confusion_matrix(Y_test, predict_val)

array([[112,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,  80,   0,   0,   0,   0,   0,   0,   1,   0,   0,   0,   0,
          1,   2,   0,   0],
       [  0,   0,  92,   1,   0,   2,   0,   0,   0,   0,   0,   1,   0,
          0,   0,   0,   0],
       [  0,   0,   0, 291,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   0, 206,   1,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   2,   2,   0,  99,   0,   0,   0,   0,   0,   0,   0,
          1,   2,   0,   0],
       [  0,   0,   0,   0,   0,   0,  55,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,  15,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   0,   2,   0,   0,   0, 133,   0,   0,   0,   0,
          2,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,  67,   0,   0,   0,
         

In [25]:
model_pipe.predict(['اسمي'])

array(['Arabic'], dtype=object)

In [26]:
import pickle

In [27]:
new_file = open('model.pckl','wb')
pickle.dump(model_pipe, new_file)
new_file.close()