# Language Detection with Machine Learning


# Advanced Classification South African Language Identification Hack 2022
EDSA 2201 & 2207 classification hackathon

©  Explore Data Science Academy

---

### Honour Code

I Harmony Odumuko, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

---


South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable
tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and
political life of the South African society. With such a multilingual population, it is only obvious that our systems and
devices also communicate in multi-languages.

This model you will take text which is in any of South Africa's 11 Official languages and identify which
language the text is in

Let’s start the task of language detection with machine learning by importing the necessary Python libraries and the dataset:


In [53]:
import pandas as pd
from nltk import tokenize
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn import pipeline
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import f_classif
from sklearn import feature_selection

Import our train and test data

In [54]:
df_train = pd.read_csv('data/train_set.csv')

df_test = pd.read_csv('data/test_set.csv')

In [55]:
df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


Now let’s have a look at all the languages present in this dataset:

In [56]:
df_train['lang_id'].unique()

array(['xho', 'eng', 'nso', 'ven', 'tsn', 'nbl', 'zul', 'ssw', 'tso',
       'sot', 'afr'], dtype=object)

In [57]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


In [58]:
df_train.isnull().sum()

lang_id    0
text       0
dtype: int64

In [59]:
df_train['lang_id'].value_counts()

xho    3000
eng    3000
nso    3000
ven    3000
tsn    3000
nbl    3000
zul    3000
ssw    3000
tso    3000
sot    3000
afr    3000
Name: lang_id, dtype: int64

This dataset contains 11 languages with 3000 word from each language. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

In [60]:
df_train[df_train.lang_id == 'xho'].sample(2)

Unnamed: 0,lang_id,text
21198,xho,ixabiso lixhomekeka kuhlobo lwemibuzo oza nayo...
26227,xho,oku kungasentla kuyasebenza nakogunyaziswe ngu...


Data Cleaninig

In [61]:
def clean_data(text):
    # change the case of all words in the text to lowercase 
    text = text.lower()
    
    # let's remove punctuation
    text = "".join([x for x in text if x not in string.punctuation])
    
    # remove numbers
    text = re.sub(r'\d+', '', text)
    
    return text


cleaning the dataset

In [62]:
# Clean the train dataset
df_train['text'] = df_train['text'].apply(clean_data)

# Clean the test dataset
df_test['text'] = df_test['text'].apply(clean_data)

df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqosiseko wenza amalungiselelo kumaziko axh...
1,xho,idha iya kuba nobulumko bokubeka umsebenzi nap...
2,eng,the province of kwazulunatal department of tra...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


Now let's create a list of words

In [63]:
def create_bank_set(dataset, word="", category=""):
    '''
        Creates a list of all the words or characters in the message feature
        
        Input:
            dataset - The dataset to extract words or characters from
            category - Filters the dataset by the specified category
            type - Specifies the level of extraction; wether characters or words
        
        Output:
            pandas DataFrame of all the characters or words of the specified category 
    '''
    corpus = []
    if category:
        df = dataset[dataset['lang_id'] == category]['text']
    else:
        df = dataset['text']
    
    if word:
        bank = []
        for row in df:
            bank.extend(row.split(" "))
    else:
        bank = [row[x] for row in df for x in range(len(row))]
        
    return pd.DataFrame(bank)

In [64]:
all_words = create_bank_set(df_train[df_train['lang_id'] != 'xho'], word=True)
all_words.value_counts()

ya                28239
a                 21160
le                20808
ka                18121
go                17102
                  ...  
lyons                 1
lynvisvoorraad        1
lynvisspesies         1
lynitems              1
magaweni              1
Length: 124783, dtype: int64

In [65]:
create_bank_set(df_train, category='xho', word=True).value_counts()

ukuba             1636
okanye            1306
kufuneka           665
kunye              498
kwaye              390
                  ... 
lamasebe             1
lamaqumrhu           1
lamapolisa           1
alunakusekelwa       1
ã·                   1
Length: 25353, dtype: int64

In [66]:
df_train

Unnamed: 0,lang_id,text
0,xho,umgaqosiseko wenza amalungiselelo kumaziko axh...
1,xho,idha iya kuba nobulumko bokubeka umsebenzi nap...
2,eng,the province of kwazulunatal department of tra...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...
...,...,...
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...


Language Detection Model

Now let’s split the data into training and test sets:

In [67]:
x_train, x_test, y_train, y_test = train_test_split(df_train['text'], df_train['lang_id'], random_state=42, test_size=0.5)

In [68]:
vectorizer = TfidfVectorizer(ngram_range=(1,3), analyzer='char', min_df=3, max_df = 0.7)
model = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('clf', LogisticRegression())
])
model.fit(x_train,y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(analyzer='char', max_df=0.7, min_df=3,
                                 ngram_range=(1, 3))),
                ('clf', LogisticRegression())])

Using the ridge classifier for modelling

In [69]:
y_pred = model.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00      1493
         eng       1.00      1.00      1.00      1491
         nbl       0.99      0.99      0.99      1504
         nso       1.00      0.99      1.00      1520
         sot       1.00      1.00      1.00      1502
         ssw       1.00      1.00      1.00      1538
         tsn       0.99      1.00      1.00      1467
         tso       1.00      1.00      1.00      1483
         ven       1.00      1.00      1.00      1535
         xho       0.99      0.99      0.99      1506
         zul       0.99      0.99      0.99      1461

    accuracy                           1.00     16500
   macro avg       1.00      1.00      1.00     16500
weighted avg       1.00      1.00      1.00     16500



Ridge Classifier

In [70]:
vectorizer = TfidfVectorizer(ngram_range=(3,6), analyzer='char', min_df=3, max_df = 0.5)
model = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('clf', RidgeClassifier())
])
model.fit(x_train,y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(analyzer='char', max_df=0.5, min_df=3,
                                 ngram_range=(3, 6))),
                ('clf', RidgeClassifier())])

Using the SDG classifier for modelling

In [71]:
y_pred = model.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00      1493
         eng       1.00      1.00      1.00      1491
         nbl       1.00      1.00      1.00      1504
         nso       1.00      1.00      1.00      1520
         sot       1.00      1.00      1.00      1502
         ssw       1.00      1.00      1.00      1538
         tsn       1.00      1.00      1.00      1467
         tso       1.00      1.00      1.00      1483
         ven       1.00      1.00      1.00      1535
         xho       1.00      1.00      1.00      1506
         zul       1.00      1.00      1.00      1461

    accuracy                           1.00     16500
   macro avg       1.00      1.00      1.00     16500
weighted avg       1.00      1.00      1.00     16500



SGDClassifier

In [72]:
vectorizer = TfidfVectorizer(ngram_range=(3,6), analyzer='char', min_df=3, max_df = 0.5)
model = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('clf', SGDClassifier())
])
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00      1493
         eng       1.00      1.00      1.00      1491
         nbl       1.00      1.00      1.00      1504
         nso       1.00      1.00      1.00      1520
         sot       1.00      1.00      1.00      1502
         ssw       1.00      1.00      1.00      1538
         tsn       1.00      1.00      1.00      1467
         tso       1.00      1.00      1.00      1483
         ven       1.00      1.00      1.00      1535
         xho       1.00      1.00      1.00      1506
         zul       1.00      1.00      1.00      1461

    accuracy                           1.00     16500
   macro avg       1.00      1.00      1.00     16500
weighted avg       1.00      1.00      1.00     16500



Final model

In [73]:
vectorizer = TfidfVectorizer(ngram_range=(3,5), analyzer='char', min_df=7, max_df = 0.7)
final_model = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('clf', RidgeClassifier())
])
final_model.fit(x_train,y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(analyzer='char', max_df=0.7, min_df=7,
                                 ngram_range=(3, 5))),
                ('clf', RidgeClassifier())])

Predicting the test set given to us

In [74]:
predictions = final_model.predict(df_test['text'])

Creating a Pandas DataFrame  and selecting lang_id  and index to create a csv file for submission to kaggle

In [75]:
submission = pd.DataFrame({'lang_id':predictions}, index=df_test['index'])

In [76]:
submission.to_csv('submission.csv')