# Language classifier - Data Analysis

In [48]:
# imports
import pandas as pd
import numpy as np
import plotly.express as px


In [49]:
# load the dataset
dataset = pd.read_csv("../model_code/data/datasets/LanguageDetection.csv", encoding="utf-8")
# print some examples
dataset.sample(5)

Unnamed: 0,Text,Language
3470,"Selon Larry Sanger, Nupedia a échoué à cause d...",French
1472,വിക്കിപീഡിയയുടെ വെബ് വിലാസത്തിന്റെ ആദ്യതാളിൽ വ...,Malayalam
4183,"désolé, je ne peux pas pour le moment.",French
4735,wacht een seconde.,Dutch
10190,ನನಗೆ ತಿಳಿದಿಲ್ಲ ಎಂದು ನಾನು ಹೆದರುತ್ತೇನೆ.,Kannada


In [50]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10337 entries, 0 to 10336
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      10337 non-null  object
 1   Language  10337 non-null  object
dtypes: object(2)
memory usage: 161.6+ KB


## Understand language distribution
Plot language distribution over the data

In [51]:
# get counter of the target variable 
language_count = dict(dataset["Language"].value_counts())
# print
print(f"Examples per language:\n",language_count)


Examples per language:
 {'English': 1385, 'French': 1014, 'Spanish': 819, 'Portugeese': 739, 'Italian': 698, 'Russian': 692, 'Sweedish': 676, 'Malayalam': 594, 'Dutch': 546, 'Arabic': 536, 'Turkish': 474, 'German': 470, 'Tamil': 469, 'Danish': 428, 'Kannada': 369, 'Greek': 365, 'Hindi': 63}


In [52]:
# X values
x = list(language_count.keys())
# y values
y = list(language_count.values())
# plot distribution
px.histogram(x=x, y=y).update_layout(xaxis_title="Languages", yaxis_title="Examples")

## Focus on Italian/non-italian problem

In [53]:
# create target variable for italian/non-italian 
dataset["target"] = [1 if row=="Italian" else 0 for row in dataset["Language"].to_numpy()]
# check count
target_var = dict(dataset["target"].value_counts())

In [54]:
# plot 
# X values
x = list(target_var.keys())
x = ["not-italian" if example == 0 else "italian" for example in x]
# y values
y = list(target_var.values())
# plot distribution
px.histogram(x=x, y=y).update_layout(xaxis_title="Languages", yaxis_title="Examples").update_layout(xaxis_title="Italian/Non-italian", yaxis_title="Examples")

## Discussion about the distribution
The dataset is unbalanced, is common in this kind of tasks. We can assume that the distribution is a real-world distribution, that is I don't want to perturb a lot the dataset, because I want to preserve a "prior" knowledge using the data distribution (The probability of Italian is also related to the number of italian example that I collect in a given period). This assumption is true if we trust the data collection process and the data itself, it's not true if the distribution is a "not-real world distribution" (for example if we started to collect data when we don't support italian yet).

In this case the data distibution is too unbalanced, so I will perform a mix of undersampling + oversampling techniques.

In [55]:
dataset.sample(10)

Unnamed: 0,Text,Language,target
6472,"вы качаетесь, если кто-то работает не очень хо...",Russian,0
5824,Θα το πάρω.,Greek,0
6150,"По словам Сэнгера, Википедия напоминает «психб...",Russian,0
81,This exposure alternates as the Earth revolves...,English,0
5707,έχω ραντεβού με τους φίλους μου σήμερα το απόγ...,Greek,0
6972,"siger ""jeg ved det ikke"".",Danish,0
6061,Википедия началась как дополнительный проект д...,Russian,0
5992,Μήπως υποθέτω ότι δεν θα ήθελε άλλο χρυσό ψωμί...,Greek,0
9261,من أين حصلت على الأخبار مثل كيف لك.,Arabic,0
10010,ಅಥವಾ ಮಂದ ಬುದ್ಧಿವಂತಿಕೆಯು ಈಗ ನೀವು ಜ್ಯಾಮಿತಿ ಚೂಪಾದ...,Kannada,0


In [60]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline
rus = RandomUnderSampler(random_state=12, sampling_strategy={0:2500})
ros = RandomOverSampler(random_state=12, sampling_strategy={1:1200})

pipeline = make_pipeline(rus, ros)

X_resampled, y_resampled = pipeline.fit_resample(dataset.drop(["target"], axis=1), dataset["target"])

In [62]:
balanced_dataset = X_resampled
balanced_dataset["target"] = y_resampled
balanced_dataset

Unnamed: 0,Text,Language,target
0,i wasn't listening another phrase that i use a...,English,0
1,هذا لطف كبير منك.,Arabic,0
2,colour versus color)[132] or points of view.,English,0
3,"Toutefois, les études ne concluent pas sur le ...",French,0
4,ಆ ಕ್ಷಣದಲ್ಲಿ ನಾರ್ಸಿಸಸ್ ತನ್ನ ನಿದ್ರೆಯಲ್ಲಿ ನಗಲು ಪ್...,Kannada,0
...,...,...,...
3695,fallo un po 'ringraziandoti in anticipo.,Italian,1
3696,con la sua unica figlia narcisa marion era est...,Italian,1
3697,I collaboratori di Wikipedia mantengono in più...,Italian,1
3698,"Progetti come Wikipedia, Susning.nu e la Encic...",Italian,1


In [65]:
# create target variable for italian/non-italian 
balanced_dataset["target"] = [1 if row=="Italian" else 0 for row in balanced_dataset["Language"].to_numpy()]
# check count
target_var = dict(balanced_dataset["target"].value_counts())

In [66]:
# plot 
# X values
x = list(target_var.keys())
x = ["not-italian" if example == 0 else "italian" for example in x]
# y values
y = list(target_var.values())
# plot distribution
px.histogram(x=x, y=y).update_layout(xaxis_title="Languages", yaxis_title="Examples").update_layout(xaxis_title="Italian/Non-italian", yaxis_title="Examples")

In [68]:
# save dataset
balanced_dataset.to_csv("../model_code/data/datasets/balanced.csv", index=False)

## Number of different tokens

In [69]:
def create_vocabulary(dataset:pd.DataFrame, text_column:str="Text"):
    words = dict()

    for sentence in dataset[text_column].to_numpy():
        # simple tokenization
        # lower case sentence
        for word in sentence.lower().split():
            if word not in words:  # vocabulary
                words[word] = 1 
            else:
                words[word] += 1
    return words

In [70]:
# token to idx
vocab = create_vocabulary(dataset=balanced_dataset)

In [71]:
# top k words by occurences
K = 10
top_k = {k:v for k,v in sorted(vocab.items(), key=lambda item: item[1], reverse=True)[:K]}
top_k

{'di': 1142,
 'de': 926,
 'a': 754,
 'la': 729,
 'in': 701,
 'e': 695,
 'che': 545,
 'un': 533,
 'en': 444,
 'è': 415}