# Language classifier - Data Analysis

In [2]:
# imports
import pandas as pd
import numpy as np
import plotly.express as px


In [3]:
# load the dataset
dataset = pd.read_csv("../model_code/data/datasets/LanguageDetection.csv", encoding="utf-8")
# print some examples
dataset.sample(5)

Unnamed: 0,Text,Language
1495,ഇംഗ്ലീഷ് വിക്കിപീഡിയയിൽ പലപ്പോഴും ഭൂരിപക്ഷം ആള...,Malayalam
10206,ನಾನು ನಿಮ್ಮೊಂದಿಗೆ 100% ಒಪ್ಪುತ್ತೇನೆ.,Kannada
9908,Marian erzählte Mellie und Terry alles über Na...,German
6198,Было разослано около 40 000 дисков.,Russian
7466,ci sono troppe cose nel mio piatto che sto aff...,Italian


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10337 entries, 0 to 10336
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      10337 non-null  object
 1   Language  10337 non-null  object
dtypes: object(2)
memory usage: 161.6+ KB


## Understand language distribution
Plot language distribution over the data

In [5]:
# get counter of the target variable 
language_count = dict(dataset["Language"].value_counts())
# print
print(f"Examples per language:\n",language_count)


Examples per language:
 {'English': 1385, 'French': 1014, 'Spanish': 819, 'Portugeese': 739, 'Italian': 698, 'Russian': 692, 'Sweedish': 676, 'Malayalam': 594, 'Dutch': 546, 'Arabic': 536, 'Turkish': 474, 'German': 470, 'Tamil': 469, 'Danish': 428, 'Kannada': 369, 'Greek': 365, 'Hindi': 63}


In [6]:
# X values
x = list(language_count.keys())
# y values
y = list(language_count.values())
# plot distribution
px.histogram(x=x, y=y).update_layout(xaxis_title="Languages", yaxis_title="Examples")

## Focus on Italian/non-italian problem

In [7]:
# create target variable for italian/non-italian 
dataset["target"] = [1 if row=="Italian" else 0 for row in dataset["Language"].to_numpy()]
# check count
target_var = dict(dataset["target"].value_counts())

In [8]:
# plot 
# X values
x = list(target_var.keys())
x = ["italian" if example == 0 else "non-italian" for example in x]
# y values
y = list(target_var.values())
# plot distribution
px.histogram(x=x, y=y).update_layout(xaxis_title="Languages", yaxis_title="Examples").update_layout(xaxis_title="Italian/Non-italian", yaxis_title="Examples")

## Discussion about the distribution
The dataset is unbalanced, is common in this kind of tasks. We can assume that the distribution is a real-world distribution, that is I don't want to perturb a lot the dataset, because I want to preserve a "prior" knowledge using the data distribution (The probability of Italian is also related to the number of italian example that I collect in a given period). This assumption is true if we trust the data collection process and the data itself, it's not true if the distribution is not-real world (for example if we started to collect data when we don't support italian yet).

In my project I want to assume that we can trust the data, so I will not perform undersampling.

In [9]:
dataset.sample(10)

Unnamed: 0,Text,Language,target
934,"Artificial neural networks (ANNs), or connecti...",English,0
3047,você arrasa se alguém não está trabalhando mui...,Portugeese,0
2884,A ordem judicial surgiu de uma ação movida pel...,Portugeese,0
12,"It is often taken to mean the ""natural environ...",English,0
6009,Большинство направляемых в Nature статей отсеи...,Russian,0
10305,"ಟಿ ನೀವು ಹೇಗಿದ್ದೀರಿ ನಾನು ತುಂಬಾ ಒಳ್ಳೆಯವನಾಗಿದ್ದೆ,...",Kannada,0
6926,Jeg fejlede.,Danish,0
6101,"В августе 2007 года сайт, разработанный аспира...",Russian,0
6062,«Нупедия» была основана 9 марта 2000 года как ...,Russian,0
6872,"det betyder ikke noget, om du vil ønske nogen ...",Danish,0


## Number of different tokens

In [10]:
def create_vocabulary(dataset:pd.DataFrame, text_column:str="Text"):
    words = dict()

    for sentence in dataset[text_column].to_numpy():
        # simple tokenization
        # lower case sentence
        for word in sentence.lower().split():
            if word not in words:  # vocabulary
                words[word] = len(words)  # Assign each word with a unique index
    return words

In [11]:
# token to idx
vocab = create_vocabulary(dataset=dataset)

In [12]:
# top k words by occurences
K = 10
top_k = {k:v for k,v in sorted(vocab.items(), key=lambda item: item[1], reverse=True)[:K]}
top_k

{'ಒಳ್ಳೆಯವರು': 58686,
 'ಅವನಾಗಬಹುದು': 58685,
 'ನೋಡುತ್ತಿದ್ದೇನೆ': 58684,
 'ಕಾಣುತ್ತಿದ್ದೀರಿ': 58683,
 'ದೇವದೂತನಂತೆ': 58682,
 'ಹಿಸಿದ್ದೇನೆ.': 58681,
 'ಬದಲಾಗಿದ್ದಾಳೆ.': 58680,
 'ಸಮಯದಿಂದ': 58679,
 'ಹೇಳಿದೆ': 58678,
 'ಎಲ್ಲವನ್ನೂ': 58677}

The simple tokenization strategy doesn't work for the Kannada (indian language) examples, let's remove the hindi examples and try again

In [13]:
non_hindi_vocab = create_vocabulary(dataset=dataset[dataset["Language"] != "Kannada"])
K = 10
non_hindi_top_k = {k:v for k,v in sorted(non_hindi_vocab.items(), key=lambda item: item[1], reverse=True)[:K]}
non_hindi_top_k

{'seid': 56980,
 'sein?': 56979,
 'ich,': 56978,
 'oder?': 56977,
 'verändert.': 56976,
 'passiert': 56975,
 'beiden': 56974,
 'erzählt': 56973,
 'narzissmus': 56972,
 "wie'": 56971}

In [14]:
# Kannada impact
len(dataset[dataset["Language"] == "Kannada"]) / len(dataset)

0.03569701073812518

Now the results seem better, so I will delete the Kannada language from the dataset (-3.57%). In the future we can think about add support for the Kanndada language