## Data Analysis and Cleaning
The goal of this notebook is to show the analysis of the data that was done.

In [1]:
# read the given data
import pandas as pd 

train = pd.read_csv("./sentiment-analysis-test/data/train.csv")
test = pd.read_csv("./sentiment-analysis-test/data/test.csv")

In [2]:
# Class distribution in training set
train['sentiment'].value_counts()

neutral       8823
positive      8318
negative      7858
unassigned       1
Name: sentiment, dtype: int64

In [3]:
# drop row with unassigned label 
unassigned = train.loc[train['sentiment'] == "unassigned"]
print(unassigned)
train.drop(unassigned.index, inplace=True)

                content   sentiment
5657  ويلييي شو بتصرع💙💙  unassigned


In [4]:
train.loc[train['sentiment'] == "neutral"].sample(5)

Unnamed: 0,content,sentiment
18438,Johnson benadrukt weer dat VK op 31 oktober EU...,neutral
6707,22 минуты назад,neutral
10222,Basarnas perkuat sinergitas dengan potensi SAR...,neutral
19702,Burkina Faso’s long night of horror in killing...,neutral
10238,L'UDC Jura présente une liste au Conseil des E...,neutral


In [5]:
train.loc[train['sentiment'] == "positive"].sample(5)

Unnamed: 0,content,sentiment
959,大家早上好,positive
19290,Obal vypadá pěkně a drží.,positive
22538,ห้องพักวิวสวยมากมองเห็นวิวทะเลจากห้องพัก ห้องส...,positive
9804,Geste solitaire merci Gp Renault maroc,positive
18998,A wonderfully written book that really brings ...,positive


In [6]:
train.loc[train['sentiment'] == "negative"].sample(5)

Unnamed: 0,content,sentiment
14612,Что сказать за качество + дисплей не показывае...,negative
24517,"Rất mau khô lại, lúc đầu mở nắp ra thử lên tay...",negative
24061,Pas conforme à l'image couleur pas identique e...,negative
7805,Não serve para quem tem filhos acima de 13 ano...,negative
6306,フリード乗り換えました。CROSSTARのガソリン車です。 [エクステリア] 外観カッコイイ...,negative


In [7]:
sentiment2id = {"negative": 0, "neutral": 1, "positive": 2} # Following the ids of the model i'm using
# Average length of sentences
train['num_words'] = train['content'].apply(lambda x: len(x.split()))
test['num_words'] = test['content'].apply(lambda x: len(x.split()))

In [8]:
train.describe()

Unnamed: 0,num_words
count,24999.0
mean,18.658666
std,45.455195
min,1.0
25%,5.0
50%,10.0
75%,18.0
max,2994.0


In [9]:
test.describe()

Unnamed: 0,num_words
count,2500.0
mean,18.1208
std,33.95469
min,1.0
25%,5.0
50%,9.0
75%,17.0
max,674.0


We can see that the training dataset is balanced with respect to the sentiment classes. The sentences in both train and test have a mean of 18 words. (Biased because just splitting doesn't work for some langages like korean for example).

Next, we want to inspect the different languages we have on the train and test sets.

In [11]:
from langdetect import detect, LangDetectException
def detect_language(s):
    try:
        return detect(s)
    except LangDetectException:
        return "unknown"
train['lang'] = train['content'].apply(lambda x: detect_language(x))
test['lang'] = test['content'].apply(lambda x: detect_language(x))

In [12]:
train['lang'].value_counts()

en         4048
ru         3265
id         3196
ar         1801
fr         1498
es         1323
pt         1211
ko         1076
zh-cn       842
ja          742
it          587
de          548
th          460
tr          297
tl          258
so          246
vi          208
pl          196
et          183
nl          178
uk          156
ro          156
sv          153
ca          144
gu          143
bg          134
hi          131
bn          130
fi          113
da          107
fa          106
zh-tw       101
no           95
cs           93
ta           89
af           86
he           85
mk           85
sk           82
ml           81
hu           81
lt           72
ur           70
el           58
unknown      57
hr           55
cy           53
sw           43
sl           36
lv           25
sq            9
kn            3
mr            2
te            2
Name: lang, dtype: int64

In [13]:
test['lang'].value_counts()

en         393
id         313
ru         291
ar         181
fr         157
pt         135
es         122
ko         114
zh-cn       89
it          74
ja          69
de          62
th          45
tl          32
tr          30
nl          24
pl          23
vi          22
et          18
ro          18
so          16
fi          16
sv          15
hi          15
uk          14
bg          14
ca          14
af          12
bn          12
ml          12
no          12
zh-tw       11
he          11
hu          11
fa          11
lt          10
cs          10
da           9
ta           8
unknown      8
ur           7
gu           7
el           6
mk           6
cy           5
sw           5
sk           5
hr           2
sl           2
lv           1
mr           1
Name: lang, dtype: int64

In [17]:
len(set(train['lang'].tolist() + test['lang'].tolist()))

54

The dataset is composed of 54 different languages with english, russian, indonesian, arabic, french, spanish, portugese and korean the most represented ones.

In [42]:
from sklearn.model_selection import train_test_split
final_train = train[["content", "sentiment"]]
train_split, val = train_test_split(final_train, test_size=0.2, stratify=final_train['sentiment'])

In [44]:
train_split['sentiment'].value_counts()

neutral     7058
positive    6654
negative    6287
Name: sentiment, dtype: int64

In [46]:
val["sentiment"].value_counts()

neutral     1765
positive    1664
negative    1571
Name: sentiment, dtype: int64

In [47]:
train_split.to_csv("sentiment-analysis-test/data/train_clean.csv", index=False)
val.to_csv("sentiment-analysis-test/data/val_clean.csv", index=False)