# Data Preprocessing
This notebook describes the preprocessing steps and an exploratory analysis:

In [1]:
import pandas as pd
import networkx as nx
import ast


Loading the data:

In [2]:
data = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
data.head()

Unnamed: 0,language,sentence,n,edgelist,root
0,Japanese,2,23,"[(6, 4), (2, 6), (2, 23), (20, 2), (15, 20), (...",10
1,Japanese,5,18,"[(8, 9), (14, 8), (4, 14), (5, 4), (1, 2), (6,...",10
2,Japanese,8,33,"[(2, 10), (2, 14), (4, 2), (16, 4), (6, 16), (...",3
3,Japanese,11,30,"[(30, 1), (14, 24), (21, 14), (3, 21), (7, 3),...",30
4,Japanese,12,19,"[(19, 13), (16, 19), (2, 16), (4, 10), (4, 15)...",11


Next we check the types and nulls values per column of the training data. No null values are found:

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10500 entries, 0 to 10499
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   language  10500 non-null  object
 1   sentence  10500 non-null  int64 
 2   n         10500 non-null  int64 
 3   edgelist  10500 non-null  object
 4   root      10500 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 410.3+ KB


Now we can check some basic statistics of the numeric columns.
- On average each sentence has $18.8$ words
- The root predicted node can be from the 1st till the 68th node (depending the size of the sentence)

In [4]:
data.describe()

Unnamed: 0,sentence,n,root
count,10500.0,10500.0,10500.0
mean,494.778,18.807524,9.844476
std,290.256632,8.190593,7.20774
min,2.0,3.0,1.0
25%,233.5,13.0,4.0
50%,483.0,18.0,8.0
75%,742.25,23.0,14.0
max,995.0,70.0,68.0


In the training data we can find 21 languages with 500 sentences for each language, producing $10,500$ total rows of the training data:

In [5]:
print('Total languages: ' + str(data[['language']].value_counts().shape[0]))
data[['language']].value_counts()

Total languages: 21


language  
Arabic        500
Chinese       500
Czech         500
English       500
Finnish       500
French        500
Galician      500
German        500
Hindi         500
Icelandic     500
Indonesian    500
Italian       500
Japanese      500
Korean        500
Polish        500
Portuguese    500
Russian       500
Spanish       500
Swedish       500
Thai          500
Turkish       500
Name: count, dtype: int64

## Dataset preparation
Now the dataset will be transformed into a training set suitable for binary classification models using centralities as
vertex features

A function is created that transform an edge list into a networkx graph calculating the centralities

In [6]:
def centralities(edgelist):
    T = nx.from_edgelist(edgelist)
    dc = nx.degree_centrality(T)
    cc = nx.harmonic_centrality(T)
    bc = nx.betweenness_centrality(T)
    pc = nx.pagerank(T)
    return {v: (dc[v], cc[v], bc[v], pc[v]) for v in T}

Now we iterate over each row of the training data, transforming it into a graph and calculating the centralities for every node. The new binary dataset contains the features of `language`, `sentence`,`n`, `vertex`, centrality scores and `is_root` (which takes values of 0 or 1 if the particular node is a root node)

In [7]:
columns = ['language', 'sentence', 'vertex', 'n',
           'degree', 'harmonic',
           'betweeness', 'pagerank',
           'is_root']

binary_data_list = []

# Saving column of edges as a list instead of a string
data['edgelist'] = data['edgelist'].apply(ast.literal_eval)

for row in data.itertuples(index=False):
    target = row.root
    
    for node, (degree, harmonic, betweeness,
              pagerank) in centralities(row.edgelist).items():
        new_row = {'language': row.language,
                   'sentence': row.sentence,
                   'vertex': node,
                   'n': row.n,
                   'degree': degree,
                   'harmonic': harmonic,
                   'betweeness': betweeness,
                   'pagerank': pagerank,
                   'is_root': 1 if node == target else 0}

        binary_data_list.append(new_row)

expanded_data = pd.DataFrame(binary_data_list, columns=columns)
expanded_data.head(5)



Unnamed: 0,language,sentence,vertex,n,degree,harmonic,betweeness,pagerank,is_root
0,Japanese,2,6,23,0.090909,5.823846,0.090909,0.048565,0
1,Japanese,2,4,23,0.045455,4.561122,0.0,0.027162,0
2,Japanese,2,2,23,0.136364,6.991703,0.255411,0.066901,0
3,Japanese,2,23,23,0.045455,5.157179,0.0,0.025477,0
4,Japanese,2,20,23,0.090909,7.146825,0.311688,0.042552,0
