# PROJECT 2: Investigating the history of number words

### Dependencies:
1. It is assumed that the user has anaconda installed on their system
2. Additional from anaconda, a python module named 'distance' is required to run this notebook. It can be installed using the following pip command: <br>
    pip install distance
   





In [1]:
#Loading Dependencies
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
import re
import math
import random
from distance import levenshtein as edit_d, hamming, nlevenshtein as nedit_d, jaccard

random.seed(1234)

#Data_Paths:
data_path = 'csv_files/'
metadata_ie = 'Languages_indo_euro.csv'
metadata_aa = 'Languages_afro_asiatic.csv'

### Loading data

Let's store all the language datasets in a dictionary so that we can fetch them later on easily

In [2]:
#Store all languages datasets in a dictionary 
lang_data = {file[18:-4]: pd.read_csv(data_path + file, encoding='utf8') for file in listdir(data_path)}

#No. of languages:
print("Number of languages: %d" % len(lang_data))


Number of languages: 7221


Let's also store the metadata of some Classification glottologs:

In [3]:
#Indo_European metadata
indo_euro = pd.read_csv(metadata_ie, encoding='utf8')
print("Number of indo european languages: %d" % len(indo_euro))

#Afro_asiatic metadata
afro_asia = pd.read_csv(metadata_aa, encoding='utf8')
print("Number of afro asiatic languages: %d" % len(afro_asia))

Number of indo european languages: 414
Number of afro asiatic languages: 358


## Main Approach:

1. **Categorization of Languages by Glottolog Classification**: <br>
We begin by creating two independent datasets, one for Indo-European languages and the other for Afro-Asiatic languages. <br>

2. **Categorization of Languages by Year of Extinction**: <br>
We create three clusters of languages: <br>
2.1 Ancient <br>
2.2 Recently Extinct <br>
2.3 Non-Extinct <br>

3. **We analyze the similarities of 'number words' VS 'other words' for each of these clusters and compare the results on between different glottolog classifications**. <br>


Defining some functions below so that they can be used later efficiently.
Each function's description is described within the function

In [4]:
def ancient_langs(data):
    '''
    Input: asjp languages metadata 
    Output: a list of ancient language names
    '''
    #store names of languages that are extinct
    names = data[data.long_extinct == True].name.values
    
    #replace spaces by '_' and changes everything to uppercase
    names = [re.sub(r' ', r'_', name).upper() for name in names] 
    
    print("%d ancient languages stored" % len(names))
    print(names)
    print("\n")
    
    return [(x, 'ancient') for x in names]
    
def recently_extinct(data):
    '''
    Input: asjp language metadata 
    Output: a list of names of recently extinct languages
    '''
    #store names of languages that are extinct
    names = data[data.recently_extinct == True].name.values
    
    #replace spaces by '_' and changes everything to uppercase
    names = [re.sub(r' ', r'_', name).upper() for name in names] 
    
    print("%d recently extinct languages stored" % len(names))
    print(names)
    print("\n")
    
    return [(x, 'recently_extinct') for x in names]

def new_languages(data):
    '''
    Input: asjp language metadata 
    Output: a short subset of list of names of recent languages (non-extinct)
    '''
    random.seed(12)
    
    #store names of languages that are extinct
    names = data[(data.recently_extinct == False) & \
                (data.long_extinct == False)].name.values
    
    #Select random 10 language names (as the whole list is too long)
    random.shuffle(names)
    names = names[:13]
    
    #replace spaces by '_' and changes everything to uppercase
    names = [re.sub(r' ', r'_', name).upper() for name in names] 
    
    random.shuffle(names)
    
    print("%d new languages stored" % len(names))
    print(names)
    print("\n")
    
    return [(x, 'non_extinct') for x in names]

def vocab(language):
    '''
    Input: language name (a string) 
    Output: list of words in that language
    '''
    vocab = set(list(lang_data[language].Parameter_name.values))
    
    print("%d words of %s stored" % (len(vocab), language))
    
    return list(vocab)

def similarity_metric(metric, lang1, lang2, word):
    '''
    Input:  1. metric: Similarity function's name  : {'edit_dist', 'jaccard'}
            2. lang1: first language's name
            3. lang2: second language's name
            4. word: word to be compared
            
    Output: Value of similarity metric calculated on the word b/w lang1 and lang2
    '''
    values_lang1 = lang_data[lang1][lang_data[lang1].Parameter_name == word].Value.values
    values_lang2 = lang_data[lang2][lang_data[lang2].Parameter_name == word].Value.values
    #If multiple values, pick the one with smallest length
    index1 = list(values_lang1).index(min(values_lang1, key=len))
    index2 = list(values_lang2).index(min(values_lang2, key=len))
    
    if metric == 'edit_d':
        return edit_d(values_lang1[index1], values_lang2[index2])
    
    elif metric == 'nedit_d':  #Normalized edit distance
        return nedit_d(values_lang1[index1], values_lang2[index2], method=2) #longest alignment
        
    elif metric == 'norm_hamming':
        return hamming(values_lang1[index1], values_lang2[index2], normalized = True)
    
    elif metric == 'jaccard':
        return jaccard(values_lang1[index1], values_lang2[index2])
        
    else:
        return None

## Section 1. Indo-European Languages

We start by analyzing the similarites of words on Indo-European languages.

The names of these languages are stored in lists to be fetched later on <br>


In [5]:
#Store the names of languages in a list 
ancient_l = ancient_langs(indo_euro)
rec_ext_l = recently_extinct(indo_euro)
new_lang_l = new_languages(indo_euro)

#Common words
#words = vocab('ENGLISH')
words = ['one', 'two', 'I', 'you', 'we', 'person', 'fish', 'dog', \
         'louse', 'tree', 'leaf', 'skin', 'blood', 'bone', 'horn', \
        'ear', 'eye', 'nose', 'tooth', 'tongue', 'knee', 'hand', \
        'breast', 'liver', 'drink', 'see', 'hear', 'die', 'come',
        'sun', 'star', 'water', 'stone', 'fire', 'path',\
        'mountain', 'night', 'full', 'new', 'name']

21 ancient languages stored
['HITTITE', 'PALAIC', 'ARMENIAN_CLASSICAL', 'OLD_PRUSSIAN', 'CORNISH', 'GOTHIC', 'OLD_ENGLISH', 'OLD_FRISIAN', 'OLD_HIGH_GERMAN', 'OLD_LOW_FRANCONIAN', 'OLD_NORSE', 'OLD_SAXON', 'GREEK_ANCIENT', 'ILLYRIAN', 'PALI', 'SANSKRIT', 'AVESTAN', 'AVESTAN_2', 'OLD_PERSIAN', 'LATIN', 'OLD_CHURCH_SLAVONIC']


13 recently extinct languages stored
['MANX', 'OLD_IRISH', 'WELSH_ROMANI', 'DALMATIAN', 'EMILIANO_CARPIGIANO', 'EMILIANO_FERRARESE', 'EMILIANO_REGGIANO', 'ROMAGNOL_RAVENNATE', 'COMMON_TOCHARIAN', 'TOCHARIAN_A', 'TOCHARIAN_B', 'BERBICE_DUTCH_CREOLE', 'NEGERHOLLANDS']


13 new languages stored
['KASHMIRI', 'KOSOVO_ARLI_ROMANI', 'PARGAM_NISAR_KHOWAR', 'DOMAAKI', 'SANDNES_NORWEGIAN', 'KASHUBIAN', 'GULLAH', 'ASHRET_PHALURA', 'FRENCH', 'JUDEO_TAT_2', 'ENGLISH', 'PALAS_SHINA', 'CHAMAN_PASHTO']




### 1.1 Metric: EDIT DISTANCE
#### Dataframe to compare EDIT DISTANCES between phonemes of words in different languages

In [6]:
all_langs = new_lang_l + rec_ext_l + ancient_l

#Multi index for pandas dataframe
index_adv = pd.MultiIndex.from_tuples(all_langs, names=['language', 'cluster'])

Calculating edit_distance between each pair of words between different languages

In [7]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

#Evaluate metrics for each pair of language
for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('edit_d', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language               

#### A look at EDIT_DISTANCE data frame

The NaN values mean that the particular word was not found in that language

In [8]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
language,cluster,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
KASHMIRI,non_extinct,3.0,3.0,3.0,2.0,3.0,5.0,3.0,3.0,2.0,3.0,...,3.0,4.0,4.0,3.0,3.0,7.0,3.0,5.0,1.0,2.0
KOSOVO_ARLI_ROMANI,non_extinct,3.0,2.0,2.0,1.0,4.0,6.0,6.0,5.0,3.0,5.0,...,6.0,4.0,4.0,4.0,5.0,6.0,3.0,7.0,3.0,3.0
PARGAM_NISAR_KHOWAR,non_extinct,3.0,1.0,3.0,1.0,3.0,6.0,4.0,4.0,,3.0,...,2.0,5.0,4.0,4.0,2.0,,3.0,,2.0,1.0
DOMAAKI,non_extinct,3.0,2.0,,1.0,3.0,,6.0,6.0,,2.0,...,3.0,5.0,3.0,4.0,,,5.0,,3.0,1.0
SANDNES_NORWEGIAN,non_extinct,2.0,1.0,1.0,,2.0,5.0,2.0,3.0,1.0,1.0,...,4.0,3.0,4.0,4.0,3.0,7.0,2.0,,2.0,3.0
KASHUBIAN,non_extinct,4.0,3.0,2.0,2.0,2.0,,4.0,5.0,,7.0,...,,4.0,4.0,4.0,5.0,6.0,3.0,5.0,3.0,6.0
GULLAH,non_extinct,1.0,0.0,2.0,0.0,0.0,3.0,,1.0,,0.0,...,,3.0,,2.0,2.0,7.0,1.0,0.0,,0.0
ASHRET_PHALURA,non_extinct,3.0,1.0,2.0,1.0,2.0,6.0,5.0,6.0,,3.0,...,3.0,5.0,4.0,4.0,3.0,,3.0,,2.0,2.0
FRENCH,non_extinct,3.0,2.0,2.0,2.0,2.0,6.0,7.0,3.0,3.0,4.0,...,5.0,5.0,5.0,3.0,3.0,5.0,2.0,4.0,2.0,2.0
JUDEO_TAT_2,non_extinct,3.0,2.0,2.0,2.0,3.0,6.0,3.0,2.0,4.0,3.0,...,5.0,5.0,4.0,4.0,2.0,7.0,3.0,3.0,4.0,1.0


#### Mean of Edit Distances for three clusters

In [9]:
#Missing Value imputations by Median
df.fillna(df.median(), inplace=True)

#Mean Grouping by clsuters
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,3.047619,2.857143,2.333333,1.714286,2.333333,5.904762,3.904762,3.619048,3.952381,3.52381,...,2.904762,3.047619,3.285714,2.952381,3.428571,6.380952,3.190476,3.047619,2.904762,2.619048
non_extinct,2.615385,1.692308,1.923077,1.230769,2.307692,5.0,3.846154,3.461538,3.153846,2.846154,...,3.230769,3.923077,3.846154,3.307692,2.769231,6.153846,2.769231,3.230769,2.230769,1.769231
recently_extinct,2.461538,1.923077,2.076923,1.384615,2.846154,5.692308,3.076923,3.076923,4.230769,4.0,...,3.846154,3.538462,4.076923,3.153846,5.076923,5.615385,3.307692,3.230769,1.846154,1.461538


#### 'Number' words Vs 'Other' words

In [10]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,2.952381,3.635338
non_extinct,2.153846,3.313765
recently_extinct,2.192308,3.682186


### 1.2 METRIC: Normalized Edit DISTANCE

#### Dataframe based on normalized Edit Distance:
We can analyze all the results with respect to normalized edit distance as well

In [11]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('nedit_d', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language 
              
#Missing Value imputations
df.fillna(df.median(), inplace=True)
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,0.823413,0.801134,0.857143,0.663492,0.768254,0.884921,0.775661,0.897619,0.72619,0.804195,...,0.473621,0.601587,0.719841,0.65873,0.85,0.894558,0.701587,0.614626,0.693651,0.606349
non_extinct,0.830769,0.641026,0.858974,0.615385,0.826923,0.791209,0.837179,0.826923,0.665385,0.699359,...,0.561172,0.784615,0.793956,0.765385,0.788462,0.798535,0.657692,0.753846,0.628205,0.46337
recently_extinct,0.737179,0.75,0.897436,0.608974,0.801282,0.876374,0.862821,0.865385,0.876923,0.82619,...,0.689744,0.697436,0.828205,0.761538,0.929487,0.751374,0.726923,0.858974,0.583333,0.44359


In [12]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,0.812273,0.749742
non_extinct,0.735897,0.747043
recently_extinct,0.74359,0.808606


### 1.3 METRIC: Normalized Hamming

#### Dataframe based on normalized Hamming:
We can analyze all the results with respect to normalized hamming

In [13]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('norm_hamming', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language   
            
            
    
#Missing Value imputations
df.fillna(df.median(), inplace=True)
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,0.968254,0.52381,0.952381,0.595238,0.5,0.920635,1.0,0.698413,1.0,0.968254,...,0.952381,0.409524,0.72619,0.714286,0.666667,0.857143,0.797619,0.539683,0.5,0.365079
non_extinct,0.871795,0.5,0.884615,0.615385,0.576923,0.839744,0.923077,0.666667,0.884615,0.74359,...,0.923077,0.384615,0.711538,0.711538,0.717949,0.791209,0.673077,0.589744,0.461538,0.358974
recently_extinct,0.923077,0.615385,0.923077,0.576923,0.538462,0.923077,0.948718,0.692308,1.0,1.0,...,1.0,0.553846,0.730769,0.769231,0.717949,0.857143,0.730769,0.794872,0.5,0.410256


In [14]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,0.746032,0.713429
non_extinct,0.685897,0.667852
recently_extinct,0.769231,0.738245


### 1.4 METRIC : Jaccard

#### Dataframe based on Jaccard


In [15]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('jaccard', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language    
                 
#Missing Value imputations
df.fillna(df.median(), inplace=True)
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,0.871032,0.845918,0.873016,0.768254,0.821429,0.886144,0.814683,0.880499,0.732937,0.835809,...,0.43254,0.656047,0.762528,0.719274,0.892744,0.901623,0.716837,0.620975,0.703175,0.586054
non_extinct,0.852564,0.705128,0.871795,0.692308,0.826923,0.782143,0.853114,0.85641,0.696154,0.725641,...,0.553266,0.815476,0.77967,0.78315,0.839103,0.807082,0.700641,0.75641,0.675641,0.567308
recently_extinct,0.798718,0.824359,0.923077,0.683333,0.839744,0.859463,0.874908,0.914652,0.856838,0.864927,...,0.647619,0.711355,0.869231,0.81859,0.95,0.6663,0.779945,0.830037,0.652564,0.557692


In [16]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,0.858475,0.769448
non_extinct,0.778846,0.770276
recently_extinct,0.811538,0.822428


## Section 2. Afro-Asiatic Languages

Now we apply the same approach to a different classification of languages.
Let's see what conclusions we come to in this case

In [17]:
#Store the names of languages in a list 
ancient_l = ancient_langs(afro_asia)
rec_ext_l = recently_extinct(afro_asia)
new_lang_l = new_languages(afro_asia)


all_langs = new_lang_l + rec_ext_l + ancient_l

#Multi index for pandas dataframe
index_adv = pd.MultiIndex.from_tuples(all_langs, names=['language', 'cluster'])

18 ancient languages stored
['LATE_EGYPTIAN', 'MIDDLE_EGYPTIAN', 'AKKADIAN', 'AKKADIAN_2', 'ARABIC_QURANIC', 'CLASSICAL_ARABIC', 'STANDARD_ARABIC', 'UGARITIC', 'ARAMAIC_ANCIENT', 'ACHAEMENID_ARAMAIC', 'CLASSICAL_MANDAIC', 'BIBLICAL_HEBREW', 'PHOENICIAN', 'SABEAN', 'ETHIOPIC', 'GEEZ', 'GEEZ_2', 'GEEZ_3']


12 recently extinct languages stored
['FOQAHA', 'HOLMA', 'TESHENAWA', 'KWADZA', 'COPTIC', 'CLASSICAL_SYRIAC', 'SYRIAC', 'SYRIAC_2', 'MLAHSO', 'GAFAT', 'MESMES', 'MESMES_2']


13 new languages stored
['BAZZA', 'MWAGHAVUL', 'FALI_GILI', 'BUDUMA', 'TSAMAI', 'BILIN_2', 'BURUNGE', 'SOMRAI', 'DAHALO_2', 'GHYE', 'SHA', 'CUVOK', 'KULERE']




### 2.1 Metric: EDIT DISTANCE
#### Dataframe to compare EDIT DISTANCES between phonemes of words in different languages

In [18]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

#Evaluate metrics for each pair of language
for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('edit_d', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language     

#Missing Value imputations by Median
df.fillna(df.median(), inplace=True)

#Mean Grouping by clsuters
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,3.666667,4.055556,2.944444,3.333333,3.888889,6.0,3.666667,3.944444,5.111111,4.222222,...,4.611111,4.5,3.888889,3.777778,4.388889,6.222222,4.0,3.833333,4.444444,2.277778
non_extinct,3.923077,3.538462,2.615385,3.0,4.230769,5.923077,5.0,4.307692,5.076923,4.153846,...,6.615385,4.461538,4.615385,4.153846,4.615385,6.153846,4.538462,4.384615,5.230769,3.538462
recently_extinct,3.583333,4.166667,3.0,2.916667,3.916667,5.833333,4.166667,3.833333,5.416667,4.333333,...,5.416667,4.416667,4.75,4.083333,4.75,6.083333,5.25,3.833333,4.916667,3.166667


In [19]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,3.861111,4.162281
non_extinct,3.730769,4.496964
recently_extinct,3.875,4.433114


### 2.2 METRIC: Normalized Edit DISTANCE

#### Dataframe based on normalized Edit Distance:

In [20]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('norm_hamming', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language   
            
            
    
#Missing Value imputations
df.fillna(df.median(), inplace=True)
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,0.981481,0.944444,1.0,1.0,,0.981481,0.981481,1.0,1.0,1.0,...,0.986111,0.8,0.944444,1.0,1.0,,1.0,1.0,,0.685185
non_extinct,0.974359,1.0,1.0,0.961538,,0.987179,1.0,1.0,0.961538,0.974359,...,1.0,0.815385,1.0,1.0,1.0,,1.0,1.0,,0.74359
recently_extinct,1.0,1.0,1.0,1.0,,0.986111,1.0,0.972222,1.0,1.0,...,1.0,0.8,1.0,1.0,1.0,,1.0,1.0,,0.777778


In [21]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,0.962963,0.967116
non_extinct,0.987179,0.969451
recently_extinct,1.0,0.973651


### 2.3 METRIC: Normalized Hamming

#### Dataframe based on normalized Hamming:

In [22]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('norm_hamming', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language   
            
            
    
#Missing Value imputations
df.fillna(df.median(), inplace=True)
df.groupby(level=['cluster']).mean()

Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,0.981481,0.944444,1.0,1.0,,0.981481,0.981481,1.0,1.0,1.0,...,0.986111,0.8,0.944444,1.0,1.0,,1.0,1.0,,0.685185
non_extinct,0.974359,1.0,1.0,0.961538,,0.987179,1.0,1.0,0.961538,0.974359,...,1.0,0.815385,1.0,1.0,1.0,,1.0,1.0,,0.74359
recently_extinct,1.0,1.0,1.0,1.0,,0.986111,1.0,0.972222,1.0,1.0,...,1.0,0.8,1.0,1.0,1.0,,1.0,1.0,,0.777778


In [23]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,0.962963,0.967116
non_extinct,0.987179,0.969451
recently_extinct,1.0,0.973651


### 2.4 METRIC : Jaccard

#### Dataframe based on Jaccard


In [24]:
#Create an empty dataframe with columns as words and multi index (language, cluster)
df = pd.DataFrame(columns = words, index = index_adv, dtype=float)

for (lang, cluster) in all_langs:
    for word in words:
        try:
            df.loc[(lang, cluster),word] = similarity_metric('jaccard', 'ENGLISH', lang, word)
        except:
            pass   #for words not found in a particular language    
                 
#Missing Value imputations
df.fillna(df.median(), inplace=True)
df.groupby(level=['cluster']).mean()


Unnamed: 0_level_0,one,two,I,you,we,person,fish,dog,louse,tree,...,star,water,stone,fire,path,mountain,night,full,new,name
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ancient,0.923876,0.887302,0.953704,1.0,0.958333,0.86764,0.966667,0.906085,0.799647,0.905556,...,0.887566,0.894841,0.882275,0.89871,0.982804,0.851808,0.879894,0.822487,0.972222,0.772222
non_extinct,0.929823,0.958791,0.901282,0.955128,0.940934,0.895843,0.894597,0.840018,0.86859,0.890018,...,0.7887,0.885531,0.956044,0.984203,0.978022,0.83315,0.881654,0.901099,0.964011,0.850549
recently_extinct,0.933333,0.904762,0.869444,1.0,0.965278,0.812497,0.990741,0.85873,0.83955,0.864286,...,0.850992,0.899967,0.846329,0.870784,0.988095,0.760747,0.850956,0.860317,0.979167,0.812963


In [25]:
number_words = list(df.columns[:2])
other_words = list(df.columns[2:])

df['number_words'] = df[number_words].mean(axis=1)
df['other_words'] = df[other_words].mean(axis=1)

df[['number_words', 'other_words']].groupby(level=['cluster']).mean()

Unnamed: 0_level_0,number_words,other_words
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
ancient,0.905589,0.914588
non_extinct,0.944307,0.910546
recently_extinct,0.919048,0.903343
