##Importing Dataset

In this section I import the relevant libraries for loading, cleaning and proccessing the dataset that I will use for the baseline testing. I also loaded the dataset in for google drive and perform any cleaning and proccessign tasks that need to be done.

In [72]:
import pandas as pd
import string

In [73]:
df = pd.read_table('../../data/etymwn.tsv', header=None)

The dataset I have choosen is a etymological wordnet dataset form http://etym.org/

As you can see below, the dataset consists of three colummns. Two columns contain a word and its ISO-639-3 language code and the other column contains the etymological relationship between them. Some cleaning is required to get thihs data into a usable format

In [74]:
df.head()

Unnamed: 0,0,1,2
0,aaq: Pawanobskewi,rel:etymological_origin_of,eng: Penobscot
1,aaq: senabe,rel:etymological_origin_of,eng: sannup
2,abe: waniigan,rel:etymological_origin_of,eng: wangan
3,abe: waniigan,rel:etymological_origin_of,eng: wannigan
4,abs: beta,rel:etymological_origin_of,zsm: beta


In [75]:
df2 = df.loc[df[0].str[:3] == 'eng']

In [76]:
df3 = df2.loc[df2[1].str[4:] == 'etymology']

In [77]:
df4 = df3.loc[df3[2].str[:3] != 'eng']

df4[2] = df4[2].str[:3]

df4[0] = df4[0].str[4:]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4[2] = df4[2].str[:3]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4[0] = df4[0].str[4:]


In [78]:
df4 = df4.drop(1, axis=1)

## Language Codes

I will now import a table that contains the language associated with each ISO-639-3 language code. I will do this so I can replace the language codes with the name of the language to make the dataset more interpretable

In [79]:
lanAlt = pd.read_csv('../../data/iso-639-3_Name_Index_20230123.tab', sep='\t')

In [80]:
#Some language codes in the data set were not present in the language code file
#Because of this I added them manually
lang_dict = dict(zip(lanAlt['Id'], lanAlt['Print_Name']))
lang_dict['p_s'] = 'unknown'
lang_dict['nah'] = 'Nahuatl'
lang_dict['nan'] = 'Min Nan'
lang_dict['wit'] = 'unknown'

In [81]:
for i, v in df4.iterrows():
    df4[2][i] = lang_dict[v[2]]
    v[0] = v[0].replace('-', '')
    v[0] = v[0].replace("'", "")

In [83]:
df4[2] = df4[2].str.replace(r'\(.*\)', '', regex=True)

There are many languages present in the current dataset and to build any model with that many classes to predict would result in an extremely skewed model as most classes would be under-represented. To make things slightly easier, I will take the top ten largest classes and build a model around them.

In [85]:
language_counts = df4[2].value_counts()
result_df = pd.DataFrame({'Language': language_counts.index, 'Count': language_counts.values})
topLang = result_df['Language'].tolist()[:10]

In [86]:
print("Amount of languages in the dataset: " + str(len(result_df['Language'].tolist())))

Amount of languages in the dataset: 238


In [87]:
print("Top ten languages: ")
for i in topLang:
  print(i)

Top ten languages: 
Latin
Middle English 
French
Ancient Greek 
Old French 
Old English 
Middle French 
Japanese
Italian
Spanish


In [88]:
drop_in = []

#getting the index of languages that aren't in the top ten
for i, v in df4.iterrows():
  if v[2] not in topLang:
    drop_in.append(i)

#dropping these languages from the data
for i in drop_in:
  df4.drop(i, inplace=True)

## Baseline Test

The idea behind this baseline test is that different languages will have ngrams of characters that are somewhat unique to that language. Of course this is not always the case and languages that are similar to each other such as the romantic lanuages will often share many of theses ngrams.

The ngram character combinations were determinded through a mix of my own prior knowledge of some languages, trail and error as well as asking ChatGPT. The prompt given to ChatGPT was "Give me common letter combinations in French" and then cherry picking the results I thought would work best.

If nothing is found, Latin is returned as it is the most common etymological root for english. Also note that I have decided to merge "French", "Middle French" and "Old French" into just "French" as they are too similar to tell apart. I have also done this with "Middle English" and "Old English". This leaves us with seven classes to predict.

In [89]:
def ept_classifier_rule(word):
    if "ua" in word or "eca" in word or "us" in word or "um" in word or "ia" in word or "ex" in word or "lib" in word or "oc" in word or "au" in word:
        return("Latin")
    elif "ia" in word or "ee" in word or "oy" in word or "ow" in word or "ev" in word or "ng" in word or "ch" in word or "th" in word or "ph" in word:
        return("Ancient Greek")
    elif "gh" in word or "kn" in word or "wr" in word or "ghth" in word or "tion" in word or "ou" in word:
        return("English")
    elif "au" in word or "eau" in word or "ai" in word or "ei" in word or "oi" in word or "eu" in word or "gn" in word or "ill" in word:
        return("French")
    elif "ka" in word or "ki" in word or "ku" in word or "ke" in word or "ko" in word:
        return("Japanese")
    elif "ce" in word or "ge" in word or "sce" in word or "gh" in word or "gli" in word or "io" in word or "za" in word:
        return("Italian")
    elif "ll" in word or "qu" in word or "gu" in word or "rr" in word or "gui" in word or "ves" in word:
        return("Spanish")
    else:
        return("Latin")

In [90]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df4, test_size=0.2, random_state=42)

In [91]:
test_df

results = []
for i, v in test_df.iterrows():
  if ept_classifier_rule(v[0]) in v[2]:
    results.append(1)
  else:
    results.append(0)

accuracy = sum(results)/len(results)
print("Accuracy: " + str(accuracy))

Accuracy: 0.3175089331291475


In [92]:
word = "democracy"

print("The etymology of the word " + word + " is "  + ept_classifier_rule(word))

The etymology of the word democracy is Latin


Testing random 30

In [93]:
test = pd.read_csv('../../data/test.csv')

correct = []
all = []

for i, r in test.iterrows():
  if ept_classifier_rule(r['0']) == r['2']:
    correct.append(1)
  all.append(1)

accuracy = sum(correct)/sum(all)
print(accuracy)

0.36666666666666664
