NL2DS - Winter 2023

Assignment 5 -- Language Phylogeny and Clustering

Name: **[YOUR NAME HERE]**

Student ID: **[YOUR STUDENT ID HERE]**

In this assignment, we will look at some cross-linguistic word form data and use some of the tools we saw in class to build family trees of languages based on the sound forms of words---otherwise known as "optimal phylogenies." 

We will use data from the following recent paper.


[Dellert, Johannes, Daneyko, T., Muench, A., Ladygina, A., Buch, A., Clarius, N., Grigorjew, I., Balabel, M., Boga, H. I., Baysarova, Z., Muehlenbernd, R., Wahle, J., and Jaeger, G. (2020). Northeuralex: A wide-coverage lexical database of northern eurasia. Language Resources & Evaluation, 54(273–301).](https://drive.google.com/file/d/1ptoMNctdJs99wPWfBUGbw4_X60NtKl9B/view?usp=sharing)

This data  can be found [here](http://northeuralex.org/) as well.

Copy the data to your drive folder from: [here](https://drive.google.com/file/d/1Mfa8XayBFJb0fY8wfinODw90yuRal8AD/view?usp=sharing), [here](https://drive.google.com/file/d/1AQqkscWKlq3quw-BWjB8xqSQzm7-uDtt/view?usp=sharing), and [here](https://drive.google.com/file/d/1R7ZLEzDW9QKUen3BjItPsySaUPCpu7xk/view?usp=sharing).

# **Part 1**

***Question 1:*** What is the Northeuralex dataset? Give a brief overview. What kind of data is it? What is its purpose? How was it constructed? No need to go into all of the particulars (such as fields of the files), just give an overview of no more than one paragraph that gives the gist for someone unfamiliar with the dataset.

**A1: put your answer here (please keep it brief, 3-5 sentences)**

Now, let's read in the wordforms in this dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

import pandas as pd
wordforms=pd.read_csv("/content/drive/My Drive/northeuralex.csv")
display(wordforms)

***Question 2:*** Describe the meaning of the `Langauge_ID`, `Concept_ID`, `rawIPA` and `IPA` columns of the data.

**A2: put your answer here (please keep it brief, no more than 1 sentence per column)**

Now let's read in some metadata about the languages.

In [None]:
languages=pd.read_csv("/content/drive/My Drive/northeuralex-languages.csv")
display(languages)

***Question 3:*** Describe the meaning of the `family`, `iso_code`, and `subfamily` columns of the data.

**A3: put your answer here (please keep it brief, no more than 1 sentence per column)**

Now let's read in some further data about the concepts.

In [None]:
concepts=pd.read_csv("/content/drive/My Drive/northeuralex-concepts.csv")
display(concepts)

***Question 4:*** Describe the meaning of the `id_nelex`, `gloss_en`, and `position_in_ranking` columns of the data.

**A4: put your answer here (please keep it brief, no more than 1 sentence per column)**

# **Part 2**

It will be useful to merge all of the meta-information into the main wordforms dataframe.

In [None]:
# Problem 1a: rename the appropriate columns in the languages and concepts dataframes to make this merge possible.
#your code here

# Problem 1b: Use the merge function to merge the three dataframes into one.
#your code here

display(wordforms)

In this problem set, we will make use of the `lingpy` package of tools for historical linguistics. You can find more information on this [here](ttps://lingpy.org/index.html). We'll start by installing the package.

In [None]:
!pip install lingpy

In order to make our computations below more manageable, we will focus on the Indo-european languages which you can read more about [here](https://en.wikipedia.org/wiki/Indo-European_languages). We will also focus just on the top 20 concepts as determined by their rank.

In [None]:
#Problem 2a: Filter out the non-Indo-European languages from the dataframes
#your code here

#Problem 2b: Filter the concepts to include those less than or equal to rank 20 in the dataframe.
# your code here

display(wordforms)

# **Part 3**

Our goal is to use agglomerative clustering to try to reconstruct the tree for the indoeuropean languages. You can find a reference tree (for families) [here](https://en.wikipedia.org/wiki/Indo-European_languages#/media/File:IndoEuropeanLanguageFamilyRelationsChart.jpg).

In order to do this, we will need to construct a  matrix of similarities between the languages, called a confusion matrix.

We will compute the (normalized) levenshtein distance between the strings for each concept for each pair of languages. For instance, we will compute the normalized levenshtein distance between the words for Wasser::N (water in English) for German and English and then similarily for all 19 other concepts. If there are multiple words for the same concept, take the average across all pair possibilities. We will then average these values (i.e., average across all concepts) to find the similarity between German and English. We will do this for all pairs of languages to create a list of lists representing the confusion matrix.

Note that running your code will take a few minutes.

In [None]:
import lingpy as lp
import numpy as np

#Problem 3: fill the confusion matrix  using the 
#lp.align.pairwise.edit_dist function from lingpy, on 
#the "IPA" fields for each language.

#initialize confusion matrix
language_list = None # Initialise list of languages in the current modified wordforms dataset
confusion = [[0 for j in range(len(language_list))] for i in range(len(language_list))]

for language1 in ...
  for language2 in ...
    ...
    for concept in ...
      ...
    confusion[language1][language2]=...

Clear the output of the above cell (by clicking on the cross at top left of the output part) so that it doesn't clutter the pdf.

Now that we have computed a matrix of similarities, we can use clustering algorithms to try to build phylogenetic trees representing the languages historical relationships. First, let's use the `lp.algorithm.clustering.flat_cluster` function from `lingpy` to derive a flat clustering of languages. 


In [None]:
lp.algorithm.clustering.flat_cluster('upgma', 0.6, confusion, language_list)

***Question 5:*** Do you recognize any of the clusters of languages? Are there any noteworthy errors in this clustering?

**A5: put your answer here (please keep it brief, no more than 4-5 sentences.)**


## **Part 4**

Now we will build our own dendrogram using the clustering algorithms available in [`scipy`](https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). You can read in particular about the [`linkage`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) function and the [`dendrogram`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html) function. 

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import v_measure_score
import matplotlib.pyplot as plt

#Problem 4: use the linkage function with the average linkage method to compute the clustering.
linked = ...


#plot the results using dendrogram
def llf(id): return language_list[id]
plt.figure(figsize=(12, 8))
dendrogram(linked,
           p=100,
           truncate_mode="level",
           orientation='top',
           distance_sort='descending',
           show_leaf_counts=False,
           leaf_label_func=llf)

plt.show()

***Question 6:*** Do you recognize any of the clusters of languages at any of the levels? Are there any noteworthy errors in this clustering?

**A6: put your answer here (please keep it brief, no more than 4-5 sentences.)**

# **Part 5**

***Question 7:*** Try three of the other linkage methods and describe how they change the results.

**A7: Put your answer here (please keep it brief, no more than 4-5 sentences). Include all the code in separate cells.**


***Question 8:*** Try increasing the number of concepts we use to compute our confusion matrix to be higher than 20. Does it change the results?

**A8: Put your answer here (please keep it brief, no more than 3-4 sentences). Include all the code in separate cells.**

# To Submit
To submit, name this notebook `YOUR_STUDENT_ID_Assignment_5.ipynb`, then convert this `.ipynb` file to a `.pdf` (e.g., using the following instructions) and upload the PDF to the Gradescope assignment "Assignment 5 -- Language Phylogeny and Clustering".

(Note: `Print > Save as PDF` **will not work** because it will not display your figures correctly.)

You can convert the notebook to a PDF using the following instructions.

# Converting this notebook to a PDF

1. Make sure you have renamed the notebook, e.g. `000000000_Assignment_5.ipynb` where `000000000` is your student ID.
2. Make sure to save the notebook (`ctrl/cmd + s`).

Make sure Google Drive is mounted (it likely already is from the first question).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
!ls "/content/drive/MyDrive/Colab Notebooks/"

3. Install packages for converting .ipynb to .pdf

In [None]:
!apt-get -q install texlive-xetex texlive-fonts-recommended texlive-plain-generic

4. Convert to PDF (replace `STUDENT-ID` with your student ID)

In [None]:
# Replace STUDENT-ID with your student
!jupyter nbconvert --to pdf "/content/drive/MyDrive/Colab Notebooks/STUDENT-ID_Assignment_5.ipynb"

5. Download the resulting PDF file. If you are using Chrome, you can do so by running the following code. On other browsers, you can download the PDF using the file mananger on the left of the screen (Navigate to the file > Right Click > Download).

In [None]:
# Replace STUDENT-ID with your student id below:
from google.colab import files
files.download(f"/content/drive/MyDrive/Colab Notebooks/STUDENT-ID_Assignment_5.pdf") 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

6. Verify that your PDF correctly displays your figures and responses.