Demonstration of the UPGMA hierarchal clustering algorithm in Pandas, Seaborn, and Scipy
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Finish.png
LICENSE
README.md
Start.png
UPGMA.pptx
dendrogram.png
protein_diff.py
upgma.py

README.md

upgma

Demonstration of the UPGMA hierarchal clustering algorithm in Pandas, Seaborn, and Scipy.

Introduction

The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm is a bottom up agglomerative/hierarchical clustering algorithm commonly performed on genetic distance matrices. Running the UPGMA algorithm generally allows for construction of a dendrogram. The code in this repository utilizes Pandas and Seaborn for data visualization and vectorization capabilities.

In the context of this repository, UPGMA performs deterministically. Therefore, results will always be the same for every run. In addition, as long as the data integrity is preserved, the data may be organized in any order and the results will still remain the same.

Start

alt text

Finish

alt text

Results

{('Man', 'Monkey'): 0.5,
 ('Turtle', 'Chicken'): 4.0,
 (('Man', 'Monkey'), 'Dog'): 6.25,
 (('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')): 7.875,
 ((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna'): 14.1875,
 (((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna'), 'Moth'): 18.21875}

Dendrogram

alt text

Dependencies

  • python3-numpy
  • python3-pandas
  • python3-scipy
  • python3-seaborn

Running the Code

Execute the upgma.py file in an IPython environment.

Tables may be viewed by running commands such as:

upgma.upgma_records[('Man', 'Monkey')]
upgma.upgma_records[(((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna'),'Moth')]

The phylogenetic distances may be viewed by running:

upgma.phylogeny

Notes

  • The Pandas styler contains a bug that affects one of the intermediate steps of this program. When the index is [((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna')], the original dataframe cannot be properly stylized.

See the created issue: https://github.com/pandas-dev/pandas/issues/24687

ValueError: Buffer has wrong number of dimensions (expected 1, got 3)