# Ancestral reconstruction and clustering

Ancestral reconstruction problem naturally emerges in situations where we must identify true original objects from a set of objects created by noisy reproduction procedures:
* Find out how species have evolved using DNA samples.
* Find out which of the ancient manuscripts is the original.
* Find out the source of a gossip and evaluate internet memes. 
* Find out how academic texts are plagiarised.

In [1]:
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.stats as stats
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hclust 
import sklearn

from pandas import Series
from pandas import DataFrame
from typing import List,Tuple

from pandas import Categorical
from pandas.api.types import CategoricalDtype

from tqdm import tnrange#, tqdm_notebook
from plotnine import *

from scipy.stats import norm
from scipy.stats import multivariate_normal


from sklearn.cluster import AgglomerativeClustering

# Local imports
from common import *
from convenience import *

## I. Naive probabilistic model

For simplicity, let us consider the case where objects are encoded as binary vectors of fixed length. 
Let $\boldsymbol{u}=(u_1,\ldots,u_n)$ and $\boldsymbol{v}=(v_1,\ldots,v_n)$ denote two documents such that $\boldsymbol{v}$ is generated from $\boldsymbol{u}$ by a noisy decomposition procedure that flips the values with probability $p$:

\begin{align*}
\begin{aligned}
&\Pr[u_i=0\to v_i=0]=1-p\\
&\Pr[u_i=0\to v_i=1]=p
\end{aligned}
\qquad\qquad
\begin{aligned}
&\Pr[u_i=1\to v_i=0]=p\\
&\Pr[u_i=1\to v_i=1]=1-p
\end{aligned}
\end{align*}

Under the assumption that bits $u_i$ are copied independently, the probability that $\vec{v}$ is generated from $\vec{u}$ is

\begin{align*}
\Pr[\boldsymbol{u}\to\boldsymbol{v}]=(1-p)^{n-h(\boldsymbol{u},\boldsymbol{v})}p^{h(\boldsymbol{u},\boldsymbol{v})}
\end{align*}

where $h(\boldsymbol{u},\boldsymbol{v})$ is Hamming distance.
The probability of the entire tree is the product of edge probabilities:

\begin{align*}
\Pr[\mathcal{T}]=\prod_{\boldsymbol{u}\to\boldsymbol{v}}\Pr[\boldsymbol{u}\to\boldsymbol{v}]= \prod_{\boldsymbol{u}\to\boldsymbol{v}}(1-p)^{n}\left(\frac{p}{1-p}\right)^{h(\boldsymbol{u},\boldsymbol{v})}\,.
\end{align*}

By dividing log-likelihood with a constant $n\cdot\log(1-p)$ we get a simpler minimisation goal:

\begin{align*}
|E|+\tau(p)\cdot \sum_{\boldsymbol{u}\to\boldsymbol{v}} h(\boldsymbol{u},\boldsymbol{v})\to\min
\end{align*}

where  $|E|$ is the number of edges and

\begin{align*}
\tau(p)=\frac{\log\left(\frac{p}{1-p}\right)}{n \cdot \log(1-p)}
=\frac{1}{n}\cdot \Bigl(\frac{\log p}{\log(1-p)}-1\Bigr)\,.
\end{align*}

This implies that for trees with equal size we should take the one with fewer changes. For different tree sizes, the choice depends on $\tau(p)$ value. 

# Homework

## 1.1 Optimal solution for fixed task (<font color='red'>2p</font>)
Let $0010$, $1011$, $1001$, $0011$, $1011$ represent features present or missing in variations of the same text coming from different sources. Find out the most probable history based on the naive mutation model. Recall that the maximal likelihood solution can be found by solving the minimisation task:
\begin{align*}
|E|+\tau(p)\cdot\sum_{\boldsymbol{u}\to\boldsymbol{v}} h(\boldsymbol{u},\boldsymbol{v})\to \min\,.
\end{align*} 
Find an optimal solution for each tree size $|E|=6,\ldots, 16$ and corresponding regions of mutation probabilities $p_k\in[a_k,b_k]$ where the tree of size $k$ provides an optimal solution.

## 1.2 Hierarchical clustering as an approximation to ancestral reconstruction (<font color='red'>2p</font>)

First implement the naive mutation model as `generate_data`. For that you need to fix the number of child nodes for each document. Assume that the number of children follows Poisson distribution with expected number of children 2. Use Hamming distance in the clustering algorithm and try out different clustering methods from [`scipy.cluster.hierarchy.linkage`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage). Define goodness of clustering using [Robinson–Foulds metric](https://en.wikipedia.org/wiki/Robinson–Foulds_metric). The latter has an implementation in `ete3` package. 

In [23]:
from ete3 import Tree
t1 = Tree('(((a,b),c), ((e, f), g));')
t2 = Tree('(((a,c),b), ((e, f), g));')
print('Normalised RF-metric {:.2f}'.format(t1.compare(t2)['norm_rf']))

Normalised RF-metric 0.25
