# Classification with `pyani` output

This notebook contains experiments in generating *continuous* classifications from `pyani` output. The general principle is as follows:

1. Generate a graph with each isolate as a node, whose edges contain information about %identity and %coverage for the pairwise comparison between genomes
2. Prune the graph by removing edges that fall below a minimum level of %coverage.
3. Identify all (*k*-complete) graphs at this level, and note the minimum %identity *i*

Then, for each of the initial graphs:

* Progressively prune the graph edges, from lowest to highest %identity.
* When a *k*-complete graph is noted, we record this as a specific grouping/classification as $G_{m,n}$ where $m$ is the %coverage used to construct the initial graph, and $n$ is the %identity at which the $k$-complete graph is first observed.

# Implementing the algorithm

### Python dependencies

We're using `networkx` for the graph manipulations, and `pygraphviz` for some rendering.

In [None]:
%pylab inline

import pandas as pd
import networkx as nx

## Initial Graph construction

Load the coverage and identity data as `Pandas` dataframes:

In [None]:
# Data from a Buchnera comparison
covfile = "Buchnera_ANIm/ANIm_alignment_coverage.tab"
idfile = "Buchnera_ANIm/ANIm_percentage_identity.tab"

# Load pyani outputs into dataframes
covdata = pd.DataFrame.from_csv(covfile, sep='\t')
iddata = pd.DataFrame.from_csv(idfile, sep='\t')

Create nodes and edges for an undirected graph with the *minimal* coverage (identity should be the same in both directions)

In [None]:
# Our node names will be the isolate IDs
node_names = covdata.columns

# Loop over each isolate ID and get all coverage/identity values
rows_list = []
for idx, node_from in enumerate(node_names[:-1]):
    for node_to in node_names[idx+1:]:
        datadict = {'from': node_from,
                    'to': node_to,
                    'coverage': min(covdata[node_from][node_to],
                                    covdata[node_to][node_from]),
                    'identity': iddata[node_from][node_to]}
        rows_list.append(datadict)

# Create dataframe of from/to nodes, coverage and identity (in that order)
node_data = pd.DataFrame(rows_list, columns=['from', 'to', 'coverage', 'identity'])

Build a `networkx` undirected graph from this data

In [None]:
# Create graph
G = nx.Graph()

# Function to add an edge to the graph
def add_edge(row, graph, cov_thresh=0.5):
    if row['coverage'] >= cov_thresh:
        graph.add_edge(row['from'], row['to'],
                       coverage=row['coverage'],
                       identity=row['identity'],
                       difference=1-row['identity'],
                       logident=abs(log(row['identity'])))

# Add nodes, then loop over rows in node data, adding edges
for node in node_names:
    G.add_node(node)

errs = node_data.apply(add_edge, axis=1, args=(G,), cov_thresh=0.5)

Visualising the graph with `pylab`:

In [None]:
pos = nx.spring_layout(G, weight='identity')
nx.draw(G, pos)

In [None]:
pos = nx.spring_layout(G, weight='difference')
nx.draw(G, pos)

In [None]:
pos = nx.spring_layout(G, weight='logident')
nx.draw(G, pos)

In [None]:
for SG in nx.connected_component_subgraphs(G):
    print(len(SG.nodes()))

In [None]:
for c in nx.find_cliques(G):
    print(c)

In [None]:
# Create graph
H = nx.Graph()

# Add nodes, then loop over rows in node data, adding edges
for node in node_names:
    H.add_node(node)

errs = node_data.apply(add_edge, axis=1, args=(H,), cov_thresh=0.9)

In [None]:
nx.draw(H)

In [None]:
for c in nx.find_cliques(H):
    print(c)

In [None]:
edgelist = H.edges(data='identity')
for e in edgelist:
    if e[-1] < 0.95:
        H.remove_edge(e[0], e[1])

In [None]:
nx.draw(H)

In [None]:
for c in nx.find_cliques(H):
    print(len(c), c)

In [None]:
len(H.nodes())

In [None]:
edgelist = H.edges(data='identity')
for e in edgelist:
    if e[-1] < 0.999:
        H.remove_edge(e[0], e[1])

In [None]:
nx.draw(H)

In [None]:
for c in nx.find_cliques(H):
    print(len(c), c)