## ROH analysis

This notebook analyses the output of bcftools ROH, and plots the joint distribution of fROH and nROH per-individual

In [None]:
# Import our libs
import sgkit as sg
import hmmlearn
import json
import hashlib
import allel
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
# Define some useful parameters

# Load and filter metadata
df_samples = pd.read_csv('/Users/dennistpw/Projects/AsGARD/metadata/cease_combinedmetadata_noqc.20250212.csv')

# Chrom / lengths dict
scaflens={'CM023248' : 93706023,
'CM023249' : 88747589,
'CM023250' : 22713616}

# px config
config = {
  'toImageButtonOptions': {
    'format': 'png', # one of png, svg, jpeg, webp
    'filename': 'custom_image',
    'height': 500,
    'width': 700,
    'scale':6 # Multiply title/legend/axis/canvas sizes by this factor
  }
}

# Palettes
pop_code_cols = {
    'APA' : '#ff7f00', #orange
    'SAE' : '#6a3d9a', #dark purple
    'SAR' : '#cab2d6', #ligher purple
    'IRS' : '#c27a88', #not sure yet
    'IRH' : '#c57fc9', #not sure yet
    'INB' : '#96172e', #darkred
    'INM' : '#f03e5e', #lightred
    'DJI' : '#507d2a', #sap green
    'ETB' : '#007272', #cobalt turq
    'ETS' : '#33a02c',#green
    'ETW' : '#a6cee3',#cerulean
    'SUD' : '#fccf86',#ochre
    'YEM' : '#CC7722'#pinkish
}

In [None]:
# Define functions for analysis

# Hashing func
def hash_params(*args, **kwargs):
    """Helper function to hash analysis parameters."""
    o = {
        'args': args,
        'kwargs': kwargs
    }
    s = json.dumps(o, sort_keys=True).encode()
    h = hashlib.md5(s).hexdigest()
    return h

def infer_roh(ind, chrom, analysis_name, results_dir):

    # Construct a key to save the results under
    results_key = hash_params(
        ind=ind,
        chrom=chrom,
        analysis_name=analysis_name
    )

    # Define paths for results files
    data_path = f'{results_dir}/{results_key}-roh.csv'

    try:
        # Try to load previously generated results
        df_roh = pd.read_csv(data_path)
        return df_roh
    except FileNotFoundError:
        # No previous results available, need to run analysis
        print(f'running analysis: {results_key}')

    # Load data
    ds = sg.load_dataset(f'/Users/dennistpw/Projects/AsGARD/data/variants_combined_cohorts/combined_cohorts.{chrom}.zarr')

    # Locate selected samples
    loc_samples = df_samples['sample_id'] == ind
    ds = ds.isel(samples=loc_samples)

    # Subset to accessible sites only and load genotypes
    print('subsetting to accessible sites only')
    accmask = ds['is_accessible'].compute()
    ds = ds.sel(variants=(accmask))
    gt = allel.GenotypeArray(ds['call_genotype'])
    gt = gt[:,0]

    # Get variant position
    pos = ds['variant_position'].compute()

    # Infer ROH for ind / chrom
    print(f'computing ROH for {ind}, {chrom}')
    df_roh = allel.roh_mhmm(gv=gt, pos=pos, contig_size = scaflens[chrom])[0]
    df_roh['ind'] = ind
    df_roh['chrom'] = chrom
    
    
    # Save results to hash cache
    df_roh.to_csv(data_path, index=False)
    print(f'saved results: {results_key}')
    return(df_roh)

# Plotting function for big ROH df
def plot_roh(
        roh_df,
        length = 1000,
        attr1 = 'fROH',
        attr2 = 'count',
        colour='analysis_pop',
        tit = 'allel',
        metadata = df_samples,
        palette = px.colors.qualitative.Plotly,
        **kwargs,
        ):
    
    # Aggregate roh data and get fROH
    roh_df = roh_df[roh_df['length'] > length]
    roh_data_agg = []
    roh_data_agg = roh_df.groupby('ind')['length'].agg(['count', 'sum', 'median', 'skew'])
    roh_data_agg['fROH'] = roh_data_agg['sum'] / sum(scaflens.values())
    
    # Join to df_samples by sample_id
    metadata = metadata.set_index('sample_id', drop=False)
    roh_data_agg = roh_data_agg.join(metadata)
    roh_data_agg['size'] = 1 #hack to enable size control of pts

    # Define plot options
    # Labels
    labs = (f"nROH > {length/1000} kb", 'fROH', 'State'), #tuple of labels

    plot_kwargs = dict(
        width=800,
        height=600,
        template='simple_white',
        hover_name='sample_id',
        title = f'{tit} ROH output, length > {length}',
        hover_data=[
            'sample_id',
            'admin1_name',
            'location', 
            'country', 
        ],
        size='size',
        color_discrete_map = palette,
        size_max=8,
        opacity=0.9,
        render_mode='svg',
    )

    # apply any user overrides
    plot_kwargs.update(kwargs)

    fig = px.scatter(roh_data_agg,
            x = attr1,
            y = attr2,
            color=colour,
            **plot_kwargs)
    fig.show()
    #return aggregate data
    return(roh_data_agg)


In [None]:
# Iterate over inds and chroms and infer ROH. This takes about a day using a single processor on my laptop for 500 mosquitoes
# Could / should probably speed this up by multiprocessing.
rohlist = []
for chrom in scaflens.keys():
    for ind in df_samples['sample_id']:
        roh_df = infer_roh(ind, chrom, 'default', '../data/roh_20240920/')
        rohlist.append(roh_df)

In [None]:
#Concat into a big dataframe
roh_data = pd.concat(rohlist)

In [None]:
#have a look at total length distribution by chrom
fig = px.histogram(roh_data, x="length", facet_col='chrom')
fig.show()

Ok, the long tail makes this distribution impossible to visualise - let's try truncating the tail

In [None]:
#have a look at total length distribution by chrom - with length filter of < 50k
fig = px.histogram(roh_data[roh_data['length'] < 50000], x="length", facet_col='chrom')
fig.show()

This is more informative - most ROH in the genome are below 20kb. In Ag1000G they use a length cutoff of 100k.
The scale of LD is important for understanding whether we are falsely inferring short linkage blocks with ROH. However, short LD blocks are still reflective of ancestral demographic events. Pemberton sheep study examines ROH/IBD length over different categories, so let's try this here with: all segments, segments > 100kb only, segments over 1Mb only, and short (<100kb segments>) only.

While we're doing this, let's analyse bcftools roh output alongside.

In [None]:
cols = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6']

roh_0 = plot_roh(attr2 = 'count',
         length=0,
         roh_df=roh_data,
         tit = 'bcftools',
         palette=analysis_popcols)

Some interesting signals here - clearly more ROH in Saudi and India, afgh and pak but at least in the native range, esp in afgh.pak, F tends to be much lower. Perhaps as this is the much more established population? Much longer tails, much higher F in invasive population.

In [None]:
roh_agg = plot_roh(attr2 = 'count',
         length=25_000,
         roh_df=roh_data,
        palette=analysis_popcols)

Applying a 25kb length filter removes most of the Pakistan and Yemeni segments

In [None]:
roh_100k = plot_roh(attr2 = 'count',
         length=100_000,
         roh_df=roh_data,
        palette=pop_code_cols)

Now they begin to resemble each other much more. bcftools seems to generally call more ROH than scikit allel. Which one do we pick?

In [None]:
roh_agg = plot_roh(attr2 = 'count',
         length=1e6,
         roh_df=roh_data,
        palette=analysis_popcols)

In [None]:
from matplotlib.ticker import LinearLocator, MaxNLocator, FuncFormatter
sns.set_style("ticks")


# Plot for paper
def plot_roh_for_paper(df, figname):


    sns.set_theme(rc={'figure.figsize':(7,7)}, style="ticks")

    scatter_kws = {
        's': 150,  # Point size
        'edgecolor': 'white',  # Thin white boundary
        'linewidth': 0.5,  # Thickness of the boundary
        'alpha': 0.6  # Point opacity (80% opacity)
    } 


    # Plot the first scatterplot
    g = sns.scatterplot(data=df, x='fROH', y='count', hue='pop_code', palette=pop_code_cols, **scatter_kws)
    
    #Format axes and rm legend
    g.set_ylim(10, 210)
    g.set_xlim(0, 0.7)
    g.set(xlabel='fROH', ylabel='nROH')
    g.xaxis.set_major_locator(MaxNLocator(nbins=4, integer=True))  # 5 ticks on the x-axis
    g.yaxis.set_major_locator(MaxNLocator(4, integer=True))  # 5 ticks on the x-axis

    #g.yaxis.set_major_locator(MaxNLocator(integer=True))  # Integer ticks on the y-axis
    g.xaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:.2g}'))
    #g.spines['left'].set_visible(False)
    #g.spines['bottom'].set_visible(False)  # Remove x-axis spine
    g.spines['right'].set_visible(False)  # Remove x-axis spine
    g.spines['top'].set_visible(False)  # Remove x-axis spine
    g.spines['left'].set_position(('outward', 10))  # Move the left spine further to the left
    g.spines['bottom'].set_position(('outward', 10))  # Move the left spine further to the left
    g.yaxis.set_tick_params(labelsize = 14)
    g.xaxis.set_tick_params(labelsize = 14)

   # Set axis labels and font size
    g.set_xlabel('fROH', fontsize=16)
    g.set_ylabel('nROH', fontsize=16)
    g.legend(title='Cohort')
    #rm spines
    #g.legend_.remove()


    # Despine the axes

    plt.savefig(f'../figures/{figname}.svg', format='svg')

roh_plot = roh_100k[roh_100k['fROH'] < 0.9]

# Make final ROH for manuscript
plot_roh_for_paper(roh_plot,' roh_1e5.svg')

In [None]:
# Plot scatterplots of fROH by location

# Aggregate roh data and get fROH
roh_df = roh_data[roh_data['length'] > 1e5]
roh_data_agg = []
roh_data_agg = roh_df.groupby('ind')['length'].agg(['count', 'sum', 'median', 'skew'])
roh_data_agg['fROH'] = roh_data_agg['sum'] / sum(scaflens.values())

# Join to df_samples by sample_id
metadata = df_samples.set_index('sample_id', drop=False)
roh_data_agg = roh_data_agg.join(metadata)
roh_data_agg['size'] = 1 #hack to enable size control of pts

roh_data_agg = roh_data_agg[(roh_data_agg['country'] == "Ethiopia") | (roh_data_agg['country'] == "Djibouti")]

# Define plot options
# Labels
#labs = (f"nROH > {length/1000} kb", 'fROH', 'State'), #tuple of labels

plot_kwargs = dict(
    width=800,
    height=600,
    template='simple_white',
    hover_name='sample_id',
    title = f'ROH output, length > 1e5',
    hover_data=[
        'sample_id',
        'admin1_name',
        'location', 
        'country', 
    ],
    size='size',
    #color_discrete_map = palette,
    size_max=8,
    opacity=0.9,
    render_mode='svg',
)

# apply any user overrides
#plot_kwargs.update(kwargs)

fig = px.scatter(roh_data_agg,
        x = 'fROH',
        y = 'count',
        color='location',
        **plot_kwargs)
fig.show()

We can see that the properties of the f/nROH distribution differ by length. The literature suggests that inference of ROH shorter than the typical LD length is prone to artefacts (though the distinction between ROH and LD segments as relics of ancestry is unclear, as LD segments are also indicative of ancestry, albeit in the recent past).

Many populations show signs of extreme inbreeding - fROH of 0.2 is still quite high for mosquitoes! Even for humans, some of the more inbred human pops (see Ceballos, 2018) have fROH of around 0.1, and An. gambiae for the most part are below 0.1.

Different demographic events leave footprints in the length and number of ROH in the genome. Thus, it is possible to date demographic events based on different ROH distributions. Each population and species has specific ROH characteristics - see [Pemberton et al, 2012](https://www.cell.com/ajhg/pdf/S0002-9297(12)00323-0.pdf). This paper (and other subsequently) apply clustering approaches to identify different categories of ROH lengths in different population.

Pemberton paper uses `mclust` in R - this is a Gaussian mixture based model for clustering data. This method is also implemented in `scikit-learn`. A Gaussian mixture model is a probabilistic model that assumes all data points are generated from a mixture of finite Gaussians with unknown parameters. Mixture models generalise k-means clustering to incorporate information about the covariance of the data. K-means is an algorithm, where GMM is a model.

Another interesting paper on ROH distributions, this time in soay sheep uses the expected generation time and recombination rate to infer likely ROH ages. The equation g = 100 / (2rL) where g = generation time (years), r = recombination rate in cm/Mb-1, and L = segment length in Mb, can be used to date ROH segments. Assuming a recombination rate similar to An. gambiae - a genomewide average of around 1cm/Mb-1, or dividing the corrected (see (this paper)[https://academic.oup.com/genetics/article/153/1/251/6047849#325536228]) chr 2 linkage map length by chrom 2 length in bp (128/93.1), yields a genomewide average of around 1.37cm/Mb-1. An example then - for a group of ROH with length 1Mb, we would expect these to be 37 generations old? Assuming a generation time in Anopheles of around 11 (verify/modify this for a range), that puts a 1Mb segment to be around 3 years old.