## Setting the simulation

All the data needed to run this notebook is here [test-data](test-data/)

In [1]:
import sys
import pandas as pd
import gzip
import time
import numpy as np
import fwdpy11
import msprime

sys.path.append('../')
from simutils import utils

fwdpy11.__version__

'0.17.2.dev15+g84f4f8ed'

# Simulation data

## Loading the data

I have generated 200 random samples of 1Mb length from the genome. Here, I take one of those samples, as an example, to set the simulation.
**NOTE:**

- I subtract the start position of the region from the intervals and the recombination map. So, the start position is zero. (Not sure If I need to do this).


I created a python module: [simutils](../simutils/), where 
I put some function to load the data.

In [2]:
sim_dat = utils.simuldata(path_to_samples='test-data/', sample_id=23, path_to_genetic_maps='test-data/')
print(sim_dat)

Region: 22, start: 29000000, end: 30000000


# Genomic intervals

In [3]:
## Coding interval are stored as a data frame
sim_dat.coding_intervals.head()

Unnamed: 0,chro,start,end
0,22,42494,42569
1,22,43298,43430
2,22,44779,44890
3,22,46715,46883
4,22,48388,48491


In [4]:
## Same for noncoding
sim_dat.noncoding_intervals.head()

Unnamed: 0,chro,start,end
0,22,0,42494
1,22,42569,43298
2,22,43430,44779
3,22,44890,46715
4,22,46883,48388


In [5]:
# we can get the length of the coding and noncoding intervals


print(
 f'coding length = {sim_dat.L(coding=True)}\nnoncoding length = {sim_dat.L(False)}'
)


coding length = 28876
noncoding length = 957378


## Recombination rate in the region

**NOTE:** I think the recombination rate in the map above
    is per base pair

In [6]:
# Recombination map in the region
print(sim_dat.rmap)


┌──────────────────────────────────────────────┐
│left    │right    │       mid│  span│     rate│
├──────────────────────────────────────────────┤
│0       │2389     │    1194.5│  2389│  1.4e-09│
│2389    │2471     │      2430│    82│  9.3e-09│
│2471    │4331     │      3401│  1860│  1.5e-08│
│4331    │4527     │      4429│   196│  8.4e-09│
│4527    │5345     │      4936│   818│  1.3e-08│
│5345    │6551     │      5948│  1206│  9.5e-10│
│6551    │6763     │      6657│   212│  1.1e-09│
│6763    │6844     │    6803.5│    81│  9.9e-10│
│6844    │7062     │      6953│   218│  1.3e-09│
│7062    │8663     │    7862.5│  1601│  3.5e-09│
│⋯       │⋯        │         ⋯│     ⋯│        ⋯│
│994559  │994665   │    994612│   106│        0│
│994665  │994787   │    994726│   122│        0│
│994787  │995001   │    994894│   214│  4.7e-11│
│995001  │995292   │  995146.5│   291│  3.4e-11│
│995292  │996551   │  995921.5│  1259│    4e-11│
│996551  │997514   │  997032.5│   963│  3.1e-11│
│997514  │997555   

**NOTE:**
    
    Intervals are given with position relative to start.

## Scaled mutation rates

In [7]:
print(f'''

Scaled mutation rates

Non-coding ml: {sim_dat.ml_noncoding}
synonymous ml: {sim_dat.ml_synonymous}
missense ml: {sim_dat.ml_missense}
LOF ml: {sim_dat.ml_LOF}
''')



Scaled mutation rates

Non-coding ml: 0.00957497053154
synonymous ml: 7.921106974e-05
missense ml: 0.00019355650402
LOF ml: 1.151627658e-05



In [8]:
# we also have the per-base mutation rates

print(f'''

Per base mutation rates 

Non-coding m: {sim_dat.m_noncoding}
synonymous m: {sim_dat.m_synonymous}
missense m: {sim_dat.m_missense}
LOF m: {sim_dat.m_LOF}
''')



Per base mutation rates 

Non-coding m: 1.0001243533421491e-08
synonymous m: 2.7431455097658955e-09
missense m: 6.703023411137277e-09
LOF m: 3.988182774622524e-10



# Setting the simulation 

## Neutral regions

In [9]:
## we will label the mutations according to the functional category

mut_labels = {
    'neutral': 0,
    'missense': 1,
    'synonymous': 2,
    'LOF': 3,
}

In [10]:
# Construct the neutral regions from the noncoding intervals
# we also assume that synonymous mutations are neutral
# NOTE: we set the weight to the per base mutation rate


nregions = []
for _, noexon in sim_dat.noncoding_intervals.iterrows():
    nregions.append(
        fwdpy11.Region(beg=noexon.start, end=noexon.end, weight=sim_dat.m_noncoding, label=mut_labels['neutral'])
    
    )

# synonymous we assume they are neutral
for _, exon in sim_dat.coding_intervals.iterrows():
    nregions.append(
        fwdpy11.Region(beg=exon.start, end=exon.end, weight=sim_dat.m_synonymous, label=mut_labels['synonymous'])
    
    )

In [11]:
nregions[:5]

[fwdpy11.Region(beg=0, end=42494, weight=1.0001243533421491e-08, coupled=True, label=0),
 fwdpy11.Region(beg=42569, end=43298, weight=1.0001243533421491e-08, coupled=True, label=0),
 fwdpy11.Region(beg=43430, end=44779, weight=1.0001243533421491e-08, coupled=True, label=0),
 fwdpy11.Region(beg=44890, end=46715, weight=1.0001243533421491e-08, coupled=True, label=0),
 fwdpy11.Region(beg=46883, end=48388, weight=1.0001243533421491e-08, coupled=True, label=0)]

## Distributions of effect sizes | Selected regions

- For now I use Aaron's infered DFEs [see here](https://moments.readthedocs.io/en/main/modules/dfe.html#all-data).
- The weights establish the relative probability that a mutation comes from a given region.

**NOTE:**

- When multiple “sregion” objects are used, the default behavior is to multiply the input weight by end-beg:
- The weights should depend on the mutation type (i.e. synonymous, missense). We will
use the perbase mutation rate as the weight.


**Comments:**

- The selection and dominance should also depend on the mutation class. We'll need to pick an appropiate DFEs for each case.


### DFE for missense variants

The parameters that were fit are alpha and beta (or shape and scale) of the gamma distribution.

- Ne = 11372.91
- shape: 0.1596
- scale: 2332.3

The mean of the gamma distribution is $\alpha\beta$. I need to divide by 2Ne.


In [12]:
Ne = 11372.91
shape = 0.1596
scale = 2332.3
mean_s = (shape * scale) / (2 * Ne)
mean_s

0.01636498838028262

In [13]:
# This will be the DFE for missense variants
fwdpy11.GammaS(beg=0, end=1, weight=1, mean=mean_s, shape_parameter=shape, h=1)

fwdpy11.GammaS(beg=0, end=1, weight=1, mean=0.01636498838028262, shape_parameter=0.1596, h=1, coupled=True, label=0, scaling=1.0)

In [14]:
# DFE for LOF
shape_lof = 0.3589
scale_lof = 7830.5
mean_s_lof = (shape_lof * scale_lof) / (2 * Ne)
mean_s_lof

0.1235552927966545

In [15]:
fwdpy11.GammaS(beg=0, end=1, weight=1, mean=mean_s_lof, shape_parameter=shape_lof, h=1)

fwdpy11.GammaS(beg=0, end=1, weight=1, mean=0.1235552927966545, shape_parameter=0.3589, h=1, coupled=True, label=0, scaling=1.0)

In [16]:
# Construct the selected regions from the exonic intervals
sregions = []
for _, exon in sim_dat.coding_intervals.iterrows():
    # missense
    sregions.append(
        fwdpy11.GammaS(
            beg=exon.start, end=exon.end,
            weight=sim_dat.m_missense,
            mean=mean_s, shape_parameter=shape,
            h=1,
            label=mut_labels['missense'])
    
    )
    # loss of function
    sregions.append(
        fwdpy11.GammaS(
            beg=exon.start, end=exon.end,
            weight=sim_dat.m_LOF,
            mean=mean_s_lof, shape_parameter=shape_lof,
            h=1,
            label=mut_labels['LOF'])
    
    )

In [17]:
sregions[:5]

[fwdpy11.GammaS(beg=42494, end=42569, weight=6.703023411137277e-09, mean=0.01636498838028262, shape_parameter=0.1596, h=1, coupled=True, label=1, scaling=1.0),
 fwdpy11.GammaS(beg=42494, end=42569, weight=3.988182774622524e-10, mean=0.1235552927966545, shape_parameter=0.3589, h=1, coupled=True, label=3, scaling=1.0),
 fwdpy11.GammaS(beg=43298, end=43430, weight=6.703023411137277e-09, mean=0.01636498838028262, shape_parameter=0.1596, h=1, coupled=True, label=1, scaling=1.0),
 fwdpy11.GammaS(beg=43298, end=43430, weight=3.988182774622524e-10, mean=0.1235552927966545, shape_parameter=0.3589, h=1, coupled=True, label=3, scaling=1.0),
 fwdpy11.GammaS(beg=44779, end=44890, weight=6.703023411137277e-09, mean=0.01636498838028262, shape_parameter=0.1596, h=1, coupled=True, label=1, scaling=1.0)]

In [18]:
nrec = len(sim_dat.rmap) - 1

In [19]:
recregions = []
for i in range(nrec):
    recregions.append(
     fwdpy11.PoissonInterval(
         beg=sim_dat.rmap.left[i],
         end=sim_dat.rmap.right[i],
         mean=sim_dat.rmap.rate[i] * sim_dat.rmap.span[i]
     )   
    )

In [20]:
recregions[:10]

[fwdpy11.PoissonInterval(beg=0.0, end=2389.0, mean=3.373172757461935e-06, discrete=False),
 fwdpy11.PoissonInterval(beg=2389.0, end=2471.0, mean=7.599999999885475e-07, discrete=False),
 fwdpy11.PoissonInterval(beg=2471.0, end=4331.0, mean=2.8360000000060556e-05, discrete=False),
 fwdpy11.PoissonInterval(beg=4331.0, end=4527.0, mean=1.6399999999694439e-06, discrete=False),
 fwdpy11.PoissonInterval(beg=4527.0, end=5345.0, mean=1.0490000000029642e-05, discrete=False),
 fwdpy11.PoissonInterval(beg=5345.0, end=6551.0, mean=1.139999999955066e-06, discrete=False),
 fwdpy11.PoissonInterval(beg=6551.0, end=6763.0, mean=2.3000000004547158e-07, discrete=False),
 fwdpy11.PoissonInterval(beg=6763.0, end=6844.0, mean=7.999999995789153e-08, discrete=False),
 fwdpy11.PoissonInterval(beg=6844.0, end=7062.0, mean=2.800000000191538e-07, discrete=False),
 fwdpy11.PoissonInterval(beg=7062.0, end=8663.0, mean=5.580000000005025e-06, discrete=False)]

## Total remcombination rate

Is the total recombination rate the sum if the rate of each region?

In [21]:
total_recombination_rate = sum([x.mean for x in recregions])

print(f'total_recombination_rate is {total_recombination_rate}')

total_recombination_rate is 0.007796313172757488


## Rates

We need to specify the total rates

In [22]:
#  The neutral mutation rate, selected mutation rate, and total recombination rate, respectively.
neutral_ml = sim_dat.ml_noncoding + sim_dat.ml_synonymous
selected_ml = sim_dat.ml_missense + sim_dat.ml_LOF

# recomb_rate = ??? | I'm not sure how to set this value
rates = fwdpy11.MutationAndRecombinationRates(
    neutral_mutation_rate=neutral_ml,
    selected_mutation_rate=selected_ml,
    recombination_rate=None)


## Demography

To test the DFE I will use a constant size population model, this will run faster.

In [23]:
Ne = 500
pop = fwdpy11.DiploidPopulation(N=Ne, length=int(1e6))
pop.N
pop.tables.genome_length


1000000.0

## Setting up the parameters for a simulation


In [24]:
SIM_LEN = 10 * pop.N #TODO: should be 10

In [25]:
# the parameters that fwdpy11 needs to run the simulation
p = {
    "nregions": nregions,  # neutral mutations (none for now, can add after the fact)
    "gvalue": fwdpy11.Multiplicative(2.0),  # fitness model
    "sregions": sregions, 
    "recregions": recregions,
    "rates": rates,
    "prune_selected": True,
    "demography": fwdpy11.DiscreteDemography(),  # pass the demographic model
    "simlen": SIM_LEN
}

In [26]:
params = fwdpy11.ModelParams(**p)

In [27]:
# run the simulation
# set up the random number generator
rng = fwdpy11.GSLrng(54321) 

In [None]:
# run the simulation
print('runnning simulation ...')
time1 = time.time()
fwdpy11.evolvets(
    rng, pop, params, simplification_interval=100, suppress_table_indexing=True
)
print("Simulation took", int(time.time() - time1), "seconds")

# simulation finished
print("Final population sizes =", pop.deme_sizes())

In [None]:
mkdir -p results

In [None]:
# save the simulation results
with gzip.open('results/sim-pop.gz', 'wb') as f:
    pop.pickle_to_file(f)

## Results

- I will get the SFS for selected and not-selected mutations for 100 samples.

In [None]:
nodes = np.array(pop.tables.nodes, copy=False)
alive_nodes = pop.alive_nodes
deme0_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 0)[0]]

In [None]:
# SFS (neutral + selected mutations)
pop.tables.fs([deme0_nodes[:100]], include_neutral=True)

In [None]:
# SFS selected
pop.tables.fs([deme0_nodes[:100]], include_neutral=False)

In [None]:
ts = pop.dump_tables_to_tskit()
ts

# How many mutations for each class do we observe?

In [None]:
from collections import Counter
mut_clas = []
for m in ts.mutations():
    mut_clas.append(m.metadata['label'])
    
Counter(mut_clas)