# Compute expected SFS under selection (given gamma distribution for the DFE)


The idea of this analysis is to get the expected SFS under selection.
We use moments for this.

The example/tutorial code is [here: DFE inference](https://moments.readthedocs.io/en/main/modules/dfe.html)


The main idea in this notes is that we want to compute the [expected SFS under selection](https://github.com/santiago1234/TheSciJournal/blob/main/journal-2022/Aaron-Notes/220714-Expected-SFS-under-selection.pdf).

Check also [these notes](https://github.com/santiago1234/TheSciJournal/blob/main/journal-2022/Aaron-Notes/220721-DFE-expected-SFS-scaling.pdf) from Aaron.

In [1]:
import numpy as np
import moments
import scipy
import pandas as pd
import sys
sys.path.append('../')

from simutils.utils import DFE_lof, DFE_missense, simuldata

In [2]:
## Load data
sim_dat = simuldata(path_to_samples='test-data/', sample_id=23, path_to_genetic_maps='test-data/')
sim_dat

Region: 22, start: 29000000, end: 30000000

## Neutral SFS

In [3]:
# First we obtain the neutral SFS

SFSs = dict()

# non coding represents: intergenic + intronic regions
SFSs['noncoding'] = moments.Demographics1D.snm([100])
SFSs['synonymous'] = moments.Demographics1D.snm([100])

## Selected SFS

In [4]:
def selection_spectrum(gamma):
    fs = moments.LinearSystem_1D.steady_state_1D(100, gamma=gamma)
    fs = moments.Spectrum(fs)
    return fs

In [5]:
spectrum_cache = {}
spectrum_cache[0] = selection_spectrum(0)

gammas = np.logspace(-4, 3, 61)
for gamma in gammas:
    spectrum_cache[gamma] = selection_spectrum(-gamma)

In [6]:
dxs = ((gammas - np.concatenate(([gammas[0]], gammas))[:-1]) / 2
    + (np.concatenate((gammas, [gammas[-1]]))[1:] - gammas) / 2)

In [7]:
def dfe_func(params, theta=1):
    alpha, beta = params
    fs = spectrum_cache[0] * scipy.stats.gamma.cdf(gammas[0], alpha, scale=beta)
    weights = scipy.stats.gamma.pdf(gammas, alpha, scale=beta)
    for gamma, dx, w in zip(gammas, dxs, weights):
        fs += spectrum_cache[gamma] * dx * w
    fs = theta * fs
    return fs

In [8]:
SFSs['LOF'] = dfe_func((DFE_lof.shape, DFE_lof.scale))
SFSs['missense'] = dfe_func((DFE_missense.shape, DFE_missense.scale))

In [9]:
def spectrum_to_frame(v_class, SFS_dict):

    return pd.DataFrame({
        'DerivedFreq': range(101),
        'MutType': v_class,
        'Frequency': SFS_dict[v_class].data
    })

In [10]:
SFSs_theta1 = pd.concat([spectrum_to_frame(x, SFSs) for x in SFSs.keys()])

### Scaling the SFS to the data

I need to scale the SFS check [these notes](https://github.com/santiago1234/TheSciJournal/blob/main/journal-2022/week29/aaron-meeting-w29.pdf) from Aaron.

To scale the spectrum to the simulation we use this function:

$$
    SFS = \sum_{r \in \text{replicate}} \text{sfs}(r) \\
    \approx E[SFS|\theta = 1] \times 4 \times N_e \times mL \times N_{sim}
$$

In [11]:
N_sim = 100
Ne = 5000

mLs = {
    'noncoding': sim_dat.ml_noncoding,
    'synonymous': sim_dat.ml_synonymous,
    'LOF': sim_dat.ml_LOF,
    'missense': sim_dat.ml_missense,
}


def scale_sfs(v_class):
    theta = 4 * Ne * mLs[v_class] * N_sim
    return SFSs[v_class] * theta

SFSs_scaled_to_data = {x: scale_sfs(x) for x in SFSs.keys()}
SFSs_scaled_to_data = pd.concat([spectrum_to_frame(x, SFSs_scaled_to_data) for x in SFSs_scaled_to_data.keys()])

In [12]:
# SAVE RESULTS
SFSs_theta1.to_csv('results/expected-sfs/SFS.csv', index=False)
SFSs_scaled_to_data.to_csv('results/expected-sfs/SFS_scaled_to_data.csv', index=False)