# Stats Module Overview

This notebook discusses the functions used by generate_results from the main_module. These are namely:
1. process_dataframe
2. generate_bin_values
3. generate_stats_for_likelihood_ratio
4. generate_p_threshold_and_binomial

## 1) Process Dataframe
This function processes a dataframe of targets through extracting statistics and removing irrelevant targets.

### Logic

Loop through each cluster in the dataframe
    1. Get the mean density for all samples and mean density for all controls
        - if mean density for all samples is 0, remove the cluster from dataframe
    2. Compute growth coefficient
        - Get all periods in the samples and all periods in the controls
        - Get all densities in the samples and all densities in the controls
        - use periods and densities to compute for the growth coefficient of samples and controls
        
### Returns
The function returns the following:
* all_sample_means
* all_control_means
* growth_coefficients: a single array where for each cluster its sample GC is followed by control GC
* samples_gt_controls
* n_targets_gt_0
* data: new dataframe where clusters with 0 sample mean have been removed

## 2) Generate Bin Values

This function takes the main dataframe and the control dataframe and generates bin values from these. 

### Logic

1. Create new bin columns for both the main dataframe and the control dataframe
2. Get total samples (sum all contribution) and total controls (count all rows from control dataframe)
3. Loop through each bin
    - sample_count: for all samples in the bin, sum the contribution
    - control_count: count all control dataframe rows that correspond to the bin
    - likelihood_ratio: sample_count/control_count (if control_count = 0, multiplier = -1)
    - p_sample: sample_count/total_samples
    - p_control: control_count/total_controls
    - p_likelihood_ratio: p_sample/p_control
    
    
### Returns
The function returns the following:
* bin_array
* sample_counts
* control_counts
* multipliers
* p_samples
* p_controls
* p_likelihood_ratios
    

## 3) Generate Stats for Likelihood Ratio

This function generates graphs and statistics mainly for the likelihood ratio. However, it also produces graphs to show actual vs predicted number of sites.

### Logic
The logic is as follows:
1. Trim arrays using trim_values function (remove rows where p_control < 30)
2. Fit and compare linear and threshold models
    - get predicted p_likelihood_ratios
    - Score predicted p_likelihood_ratios using R2 and chisquare scores
3. Compute statistics of a divided graph
    - divide graph using divide_graph function
    - compute variance for each side as well as levene stat
    - compute p_likelihood_ratio of divided graph (ratio for each side)
4. Plot graphs
    - plot likelihood ratio for linear and threshold models 
5. Write different statistics to file

## 4) Generate p Threshold and Binomial

This functions looks for the threshold (a bin value) that minimizes the binomial for p. 

### Conditions for p Binomial:
- Trial: comparison between p_sample and p_controls (done per bin)
- Success: p_sample <= p_control
- assumed probability of success: 0.5


### Logic
The logic to look for this threshold is as follows:

1. We set the current binomial to 100 (a large number) 
2. We loop through each bin and assume this bin i is the threshold
3. We loop from the first bin to bin i
    - we compare p_sample and p_control per iteration (+1 trial) 
    - if p_sample <= p_control, +1 success
4. We then check the binomial for the successes and trials with bin i as threshold
5. If the binomial is smaller than the previous binomial, we set it as the current binomial
    - we also save the following:
        - the bin as the current threshold
        - the number of success
        - the number of trials
        - the array of p_samples
        - the array of p_controls
        
### Returns
* binomial
* threshold
* threshold_success_count
* threshold_trial_count
* threshold_samples
* threshold_controls

