# Nanocompore SampComp API demo 

## Basic usage of SampComp

### Import the package

In [10]:
from nanocompore.SampComp import SampComp

#### Using a Python dictionary to specify the location of the eventalign files

In [15]:
# Init the object
s = SampComp (
    eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep1':'./sample_files/unmodified_rep_1.tsv', 'rep2':'./sample_files/unmodified_rep_2.tsv'}},
    output_db_fn = "./results/out.db",
    fasta_fn = "./reference/ref.fa",
    )

# Run the analysis
db = s ()

Initialise SampComp and checks options
Initialise Whitelist and checks options
Read eventalign index files
	References found in index: 5
Filter out references with low coverage
	References remaining after reference coverage filtering: 5
Start data processing
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


NanocomporeError: Traceback (most recent call last):
  File "/home/aleg/Programming/nanocompore/nanocompore/SampComp.py", line 326, in __process_references
    strict=self.__strict,
  File "/home/aleg/Programming/nanocompore/nanocompore/TxComp.py", line 67, in txCompare
    gmm_results = gmm_test(data, anova=anova, logit=logit, strict=strict)
  File "/home/aleg/Programming/nanocompore/nanocompore/TxComp.py", line 179, in gmm_test
    aov_results = gmm_anova_test(counters, sample_condition_labels, condition_labels, gmm_ncomponents, strict=strict)
  File "/home/aleg/Programming/nanocompore/nanocompore/TxComp.py", line 235, in gmm_anova_test
    raise NanocomporeError("While doing the Annova test we found a sample with within variance = 0. Use strict=False to ignore.")
nanocompore.common.NanocomporeError: While doing the Annova test we found a sample with within variance = 0. Use strict=False to ignore.


#### Using a YAML file instead to specify the files location

In [35]:
# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    output_db_fn = "./results/out.db",
    fasta_fn = "./reference/ref.fa")

# Run the analysis
db = s ()

Initialise SampComp and checks options
Initialise Whitelist and checks options
Read eventalign index files
	References found in index: 5
Filter out references with low coverage
	References remaining after reference coverage filtering: 5
Start data processing
100%|██████████| 5/5 [00:10<00:00,  2.28s/ Processed References]


## More advanced usage examples

#### Tweaking statistical options

In [None]:
# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    output_db_fn = "./results/out.db",
    fasta_fn = "./reference/ref.fa",
    comparison_method=["GMM", "MW", "KS"],
    sequence_context=2,
    sequence_context_weights='harmonic')

# Run the analysis
db = s ()

#### Tweaking statistical options

In [None]:
# Init the object
s = SampComp (
    eventalign_fn_dict = "./samples.yaml",
    output_db_fn = "./results/out.db",
    fasta_fn = "./reference/ref.fa",
    comparison_method=["GMM", "MW", "KS"],
    sequence_context=2,
    sequence_context_weights='harmonic')

# Run the analysis
db = s ()

## Full API documentation

In [8]:
from nanocompore.common import jhelp

jhelp(SampComp)

### nanocompore.SampComp.\__init__

Initialise a `SampComp` object and generates a white list of references with sufficient coverage for subsequent analysis. The retuned object can then be called to start the analysis.

* **eventalign_fn_dict** *: dict or str (required)*

Multilevel dictionnary indicating the condition_label, sample_label and file name of the eventalign_collapse output. 2 conditions are expected and at least 2 sample replicates per condition are highly recommended. One can also pass YAML file describing the samples instead. Example `d = {"S1": {"R1":"path1.tsv", "R2":"path2.tsv"}, "S2": {"R1":"path3.tsv", "R2":"path4.tsv"}}`

* **output_db_fn** *: file_path (required)*

Path where to write the result database

* **fasta_fn** *: file_path (required)*

Path to a fasta file corresponding to the reference used for read alignemnt

* **bed_fn** *: file_path (default = None)*

Path to a BED file containing the annotation of the transcriptome used as reference when mapping

* **whitelist** *: nancocompore.Whitelist object (default = None)*

Whitelist object previously generated with nanocompore Whitelist. If not given, will be automatically generated

* **comparison_method** *: list of str (default = ['GMM', 'KS'])*

Statistical method to compare the 2 samples (mann_whitney or MW, kolmogorov_smirnov or KS, t_test or TT, gaussian_mixture_model or GMM). This can be a list or a comma separated string

* **logit** *: bool (default = True)*

Force logistic regression even if we have less than 2 replicates in any condition

* **strict** *: bool (default = True)*

if True an exception is raised if a warning is thrown by the Anova test in case a sample has a null variance

* **sequence_context** *: int (default = 0)*

Extend statistical analysis to contigous adjacent base if available

* **sequence_context_weights** *: {uniform,harmonic} (default = uniform)*

type of weights to used for combining p-values.

* **min_coverage** *: int (default = 50)*

minimal read coverage required in all sample

* **downsample_high_coverage** *: int (default = 0)*

For reference with higher coverage, downsample by randomly selecting reads.

* **max_invalid_kmers_freq** *: float (default = 0.1)*

maximum frequency of NNNNN, mismatching and missing kmers in reads

* **select_ref_id** *: list or str (default = [])*

if given, only reference ids in the list will be selected for the analysis

* **exclude_ref_id** *: list or str (default = [])*

if given, refid in the list will be excluded from the analysis

* **nthreads** *: int <= 4 (default = 4)*

Number of threads (two are used for reading and writing, all the others for processing in parallel).

* **log_level** *: {warning,info,debug} (default = info)*

Set the log level.



type

## Initialise and call SampComp