# Examples of repo usage

Here you will find several examples of the modules performance presented in this repository.


## RandomForestClassifierCustom

Here is the proof of successful parallelization of Custom Random Forest Classifier. The time is measured for one and two streams, respectively.

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from custom_random_forest import RandomForestClassifierCustom

In [3]:
SEED = 42
X, y = make_classification(n_samples=100000)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=SEED)

In [4]:
random_forest_1 = RandomForestClassifierCustom(max_depth=30, n_estimators=50, 
                                             max_features=2, random_state=42)
random_forest_2 = RandomForestClassifierCustom(max_depth=30, n_estimators=50, 
                                             max_features=2, random_state=42)

In [5]:
%%time
random_forest_1.fit(X_train, y_train, n_jobs=1)

CPU times: user 107 ms, sys: 163 ms, total: 271 ms
Wall time: 27.9 s


In [6]:
%%time
y_pred1 = random_forest_1.predict(X_test, n_jobs=1)

CPU times: user 348 ms, sys: 862 ms, total: 1.21 s
Wall time: 6.51 s


In [7]:
%%time
random_forest_2.fit(X_train, y_train, n_jobs=2)

CPU times: user 91 ms, sys: 157 ms, total: 248 ms
Wall time: 14.8 s


In [8]:
%%time
y_pred2 = random_forest_2.predict(X_test, n_jobs=2)

CPU times: user 340 ms, sys: 896 ms, total: 1.24 s
Wall time: 6.42 s


In [9]:
# checks if the prediction is correct for both one and two stream

all(y_pred1 == y_pred2)

True

# OpenFasta

Here is an example of the OpenFasta module usage. In simple words, it allows to itterate over the fasta file and nicely prints the output.

In [10]:
from bio_files_processor import OpenFasta

In [11]:
with OpenFasta("data/example_fasta.fasta") as fasta_file:
    for record in fasta_file:
        print(record)

id='GTD323452', description='5S_rRNA NODE_272_len...', seq='ACGGCCATAGGACTTTGAAA...'
id='GTD678345', description='16S_rRNA NODE_80_len...', seq='TTGGCTTCTTAGAGGGACTT...'
id='GTD174893', description='16S_rRNA NODE_1_leng...', seq='TTGAAGAGTTTGATCATGGC...'
id='GTD906783', description='16S_rRNA NODE_1_leng...', seq='TTGAAGAGTTTGATCATGGC...'
id='GTD129563', description='16S_rRNA NODE_4_leng...', seq='CGGACGGGTGAGTAATGTCT...'


# GenescanOutput

Here is an example of how the GenescanOutput class works, specifically the run_genscan function. In simple words, it allows to run Genscan online tool run from Python. The output is the Genscan dataclass object with the status of request, as well as peptides, exons and introns sequences of the input protein. 

In [3]:
from custom_tools_main import run_genscan

In [4]:
genscan = run_genscan(sequence_file='data/genscan_sequence.fasta')
genscan

Status code: 200

Predicted peptides:
GENSCAN_predicted_peptide_1: XSQTAFRVTAMEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

Predicted introns:
Intron 1.1: 11015 - 11131
Intron 1.2: 11154 - 11262
Intron 1.3: 11542 - 12298
Intron 1.4: 12483 - 12563
Intron 1.5: 12677 - 13244
Intron 1.6: 13355 - 13697
Intron 1.7: 13835 - 13926
Intron 1.8: 14001 - 16819
Intron 1.9: 16927 - 17844

Predicted exons:
Exon 1.01: 10913 - 11014
Exon 1.02: 11132 - 11153
Exon 1.03: 11263 - 11541
Exon 1.04: 12299 - 12482
Exon 1.05: 12564 - 12676
Exon 1.06: 13245 - 13354
Exon 1.07: 13698 - 13834
Exon 1.08: 13927 - 14000
Exon 1.09: 16820 - 16926
Exon 1.10: 17845 - 17926