# Examples

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 10)

## Higher level interface
Work on `DataFrame` using classes.

In [2]:
from pycdhit import read_fasta, read_clstr, CDHIT

Read a fasta file as a `DataFrame` containing some amino acid sequences.

In [3]:
df_in = read_fasta("apd.fasta")
df_in

Unnamed: 0,identifier,sequence
0,00001,GLWSKIKEVGKEAAKAAAKAAGKAALGAVSEAV
1,00002,YVPLPNVPQPGRRPFPTFPGQGPFNPKIKWPQGY
2,00003,DGVKLCDVPSGTWSGHCGSSSKCSQQCKDREHFAYGGACHYQFPSV...
3,00004,NLCERASLTWTGNCGNTGHCDTQCRNWESAKHGACHKRGNWKCFCYFDC
4,00005,VFIDILDKVENAIHNAAQVGIGFAKPFEKLINPK
...,...,...
45,00046,GPLSCRRNGGVCIPIRCPGPMRQIGTCFGRPVKCCRSW
46,00047,GPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW
47,00048,SGISGPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW
48,00049,GIGALSAKGALKGLAKGLAEHFAN


Initialize a command object, and show the help message.

In [None]:
cdhit = CDHIT(prog="cd-hit", path="~/cd-hit")
cdhit.help()

Set options and run the command to cluster the sequences.

In [5]:
df_out, df_clstr = cdhit.set_options(c=0.7, d=0, sc=1).cluster(df_in)
df_clstr

Unnamed: 0,identifier,cluster,size,is_representative,identity
0,00037,0,40,False,100.00
1,00038,0,42,True,100.00
2,00041,0,42,False,85.71
3,00042,0,40,False,90.00
4,00043,0,38,False,86.84
...,...,...,...,...,...
44,00001,24,33,True,100.00
45,00035,25,26,True,100.00
46,00045,26,40,True,100.00
47,00036,27,38,True,100.00


The stdout of the finished program can also be retrieved.

In [None]:
print(cdhit.subprocess.stdout)

## Lower level interface
Work on files using functions.

In [7]:
from pycdhit import cd_hit

Set the path of installed CD-HIT as an environment variable `CD_HIT_DIR`, if it is not added to `PATH`.

In [8]:
import os
os.environ["CD_HIT_DIR"] = "~/cd-hit"

Cluster the sequences using cd-hit.

In [9]:
res = cd_hit(
    i="apd.fasta",
    o="out",
    c=0.7,
    d=0,
    sc=1,
)

The output files are a fasta file of representative sequences and a text file of clusters. Read the clstr file.

In [10]:
df_clstr = read_clstr("out.clstr")
df_clstr

Unnamed: 0,identifier,cluster,size,is_representative,identity
0,00037,0,40,False,100.00
1,00038,0,42,True,100.00
2,00041,0,42,False,85.71
3,00042,0,40,False,90.00
4,00043,0,38,False,86.84
...,...,...,...,...,...
44,00001,24,33,True,100.00
45,00035,25,26,True,100.00
46,00045,26,40,True,100.00
47,00036,27,38,True,100.00


The stdout of the finished program can also be retrieved.

In [None]:
print(res.stdout)