## Working with PyCoM (locally)

PyCoM database (PyCoMdb) consists of two files:
1. Curated protein database from SwissProt. (pycom.db, 700MB)
2. Corresponding Coevolution Matrices of all proteins in hd5fy format. (pycom.mat, 114GB)

There are two ways of working with the PyCoMdb database using the PyCoM API.
1. Downloading the PyCoMdb files locally and working with them. This option requires 115GB of disk space for the two database files listed above. 
2. Using PyCoM API and working with the database located on the pycom.brunel.ac.uk. a tutorial for that is available in `example_remote.ipynb`.

### Installation

Install the PyCoM package:

* `pip install git+https://github.com/scdantu/pycom`
* Requires Python 3.8 or higher

Download the `pycom.db` and `pycom.mat` files from:

* [https://pycom.brunel.ac.uk/downloads](https://pycom.brunel.ac.uk/downloads) into your favourite folder for example [/Volumes/Data/PyCoMdb/]


## Tutorial

The first step is to create a pycom object `obj_pycom` for the class PyCom. You need to provide the full path pointing to the location of both the database files.
 

In [None]:
from pycom import PyCom, ProteinParams

#database_folder_path="/Volumes/Data/PyCoMdb/"
database_folder_path="/Volumes/mason/Work/Sarath/Research/pycom/"

obj_pycom = PyCom(db_path=database_folder_path+'pycom.db', mat_path=database_folder_path+'pycom.mat')

### Construct the query parameters

Now the fun part, to find the proteins of your interest you can either construct a `query_parameters` dictionary using `ProteinParams` or directly using the 'keywords' listed below. 
 
The example below shows you how to use `ProteinParams`. Please note, if you type in `ProteinParams.` and press TAB, you will see the list of possible parameters (in CAPITALS e.g.: HAS_PTM or DISEASE).

#### `query_parameters` using `ProteinParams`

In [None]:
# Here we are asking for all the proteins that match the enzyme class 3 and have been associated with the disease cancer.
query_parameters={
    ProteinParams.ENZYME: '3.*.*.*',
    ProteinParams.DISEASE: 'cancer',  # string search, case-insensitive
}
# executing the query returns a pandas dataframe with information about all the proteins which match the query
entries_data_frame = obj_pycom.find(query_parameters)

#### Empty`query_parameters`
Note 1: *** Supplying an empty `query_parameters` to `obj_pycom.find()` will return the full protein database with 457,622 entries as pan *** 

In [1]:
#executing the query using the query_parameters
entries_data_frame = obj_pycom.find(query_parameters)
#printing the contents of the dataframe
entries_data_frame

NameError: name 'obj_pycom' is not defined

*** Note 2: At this stage we have not yet loaded the coevolution matrices in the dataframe. ***


# Query using keyword arguments:

In [None]:
#executing the query using keywords
entries_data_frame = obj_pycom.find(
    cofactor='FAD',  # string search, case-insensitive
    has_ptm=True,
    has_disease=False,
    min_length=200,
)
# printing the contents of the dataframe
entries_data_frame

### Supported query keywords:
* `uniprot_id`: The UniProt ID of the protein.
* `sequence`: The amino acid sequence of protein to search for. (full match)
* `min_length` / `max_length`: Min/Max number of residues in the protein.
* `min_helix` / `max_helix`: Min/Max percentage of helical structure in the protein.
* `min_turn` / `max_turn`: Min/Max percentage of turn structure in the protein.
* `min_strand` / `max_strand`: Min/Max percentage of beta strand structure in the protein.
* `organism`: Taxonomic name of the genus / species of the protein. (case-insensitive)
  * Species name or any parent taxonomic level can be used. (`pyc.get_organism_list()` for full list)
  * Surround with `:` to get precise results
    * `:homo:` returns `Homo sapiens` & `Homo sapiens neanderthalensis`)
    * `homo` also returns **homo**eomma, t**homo**mys, and *hundreds* others
* `organism_id`: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use `organism` instead)
* `cath`: CATH classification of the protein (`3.40.50.360` or `3.40.*.*` or `3.*`).
* `enzyme`: Enzyme Commission number of the protein. (`1.3.1.3` or `1.3.*.*` or `1.*`).
* `has_substrate`: Whether the protein has a known substrate. (`True`/`False`)
* `has_ptm`: Whether the protein has a known post-translational modification. (`True`/`False`)
* `has_pdb`: Whether the protein has a known PDB structure. (`True`/`False`)
* `disease`: The disease associated with the protein. (name of disease, case-insensitive, e.g `cancer`)
  * Use `pyc.get_disease_list()` for full list.
  * `cancer` searches for `Ovarian cancer`, `Lung cancer`, ...
* `disease_id`: The ID of the disease associated with the protein. (`DI-02205`, get_disease_list()
* `has_disease`: Whether the protein is associated with a disease. (`True`/`False`)
* `cofactor`: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g `Zn(2+)`])
* `cofactor_id`: The ID of the cofactor associated with the protein. (`CHEBI:00001`, get_cofactor_list())
* `biological_process`: Biological process associated with the protein. (e.g `antiviral defense`, use `pyc.get_biological_process_list()` for full list)
* `cellular_component`: Cellular component associated with the protein. (e.g `nucleus`, use `pyc.get_cellular_component_list()` for full list
* `domain`: Domain associated with the protein. (e.g `zinc-finger`, use `pyc.get_domain_list()` for full list)
* `ligand`: Ligand associated with the protein. (e.g `zinc`, use `pyc.get_ligand_list()` for full list
* `molecular_function`: Molecular function associated with the protein. (e.g `antioxidant activity`, use `pyc.get_molecular_function_list()` for full list
* `ptm`: Post-translational modification associated with the protein. (e.g `phosphoprotein`, use `pyc.get_ptm_list()` for full list


Here is an example of making a large query, then paginating the results:

### Paginate the results

Before loading coevolution matrices, it is recommended to paginate the results, as the matrices can take up a lot of memory.

In [13]:
entries_data_frame = obj_pycom.find(min_length=200)
print(f'Found {len(entries_data_frame)} entries with length >= 200')

page = obj_pycom.paginate(entries_data_frame, page=1, per_page=100)  # get first n entries (default 100)
print(f'Found {len(page)} entries on page 1')

Found 280391 entries with length >= 200
Found 100 entries on page 1


### Load coevolution matrices

Now the coevolution matrices can be loaded for the paginated results.

This loads them into the `matrix` column of the dataframe.

In [14]:
obj_pycom.load_matrices(page)

page.iloc[0].matrix  # show the coevolution matrix for the first entry

array([[0.        , 0.1006002 , 0.09838662, ..., 0.0467115 , 0.06828113,
        0.08706232],
       [0.10060021, 0.        , 0.10724315, ..., 0.04788237, 0.06594779,
        0.08771347],
       [0.09838659, 0.10724315, 0.        , ..., 0.04391594, 0.05695503,
        0.08397543],
       ...,
       [0.04671153, 0.04788237, 0.04391595, ..., 0.        , 0.07415783,
        0.06841787],
       [0.06828113, 0.06594782, 0.05695501, ..., 0.07415782, 0.        ,
        0.08349027],
       [0.08706234, 0.08771349, 0.08397545, ..., 0.06841787, 0.08349026,
        0.        ]])

By default, the matrices are loaded as a `numpy.ndarray`. Different formats can be specified.

Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:

In [None]:
from pycom import MatrixFormat

resultsPandas = pyc.load_matrices(page, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.load_matrices(page, mat_format=MatrixFormat.LIST)

print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')

### Load additional information

The list of cofactors, diseases, and organisms can loaded by calling:

In [17]:
entries_data_frame = obj_pycom.find(max_length=200)
cofactors = obj_pycom.get_cofactor_list()
diseases = obj_pycom.get_disease_list()
organisms = obj_pycom.get_organism_list()
organisms

Unnamed: 0,organismId,nameScientific,nameCommon,taxonomy
0,561445,African swine fever virus (isolate Pig/Kenya/K...,ASFV,:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
1,10500,African swine fever virus (isolate Tick/Malawi...,ASFV,:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
2,561443,African swine fever virus (isolate Tick/South ...,ASFV,:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
3,561444,African swine fever virus (isolate Warthog/Nam...,ASFV,:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
4,10498,African swine fever virus (strain Badajoz 1971...,Ba71V,:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
...,...,...,...,...
14316,31581,Rotavirus A (isolate RVA/Pig/Australia/TFR-41/...,RV-A,:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14317,31579,Rotavirus A (isolate RVA/Pig/Australia/BEN144/...,RV-A,:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14318,10918,Rotavirus A (strain RVA/Pig/Russia/K/1987),RV-A,:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14319,31580,Rotavirus A (isolate RVA/Pig/Australia/BMI-1/1...,RV-A,:Viruses:Riboviria:Orthornavirae:Duplornaviric...


Congratulations on completing this tutorial. If you wish to now learn the basics of how to analyse the coevolution matricies please refer to 02_Analysis.ipynb.