Skip to content

Commit

Permalink
major code update
Browse files Browse the repository at this point in the history
  • Loading branch information
rvhonorato committed May 26, 2022
1 parent 14b7518 commit b62921e
Show file tree
Hide file tree
Showing 21 changed files with 909 additions and 686 deletions.
3 changes: 3 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[flake8]
max-line-length = 88
extend-ignore = E203
161 changes: 158 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,160 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
paper/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
build/
cazy_parser.egg-info/

# VScode
.vscode/

# Project-specific
*.csv
*.chk
*.fasta
2 changes: 2 additions & 0 deletions .isort.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[settings]
profile = black
9 changes: 1 addition & 8 deletions CONTRIBUTE.md → CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,10 @@
## cazy-parser
*A way to extract specific information from the Carbohydrate-Active enZYmes.*

# How to contribute?
# How to contribute to cazy-parser?

There are still a few features that could be implemented, such as:

* Organism specific selection
* Retrieve three dimensional structures for each family

and specially

* **Retrieve fasta sequences from NCBIs servers**

___

Feel free to contact me with **suggestions**, **bugs reports** or if you need any **assistance** running the software.
106 changes: 29 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,105 +18,57 @@ License: [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.html)

doi: 10.21105/joss.00053

# IMPORTANT

Due to changes in the CAZy database, the parser is no longer functional, I will try to revive the code and update it soon. (:

## Introduction
*cazy-parser* is a tool that extract information from [CAZy](http://www.cazy.org/) in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.

## Install / Upgrade
`$ pip install --upgrade cazy-parser`

or

Download latest source from [this link](https://pypi.python.org/pypi/cazy-parser)

```
$ tar -zxvf cazy-parser-x.x.x.tar.gz
$ cd cazy-parser-x.x.x
$ python setup.py install
$ pip install --upgrade cazy-parser
```

Note: It my be necessary to open a new terminal.

## Usage

*Internet connection required*

1) Database creation

`$ create_cazy_db`

(-h for help)
* This script will parse the [CAZy](http://www.cazy.org/) database website and create a comma separated table containing the following information:
* domain
* protein_name
* family
* tag *(characterized status)*
* organism_code
* [EC](http://www.enzyme-database.org/) number (ec stands for enzyme comission number)
* [GENBANK](https://www.ncbi.nlm.nih.gov/genbank/) id
* [UNIPROT](https://www.uniprot.org) code
* subfamily
* organism
* [PDB](http://www.rcsb.org/) code

2) Extract accession codes

* Based on the previously generated csv table, extract accession codes for a given protein family.

`$ extract_cazy_ids --db <database> --family <family code>`

(-h for help)
* Optional:

`--subfamilies` Create a file for each subfamily, default = False

`--characterized` Create a file containing only characterized enzymes, default = False

## Usage examples

1) Extract all accession codes from family 9 of Glycosyl Transferases.

`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9`

This will generate the following files:
```
GT9.csv
```

2) Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies

`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`

This will generate the following files:

```
GH43.csv
GH43_sub1.csv
GH43_sub2.csv
GH43_sub3.csv
(...)
GH43_sub37.csv
cazy-parser -h
usage: cazy-parser [-h] [-f FAMILY] [-s SUBFAMILY] [-c CHARACTERIZED] [-v] {GH,GT,PL,CA,AA}
positional arguments:
{GH,GT,PL,CA,AA}
optional arguments:
-h, --help show this help message and exit
-f FAMILY, --family FAMILY
-s SUBFAMILY, --subfamily SUBFAMILY
-c CHARACTERIZED, --characterized CHARACTERIZED
-v, --version show version
```

3) Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries

`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
### Example

This will generate the following files:
Extract all fasta sequences from family 43 of Glycoside Hydrolase subfamily 1

```
PL42.csv
PL42_characterized.csv
$ cazy-parser GH -f 43 -s 1
[2022-05-26 16:39:21,511 91 INFO] ------------------------------------------
[2022-05-26 16:39:21,511 92 INFO]
[2022-05-26 16:39:21,511 93 INFO] ┌─┐┌─┐┌─┐┬ ┬ ┌─┐┌─┐┬─┐┌─┐┌─┐┬─┐
[2022-05-26 16:39:21,511 94 INFO] │ ├─┤┌─┘└┬┘───├─┘├─┤├┬┘└─┐├┤ ├┬┘
[2022-05-26 16:39:21,511 95 INFO] └─┘┴ ┴└─┘ ┴ ┴ ┴ ┴┴└─└─┘└─┘┴└─ v2.0.0
[2022-05-26 16:39:21,511 96 INFO]
[2022-05-26 16:39:21,511 97 INFO] ------------------------------------------
[2022-05-26 16:39:21,511 183 INFO] Fetching links for Glycoside-Hydrolases, url: http://www.cazy.org/Glycoside-Hydrolases.html
[2022-05-26 16:39:22,454 189 INFO] Only using links of family 43 subfamily 1
[2022-05-26 16:39:23,029 26 INFO] Dowloading 1415 fasta sequences...
[2022-05-26 16:40:32,187 51 INFO] Dumping fasta sequences to file GH43_1_26052022.fasta
```

### Download fasta sequences

Go to [NCBI's Batch Entrez](https://www.ncbi.nlm.nih.gov/sites/batchentrez) change the database to protein and submit the generated `.csv`.
This will generate the following file `GH43_1_DDMMYYY.fasta` containing the fasta sequences.

## To-do and how to contribute

Please refer to CONTRIBUTE.md
Please refer to [CONTRIBUTING](CONTRIBUTING.md) (:
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ cazy-parser

The `Carbohydrate-Active enZYmes Database (CAZy) <https://www.cazy.org>`_ provides access to a sequence based classification of enzyme that are responsible for the assembly, modification and breakdown of oligo and polysaccharides.

This database has been online for eighteen years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such asglycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
This database has been online for several years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such as glycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.

This script provides a way to extract information from the database according to user need.

Expand Down
Loading

0 comments on commit b62921e

Please sign in to comment.