major code update

rvhonorato · May 26, 2022 · b62921e · b62921e
1 parent 14b7518
commit b62921e
Show file tree

Hide file tree

Showing 21 changed files with 909 additions and 686 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,3 @@
+[flake8]
+max-line-length = 88
+extend-ignore = E203
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,160 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
 dist/
-paper/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 .idea/
-build/
-cazy_parser.egg-info/
+
+# VScode
+.vscode/
+
+# Project-specific
+*.csv
+*.chk
+*.fasta
diff --git a/.isort.cfg b/.isort.cfg
@@ -0,0 +1,2 @@
+[settings]
+profile = black
diff --git a/CONTRIBUTE.md → CONTRIBUTING.md b/CONTRIBUTE.md → CONTRIBUTING.md
@@ -1,17 +1,10 @@
-## cazy-parser
-*A way to extract specific information from the Carbohydrate-Active enZYmes.*
-
-# How to contribute?
+# How to contribute to cazy-parser?
 
 There are still a few features that could be implemented, such as:
 
 * Organism specific selection
 * Retrieve three dimensional structures for each family
 
-and specially
-
-* **Retrieve fasta sequences from NCBIs servers**
-
 ___
 
 Feel free to contact me with **suggestions**, **bugs reports** or if you need any **assistance** running the software.
diff --git a/README.md b/README.md
@@ -18,105 +18,57 @@ License: [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.html)
 
 doi: 10.21105/joss.00053
 
-# IMPORTANT
-
-Due to changes in the CAZy database, the parser is no longer functional, I will try to revive the code and update it soon. (:
 
 ## Introduction
  *cazy-parser* is a tool that extract information from [CAZy](http://www.cazy.org/) in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.
 
 ## Install / Upgrade
-`$ pip install --upgrade cazy-parser`
-
-or
-
-Download latest source from [this link](https://pypi.python.org/pypi/cazy-parser)
-
 ```
-$ tar -zxvf cazy-parser-x.x.x.tar.gz
-$ cd cazy-parser-x.x.x
-$ python setup.py install
-
+$ pip install --upgrade cazy-parser
 ```
 
-Note: It my be necessary to open a new terminal.
 
 ## Usage
 
 *Internet connection required*
 
-1) Database creation
-
-`$ create_cazy_db`
-
-(-h for help)
-* This script will parse the [CAZy](http://www.cazy.org/) database website and create a comma separated table containing the following information:
-    * domain
-    * protein_name
-    * family
-    * tag *(characterized status)*
-    * organism_code
-    * [EC](http://www.enzyme-database.org/) number (ec stands for enzyme comission number)
-    * [GENBANK](https://www.ncbi.nlm.nih.gov/genbank/) id
-    * [UNIPROT](https://www.uniprot.org) code
-    * subfamily
-    * organism
-    * [PDB](http://www.rcsb.org/) code
-
-2) Extract accession codes
-
-* Based on the previously generated csv table, extract accession codes for a given protein family.
-
-`$ extract_cazy_ids --db <database> --family <family code>`
-
-(-h for help)
-* Optional:
-
-`--subfamilies` Create a file for each subfamily, default = False
-
-`--characterized` Create a file containing only characterized enzymes, default = False
-
-## Usage examples
-
-1) Extract all accession codes from family 9 of Glycosyl Transferases.
-
-`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9`
-
-This will generate the following files:
-```
-GT9.csv
-```
-
-2) Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies
-
-`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`
-
-This will generate the following files:
 
 ```
-GH43.csv
-GH43_sub1.csv
-GH43_sub2.csv
-GH43_sub3.csv
-(...)
-GH43_sub37.csv
+cazy-parser -h
+usage: cazy-parser [-h] [-f FAMILY] [-s SUBFAMILY] [-c CHARACTERIZED] [-v] {GH,GT,PL,CA,AA}
+
+positional arguments:
+  {GH,GT,PL,CA,AA}
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -f FAMILY, --family FAMILY
+  -s SUBFAMILY, --subfamily SUBFAMILY
+  -c CHARACTERIZED, --characterized CHARACTERIZED
+  -v, --version         show version
 ```
 
-3) Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries
-
-`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
+### Example
 
-This will generate the following files:
+Extract all fasta sequences from family 43 of Glycoside Hydrolase subfamily 1
 
 ```
-PL42.csv
-PL42_characterized.csv
+$ cazy-parser GH -f 43 -s 1
+ [2022-05-26 16:39:21,511 91 INFO] ------------------------------------------
+ [2022-05-26 16:39:21,511 92 INFO]
+ [2022-05-26 16:39:21,511 93 INFO] ┌─┐┌─┐┌─┐┬ ┬   ┌─┐┌─┐┬─┐┌─┐┌─┐┬─┐
+ [2022-05-26 16:39:21,511 94 INFO] │  ├─┤┌─┘└┬┘───├─┘├─┤├┬┘└─┐├┤ ├┬┘
+ [2022-05-26 16:39:21,511 95 INFO] └─┘┴ ┴└─┘ ┴    ┴  ┴ ┴┴└─└─┘└─┘┴└─ v2.0.0
+ [2022-05-26 16:39:21,511 96 INFO]
+ [2022-05-26 16:39:21,511 97 INFO] ------------------------------------------
+ [2022-05-26 16:39:21,511 183 INFO] Fetching links for Glycoside-Hydrolases, url: http://www.cazy.org/Glycoside-Hydrolases.html
+ [2022-05-26 16:39:22,454 189 INFO] Only using links of family 43 subfamily 1
+ [2022-05-26 16:39:23,029 26 INFO] Dowloading 1415 fasta sequences...
+ [2022-05-26 16:40:32,187 51 INFO] Dumping fasta sequences to file GH43_1_26052022.fasta
 ```
 
-### Download fasta sequences
-
-Go to [NCBI's Batch Entrez](https://www.ncbi.nlm.nih.gov/sites/batchentrez) change the database to protein and submit the generated `.csv`.
+This will generate the following file `GH43_1_DDMMYYY.fasta` containing the fasta sequences.
 
 ## To-do and how to contribute
 
-Please refer to CONTRIBUTE.md
+Please refer to [CONTRIBUTING](CONTRIBUTING.md) (:
diff --git a/README.rst b/README.rst
@@ -3,7 +3,7 @@ cazy-parser
 
 The `Carbohydrate-Active enZYmes Database (CAZy) <https://www.cazy.org>`_ provides access to a sequence based classification of enzyme that are responsible for the assembly, modification and breakdown of oligo and polysaccharides.
 
-This database has been online for eighteen years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such asglycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
+This database has been online for several years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such as glycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
 
 This script provides a way to extract information from the database according to user need.