Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_cazy_db fails #4

Closed
lonsbio opened this issue May 24, 2017 · 2 comments
Closed

create_cazy_db fails #4

lonsbio opened this issue May 24, 2017 · 2 comments

Comments

@lonsbio
Copy link

lonsbio commented May 24, 2017

Unable to create database on Python 2.7.13. Output (exlcucing BeautifulSoup warning) as follows:

>> Gathering species codes for species with full genomes
>> Glycoside-Hydrolases
>> 145 families found on http://www.cazy.org/Glycoside-Hydrolases.html
> GH1

then error

first_page_idx = int(page_index_list[0]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
ValueError: invalid literal for int() with base 10: 'GH1_archaea.html?debut_TAXO=100'

Has the pagination code changed for the expression to fail?

@rvhonorato
Copy link
Owner

Yes, looks like the pagination changed a bit. I did a quick fix using regular expressions #5 and it should work fine now. Thanks for opening this issue.

@lonsbio
Copy link
Author

lonsbio commented May 25, 2017

Thanks! I tried my own patch overnight (not as elegant) and it seemed to work too.

Also, I'm not sure if this is a recent issue or incidental. My DB download file seems to have newlines surrounding the organism field:

domain	protein_name	family	tag	organism_code	ec	genbank	uniprot	subfamily	organism	pdb
	 Ahos_0285	GH1		invalid	 	AEE93176.1	 		
Acidianus hospitalis W1
	  

Fixing it does't seem to effect the extract script, but does make the csv (tsv) file readable. Is the wrapping intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants