Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

goatools包问题 #5

Closed
lonetravelwolf opened this issue Apr 7, 2022 · 7 comments
Closed

goatools包问题 #5

lonetravelwolf opened this issue Apr 7, 2022 · 7 comments

Comments

@lonetravelwolf
Copy link

博主你好,请问一下您在运行gen_onto_protein_data.py文件中create_go_data部分时是否出现下面类似问题:
goatools版本为1.2.3时,go_term.definition报错:没有.definition属性
goatools版本为1.0.11时,提示RecursionError: maximum recursion depth exceeded while calling a Python object

@Alexzhuan
Copy link
Collaborator

你好,为了提取obo文件中GO term的定义,我们修改了 goatools 这个包中obo_parser.py 源码,修改部分为:

# line 132
elif line[:5] == "def: ":
    rec_curr.definition = line[5:]

# line 169
self.definition = ""

@lonetravelwolf
Copy link
Author

谢谢您的回复,抱歉继续打扰到您,我在执行create_onto_protein_data部分时,由于'component.txt', 'function.txt', 'process.txt'三个文件中列表中元素有的是3个有的是4个,会导致protein, relation, go, _ = rec这个语句出现赋值错误。请问是数据问题么?

@Alexzhuan Alexzhuan reopened this Apr 8, 2022
@Alexzhuan
Copy link
Collaborator

你好,component.txt, function.txt, process.txt 中4个字段分别对应蛋白质ID、关系、GO ID以及evidence code,应该是不会出现缺失的情况的,对应GO注释文件中 goa_uniprot_all.gaf 这四个字段都是required ( 注释标准 ),你可能需要检查下create_goa_triplet时输入的GO注释数据。

另外,数据构造脚本中,运行顺序是 create_uniprot_data -> create_goa_triplet -> create_go_data -> create_onto_protein_data

@lonetravelwolf
Copy link
Author

你好,我的goa_uniprot_all.gaf文件部分内容如下:

!gaf-version: 2.1
!
!This file contains all GO annotations and gene product information for proteins in the UniProt KnowledgeBase (UniProtKB),
!IntAct protein complexes, and RNAcentral identifiers.
!
!Generated: 2016-07-04 15:52
!GO-version: http://purl.obolibrary.org/obo/go/releases/2016-06-29/go.owl
!
UniProtKB	A0A000	moeA5		GO:0003824	GO_REF:0000002	IEA	InterPro:IPR015421|InterPro:IPR015422	F	MoeA5	A0A000_9ACTN|moeA5	protein	taxon:35758	20160702	InterPro		
UniProtKB	A0A000	moeA5		GO:0003870	GO_REF:0000002	IEA	InterPro:IPR010961	F	MoeA5	A0A000_9ACTN|moeA5	protein	taxon:35758	20160702	InterPro		
UniProtKB	A0A000	moeA5		GO:0009058	GO_REF:0000002	IEA	InterPro:IPR004839	P	MoeA5	A0A000_9ACTN|moeA5	protein	taxon:35758	20160702	InterPro

其中还有部分数据如下显示

UniProtKB	K4CLI3	K4CLI3	NOT	GO:0031616	GO_REF:0000033	IBA	PANTHER:PTN000682053	C	Uncharacterized protein	K4CLI3_SOLLC	protein	taxon:4081	20140909	GO_Central	

请问存在这两种不同词条是否正常?我下载的是文件goa_uniprot_all.gaf.156.gz

@Alexzhuan
Copy link
Collaborator

你好,我看了下你下的注释数据格式版本是 gaf-version: 2.1(这是早期的格式),最近发布的注释数据格式版本是gaf-version: 2.2,2.2版本对Qualifier(relationship)字段做了调整(具体可见);我们使用的是比较新的注释数据,如果你需要下载比较早期的注释数据,可以下载.gpa格式的数据(对应字段)(但对应脚本中字段位置需要修改)。

@lonetravelwolf
Copy link
Author

你好,我尝试着下载了新的goa_uniprot_all.gaf.gz,但这个数据集好像特别大,压缩文件40G无法解压缩。请问您的是这个么?我可以使用goa_uniprot_all.gpi.gz文件作替代么

@Alexzhuan
Copy link
Collaborator

注释数据是比较大的,解压开来有近100G吧。.gpi应该不行,官方说明里GPI文件是GPAD文件的配套文件,里面没有GO注释信息,具体你可以研究下官方说明。

@zxlzr zxlzr closed this as completed Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants