Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2 #83

Closed
tjiagoM opened this issue Jul 7, 2019 · 20 comments

Comments

@tjiagoM
Copy link

tjiagoM commented Jul 7, 2019

Hello,

I have to run multiple enrichments, over different groups of genes, so I just have a big for loop which goes over all these group of genes, and for each one just runs:

enr = gp.enrichr(gene_list=list(genes_array.astype('<U3')),
                         organism='human',
                         description='test',
                         gene_sets='Reactome_2016',
                         cutoff=1)

Once in a while I have this error:

Traceback (most recent call last):                       
File "my_script.py", line 83, in <module>
  cutoff=1)
File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 391, in enrichr
  enr.run()
File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 331, in run
  shortID, res = self.get_results(genes_list)
File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 169, in get_results
  res = pd.read_csv(StringIO(response.content.decode('utf-8')),sep="\t")
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
  return _read(filepath_or_buffer, kwds)
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 435, in _read
  data = parser.read(nrows)
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 1139, in read
  ret = self._engine.read(nrows)
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 1995, in read
  data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2   

I'm having a huge difficulty to isolate the error because this doesn't happen always for the same group of genes. Could anyone give an hint about what the problem could be, as I've started using gseapy only very recently?

If I cannot find the source of error I guess it's fine because I've been able to run for all the groups by just repeating the code... Which is quite annoying as I don't know whether some enrichment might be wrong. What could I be missing here?

@zqfang
Copy link
Owner

zqfang commented Jul 8, 2019

How many gene groups are you querying? You got this problem because this line of code:

 res = pd.read_csv(StringIO(response.content.decode('utf-8')),sep="\t")

I don't know what happens. But I suggest the reason is network latency. gseapy wait for a long time
to get back results from Enricher server. I'll take a time to look at this

@tjiagoM
Copy link
Author

tjiagoM commented Jul 8, 2019

Yeah, for some groups I have a few hundreds, but I ended up not saving any group because it constantly changes. I will try to run again and see for which groups it stops this time.

Now that you talk about it, sometimes gseapy was failing because of a connection reset exception, and I solved this by just adding a few milliseconds of sleep before calling enrichr() each time. Could it be that that response read by StringIO has some error/warning from the API request, and that's why pandas cannot read it properly?

@tjiagoM
Copy link
Author

tjiagoM commented Jul 8, 2019

@zqfang I was going to create a new issue, but I'm now receiving another error in an inconsistent way (a bit like the error in this issue). Do you think it might be related to this?
Apologies for just throwing the exceptions here, but they just randomly appear, so maybe you might know better how to help me.

Traceback (most recent call last):
  File "07_explain_communitites.py", line 84, in <module>
    cutoff=0.05)
  File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 391, in enrichr
    enr.run()
  File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 309, in run
    gss = self.parse_genesets()
  File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 68, in parse_genesets
    enrichr_library = self.get_libraries()
  File "home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 183, in get_libraries
    libs = [lib['libraryName'] for lib in libs_json['statistics']]
KeyError: 'statistics'

@zqfang
Copy link
Owner

zqfang commented Jul 9, 2019

I think the problems you’ve had are for the same reason: the Enrichr server could not handle gseapy’s high concurrent requests from same IP address in a short time. It seems that user will be blocked to prevent API abuse. So, when you try to get the data back, you will get nothing. I have no other way to improve this, except adding sleep after each querying. Do you have any ideas?

@tjiagoM
Copy link
Author

tjiagoM commented Jul 9, 2019

I see, thanks for the help anyway!

I'd say if you have a timeout from the Enrichr server, or some error in the returning answer from Enrichr, maybe just catch that and show to the user that the problem is with the Enrichr server (and maybe suggest wait a bit or reduce the number of requests). Otherwise all these errors will surely just bring confusion when the problem is actually simple, as you pointed out.

@zqfang
Copy link
Owner

zqfang commented Jul 10, 2019

Well, good idea. Warning should be printed out to the console if nothing gets back. Enrichr server are now upgrading. If you still have the same problem, then you need to re-run.
屏幕快照 2019-07-10 下午3 41 16

@tsnetterfield
Copy link

I am also getting the same error that @tjiagoM posted above executing the following on a list of about 50 genes:

en_rnk_1=gp.enrichr(gene_list=rnk1_en,description='test',gene_sets='NCI-Nature_2016',outdir='./GSEA Files/Selected Gene Sets')

I updated to the latest release and am still getting this issue, is there still a problem with the server that is causing this?

@tsnetterfield
Copy link

tsnetterfield commented Sep 26, 2019

I have waited a week and I am still getting the same error?

`2019-09-26 14:28:42,305 Error fetching enrichment results: TRRUST_Transcription_Factors_2019
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-59-902aeaec60e8> in <module>
----> 1 en_rnk_1=gp.enrichr(gene_list=rnk1_en,gene_sets='TRRUST_Transcription_Factors_2019',outdir='./GSEA Files/Selected Gene Sets')

~\Anaconda3\lib\site-packages\gseapy\enrichr.py in enrichr(gene_list, gene_sets, organism, description, outdir, background, cutoff, format, figsize, top_term, no_plot, verbose)
    415     enr = Enrichr(gene_list, gene_sets, organism, description, outdir,
    416                   cutoff, background, format, figsize, top_term, no_plot, verbose)
--> 417     enr.run()
    418 
    419     return enr

~\Anaconda3\lib\site-packages\gseapy\enrichr.py in run(self)
    354                 self._logger.debug("Start Enrichr using library: %s" % (self._gs))
    355                 self._logger.info('Analysis name: %s, Enrichr Library: %s' % (self.descriptions, self._gs))
--> 356                 shortID, res = self.get_results(genes_list)
    357                 # Remember gene set library used
    358             res.insert(0, "Gene_set", self._gs)

~\Anaconda3\lib\site-packages\gseapy\enrichr.py in get_results(self, gene_list)
    182         if not response.ok:
    183             self._logger.error('Error fetching enrichment results: %s'%self._gs)
--> 184         res = pd.read_csv(StringIO(response.content.decode('utf-8')), sep="\t")
    185         return [job_id['shortId'], res]
    186 

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    700                     skip_blank_lines=skip_blank_lines)
    701 
--> 702         return _read(filepath_or_buffer, kwds)
    703 
    704     parser_f.__name__ = name

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    433 
    434     try:
--> 435         data = parser.read(nrows)
    436     finally:
    437         parser.close()

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1137     def read(self, nrows=None):
   1138         nrows = _validate_integer('nrows', nrows)
-> 1139         ret = self._engine.read(nrows)
   1140 
   1141         # May alter columns / col_dict

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1993     def read(self, nrows=None):
   1994         try:
-> 1995             data = self._reader.read(nrows)
   1996         except StopIteration:
   1997             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2`

Any insight into why this may be happening?

zqfang added a commit that referenced this issue Sep 28, 2019
@zqfang
Copy link
Owner

zqfang commented Sep 28, 2019

@tsnetterfield , Sorry for replying late. could you please install the lastest PR and try again? I've update the data that pandas read. Hope this will fix the problem you have

@tsnetterfield
Copy link

@zqfang Thanks for getting back to me! I updated my Python to 3.7.4 and am still getting the same error I posted above.

@zqfang
Copy link
Owner

zqfang commented Sep 29, 2019

@tsnetterfield , Please install the lastest gseapy using the this line of code:

pip install git+git://github.com/zqfang/gseapy.git#egg=gseapy

make sure that you are using v0.9.16

@tsnetterfield
Copy link

@zqfang When I do this in Anaconda Prompt this is the first line that comes up:

Requirement already satisfied: gseapy from git+git://github.com/zqfang/gseapy.git#egg=gseapy in c:\users\tatiana\anaconda3\lib\site-packages (0.9.15)

Anaconda seems to only see the 0.9.15 development version for some reason.

@armadillocommander
Copy link

armadillocommander commented Sep 29, 2019 via email

@tsnetterfield
Copy link

@armadillocommander thanks for the tip! I uninstalled and now have version 0.9.16. However, I am still getting the exact same parser error from above.

@zqfang
Copy link
Owner

zqfang commented Sep 30, 2019

@tsnetterfield , do you mind share me with your gene list input? I can't reproduce your bug

@tsnetterfield
Copy link

my_gene_list.txt

Hi @zqfang, attached is the list I was trying to run. I tried a different list just now and got the same error.

@zqfang
Copy link
Owner

zqfang commented Oct 8, 2019

@tsnetterfield , sorry for replying late. I was on vacation. However, I still could not reproduce the error you've got using the same code:

en_rnk_1=gp.enrichr(gene_list="my_gene_list.txt" ,description='test',gene_sets='NCI-Nature_2016',outdir='./GSEA Files/Selected Gene Sets')

Even I run the code for 50 times, it did not break.

zqfang added a commit that referenced this issue May 2, 2020
@zqfang
Copy link
Owner

zqfang commented May 2, 2020

close now. this issue should be gone now

@zqfang zqfang closed this as completed May 2, 2020
@Eddy265
Copy link

Eddy265 commented Feb 23, 2021

Alternately, you can save the file as CSV UTF-8 (Comma delimited)

@smartup10
Copy link

I had the same error I arranged regularizing the data in csv file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants