Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with prerank, after updating to newest version of GSEApy #251

Closed
jasiozaucha opened this issue Feb 28, 2024 · 1 comment
Closed

Error with prerank, after updating to newest version of GSEApy #251

jasiozaucha opened this issue Feb 28, 2024 · 1 comment

Comments

@jasiozaucha
Copy link

Setup

I am reporting a problem with GSEApy version, Python version, and operating
system as follows:

python 3.9.6 (default, Nov 10 2023, 13:38:27)
[Clang 15.0.0 (clang-1500.1.0.2.5)]
CPython
macOS-14.1-arm64-arm-64bit
1.1.1

Expected behaviour

2024-02-28 10:42:18,872 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-02-28 10:42:18,872 [INFO] Parsing data files for GSEA.............................
2024-02-28 10:42:18,872 [INFO] Enrichr library gene sets already downloaded in: /Users/kpbr532/.cache/gseapy, use local file
2024-02-28 10:42:18,880 [INFO] 0000 gene_sets have been filtered out when max_size=1000 and min_size=5
2024-02-28 10:42:18,880 [INFO] 0050 gene_sets used for further statistical testing.....
2024-02-28 10:42:18,880 [INFO] Start to run GSEA...Might take a while..................
2024-02-28 10:42:20,297 [INFO] Congratulations. GSEApy runs successfully................

Actual behaviour

In [76]: pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
...: gene_sets='MSigDB_Hallmark_2020',
...: threads=4,
...: min_size=5,
...: max_size=1000,
...: permutation_num=1000, # reduce number to speed up testing
...: outdir=None, # don't write to disk
...: seed=6,
...: verbose=True, # see what's going on behind the scenes
...: )
2024-02-28 10:42:50,362 [INFO] Input gene rankings contains duplicated IDs

KeyError Traceback (most recent call last)
Cell In[76], line 1
----> 1 pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
2 gene_sets='MSigDB_Hallmark_2020',
3 threads=4,
4 min_size=5,
5 max_size=1000,
6 permutation_num=1000, # reduce number to speed up testing
7 outdir=None, # don't write to disk
8 seed=6,
9 verbose=True, # see what's going on behind the scenes
10 )

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/init.py:396, in prerank(rnk, gene_sets, outdir, pheno_pos, pheno_neg, min_size, max_size, permutation_num, weight, ascending, threads, figsize, format, graph_num, no_plot, seed, verbose, *arg, **kwargs)
375 weight = kwargs["weighted_score_type"]
377 pre = Prerank(
378 rnk,
379 gene_sets,
(...)
394 verbose,
395 )
--> 396 pre.run()
397 return pre

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:444, in Prerank.run(self)
441 assert self.min_size <= self.max_size
443 # parsing rankings
--> 444 dat2 = self.load_ranking()
445 assert len(dat2) > 1
446 self.ranking = dat2

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:418, in Prerank.load_ranking(self)
415 rank_metric = self._load_data(self.rnk) # gene id is the first column
416 if rank_metric.select_dtypes(np.number).shape[1] == 1:
417 # return series
--> 418 return self._load_ranking(rank_metric)
419 ## In case the input type multi-column ranking dataframe
420 # drop na gene id values
421 rank_metric = rank_metric.dropna(subset=rank_metric.columns[0])

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:385, in Prerank._load_ranking(self, rank_metric)
383 rank_metric.dropna(how="any", inplace=True)
384 # rename duplicate id, make them unique
--> 385 rank_metric = self.make_unique(rank_metric, col_idx=0)
386 # reset ranking index, because you have sort values and drop duplicates.
387 rank_metric.reset_index(drop=True, inplace=True)

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/base.py:246, in GSEAbase.make_unique(self, rank_metric, col_idx)
243 self.logger.info("Input gene rankings contains duplicated IDs")
244 mask = rank_metric.duplicated(subset=id_col, keep=False)
245 dups = (
--> 246 rank_metric.loc[mask, id_col]
247 .groupby(id_col)
248 .cumcount()
249 .map(lambda c: "
" + str(c) if c else "")
250 )
251 rank_metric.loc[mask, id_col] = rank_metric.loc[mask, id_col] + dups
252 return rank_metric

File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/series.py:2076, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
2073 raise TypeError("You have to supply one of 'by' and 'level'")
2074 axis = self._get_axis_number(axis)
-> 2076 return SeriesGroupBy(
2077 obj=self,
2078 keys=by,
2079 axis=axis,
2080 level=level,
2081 as_index=as_index,
2082 sort=sort,
2083 group_keys=group_keys,
2084 squeeze=squeeze,
2085 observed=observed,
2086 dropna=dropna,
2087 )

File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/groupby.py:965, in GroupBy.init(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
962 if grouper is None:
963 from pandas.core.groupby.grouper import get_grouper
--> 965 grouper, exclusions, obj = get_grouper(
966 obj,
967 keys,
968 axis=axis,
969 level=level,
970 sort=sort,
971 observed=observed,
972 mutated=self.mutated,
973 dropna=self.dropna,
974 )
976 self.obj = obj
977 self.axis = obj._get_axis_number(axis)

File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/grouper.py:888, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
886 in_axis, level, gpr = False, gpr, None
887 else:
--> 888 raise KeyError(gpr)
889 elif isinstance(gpr, Grouper) and gpr.key is not None:
890 # Add key to exclusions
891 exclusions.add(gpr.key)

KeyError: 0

Steps to reproduce

See my rnk file attached, I can't identify the problem with it. Please note that I had to add the txt extension at the end, otherwise github would not accept it.

RNAseq.rnk.txt

zqfang pushed a commit that referenced this issue Feb 28, 2024
@zqfang
Copy link
Owner

zqfang commented Feb 28, 2024

Hi,

Sorry for the bug. the error message said you have duplicated gene names in your input. I just pushed a fix to the repo.

For now. you can remove the duplicated genes names in your input and run again. It will work

@zqfang zqfang closed this as completed Mar 14, 2024
@zqfang zqfang mentioned this issue Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants