Error with prerank, after updating to newest version of GSEApy #251

jasiozaucha · 2024-02-28T10:45:12Z

Setup

I am reporting a problem with GSEApy version, Python version, and operating
system as follows:

python 3.9.6 (default, Nov 10 2023, 13:38:27)
[Clang 15.0.0 (clang-1500.1.0.2.5)]
CPython
macOS-14.1-arm64-arm-64bit
1.1.1

Expected behaviour

2024-02-28 10:42:18,872 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-02-28 10:42:18,872 [INFO] Parsing data files for GSEA.............................
2024-02-28 10:42:18,872 [INFO] Enrichr library gene sets already downloaded in: /Users/kpbr532/.cache/gseapy, use local file
2024-02-28 10:42:18,880 [INFO] 0000 gene_sets have been filtered out when max_size=1000 and min_size=5
2024-02-28 10:42:18,880 [INFO] 0050 gene_sets used for further statistical testing.....
2024-02-28 10:42:18,880 [INFO] Start to run GSEA...Might take a while..................
2024-02-28 10:42:20,297 [INFO] Congratulations. GSEApy runs successfully................

Actual behaviour

In [76]: pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
...: gene_sets='MSigDB_Hallmark_2020',
...: threads=4,
...: min_size=5,
...: max_size=1000,
...: permutation_num=1000, # reduce number to speed up testing
...: outdir=None, # don't write to disk
...: seed=6,
...: verbose=True, # see what's going on behind the scenes
...: )
2024-02-28 10:42:50,362 [INFO] Input gene rankings contains duplicated IDs

KeyError Traceback (most recent call last)
Cell In[76], line 1
----> 1 pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
2 gene_sets='MSigDB_Hallmark_2020',
3 threads=4,
4 min_size=5,
5 max_size=1000,
6 permutation_num=1000, # reduce number to speed up testing
7 outdir=None, # don't write to disk
8 seed=6,
9 verbose=True, # see what's going on behind the scenes
10 )

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/init.py:396, in prerank(rnk, gene_sets, outdir, pheno_pos, pheno_neg, min_size, max_size, permutation_num, weight, ascending, threads, figsize, format, graph_num, no_plot, seed, verbose, *arg, **kwargs)
375 weight = kwargs["weighted_score_type"]
377 pre = Prerank(
378 rnk,
379 gene_sets,
(...)
394 verbose,
395 )
--> 396 pre.run()
397 return pre

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:444, in Prerank.run(self)
441 assert self.min_size <= self.max_size
443 # parsing rankings
--> 444 dat2 = self.load_ranking()
445 assert len(dat2) > 1
446 self.ranking = dat2

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:418, in Prerank.load_ranking(self)
415 rank_metric = self._load_data(self.rnk) # gene id is the first column
416 if rank_metric.select_dtypes(np.number).shape[1] == 1:
417 # return series
--> 418 return self._load_ranking(rank_metric)
419 ## In case the input type multi-column ranking dataframe
420 # drop na gene id values
421 rank_metric = rank_metric.dropna(subset=rank_metric.columns[0])

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:385, in Prerank._load_ranking(self, rank_metric)
383 rank_metric.dropna(how="any", inplace=True)
384 # rename duplicate id, make them unique
--> 385 rank_metric = self.make_unique(rank_metric, col_idx=0)
386 # reset ranking index, because you have sort values and drop duplicates.
387 rank_metric.reset_index(drop=True, inplace=True)

File ~/Library/Python/3.9/lib/python/site-packages/gseapy/base.py:246, in GSEAbase.make_unique(self, rank_metric, col_idx)
243 self.logger.info("Input gene rankings contains duplicated IDs")
244 mask = rank_metric.duplicated(subset=id_col, keep=False)
245 dups = (
--> 246 rank_metric.loc[mask, id_col]
247 .groupby(id_col)
248 .cumcount()
249 .map(lambda c: "" + str(c) if c else "")
250 )
251 rank_metric.loc[mask, id_col] = rank_metric.loc[mask, id_col] + dups
252 return rank_metric

File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/series.py:2076, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
2073 raise TypeError("You have to supply one of 'by' and 'level'")
2074 axis = self._get_axis_number(axis)
-> 2076 return SeriesGroupBy(
2077 obj=self,
2078 keys=by,
2079 axis=axis,
2080 level=level,
2081 as_index=as_index,
2082 sort=sort,
2083 group_keys=group_keys,
2084 squeeze=squeeze,
2085 observed=observed,
2086 dropna=dropna,
2087 )

File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/groupby.py:965, in GroupBy.init(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
962 if grouper is None:
963 from pandas.core.groupby.grouper import get_grouper
--> 965 grouper, exclusions, obj = get_grouper(
966 obj,
967 keys,
968 axis=axis,
969 level=level,
970 sort=sort,
971 observed=observed,
972 mutated=self.mutated,
973 dropna=self.dropna,
974 )
976 self.obj = obj
977 self.axis = obj._get_axis_number(axis)

File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/grouper.py:888, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
886 in_axis, level, gpr = False, gpr, None
887 else:
--> 888 raise KeyError(gpr)
889 elif isinstance(gpr, Grouper) and gpr.key is not None:
890 # Add key to exclusions
891 exclusions.add(gpr.key)

KeyError: 0

Steps to reproduce

See my rnk file attached, I can't identify the problem with it. Please note that I had to add the txt extension at the end, otherwise github would not accept it.

RNAseq.rnk.txt

zqfang · 2024-02-28T19:27:17Z

Hi,

Sorry for the bug. the error message said you have duplicated gene names in your input. I just pushed a fix to the repo.

For now. you can remove the duplicated genes names in your input and run again. It will work

zqfang pushed a commit that referenced this issue Feb 28, 2024

fixed duplicated IDs, #251

7d5dd11

zqfang closed this as completed Mar 14, 2024

zqfang mentioned this issue Mar 19, 2024

key error #255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with prerank, after updating to newest version of GSEApy #251

Error with prerank, after updating to newest version of GSEApy #251

jasiozaucha commented Feb 28, 2024

zqfang commented Feb 28, 2024

Error with prerank, after updating to newest version of GSEApy #251

Error with prerank, after updating to newest version of GSEApy #251

Comments

jasiozaucha commented Feb 28, 2024

Setup

Expected behaviour

Actual behaviour

Steps to reproduce

zqfang commented Feb 28, 2024