Skip to content

Commit

Permalink
Fix --expand mode for empty attributes
Browse files Browse the repository at this point in the history
If an attribute appears in sample_attribute but has no value
associated with it, it is as good as non-existent, we
discard this attribute. This only effects it on a record level
and not the entire run table. If any other sample in the same
run table has that attribute, the DataFrame will still have
all the relevant columns with the record with missing value
for an attribute as an NaN.

Closes #11
  • Loading branch information
saketkc committed Jul 18, 2019
1 parent a4ae359 commit 8b9cfa0
Showing 1 changed file with 15 additions and 3 deletions.
18 changes: 15 additions & 3 deletions pysradb/filter_attrs.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,31 @@ def _get_sample_attr_keys(sample_attribute):
if sample_attribute is None:
return None, None
sample_attribute_splitted = sample_attribute.split("||")
split_by_colon = [str(attr).split(": ") for attr in sample_attribute_splitted]
split_by_colon = [
str(attr).strip().split(": ") for attr in sample_attribute_splitted
]

# Iterate once more to consider first one as the key
# and remaining as the value
# This is because of bad annotations like in this example
# Example: isolate: not applicable || organism: Mus musculus || cell_line: 17-Cl1 ||\
# infect: MHV-A59 || time point: 5: hour || compound: cycloheximide ||\
# sequencing protocol: RiboSeq || biological repeat: long read sequencing
# Notice the `time: 5: hour`
# sample_attribute: investigation type: metagenome || project name: Landsort Depth 20090415 transect ||
# sequencing method: 454 || collection date: 2009-04-15 || ammonium: 8.7: µM || chlorophyll: 0: µg/L ||
# dissolved oxygen: -1.33: µmol/kg || nitrate: 0.02: µM || nitrogen: 0: µM ||
# environmental package: water || geographic location (latitude): 58.6: DD ||
# geographic location (longitude): 18.2: DD || geographic location (country and/or sea,region): Baltic Sea ||
# environment (biome): 00002150 || environment (feature): 00002150 || environment (material): 00002150 ||
# depth: 400: m || Phosphate: || Total phosphorous: || Silicon:
# Handle empty cases as above
split_by_colon = [attr for attr in split_by_colon if len(attr) >= 2]

for index, element in enumerate(split_by_colon):
if len(element) > 2:
key = element[0]
value = ":".join(element[1:])
key = element[0].strip()
value = ":".join(element[1:]).strip()
split_by_colon[index] = [key, value]

try:
Expand Down

0 comments on commit 8b9cfa0

Please sign in to comment.