Merge pull request #2 from vsoch/add/data-example

adding example dataset
vsoch · Oct 22, 2020 · f1716c0 · f1716c0
2 parents 54b03f7 + c6f3600
commit f1716c0
Show file tree

Hide file tree

Showing 7 changed files with 38,040 additions and 3,600 deletions.
diff --git a/README.md b/README.md
@@ -5,10 +5,42 @@ in [index.html](index.html) with data derived from the SGD_features file. I've g
 the file to match the example data provided in the library for [a human](https://github.com/eweitz/ideogram/blob/master/dist/data/annotations/SRR562646.json).
 
 We want an interface that a user can quickly search a gene, and then see it
-be highlighted in the plot. I'd like to eventually integrate this into a web application
-to show expression levels across a dataset, but I have several questions.
+be highlighted in the plot. This could be integrated into some web application to show
+expression levels. Two examples are provided:
 
-## Filter Maps
+ - [Random Generation](https://vsoch.github.io/yeast-ideogram/)
+ - [Generation from Dataset](https://vsoch.github.io/yeast-ideogram/dataset.html)
+
+## Usage
+
+To generate the [index.html](index.html) here and generate random expression values,
+just run the data generation script:
+
+```bash
+python data/generate_yeast_data.py 
+```
+This will generate the files "yeast-annots-random.json" and "features-count.json"
+in the [data](data) directory. The counts can be visualized in the ipython notebook.
+
+```
+$ ls data/
+feature_counts.ipynb    saccharomyces-cerevisiae.json  SRR562646.json          yeast.txt
+features-count.json     SGD_features.README            yeast-annots.json
+generate_yeast_data.py  SGD_features.tab               yeast_data_example.txt
+```
+
+To generate the input for [dataset.html](dataset.html) that has actual expression values,
+you'll want to include an input file as an argument.
+
+```bash
+python data/generate_yeast_data.py data/yeast_data_example.txt
+```
+
+This will generate the "yeast-annots.json" file that is read in by dataset.html.
+The file should have a systematic name (ORF) in the first column, then the expression value.
+It's assumed that the first row is a column header, so it's ignored.
+
+### Filter Maps
 
 For the original list of filter maps, see [this file](https://github.com/eweitz/ideogram/blob/3ae4fdecc01f511fabf90ce8f87225e10675393c/annotations-histogram.html#L131). Expression level was largely unchanged (with the addition of 0 if the gene has no expression) and gene type
 was modified to include a different set:
@@ -31,13 +63,13 @@ gene_types = {
 ```
 It's likely that this needs to be further organized or filtered.
 
-### Gene-Type
+#### Gene-Type
 
 One of the entries in the data is relevant for a "gene-type," an integer, and specifically
 this is referring to a Gene Type filter. We use the second entry in the features tab file,
 the "feature" to assign an integer for `gene-type` that maps to the correct string.
 
-## Expression Level
+### Expression Level
 
 Akin to Gene type, expression level is another range of values that has [this mapping](https://github.com/yeastphenome/yeastphenome.org/pull/36) from very low to very high. It would be up to the generation interface to
 assign different expression levels depending on the dataset. For the example here, since we aren't

diff --git a/data/generate_yeast_data.py b/data/generate_yeast_data.py
@@ -8,7 +8,10 @@
 import os
 import random
 import re
+import sys
+import pandas
 
+here = os.path.dirname(os.path.abspath(__file__))
 
 def str_to_roman(string):
 
@@ -30,15 +33,15 @@ def str_to_roman(string):
 
 
 def main():
-    with open("SRR562646.json", "r") as fd:
+    with open(os.path.join(here, "SRR562646.json"), "r") as fd:
         data = json.loads(fd.read())
 
     # Each entry at data['annots'] is a dict with {'chr': <chr-number>, 'annots'}
     # annots is a list of annotations, formatted like : ['KIFAP3', 169890461, 153421, 6, 1]
     # corresponding to the data['keys']: ['name', 'start', 'length', 'expression-level', 'gene-type']
 
     # metadata about yeast loci
-    with open("SGD_features.tab", "r") as fd:
+    with open(os.path.join(here, "SGD_features.tab"), "r") as fd:
         content = [x.strip("\n").split("\t") for x in fd.readlines() if x]
 
     # Ensure we have correct length
@@ -81,7 +84,6 @@ def main():
         "moderate": 3,
         "low": 2,
         "very-low": 1,
-        "none": 0,
     }
 
     # Gene type for interface lookup
@@ -100,6 +102,18 @@ def main():
         "long_terminal_repeat": 12,
     }
 
+    # Does the user provide an input file with data (requires pandas)
+    datafile = None
+    df = None
+    if len(sys.argv) > 1:
+        datafile = sys.argv[1]
+        if not os.path.exists(datafile):
+            sys.exit("Datafile %s provided, but does not exist." % datafile)
+        df = pandas.read_csv(datafile, sep="\t")
+        df.columns = ['orf', 'value']
+        df['expression_level'] = pandas.qcut(df['value'], q=len(expression_levels), labels=expression_levels.keys())
+        df.index = df['orf']
+
     # Let's keep counts of feature types
     feature_counts = dict()
 
@@ -134,7 +148,13 @@ def main():
             continue
 
         gene_type = gene_types[feature]
-        expression_level = random.choice(range(1, 8))
+        if df is not None:
+            if name in df.index:
+                expression_level = expression_levels[df.loc[name]['expression_level']]
+            else:
+                expression_level = 1
+        else:
+            expression_level = random.choice(range(1, 8))
 
         # name, start, length, expression-level, gene-type
         chroms[chromosome].append(
@@ -147,11 +167,13 @@ def main():
         data["annots"].append({"chr": str_to_roman(chrom), "annots": annots})
 
     # Save counts and data to file
-    with open("yeast-annots.json", "w") as fd:
+    annots_file = "yeast-annots.json"
+    if not datafile:
+        annots_file = "yeast-annots-random.json"
+    with open(os.path.join(here, annots_file), "w") as fd:
         fd.write(json.dumps(data, indent=4))
-    with open("features-count.json", "w") as fd:
+    with open(os.path.join(here, "features-count.json"), "w") as fd:
         fd.write(json.dumps(feature_counts, indent=4))
-
 
 
 if __name__ == "__main__":