Skip to content

Commit

Permalink
Merge pull request #2 from vsoch/add/data-example
Browse files Browse the repository at this point in the history
adding example dataset
  • Loading branch information
vsoch committed Oct 22, 2020
2 parents 54b03f7 + c6f3600 commit f1716c0
Show file tree
Hide file tree
Showing 7 changed files with 38,040 additions and 3,600 deletions.
42 changes: 37 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,42 @@ in [index.html](index.html) with data derived from the SGD_features file. I've g
the file to match the example data provided in the library for [a human](https://github.com/eweitz/ideogram/blob/master/dist/data/annotations/SRR562646.json).

We want an interface that a user can quickly search a gene, and then see it
be highlighted in the plot. I'd like to eventually integrate this into a web application
to show expression levels across a dataset, but I have several questions.
be highlighted in the plot. This could be integrated into some web application to show
expression levels. Two examples are provided:

## Filter Maps
- [Random Generation](https://vsoch.github.io/yeast-ideogram/)
- [Generation from Dataset](https://vsoch.github.io/yeast-ideogram/dataset.html)

## Usage

To generate the [index.html](index.html) here and generate random expression values,
just run the data generation script:

```bash
python data/generate_yeast_data.py
```
This will generate the files "yeast-annots-random.json" and "features-count.json"
in the [data](data) directory. The counts can be visualized in the ipython notebook.

```
$ ls data/
feature_counts.ipynb saccharomyces-cerevisiae.json SRR562646.json yeast.txt
features-count.json SGD_features.README yeast-annots.json
generate_yeast_data.py SGD_features.tab yeast_data_example.txt
```

To generate the input for [dataset.html](dataset.html) that has actual expression values,
you'll want to include an input file as an argument.

```bash
python data/generate_yeast_data.py data/yeast_data_example.txt
```

This will generate the "yeast-annots.json" file that is read in by dataset.html.
The file should have a systematic name (ORF) in the first column, then the expression value.
It's assumed that the first row is a column header, so it's ignored.

### Filter Maps

For the original list of filter maps, see [this file](https://github.com/eweitz/ideogram/blob/3ae4fdecc01f511fabf90ce8f87225e10675393c/annotations-histogram.html#L131). Expression level was largely unchanged (with the addition of 0 if the gene has no expression) and gene type
was modified to include a different set:
Expand All @@ -31,13 +63,13 @@ gene_types = {
```
It's likely that this needs to be further organized or filtered.

### Gene-Type
#### Gene-Type

One of the entries in the data is relevant for a "gene-type," an integer, and specifically
this is referring to a Gene Type filter. We use the second entry in the features tab file,
the "feature" to assign an integer for `gene-type` that maps to the correct string.

## Expression Level
### Expression Level

Akin to Gene type, expression level is another range of values that has [this mapping](https://github.com/yeastphenome/yeastphenome.org/pull/36) from very low to very high. It would be up to the generation interface to
assign different expression levels depending on the dataset. For the example here, since we aren't
Expand Down
36 changes: 29 additions & 7 deletions data/generate_yeast_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@
import os
import random
import re
import sys
import pandas

here = os.path.dirname(os.path.abspath(__file__))

def str_to_roman(string):

Expand All @@ -30,15 +33,15 @@ def str_to_roman(string):


def main():
with open("SRR562646.json", "r") as fd:
with open(os.path.join(here, "SRR562646.json"), "r") as fd:
data = json.loads(fd.read())

# Each entry at data['annots'] is a dict with {'chr': <chr-number>, 'annots'}
# annots is a list of annotations, formatted like : ['KIFAP3', 169890461, 153421, 6, 1]
# corresponding to the data['keys']: ['name', 'start', 'length', 'expression-level', 'gene-type']

# metadata about yeast loci
with open("SGD_features.tab", "r") as fd:
with open(os.path.join(here, "SGD_features.tab"), "r") as fd:
content = [x.strip("\n").split("\t") for x in fd.readlines() if x]

# Ensure we have correct length
Expand Down Expand Up @@ -81,7 +84,6 @@ def main():
"moderate": 3,
"low": 2,
"very-low": 1,
"none": 0,
}

# Gene type for interface lookup
Expand All @@ -100,6 +102,18 @@ def main():
"long_terminal_repeat": 12,
}

# Does the user provide an input file with data (requires pandas)
datafile = None
df = None
if len(sys.argv) > 1:
datafile = sys.argv[1]
if not os.path.exists(datafile):
sys.exit("Datafile %s provided, but does not exist." % datafile)
df = pandas.read_csv(datafile, sep="\t")
df.columns = ['orf', 'value']
df['expression_level'] = pandas.qcut(df['value'], q=len(expression_levels), labels=expression_levels.keys())
df.index = df['orf']

# Let's keep counts of feature types
feature_counts = dict()

Expand Down Expand Up @@ -134,7 +148,13 @@ def main():
continue

gene_type = gene_types[feature]
expression_level = random.choice(range(1, 8))
if df is not None:
if name in df.index:
expression_level = expression_levels[df.loc[name]['expression_level']]
else:
expression_level = 1
else:
expression_level = random.choice(range(1, 8))

# name, start, length, expression-level, gene-type
chroms[chromosome].append(
Expand All @@ -147,11 +167,13 @@ def main():
data["annots"].append({"chr": str_to_roman(chrom), "annots": annots})

# Save counts and data to file
with open("yeast-annots.json", "w") as fd:
annots_file = "yeast-annots.json"
if not datafile:
annots_file = "yeast-annots-random.json"
with open(os.path.join(here, annots_file), "w") as fd:
fd.write(json.dumps(data, indent=4))
with open("features-count.json", "w") as fd:
with open(os.path.join(here, "features-count.json"), "w") as fd:
fd.write(json.dumps(feature_counts, indent=4))



if __name__ == "__main__":
Expand Down

0 comments on commit f1716c0

Please sign in to comment.