Skip to content

frbr: CSHALS 2011 tutorial

Timothy Lebo edited this page Feb 14, 2012 · 76 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

If you have questions, please email http://tw.rpi.edu/instances/TimLebo.

In-depth conversion for Jim's tutorial

Jim McCusker used csv2rdf4lod to incorporate some data for his Semantic Healthcare and Life Sciences Tutorial (his slides). This (on-the-fly!) tutorial provides some more detail on how he did it. I am piecing it together from the Provenance captured by csv2rdf4lod while Jim originally used it for his demo.

Blog about our tutorials: http://www.genomeweb.com/informatics/semantic-technologies-bear-fruit-spite-development-challenges.

Source files on GitHub

You can get the source at:

https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/ncbi-nih-gov

Overview of csv2rdf4lod workflow

diagram of provenance captured during csv2rdf4lod conversion

Install csv2rdf4lod

Installing csv2rdf4lod automation

gene2go

Data: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz

Step a.1: [name](Conversion process phase: name) the data:

Use the HTTP modification date to name the version:

bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Last-Modified: Wed, 23 Feb 2011 07:49:05 GMT
Content-Length: 12359614
Accept-ranges: bytes

Step a.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:

mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source

Step a.3: Get the zip, uncompress, and log the provenance:

pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
gunzip -c gene2go.gz > gene2go
justify.sh gene2go.gz gene2go uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/

Step a.4: [csv-ify](Conversion process phase: csv-ify) the data: We only need homo sapien, and it's tab-delimited (we need comma-separated). We also want to strip out GO: so that we can construct URIs that overlap with bio2rdf. We make a manual tweak and store it in manual/ (capturing the provenance):

mkdir manual/
grep "^9606" source/gene2go | perl -pe 's/^/"/; s/GO://; s/\t/","/g; s/$/"/' > manual/gene2go-9606.csv

Step a.5: Create verbatim interpretation of tabular literals ([create](Conversion process phase: create conversion trigger) and [pull](Conversion process phase: pull conversion trigger) the conversion trigger):

cr-create-convert-sh.sh -w manual/gene2go-9606.csv
./convert-gene2go.sh

Step a.6: Cheat and get Jim's [tweaked](Conversion process phase: tweak enhancement parameters) enhancedinterpretation parameters:

curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/manual/gene2go-9606.csv.e1.params.ttl > manual/gene2go-9606.csv.e1.params.ttl

Step a.7: Create enhanced interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger again):

./convert-gene2go.sh

Step a.8: Check out automatic/gene2go-9606.csv.e1.ttl

<http://bio2rdf.org/geneid:2>
   dcterms:isReferencedBy <http://sparql.tw.rpi.edu/ontowiki/source/ncbi-nih-gov/dataset/gene2go/version/2011-Feb-23> ;
   a gene2go_vocab:Gene ;
   dcterms:identifier "2" ;
   e1:has_species  <http://bio2rdf.org/taxon:9606> ;
   e1:has_evidence_code "IDA" ;
   e1:has_evidence_code "TAS" ;
   e1:has_evidence_code "IPI" ;
   e1:has_evidence_code "NAS" ;
   e1:has_evidence_code "IEA" ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0001869> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0002576> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0004867> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005096> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005515> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005576> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005615> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005829> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0006953> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007264> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007584> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007596> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007597> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0010037> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019838> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019899> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019959> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019966> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0030168> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0031093> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0043120> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0051056> ;
   skos:broadMatch <http://purl.org/obo/owl/GO#GO_0051384> ;
   ov:csvRow "4"^^xsd:integer ;
   ov:csvRow "5"^^xsd:integer ;
   ov:csvRow "6"^^xsd:integer ;
   ov:csvRow "7"^^xsd:integer ;
   ov:csvRow "8"^^xsd:integer ;
   ov:csvRow "9"^^xsd:integer , "10"^^xsd:integer ;
   ov:csvRow "11"^^xsd:integer ;
   ov:csvRow "12"^^xsd:integer ;
   ov:csvRow "13"^^xsd:integer ;
   ov:csvRow "14"^^xsd:integer ;
   ov:csvRow "15"^^xsd:integer ;
   ov:csvRow "16"^^xsd:integer ;
   ov:csvRow "17"^^xsd:integer ;
   ov:csvRow "18"^^xsd:integer ;
   ov:csvRow "19"^^xsd:integer ;
   ov:csvRow "20"^^xsd:integer ;
   ov:csvRow "21"^^xsd:integer ;
   ov:csvRow "22"^^xsd:integer ;
   ov:csvRow "23"^^xsd:integer ;
   ov:csvRow "24"^^xsd:integer ;
   ov:csvRow "25"^^xsd:integer ;
   ov:csvRow "26"^^xsd:integer ;
   ov:csvRow "27"^^xsd:integer .

homo sapiens gene info

Data: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

Step b.1: [name](Conversion process phase: name) the data:

  • base URI: http://sparql.tw.rpi.edu/ontowiki/
  • source: ncbi-nih-gov
  • dataset: gene-mammalia-homo-sapien
  • version: 2011-Feb-23

Use the HTTP modification date to name the version:

bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Last-Modified: Wed, 23 Feb 2011 08:04:26 GMT
Content-Length: 2402004
Accept-ranges: bytes

Step b.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:

mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source

Step b.3: Get the zip, uncompress, and log the provenance.

pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
gunzip -c Homo_sapiens.gene_info.gz > Homo_sapiens.gene_info
justify.sh Homo_sapiens.gene_info.gz Homo_sapiens.gene_info uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/

Step b.4: [csv-ify](Conversion process phase: csv-ify) the data: NIH's data is tab-delimited (we need comma-separated). We make a manual tweak and store it in manual/ (capturing the provenance):

cat source/Homo_sapiens.gene_info | perl -pe 's/^/"/; s/\t/","/g; s/$/"/' > manual/Homo_sapiens.gene_info.csv
justify.sh source/Homo_sapiens.gene_info manual/Homo_sapiens.gene_info.csv tab2comma

Step b.5: Create verbatim interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger):

cr-create-convert-sh.sh -w manual/Homo_sapiens.gene_info.csv
./convert-gene-mammalia-homo-sapien.sh

Step b.6: Cheat and get Jim's [tweaked](Conversion process phase: tweak enhancement parameters) enhanced interpretation parameters:

curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/manual/Homo_sapiens.gene_info.csv.e1.params.ttl > manual/Homo_sapiens.gene_info.csv.e1.params.ttl

Step b.7: Create enhanced interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger again):

./convert-gene-mammalia-homo-sapien.sh

Step b.8: Check out automatic/Homo_sapiens.gene_info.csv.e1.ttl

<http://bio2rdf.org/geneid:1> 
   dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/ncbi-nih-gov/dataset/gene-mammalia-homo-sapien/version/2011-Feb-23> ;
   a <http://purl.obolibrary.org/obo/SO_0000704> , local_vocab:Gene ;
   e1:has_species                            <http://bio2rdf.org/taxon:9606> ;
   dcterms:identifier                        "1" ;
   jim:has_symbol                            "A1BG" ;
   rdfs:label                                "A1BG" ;
   e1:has_symbol                             "HYST2477" , "DKFZp686F0970" , "ABG" , "GAB" , "A1B" ;
   dcterms:identifier                        "MIM:138670" , "HGNC:5" , "HPRD:00726" , "Ensembl:ENSG00000121410" ;
   e1:has_location                           <http://bio2rdf.org/mapviewer:19q13_4> ;
   dcterms:description                       "alpha-1-B glycoprotein" ;
   e1:has_gene_type                          "protein-coding" ;
   e1:has_symbol_from_nomenclature_authority "A1BG" ;
   e1:has_name_from_nomenclature_authority   "alpha-1-B glycoprotein" ;
   e1:has_nomenclature_status                "Official" ;
   e1:has_other_designation                  "alpha-1B-glycoprotein" ;
   dcterms:modified                          "2011-02-06"^^xsd:date ;
   ov:csvRow                                 "1"^^xsd:integer .
Clone this wiki locally