## OTG Integration in OTP

This notebook contains raw data summaries and field interpretations for genetic portal evidence in OTP.

See [notes.md](notes.md) for more info on the history and methods behind this.


### Summary

Summary for new OTG evidence records as of v20.06:

- There are now 393k evidence strings, which expands to about 640k associations
    - This "expansion" results from associating the same variants to diseases that are ancestors in EFO of the original phenotype
    - For reference, there are 185,865 variant-trait associations in GWAS catalog v1.0.2
- Frequency of study source:
    - GWAS Catalog: **264k (67%)**
        - example: https://genetics.opentargets.org/study/GCST007068
    - Neale Lab v2 UKB: **123k (32%)** 
        - example: https://genetics.opentargets.org/study/NEALE2_30090_raw
    - Lee Lab SAIGE UKB: **5.3k (1%)**
        - example: https://genetics.opentargets.org/study/SAIGE_415
        - homepage: https://www.leelabsg.org/resources
- Variant type recorded as one of "SNP", "insertion", or "deletion" in `variant.type`
- `variant.rs_id` present in 99.3% of all 1.9M associations
- `evidence.gene2variant.resource_score` contains OTG L2G score
- `evidence.variant2disease.gwas_sample_size` has size from original study
- `evidence.variant2disease.reported_trait` has original trait name
- `evidence.variant2disease.resource_score` contains p-value from GWAS
- `evidence.gene2variant.consequence_code` has VEP consequences with these frequencies:

|    | consequence_code                   |   count |
|---:|:-----------------------------------|--------:|
|  0 | intergenic_variant                 |  290094 |
|  1 | intron_variant                     |   67178 |
|  2 | upstream_gene_variant              |   11168 |
|  3 | downstream_gene_variant            |   10411 |
|  4 | missense_variant                   |    8036 |
|  5 | 3_prime_UTR_variant                |    3699 |
|  6 | synonymous_variant                 |    1202 |
|  7 | 5_prime_UTR_variant                |     877 |
|  8 | splice_region_variant              |     338 |
|  9 | non_coding_transcript_exon_variant |      82 |
| 10 | stop_gained                        |      50 |
| 11 | splice_donor_variant               |      23 |
| 12 | inframe_deletion                   |      22 |
| 13 | frameshift_variant                 |      21 |
| 14 | inframe_insertion                  |      20 |
| 15 | splice_acceptor_variant            |       5 |
| 16 | start_lost                         |       4 |
| 17 | coding_sequence_variant            |       2 |

In [1]:
from dotenv import load_dotenv; load_dotenv()
from data_source import catalog
from pyspark.sql.session import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
cat = catalog.load()

### GWAS Catalog vs L2G Associations

The L2G score was first incorporated into OTP in 20.02.  This comparison is for L2G in 20.04/20.06 to 19.11, the last release with only lead variants from GWAS Catalog / UKB.

#### Differences

- `rs_id` now included in variant object
- `resource_score` was added to `evidence.gene2variant` whereas it was only applicable to `evidence.variant2disease` before (not both have it)
- There were 194k variant, target, disease combinations before and now there are:
    - 1.931M in 20.04
    - 393k in 20.06 (see https://blog.opentargets.org/2020/06/17/open-targets-platform-20-06-has-been-released/)
        - A .05 threshold was added to l2g score to remove so many unimportant associations

In [8]:
path = cat.download('otpev', 'l2g', version='v20.06', overwrite=True); print(path)
df = spark.read.parquet(path)
schema_new = df._jdf.schema().treeString()
df.count()

/tmp/data_source_cache/otpev/l2g/v20.06/20200616T000000/data.parquet


393232

In [3]:
path = cat.download('otpev', 'gwas', version='v19.11'); print(path)
dfo = spark.read.parquet(path)
schema_old = dfo._jdf.schema().treeString()
dfo.count()

/tmp/data_source_cache/otpev/gwas/v19.11/20191128T000000/data.parquet


194170

#### Schemas

In [5]:
print(schema_new)

root
 |-- access_level: string (nullable = true)
 |-- disease: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- reported_trait: string (nullable = true)
 |-- evidence: struct (nullable = true)
 |    |-- gene2variant: struct (nullable = true)
 |    |    |-- consequence_code: string (nullable = true)
 |    |    |-- date_asserted: string (nullable = true)
 |    |    |-- evidence_codes: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- functional_consequence: string (nullable = true)
 |    |    |-- is_associated: boolean (nullable = true)
 |    |    |-- provenance_type: struct (nullable = true)
 |    |    |    |-- database: struct (nullable = true)
 |    |    |    |    |-- dbxref: struct (nullable = true)
 |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |-- version: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- version: string

In [6]:
print(schema_old)

root
 |-- access_level: string (nullable = true)
 |-- disease: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |-- evidence: struct (nullable = true)
 |    |-- gene2variant: struct (nullable = true)
 |    |    |-- date_asserted: string (nullable = true)
 |    |    |-- evidence_codes: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- functional_consequence: string (nullable = true)
 |    |    |-- is_associated: boolean (nullable = true)
 |    |    |-- provenance_type: struct (nullable = true)
 |    |    |    |-- database: struct (nullable = true)
 |    |    |    |    |-- dbxref: struct (nullable = true)
 |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |-- version: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- version: string (nullable = true)
 |    |    |    |-- expert: struct (nullable = true)
 |    |    |    |    |-- statement: s

In [4]:
# Schema diff
with open('/tmp/schema_new.txt', 'w') as f:
    f.write(schema_new)
with open('/tmp/schema_old.txt', 'w') as f:
    f.write(schema_old)
!diff /tmp/schema_new.txt /tmp/schema_old.txt

5d4
<  |    |-- reported_trait: string (nullable = true)
8d6
<  |    |    |-- consequence_code: string (nullable = true)
24,34d21
<  |    |    |    |-- literature: struct (nullable = true)
<  |    |    |    |    |-- references: array (nullable = true)
<  |    |    |    |    |    |-- element: struct (containsNull = true)
<  |    |    |    |    |    |    |-- author: string (nullable = true)
<  |    |    |    |    |    |    |-- lit_id: string (nullable = true)
<  |    |    |    |    |    |    |-- year: long (nullable = true)
<  |    |    |-- resource_score: struct (nullable = true)
<  |    |    |    |-- method: struct (nullable = true)
<  |    |    |    |    |-- description: string (nullable = true)
<  |    |    |    |-- type: string (nullable = true)
<  |    |    |    |-- value: double (nullable = true)
39a27
>  |    |    |-- gwas_panel_resolution: long (nullable = true)
42c30
<  |    |    |-- odds_ratio: double (nullable = true)
---
>  |    |    |-- odds_ratio: string (nullable = true)


### Field Stats

In [10]:
df.groupBy(F.col('variant.rs_id').isNull()).count().show()

+-----------------------+------+
|(variant.rs_id IS NULL)| count|
+-----------------------+------+
|                   true|  2797|
|                  false|390435|
+-----------------------+------+



In [11]:
df.groupBy('evidence.gene2variant.is_associated', 'evidence.variant2disease.is_associated').count().show()

+-------------+-------------+------+
|is_associated|is_associated| count|
+-------------+-------------+------+
|         true|         true|393232|
+-------------+-------------+------+



In [12]:
df.select(F.explode('evidence.gene2variant.evidence_codes')).groupBy('col').count().show(truncate=50)

+-------------------------------------------------+------+
|                                              col| count|
+-------------------------------------------------+------+
|http://identifiers.org/eco/locus_to_gene_pipeline|393232|
|       http://purl.obolibrary.org/obo/ECO_0000362|393232|
+-------------------------------------------------+------+



In [13]:
df.select(F.explode('evidence.variant2disease.evidence_codes')).groupBy('col').count().show(truncate=50)

+------------------------------------------+------+
|                                       col| count|
+------------------------------------------+------+
|           http://identifiers.org/eco/GWAS|393232|
|http://purl.obolibrary.org/obo/ECO_0000362|393232|
+------------------------------------------+------+



In [38]:
df.groupBy('evidence.gene2variant.resource_score.method.description').count().show(truncate=50)

+--------------------------------------------------+------+
|                                       description| count|
+--------------------------------------------------+------+
|Locus to gene score generated by OpenTargets Ge...|393232|
+--------------------------------------------------+------+



In [14]:
df.groupBy('evidence.variant2disease.resource_score.method.description').count().show(truncate=50)

+-----------------------------------------+------+
|                              description| count|
+-----------------------------------------+------+
|pvalue for the snp to disease association|393232|
+-----------------------------------------+------+



In [39]:
df.groupBy(F.col('evidence.variant2disease.resource_score.value').isNull()).count().show(truncate=50)

+-------------------------------------------------------+------+
|(evidence.variant2disease.resource_score.value IS NULL)| count|
+-------------------------------------------------------+------+
|                                                  false|393232|
+-------------------------------------------------------+------+



In [41]:
df.agg(
    F.min(F.col('evidence.variant2disease.resource_score.value')),
    F.max(F.col('evidence.variant2disease.resource_score.value'))
).show()

+--------------------------------------------------+--------------------------------------------------+
|min(evidence.variant2disease.resource_score.value)|max(evidence.variant2disease.resource_score.value)|
+--------------------------------------------------+--------------------------------------------------+
|                                          1.0E-302|                              5.000000000000001E-8|
+--------------------------------------------------+--------------------------------------------------+



In [15]:
df.groupBy('evidence.variant2disease.resource_score.type').count().show(truncate=50)

+------+------+
|  type| count|
+------+------+
|pvalue|393232|
+------+------+



In [16]:
df.groupBy('evidence.variant2disease.provenance_type.expert.statement').count().show(truncate=50)

+-----------------------------+------+
|                    statement| count|
+-----------------------------+------+
|Primary submitter of the data|393232|
+-----------------------------+------+



In [17]:
df.groupBy('evidence.variant2disease.provenance_type.expert.status').count().show(truncate=50)

+------+------+
|status| count|
+------+------+
|  true|393232|
+------+------+



In [18]:
df.groupBy('evidence.variant2disease.reported_trait').count().show(5, truncate=50)

+--------------------------------------------------+-----+
|                                    reported_trait|count|
+--------------------------------------------------+-----+
|                        6mm strong meridian (left)|  401|
|  Estimated glomerular filtration rate in diabetes|  168|
|LDL cholesterol levels x alcohol consumption (r...|   26|
|Chickenpox | non-cancer illness code, self-repo...|    8|
|Chronic obstructive pulmonary disease [conditio...|    4|
+--------------------------------------------------+-----+
only showing top 5 rows



In [19]:
df.groupBy('type').count().show()

+-------------------+------+
|               type| count|
+-------------------+------+
|genetic_association|393232|
+-------------------+------+



In [20]:
df.groupBy('variant.type').count().show()

+---------+------+
|     type| count|
+---------+------+
|      SNP|362375|
|insertion| 16486|
| deletion| 14371|
+---------+------+



In [21]:
df.groupBy('variant.source_link').count().show(3, truncate=100)

+--------------------------------------------------------+-----+
|                                             source_link|count|
+--------------------------------------------------------+-----+
|https://genetics.opentargets.org/variant/5_139715518_G_T|    2|
| https://genetics.opentargets.org/variant/5_40392626_T_C|   10|
|https://genetics.opentargets.org/variant/19_44908822_C_T|  265|
+--------------------------------------------------------+-----+
only showing top 3 rows



In [22]:
df.groupBy('target.target_type').count().show(3, truncate=100)

+------------------------------------------------+------+
|                                     target_type| count|
+------------------------------------------------+------+
|http://identifiers.org/cttv.target/gene_evidence|393232|
+------------------------------------------------+------+



In [23]:
df.groupBy('target.activity').count().show(3, truncate=100)

+-------------------------------------------------------+------+
|                                               activity| count|
+-------------------------------------------------------+------+
|http://identifiers.org/cttv.activity/predicted_damaging|393232|
+-------------------------------------------------------+------+



In [24]:
df.groupBy('sourceID').count().show(3, truncate=100)

+------------------+------+
|          sourceID| count|
+------------------+------+
|ot_genetics_portal|393232|
+------------------+------+



#### Variant Consequence

In [35]:
df.groupBy('evidence.gene2variant.functional_consequence').count().sort(F.col('count').desc()).show(3, truncate=False)

+-----------------------------------------+------+
|functional_consequence                   |count |
+-----------------------------------------+------+
|http://purl.obolibrary.org/obo/SO_0001628|290094|
|http://purl.obolibrary.org/obo/SO_0001627|67178 |
|http://purl.obolibrary.org/obo/SO_0001631|11168 |
+-----------------------------------------+------+
only showing top 3 rows



In [37]:
print(df.groupBy('evidence.gene2variant.consequence_code').count().sort(F.col('count').desc()).toPandas().to_markdown())

|    | consequence_code                   |   count |
|---:|:-----------------------------------|--------:|
|  0 | intergenic_variant                 |  290094 |
|  1 | intron_variant                     |   67178 |
|  2 | upstream_gene_variant              |   11168 |
|  3 | downstream_gene_variant            |   10411 |
|  4 | missense_variant                   |    8036 |
|  5 | 3_prime_UTR_variant                |    3699 |
|  6 | synonymous_variant                 |    1202 |
|  7 | 5_prime_UTR_variant                |     877 |
|  8 | splice_region_variant              |     338 |
|  9 | non_coding_transcript_exon_variant |      82 |
| 10 | stop_gained                        |      50 |
| 11 | splice_donor_variant               |      23 |
| 12 | inframe_deletion                   |      22 |
| 13 | frameshift_variant                 |      21 |
| 14 | inframe_insertion                  |      20 |
| 15 | splice_acceptor_variant            |       5 |
| 16 | start_lost           

#### Studies

In [25]:
df.groupBy('unique_association_fields.study').count().sort(F.col('count').desc()).show(25, truncate=100)

+-------------------------------------------------------+-----+
|                                                  study|count|
+-------------------------------------------------------+-----+
|   https://genetics.opentargets.org/study/NEALE2_50_raw| 3917|
|      https://genetics.opentargets.org/study/GCST007841| 3239|
|      https://genetics.opentargets.org/study/GCST006571| 2755|
|https://genetics.opentargets.org/study/NEALE2_20015_raw| 2611|
|      https://genetics.opentargets.org/study/GCST006979| 2413|
|      https://genetics.opentargets.org/study/GCST006568| 2287|
|https://genetics.opentargets.org/study/NEALE2_23129_raw| 2090|
|      https://genetics.opentargets.org/study/GCST007069| 2071|
|https://genetics.opentargets.org/study/NEALE2_23130_raw| 2038|
|https://genetics.opentargets.org/study/NEALE2_30100_raw| 1990|
|https://genetics.opentargets.org/study/NEALE2_30080_raw| 1990|
|      https://genetics.opentargets.org/study/GCST005537| 1985|
|https://genetics.opentargets.org/study/

In [29]:
(
    df
    .select(F.element_at(F.split(F.col('unique_association_fields.study'), '/'), -1).alias('study'))
    .withColumn('prefix', F.col('study').substr(1, 5))
    .groupBy('prefix').count().show()
)

+------+------+
|prefix| count|
+------+------+
| NEALE|123153|
| SAIGE|  5313|
| GCST0|264766|
+------+------+



In [30]:
(
    df.filter(F.col('unique_association_fields.study').contains('SAIGE'))
    .select('unique_association_fields.study')
    .show(5, truncate=False)
)

+--------------------------------------------------+
|study                                             |
+--------------------------------------------------+
|https://genetics.opentargets.org/study/SAIGE_454_1|
|https://genetics.opentargets.org/study/SAIGE_415  |
|https://genetics.opentargets.org/study/SAIGE_740_1|
|https://genetics.opentargets.org/study/SAIGE_240  |
|https://genetics.opentargets.org/study/SAIGE_208  |
+--------------------------------------------------+
only showing top 5 rows



In [26]:
(
    df
    .filter(F.col('unique_association_fields.study').contains('NEALE'))
    .select(F.element_at(F.split(F.col('unique_association_fields.study'), '/'), -1).alias('study'))
    .withColumn('prefix', F.element_at(F.split(F.col('study'), '_'), 1))
    .groupBy('prefix').count().show()
)

+------+------+
|prefix| count|
+------+------+
|NEALE2|123153|
+------+------+

