## Evidence Preparation Error Summary Report

In [None]:
import $ivy.`org.plotly-scala::plotly-almond:0.7.2`
import $file.^.sparkinit, sparkinit._
import $file.^.pathinit, pathinit._
import $file.^.cpinit, cpinit._
import ss.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import java.nio.file.Paths
import plotly._
import plotly.element._
import plotly.layout._
import plotly.Almond._
implicit class DFOPs(df: DataFrame) { def fn[T](fn: DataFrame => T): T = fn(df)}

In [2]:
init(offline=false)

### Load Summary Datasets

In [3]:
val regex = ".*evidence_(.*)_validation_.*".r
val steps = RESULTS_DIR.resolve("errors").toFile.listFiles.map(_.toString).map({
    case regex(k) => k
    case _ => ""
}).toSet
def loadSummary(step: String) = ss.read
    .parquet(RESULTS_DIR.resolve(s"errors/evidence_${step}_validation_summary.parquet").toString)
    .withColumn("step", lit(step))
val df = steps.map(loadSummary).reduce(_.union(_))

[36mregex[39m: [32mscala[39m.[32mutil[39m.[32mmatching[39m.[32mRegex[39m = .*evidence_(.*)_validation_.*
[36msteps[39m: [32mSet[39m[[32mString[39m] = [33mSet[39m([32m"disease_id"[39m, [32m"schema"[39m, [32m"target_id"[39m, [32m"data_source"[39m)
defined [32mfunction[39m [36mloadSummary[39m
[36mdf[39m: [32mDataFrame[39m = [sourceID: string, reason: string ... 2 more fields]

In [4]:
df.withColumn("reason", coalesce($"reason", lit("none"))).show(1000, false)

+------------------+------------------------------+-------+-----------+
|sourceID          |reason                        |count  |step       |
+------------------+------------------------------+-------+-----------+
|uniprot_literature|id_not_found                  |10     |disease_id |
|gene2phenotype    |id_not_found                  |3      |disease_id |
|phewas_catalog    |id_not_found                  |8941   |disease_id |
|cancer_gene_census|none                          |96274  |disease_id |
|genomics_england  |id_not_found                  |522    |disease_id |
|eva_somatic       |id_not_found                  |10     |disease_id |
|phenodigm         |id_not_found                  |714    |disease_id |
|europepmc         |id_not_found                  |106933 |disease_id |
|cancer_gene_census|id_not_found                  |10     |disease_id |
|crispr            |none                          |1844   |disease_id |
|reactome          |id_not_found                  |30     |disea




### Record Invalidation Cause Frequency by Source

Evidence records are eliminated in the pipeline for a variety of reasons, and this section shows the the frequency with which those conditions are encountered per data source.  Reason = "none" below indicates that the record passed all validation filters (i.e. only these records are kept, all others are lost).

In [None]:
// Visualize the number of records with some reason they were invalid alongside those that were not (reason = "none")
val traces = df.select("step").dropDuplicates().collect().map(_.getAs[String]("step")).toList.map(s => 
        df
            .withColumn("reason", coalesce($"reason", lit("none")))
            .withColumn("sourceID", coalesce($"sourceID", lit("none")))
            .filter($"step" === s)
            .fn(ds => {
                s -> ds.select("reason").dropDuplicates().collect.map(_.getAs[String]("reason")).toSeq.map(r => {
                    ds
                        .filter($"reason" === r)
                        .sort($"count".desc)
                        .fn( dsr => {
                            Bar(
                                x=dsr.select("sourceID").collect.map(_.getAs[String]("sourceID")).toList,
                                y=dsr.select("count").collect.map(_.getAs[Long]("count")).toList,
                                name=r,
                                showlegend=true
                            )
                        })
                })
            })
    ).toMap

In [6]:
traces.foreach {case (k, data) => {
    data.plot(
        title=s"Validation phase: $k", 
        yaxis=Axis(`type`=AxisType.Log),
        margin=Margin(t=40),
        barmode=BarMode.Group
    )
}}

Show only non-valid record counts for reference:

In [7]:
df.filter($"reason".isNotNull).show(100, false)

+------------------+------------------------------+------+----------+
|sourceID          |reason                        |count |step      |
+------------------+------------------------------+------+----------+
|uniprot_literature|id_not_found                  |10    |disease_id|
|gene2phenotype    |id_not_found                  |3     |disease_id|
|phewas_catalog    |id_not_found                  |8941  |disease_id|
|genomics_england  |id_not_found                  |522   |disease_id|
|eva_somatic       |id_not_found                  |10    |disease_id|
|phenodigm         |id_not_found                  |714   |disease_id|
|europepmc         |id_not_found                  |106933|disease_id|
|cancer_gene_census|id_not_found                  |10    |disease_id|
|reactome          |id_not_found                  |30    |disease_id|
|uniprot           |id_not_found                  |58    |disease_id|
|progeny           |id_not_found                  |87    |disease_id|
|gwas_catalog      |