官方网站—第四部分、Software Tools #16

wanghaisheng · 2015-03-18T07:49:58Z

分析工具

这些工具能够充分利用高级可视化、分析方法、交互式的探索数据。
工具的代码全部发布在GitHub.

1、ACHILLES——数据特征化

对OMOP CDM v4 数据库的统计分析.该软件于在San Diego召开的2014 EDM论坛上发布。演示地址请点击Demo.

ACHILLES上可以对数据库进行特征化、质量评估和可视化。可供用户以一种交互式的方式来评估患者的人口学信息，病情、药物和手术的的流行程度，评估临床观察值的分布情况

只要有个案数据的话就可以在本地部署ACHILLES

ACHILLES 有2大组件，第一个是用R实现的，本地运行，不会泄露任何个人信息。该R包要求数据格式符合OMOP统一数据模型. 该R包能够生成和导出描述个案数据库的质量和内容的统计数据. 第二个是用HTML5 / JavaScript实现的前台界面，提供交互式报告来可视化和探索统计数据。
单一前台界面可以配套多个后台数据库。

第一部分: 生成统计数据的R包(https://github.com/OHDSI/Achilles)

第二部分: 可视化统计数据的web界面(https://github.com/OHDSI/AchillesWeb)

2、HOMER——面向人群的估计

观察性医疗数据例如电子病历和医保索赔单据能够为健康、疾病和药品的研究带来无限大的价值。
目前医疗领域中大数据分析的范式主要还是以 episodic in nature为主，比方说，某个研究人员就某种关联关系(比如某种药物与某种疗效的关系)提出了一种假设，设计了一项观察性分析试验来检验假设，在某个开放的观察性数据库上执行这个试验分析，恰巧p<0.05,差异有统计意义，然后他试图以同行评审的文献，在论坛上传播他的发现. 一般而言，就因果关系的假设检验主要集中在产生无偏倚的对关联强度的估计，决定是否相对风险 risk metric 是否足够来否决没有效果的无效假设。这种范式会带来如下几个问题：

目前的研究流程效率太低— evidence is generated to support one hypothesis at a time, and the number of questions about disease and medical products that patients and providers deserve reliable evidence about are growing at a pace that outstrips the output of the entire research enterprise. For example, across all pharmaceutical drugs and all health outcomes of interest, only 4% of combinations have evidence in the published literature from randomized clinical trials or observational studies; 96% of the potential questions remain unasked and therefore unanswered.
研究得到的证据不可靠 — estimates of strength of association from observational database analyses are subject to systematic error which bedevils the field of epidemiology. Repeated examples illustrate the challenge in proper analyses, as different research groups attempting to answer the same question on the same data generate conflicting results (such as bisphosponate-esophageal cancer, pioglitazone-bladder cancer), and findings across observational databases on the same issue fail to replicate (such as flouroquinalone-retinal detachment, dabigatran-bleeding). Issues of data source heterogeneity and method parameter sensitivity make it critically important to explore multiple databases and multiple analysis choices when addressing a particular product-outcome association, but conducting multiple large-scale analyses across disparate sources and synthesizing results across the analyses is difficult.
得到的证据不足以解决因果性的问题— most observational database analyses provide estimates of strength of association, and when statistically significant findings are observed, offer post-hoc rationalizations for biologic plausibility. Austin Bradford Hill outlined in 1965 [14] many facets that bear consideration when considering a causal effect, including strength of association, plausibility, consistency, temporality, biologic gradient, analogy, specificity, experiment, and coherence. These viewpoints have been applied to specific pharmaocovigilance analyses [27], but have not been consistently adopted or systematically applied in the context of observational data studies. An open opportunity for the novel use of observational data involves developing exploratory analyses for each of these causal dimensions, as well some novel dimensions, to strengthen the interpretation of any purported effect. In addition we propose to develop quantitative metrics associated with each of these dimensions.
目前所使用的数据和得到的结果都是静态的— patient-level data are summarized in a series of statistics that populate tables in a manuscript. The level of detail provided about the underlying data and analysis methods applied to the data is often not sufficiently transparent to evaluate the integrity of the study, and because the patient-level data are not publicly available, the analyses are often not reproducible. Yet, most study results stimulate more questions than they answer. For example, if we find that dabigatran causes bleeding, the community will immediately want to go further to ask: is the effect observed for all indications of the treatment or for all patient subgroups within each indication? Do other anticoagulants have similar effects? If the drug causes gastrointestinal bleeding, are there other hemorrhagic conditions that it is also associated with? Do observed associations persist as observational data accumulate, health care delivery evolves, and the practice of medicine learns from prior work to develop interventions intended to maximize benefits and minimize risks of treatments?
要解决这些问题需要一种迭代式的数据分析方法s, one which facilitates exploration of summary results while protecting patient privacy, through coordination of an observational data network of disparate sources that provide timely access to current summary analysis results on an on-going basis.

为了解决这些问题，也就是说我们要设计、实现和部署Health Outcomes and Medical Effectiveness Research (HOMER) 这样的一个系统。HOMER 是一个交互式的可视化平台，研究所人员可以在观察性数据库形成的网络之上探索关联关系. 我们会提供标准化的大规模分析的工具来提取统计数据，提供一个web节目供实时研究统计数据。

从大数据的四大维度思考:
观察性医疗数据库是持续增长的，多数数据库中超过了 100 million patients ，包含数以十亿记的临床观察项；在整个数据网络之上, 在电子病历和医保过程中产生的数据也满足多样性的特点；而数据的veracity则依赖于我们解读数据的能力，能否将其转换成对个体生活体验的精确预测, 获取某个药物在人群中的无偏倚效果. Healthcare data offers substantial velocity, with clinical observations captured every day, and large-scale analyses are expected to be executed on a regular basis, if not real time, to ensure the timeliness of all evidence generated. However, the “big data” problem goes beyond the patient-level sources; for example, with 10,000s of medical interventions and 1000s of health outcomes of interest to patients, a comprehensive summary of all potential effects constitutes “big results”; if we estimate that 1000 summary statistics will be needed to properly characterize a single drug-outcome effect, then the result set to explore all drugs and all outcome should be expected to exceed 10 billion.

因此，综述性质的结果需要一种新的研究方法，一个人不可能人工对所有信息进行回顾来辨别潜在的效果，衡量信息的准确度来破除谣传的效果，这就需要一种基于交互式可视化技术的大规模的探索框架，研究人员可以对结果进行过滤、缩放、平移，将结果与正交分析组件进行关联，从而得出药物作用的
evidence-based story或者找出关联的原因。而且，大规模的分析结果可供对研究方法的可靠性和性能进行大规模的评估，为如何从这些结果中学习以及在某个时间节点结果到底有多少可信度提供进一步的证据

HOMER框架始自 Sir Austin Bradford Hill’s因果关系的考虑因素( causal considerations.)，对于其中每一个因果关系组件我们会开发大规模分析的解决方案，
strength of association, consistency, temporality, experiment, plausibility, coherence, biologic gradient, specificity, and analogy.
每个组件包含两部分，一个是从个案数据库中得到统计数据的方法，一个是对统计数据进行可视化的方法。在工具中，你可以自由选择任何一种药物和任何一种效果来研究所有与药物-效果相关联的证据，
要完成大规模的分析也就是说开发的工具的计算性能优越，能够处理包含数以百万计的个案数据集, 能够在上百万个协变量之间应用复杂的regularization strategies进行confounding adjustment ，同时也能够对上百万对药物-疗效进行研究。

HOMER的所有部件都是开源开放的。

3、PLATO——面向个体的预测

对个案治疗效果的评估：根据患者的病史，评估患者在接受某种干预之后是否会出现某种治疗效果的可能性的预测模型

Patients seek medical care to diagnose and treat illness. Current medical practice relies on limited aggregate information for prognosis and prediction of a patient’s health. When predictive models are used in healthcare they draw on data from hundreds to thousands of patients and consider small numbers of patient characteristics, often five or fewer. This contrasts sharply with the reality of modern medicine wherein patients generate a rich digital trail, which is well beyond the power of any medical practitioner to fully assimilate. The recent emergence of massive patient-level databases of electronic health records and administrative claims opens up extraordinary opportunities for massive-scale, patient-specific predictive modeling. Such models can inform truly personalized medical care leading hopefully to sharply improved patient outcomes.

我们使用累积多年的1亿患者和超过50亿临床观察的连续性数据来开发预测模型。
大的人口基数能够提供丰富的数据来构建高效的预测模型，也能为更多的需要改进医疗质量的患者提供即时服务。对数据的有效研究需要新的方法论和跨学科的协作。我们坚信，依靠OHDSI的综合背景，能够访问这么大的数据量，通过竭诚合作定能在这个领域有所突破。

主要是基于irregularly-spaced 的电子病历数据来研究一些模型和算法来衍生出与临床紧密相关的预测模型，研究如何使用这些信息进行大规模变量建模的算法,评估个案层面的效果预测准确性的性能. Predictive modeling in databases containing data for upwards of 100 million patients presents non-trivial engineering challenges.
但我们的团队很熟悉这些数据，并且已经架设好了一个量身定做的计算环境，主要目的还是在与确定开发精准的个性化预测模型的标准化流程。

预测模型是基于OMOP数据模型的，工具都是开源开放的。
Person-Level Assessment of Treatment Outcomes (PLATO) will be an integrated framework to allow all users to use the library of predictive models developed to produce individualized risk for all medical interventions and all health outcomes of interest, based on personal demographics, medical history, and health behaviors.

4、HERMES——术语工具

Health Entity Relationship and Metadata Exploration System (HERMES)

HERMES是一个web工具，可以查询和检索存储在 OMOP Common Data Model (CDM)中的术语。
. 同时也支持术语的管理和导出功能。
HERMES包括了HTML / JavaScript 实现的前台界面和访问 OMOP CDM 术语资源的后台服务。

5、HERCULES——质控报表

Health Enterprise Resource, Care, and Utilization Learning Exploration System (HERCULES):
标准化的描述性报表：医疗机构可以分析哪些地方可以改进，与类似的医疗机构的基线进行比较

HERCULES可以分析医疗质量、成本、医疗实践的模式，也是基于OMOP 统一数据模型的。
HERCULES提供了可视化工具和进一步深挖质控指标的能力，也可在多个患者队列中应用

6、WhiteRabbit——ETL设计工具

WhiteRabbit是一个用来辅助机构将数据导出成OMOP统一数据模型CDM的ETL工具。
源数据库的数据可以是csv、(MySQL, SQL Server, ORACLE, PostgreSQL);
CDM可以是(MySQL, SQL Server, PostgreSQL)任意一种数据库

WhiteRabbit的核心功能是对源数据进行扫描，提供每个表、字段、值的详细信息。扫描之后会得到一份报告，可以作为设计ETL过程的参考依据，for instance when using the Rabbit-In-a-Hat tool. Rabbit-In-a-Hat uses the scan document and displays source data information through a graphical user interface to allow a user to connect source data structure to the CDM data structure. The function of Rabbit-In-a-Hat is to generate documentation for the ETL process, not generate code to create an ETL.

Download WhiteRabbit: https://github.com/OHDSI/WhiteRabbit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

官方网站—第四部分、Software Tools #16

官方网站—第四部分、Software Tools #16

wanghaisheng commented Mar 18, 2015

官方网站—第四部分、Software Tools #16

官方网站—第四部分、Software Tools #16

Comments

wanghaisheng commented Mar 18, 2015

分析工具

1、ACHILLES——数据特征化

2、HOMER——面向人群的估计

3、PLATO——面向个体的预测

4、HERMES——术语工具

5、HERCULES——质控报表

6、WhiteRabbit——ETL设计工具