CMDL

Cross-Modal Data Discovery over Structured and Unstructured Data Lakes

Set up:

environment.yml will set up a conda environment

Entry points:

trainer/pretrain-text.ipynb: Fine tuning a language model on text corpus to learn text embeddings
trainer/pretrain-tables.ipynb: Fine tuning a language model on table collection to learn tuple embeddings
trainer/column_text_joint_training.ipynb: training a baseline connecting text to table columns
compare_gt.py: accuracy measurement of search based baselines and similarity sketches on text->table relation discovery using the ground truth provided

Data Sets & Ground Truths:

All files and directories are inside the inputs directory

Phamra
- drugbank-tables: drugbank tables as csv files
- pubmed-targets: pubmed article abstracts as txt files
- DrugBank_Synthetic_dataset: synthetic drugbank tables as csv files
ChEBI
- ChEBI_tables_dataset: ChEMBL tables as csv files
  Note: chebi-reference.csv.zip & chebi-structures.csv.zip are compressed due to GitHub limits
ChEMBL
- ChEMBL_tables_dataset: ChEMBL tables as csv files
  Note: chembl_27-activity_supp.csv.zip , chembl_27-chembl_id_lookup.csv.zip , chembl_27-compound_records.csv.zip , chembl_27-molecule_dictionary.csv.zip are compressed due to GitHub limits
MLOpen
- MLOpen Data Source
- For our experiments we use certain subsets of the data which can be found in the subdirectories:
  - mlopen_t2t_SS_dataset
  - mlopen_t2t_MS_dataset
  - mlopen_t2t_LS_dataset
UKOpen
- UKOpen Data Source

The ground truth files for each dataset are present in the inputs directory

Resources:

Paper manuscripts provided under the folder 'docs'

Prior baselines:

snorkel labeler.ipynb needs to be run in its separate environment by following instructions at: https://github.com/snorkel-team/snorkel
build_label_files.py: profiles data, indexes tables, creates labels by probing indexes using each text
build_features.py: featurizes input data, saves features to disk to be read during training

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
indexer		indexer
inputs		inputs
labeler		labeler
profiler		profiler
trainer		trainer
utils		utils
.DS_Store		.DS_Store
README.md		README.md
build_features.py		build_features.py
build_label_files.py		build_label_files.py
compare_gt.py		compare_gt.py
environment.yml		environment.yml
evaluate_trained.py		evaluate_trained.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexer

indexer

inputs

inputs

labeler

labeler

profiler

profiler

trainer

trainer

utils

utils

.DS_Store

.DS_Store

README.md

README.md

build_features.py

build_features.py

build_label_files.py

build_label_files.py

compare_gt.py

compare_gt.py

environment.yml

environment.yml

evaluate_trained.py

evaluate_trained.py

Repository files navigation

CMDL

Set up:

Entry points:

Data Sets & Ground Truths:

Resources:

Prior baselines:

About

Releases

Packages

Languages

qcri/CMDL

Folders and files

Latest commit

History

Repository files navigation

CMDL

Set up:

Entry points:

Data Sets & Ground Truths:

Resources:

Prior baselines:

About

Resources

Stars

Watchers

Forks

Languages