The COVID-19 Open Research Dataset Challenge (CORD-19) is published by Kaggle: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge. The goal is to explore the dataset and get new insights using Weaviate.
Download all json files from https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge into a folder.
docker run --env "WEAVIATE_URL=<weaviate-url>" semitechnologies/weaviate-demo-covid19
Please note that the Kaggle data is cached inside the Docker container.
Start an English local Weaviate or a Weaviate on the cluster service (Install weaviate-cli then $ weaviate-cli cluster-create
).
$ pip3 install -r requirements.txt
$ weaviate-cli schema-import --location=schema.json
$ python3 import.py <weaviate-url> <data-folder>
Useful links:
- Learn about Weaviate GraphQL in the documentation.
- Run the queries interactively in the Weaviate Playground.
{
Aggregate{
Things{
Paper{
meta {
count
}
}
}
}
}
{
Get {
Things {
Paper {
title
Journal {
... on Journal {
name
}
}
}
}
}
}
{
Get {
Things {
Paper(
explore:{
concepts: ["chiroptera"] # <== basically a bat
},
limit: 5
) {
title
abstract
}
}
}
}
{
Get {
Things {
Paper(
explore:{
concepts: ["chiroptera"] # <== basically a bat
moveTo: {
concepts: ["cattle"] # <== relation to cows, bulls, oxen, or calves
force: 0.85
}
},
limit: 5
) {
title
abstract
}
}
}
}
Currently, the paper id, title, abstract and full body text of 885 papers of biorxiv_medxriv
can be imported by the script above. Next steps are:
- Add metadata (references) of the papers to Weaviate. First, separate objects for authors, institutes, journals etc need to be created.
- Add papers from the other collections (commercial, non-commercial, custom licence, see https://pages.semanticscholar.org/coronavirus-research)
- Let's collaborate to make something great
- I did not pay full attention to vectorization settings in the schema. When we want to do search and classification, maybe this needs better setting (better check now than later)
- Used some code from this kernel: https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv