CORD-19 Weaviate

The COVID-19 Open Research Dataset Challenge (CORD-19) is published by Kaggle: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge. The goal is to explore the dataset and get new insights using Weaviate.

How to get started

Download the papers

Download all json files from https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge into a folder.

Run with Docker

docker run --env "WEAVIATE_URL=<weaviate-url>" semitechnologies/weaviate-demo-covid19

Please note that the Kaggle data is cached inside the Docker container.

Execute the import

Start an English local Weaviate or a Weaviate on the cluster service (Install weaviate-cli then $ weaviate-cli cluster-create).

$ pip3 install -r requirements.txt
$ weaviate-cli schema-import --location=schema.json
$ python3 import.py <weaviate-url> <data-folder>

Example Queries

Useful links:

Learn about Weaviate GraphQL in the documentation.
Run the queries interactively in the Weaviate Playground.

Count all papers

{
  Aggregate{
    Things{
      Paper{
        meta {
          count
        }
      }
    }
  }
}

Get a paper with a graph reference

{
  Get {
    Things {
      Paper {
        title
        Journal {
          ... on Journal {
            name
          }
        }
      }
    }
  }
}

Search for the concept of chiroptera

{
  Get {
    Things {
      Paper(
        explore:{
          concepts: ["chiroptera"] # <== basically a bat
        },
      	limit: 5
      ) {
        title
        abstract
      }
    }
  }
}

Search for the concept of chiroptera in a relation to cattle

{
  Get {
    Things {
      Paper(
        explore:{
          concepts: ["chiroptera"] # <== basically a bat
          moveTo: {
            concepts: ["cattle"] # <== relation to cows, bulls, oxen, or calves
            force: 0.85
          }
        },
      	limit: 5
      ) {
        title
        abstract
      }
    }
  }
}

Status

Currently, the paper id, title, abstract and full body text of 885 papers of biorxiv_medxriv can be imported by the script above. Next steps are:

Add metadata (references) of the papers to Weaviate. First, separate objects for authors, institutes, journals etc need to be created.
Add papers from the other collections (commercial, non-commercial, custom licence, see https://pages.semanticscholar.org/coronavirus-research)

Notes

Let's collaborate to make something great
I did not pay full attention to vectorization settings in the schema. When we want to do search and classification, maybe this needs better setting (better check now than later)

References

Used some code from this kernel: https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
modules		modules
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
import.py		import.py
readme.md		readme.md
requirements.txt		requirements.txt
schema.json		schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORD-19 Weaviate

How to get started

Download the papers

Run with Docker

Execute the import

Example Queries

Count all papers

Get a paper with a graph reference

Search for the concept of chiroptera

Search for the concept of chiroptera in a relation to cattle

Status

Notes

References

About

Releases

Packages

Contributors 4

Languages

weaviate/CORD-19-Weaviate

Folders and files

Latest commit

History

Repository files navigation

CORD-19 Weaviate

How to get started

Download the papers

Run with Docker

Execute the import

Example Queries

Count all papers

Get a paper with a graph reference

Search for the concept of chiroptera

Search for the concept of chiroptera in a relation to cattle

Status

Notes

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages