Skip to content

weaviate/biggraph-wikidata-search-with-weaviate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PyTorch-BigGraph Wikidata search with the Weaviate vector search engine

PyTorch-BigGraph is a project by Facebook Research and a "distributed system for learning graph embeddings for large graphs" which -in turn- is based on the PyTorch-BigGraph: A Large-scale Graph Embedding Framework paper. As an example dataset, they trained a PBG model on the full Wikidata graph.

In this repository, you'll find a guide on how you can import the complete Wikidata PBG model into a Weaviate and search through the entire dataset in < 50 milliseconds (excluding internet latency). The demo GraphQL queries below contain both pure vector search and scalar and vector searched mixed queries.

If you like what you see, a ⭐ on the Weaviate Github repo or joining our Slack is appreciated.

Additional links:

Acknowledgments

Stats

description value
Data objects imported 78.404.883
Machine 16 CPU, 128Gb Mem
Weaviate version v1.8.0-rc.2
Dataset size 125G

Note:

  • This dataset is indexed on a single Weaviate node to show the capabilities of a single Weaviate instance. You can also set up a Weaviate Kubernetes cluster and import the complete dataset in that way.

Index

Import

You can import the data yourself in two ways: by running the python script included in this repo or by restoring a Weaviate backup (this is the fastest!).

Import using Python from source

$ wget https://dl.fbaipublicfiles.com/torchbiggraph/wikidata_translation_v1.tsv.gz
$ gzip -d wikidata_translation_v1.tsv.gz
$ pip3 install -f requirements.txt
$ docker-compose up -d
$ python3 import.py

The import takes a few hours, so probably you want to do something like:

$ nohup python3 -u import.py &

Note:

  • The script assumes that the tsv file is called: wikidata_translation_v1.tsv

Restore as Weaviate backup

You can download a backup and restore it. This is by far the fastest way to get the dataset up and running ⁉️

# clone this repository
$ git clone https://github.com/semi-technologies/biggraph-wikidata-search-with-weaviate
# download the Weaviate backup
$ curl https://storage.googleapis.com/semi-technologies-public-data/weaviate-1.8.0-rc.2-backup-wikipedia-pytorch-biggraph.tar.gz -O
# untar the backup (125G unpacked)
$ tar -xvzf weaviate-1.8.0-rc.2-backup-wikipedia-pytorch-biggraph.tar.gz
# get the unpacked directory
$ echo $(pwd)/var/weaviate
# use the above result (e.g., /home/foobar/weaviate-disk/var/weaviate)
#   update volumes in docker-compose.yml (NOT PERSISTENCE_DATA_PATH!) to the above output
#   (e.g., PERSISTENCE_DATA_PATH: '/home/foobar/weaviate-disk/var/weaviate:/var/lib/weaviate')
#   With 16 CPUs this process takes about 12 to 15 minutes
# start the container
$ docker-compose up -d

Notes:

  • Weaviate needs some time to restore the backup, in the docker logs, you can see the status of the import. For more verbose information regarding the import. Add LOG_LEVEL: 'debug' in docker-compose.yml
  • This setup is tested with Ubuntu 20.04.3 LTS and the Weaviate version in the Docker-compose file attached

Example queries

Finding Stanley

##
# The one and only Stanley Kubrick πŸš€β¬›πŸ’
##
{
  Get {
    Entity(
      nearObject: {id: "7392bc9d-a3c0-4738-9d25-a473245971c5", certainty: 0.75}
      limit: 24
    ) {
      url
      _additional {
        id
        certainty
      }
    }
    Label(nearObject: {id: "7392bc9d-a3c0-4738-9d25-a473245971c5", certainty: 0.8}) {
      content
      language
      _additional {
        id
        certainty
      }
    }
  }
}

Show those vectors!

##
# Na na na na na na na na na na na na na na na na... BATMAN! πŸ¦‡
##
{
  Get {
    Entity(
      nearObject: {id: "72784488-d8a9-4fa5-8c5c-208465a31fe2", certainty: 0.75}
      limit: 3
    ) {
      url
      _additional {
        id
        certainty
        vector
      }
    }
  }
}

About

Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published