Advanced similarity and duplicate source code at scale.
Clone or download
Latest commit aabbaf0 Oct 23, 2018
Failed to load latest commit information.
k8s add docker-push to travis Jan 18, 2018
project bump engine to 0.7.0 Jul 10, 2018
scripts ci: use archive URL for getting Spark Oct 2, 2018
src Merge pull request #176 from smacker/pass_func_name_and_line_to_query Oct 23, 2018
.dockerignore Move feature extractor python files to their own dir Apr 9, 2018
.gitignore Addressing review feedback Jun 8, 2018
.travis.yml Merge pull request #176 from smacker/pass_func_name_and_line_to_query Oct 23, 2018
DCO Initial commit: hash (Apache Spark) + query (Go, Scala) Nov 30, 2017
Dockerfile fix dependencies + assembly deps Apr 24, 2018
FE.Dockerfile install pytest Jul 10, 2018
LICENSE add license note for datasketch May 8, 2018
MAINTAINERS Maxim is the maintainer Oct 10, 2018
Makefile Use ./sbt for running query and report in dev mode Jun 8, 2018 simple documentation for function similarity Oct 23, 2018
build.sbt Switch to Jackson JSON parser Jun 25, 2018 use docker-compose in CI May 11, 2018
docker-compose.yml update bblfshd version Jul 10, 2018
feature_extractor ci: cleanup python deps installation Oct 1, 2018
gemini-k8s-cluster Convenient script to run gemini on cluster Jan 4, 2018
hash Hash CLI: add 2 configuration options Jul 2, 2018
query use new parser and remove Jun 19, 2018
report use new parser and remove Jun 19, 2018
sbt Initial commit: hash (Apache Spark) + query (Go, Scala) Nov 30, 2017
scalastyle-config.xml Initial commit: hash (Apache Spark) + query (Go, Scala) Nov 30, 2017

Gemini Build Status codecov

Find similar code in Git repositories

Gemini is a tool for searching for similar 'items' in source code repositories. Supported granularity level or items are:

  • repositories (TBD)
  • files
  • functions


./hash   <path-to-repos-or-siva-files>
./query  <path-to-file>

You would need to prefix commands with docker-compose exec gemini if you run it in docker. Read below how to start gemini in docker or standalone mode.


To pre-process number of repositories for a quick finding of the duplicates run

./hash ./src/test/resources/siva

Input format of the repositories is the same as in src-d/Engine.

To pre-process repositories for search of similar functions run:

./hash -m func ./src/test/resources/siva


To find all duplicate of the single file run

./query <path-to-single-file>

To find all similar function defined in a file run:

./query -m func <path-to-single-file>

If you are interested in similarities of only 1 function defined in the file you can run:

./query -m func <path-to-single-file>:<function name>:<line number where the function is defined>


To find all duplicate files and similar functions in all repositories run


All repositories must be hashed before and a community detection library installed.



Start containers:

docker-compose up -d

Local directories repositories and query are available as /repositories and /query inside the container.


docker-compose exec gemini ./hash /repositories
docker-compose exec gemini ./query /query/consumer.go
docker-compose exec gemini ./report


You would need:

  • JVM 1.8
  • Apache Cassandra or ScyllaDB
  • Apache Spark
  • Python 3
  • Bblfshd v2.5.0+

By default, all commands are going to use

  • Apache Cassandra or ScyllaDB instance available at localhost:9042
  • Apache Spark, available though $SPARK_HOME
# save some repos in .siva files using Borges
echo -e "\n" > repo-list.txt

# get Borges from
borges pack --loglevel=debug --workers=2 --to=./repos -f repo-list.txt

# start Apache Cassandra
docker run -p 9042:9042 \
  --name cassandra -d rinscy/cassandra:3.11

# or ScyllaDB \w workaround
docker run -p 9042:9042 --volume $(pwd)/scylla:/var/lib/scylla \
  --name some-scylla -d scylladb/scylla:2.0.0 \
  --broadcast-address --listen-address --broadcast-rpc-address \
  --memory 2G --smp 1

# to get access to DB for development
docker exec -it some-scylla cqlsh

External Apache Spark cluster

Just set url to the Spark Master though env var

MASTER="spark://<spark-master-url>" ./hash <path>

CLI arguments

All three commands accept parameters for database connection and logging:

  • -h/--host - cassandra/scylla db hostname, default
  • -p/--port - cassandra/scylla db port, default 9042
  • -k/--keyspace - cassandra/scylla db keyspace, default hashes
  • -v/--verbose - producing more verbose output, default false

For query and hash commands parameters for bblfsh/features extractor configuration are available:

  • -m/--mode - similarity modes: file or function, default file
  • --bblfsh-host - babelfish server host, default
  • --bblfsh-port - babelfish server port, default 9432
  • --features-extractor-host - features-extractor host, default
  • --features-extractor-port - features-extractor port, default 9001

Hash command specific arguments:

  • -l/--limit - limit the number of repositories to be processed. All repositories will be processed by default
  • -f/--format - format of the stored repositories. Supported input data formats that repositories could be stored in are siva, bare or standard, default siva


Compile & Run

If env var DEV is set, ./sbt is used to compile and run all non-Spark commands: ./hash and ./report. This is a convenient for local development, as not requiring a separate "compile" step allows for a dev workflow that is similar to experience with interpreted languages.


To build final .jars for all commands

./sbt assemblyPackageDependency
./sbt assembly

Instead of 1 fatJar we bulid 2, separating all the dependencies from actual application code to allow for lower build times in case of simple changes.


To run tests, that rely

./sbt test

Re-generate code

Latest generated code for gRPC is already checked in under src/main/scala/tech/sourced/featurext. In case you update any of the src/main/proto/*.proto, you would need to generate gRPC code for Feature Extractors:


To generate new protobuf messages fixtures for tests, you may use bblfsh-sdk-tools:

bblfsh-sdk-tools fixtures -p .proto -l <LANG> <path-to-source-code-file>


Copyright (C) 2018 source{d}. This project is licensed under the GNU General Public License v3.0.