Skip to content
This repository has been archived by the owner on Nov 20, 2023. It is now read-only.

wikimedia/research-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# archived - see https://gitlab.wikimedia.org/repos/research

# research repository

This repository serves multiple purposes
- library for sharing and reusing code for research at wmf. E.g. pipeline transformations for
    - graphs revision graph, diffs 
    - redirect resolution, link graph between articles
    - topic modelling
    - ...
- datasets. Pipelines that create datasets as output
- tooling for running orchestrated machine learning pipelines in the data engineering infrastructure
    - running airflow dags
    - gpu training on yarn cluster
    - custom workloads using skein 


Instead of splitting different functionality into different repos prematurely, this will be a mono repo for now. 

## Library

To use this package, add it as a requirement to your setup.py or pip install it into a given python evironment.

`pip install "wmf_research @ git+https://github.com/wikimedia/research-datasets.git@master"`


### Use interactively

todo. describe interactive ipython notebook 

## Datasets

### covid19

datasets about the covid19 pandemic that the wmf released publicly

### commons images

download images uploaded to wikipedia commons from swift and make the available on hdfs for machine learning use cases

About

No description, website, or topics provided.

Resources

Code of conduct

Stars

Watchers

Forks

Languages