Procedural Map Generation
This is the code behind our map of Wikipedia, which maps Wikipedia articles into an imaginary geographical space based on their relatedness to each other. If you want to be able to generate your own maps, read on!
- These instructions are written for a Mac system. They should be pretty easily portable to Linux/Windows with appropriate adjustment of the command line stuff, but we haven't tested it on those systems yet, so some things might be different.
- You'll need your own vectorized dataset and some kind of popularity/weighting measure for it (for Wikipedia, ours was a combination of page views and page rank). Unfortunately, we can't publish our Wikipedia dataset - it is ~10GB of it so it doesn't fit on Github :( If you'd like to play around with the full Wikipedia dataset, open an issue and we'll do our best to get you set up.
- Python 2.7, which you can install here if you don't already have it, and your text editor of choice.
- Docker, which you can install here
- PyCharm, which you can install here
Fork and clone the repo:
git clone https://github.com/shilad/cartograph cd cartograph/
#Getting your data set up As long as your data is in the proper format, the pipeline should be able to handle it just fine. Unfortunately, it's a pretty specific format, so be careful.
The basics: Your data need to be vectorized (think word2vec) and have some kind of popularity/weighting score attached to each individual data point. They'll be stored in tsvs (tab-separated files), which are pretty easy to create if you don't have them already. (This may change to csv files as we move to processing in Pandas.)
Files you need
I'll number the files you need so that you can reference them later.
- A file of all your vectors, one per line, with individual numbers separated by tabs. The first line of this should just be a list of numbers from 1 to the length of your vectors (ours goes from 1 to 100). The line numbers in this file will eventually become the unique id numbers for each vector/item (id numbers are arbitrary and meaningless, but help with data tracking and lookup). Here are the first two lines of our vecs file so you can see:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 -0.07138634 -0.08672774 -0.013411164 0.07870448 0.04251623 0.11026025 0.008073568 0.044899702 0.05322492 -0.13140923 0.077965975 -0.12912643 0.073434114 -0.053325653 -0.09941709 0.08749974 0.060241103 -0.124527335 -0.014015436 0.033687353 0.1030426 -0.012437582 0.004548788 -0.061617494 0.1483283 0.0057908297 -0.13584042 0.15024185 -0.037984252 -0.12157321 0.09136045 0.020964265 0.016398907 -0.081725836 -0.036406517 -0.15739226 -0.05564201 0.056917787 0.06571305 0.2697841 0.19257343 0.11049521 0.06296027 0.01737237 0.083135724 -0.06151676 -0.22469974 -0.14886552 0.05225146 -0.060946107 0.049633026 -0.15782869 -0.12573588 -0.015895367 0.012135506 0.043959737 0.03600359 0.034795165 -0.04003203 -0.007905722 0.08175945 0.06722367 -0.017473102 0.009483457 0.04708171 0.16971242 0.08944678 -0.0101213455 0.055675507 0.05460131 0.17296863 0.19190204 -0.13852596 -0.114959955 -0.03523159 -0.014250398 0.114590645 0.0142169 -0.04355693 -0.19052565 0.07115126 -0.28525978 -0.027846217 -0.07007706 -0.14977187 -0.022709966 0.056917787 -0.025865555 0.06698871 0.09790647 -0.0046830177 0.11667204 0.03761494 -0.0047165155 0.17921269 0.07742882 -0.14728767 -0.14275575 -0.073064804 0.14440072
- A file of all your names/titles, one per line. There should be two columns, one callled "index" and one called "name". "Index" should correspond to the line number of the vectors file (i.e. they should be in the same order), and "name" is the title of your item/point. Again, everything should be tab-separated. Here are the first few lines of our names file:
index name 1 Kat Graham 2 Cedi Osman 3 List of Chicago Bulls seasons 4 Alabama 5 An American in Paris
- A file of all your popularities/weights, one per line. This should have two tab-separated columns, one of which contains the name/title of the point, and the second of which contains the score. This file does not need a heading. It also does not need to be in the same order as the other two. Here's our popularity file, which measures the popularity of Wikipedia articles using a weighted combination of page views and page rank.
2014–15 West Ham United F.C. season 6.038550978007928E-6 The Return of Superman (TV series) 8.09840661109975E-5 Bethany Mota 9.359317492029514E-5 Elfrid Payton (basketball) 1.3807395228814057E-5 Dothraki language 5.198423874878717E-5 List of edible molluscs 4.500109212570877E-6
- (OPTIONAL) Region names. This is probably something you'll add after you see your data for the first time. It's the file that creates the region labels ("Physics & Maths", "Politics & Geography", etc. on our map). It has a header - one column is "cluster_id" and the other is "label", and is tab-separated, like all of our data files. If you leave this one off, the code will just number your regions on the map.
cluster_id label 0 India 1 History & Geography 2 Politics & Economics 3 Physics & Maths
This is the top of data/conf/defaultconfig.txt, which you'll need to edit to correspond to your data files. You'll also have to create a couple of directories for your files to live in.
[DEFAULT] dataset: dev_en baseDir: ./data/%(dataset)s generatedDir: %(baseDir)s/tsv mapDir: %(baseDir)s/maps geojsonDir: %(baseDir)s/geojson externalDir: ./data/labdata/%(dataset)s [ExternalFiles] vecs_with_id: %(externalDir)s/vecs.tsv names_with_id: %(externalDir)s/names.tsv popularity: %(externalDir)s/popularity.tsv article_embedding = %(externalDir)s/tsne_cache.tsv
Directory setup and config
This is the only part of this file you should need to change - the rest will either be generated based on your data or is constant.
First up, the directories.
- In data/labdata, create a new folder and call it something relevant to your dataset. Ours is called dev_en (development english)
- Put all your data files from Data Format into this folder
- Change the "dataset" line in the config file to point to your folder rather than dev_en
- One more conf file to create - create a file called conf.txt and put it in the base directory (cartograph). It doesn't need to do anything, but it won't work if it's blank, so just add an arbitary heading like so:
Next, external files: they're pretty easy - they go in the order described above in Data Format. Just pop in the title of your file after the last / in the filepath - so, for example, if you called your vectors file myvectors.tsv, you would change the file to read
Do the same for the other four files. If you don't have the last two, it's okay to leave them - the pipeline is set up to catch that and work around it.
If you'd just like to create our example map of Simple English Wikipedia, you can also download the data files for that map from the Cartograph website (or at least, you should be able to soon).
Docker: Dependencies galore!
Docker will take care of installing a bunch of dependencies in order to get you started. Most of them are pretty quick, with the exception of pygsp, which takes about half an hour - ample time to go get a snack or catch some Pokémon or something.
The actual location of the cartograph repo (and even its name) may change depending on where your data are stored in your file system. For us, the Cartograph directory contains a /data/ext/simple subdirectory that itself contains our data files.
docker run -ti -v ~/PycharmProjects/cartograph:/testvoldir mjulstro/cartorepo:version4
Run the pipeline!
This runs a luigi script that works through workflow.py, checking to see if any tasks have already been completed and skipping those (so you don't have to rerun the clustering algorithm/denoising/etc every time). It will automatically update if code marked as required has been changed. The end product of this script is an xml file that represents the map. What comes after the --conf tag in this command will, like the previous command, depend on your filesystem and where your data are stored.
./build.sh --conf ./data/conf/summer2017_simple.txt
Run the server!
The last step is to run the TileStache server, which takes your map xml and turns it into tiles that can then be served. This part also handles setting up things like the search function. For the last part of this command, just enter the same thing you entered after --conf in the last command.
python cartograph/server/app2.py ./data/conf/summer2017_simple.txt