Skip to content

Latest commit

 

History

History
219 lines (166 loc) · 6.98 KB

DATA-PIXELS.md

File metadata and controls

219 lines (166 loc) · 6.98 KB

Data Pixels

The map is powered by pre calculated pixels, not coordinates.

Setting up the tables

These tables only need to be created once...

CREATE TABLE IF NOT EXISTS addshore.wikidata_map_item_pixels (
    `id` string,
    `posx` int,
    `posy` int
)
PARTITIONED BY (
  `snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

CREATE TABLE IF NOT EXISTS addshore.wikidata_map_item_relation_pixels (
    `forId` string,
    `posx1` int,
    `posy1` int,
    `posx2` int,
    `posy2` int
)
PARTITIONED BY (
  `snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

Generating data

Setup

spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64

You can read more about the WMF spark setup here.

Selecting snapshot & poking settings

SET hivevar:WIKIDATA_MAP_SNAPSHOT='2021-10-18';

-- Modern flow tables
SET hivevar:WIKIDATA_MAP_ITEM_COORD_TABLE=addshore.wikidata_map_item_coordinates;
SET hivevar:WIKIDATA_MAP_ITEM_RELATION_TABLE=addshore.wikidata_map_item_relations;
-- Old backfill tables
SET hivevar:WIKIDATA_MAP_ITEM_COORD_TABLE=addshore.wikidata_map_item_coordinates_old_backfill_text;
SET hivevar:WIKIDATA_MAP_ITEM_RELATION_TABLE=addshore.wikidata_map_item_relations_old_backfill_text;

You also need to set this:

SET hive.exec.dynamic.partition.mode=nonstrict;

Calculate pixel locations

From here was want to calculate pixel locations for a canvas, to avoid doing any computation on the client.

The primary target canvas size is 1920 x 1080.

The old "huge" map rendered at 8000 x 4000.

In order to get to a similar quality we will x4 the target size, to 7680 x 4320.

  • TODO add ids to the pixel entries, so they can be displayed on the map..
  • TODO the below query is for earth only...
INSERT INTO addshore.wikidata_map_item_pixels
PARTITION(snapshot)
SELECT
    id, 
    cast((cast(longitude as decimal(15, 10)) + 180) / 361 * 7680 as int) as posx,
    cast(abs((cast(latitude as decimal(15, 10)) - 90) / 181 * 4320)as int) as posy,
    snapshot
FROM ${WIKIDATA_MAP_ITEM_COORD_TABLE}
WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
    AND globe = "http://www.wikidata.org/entity/Q2";

Calculate item relation pixel locations

Then figure out how the relations relate to our pixel map:

INSERT INTO addshore.wikidata_map_item_relation_pixels
PARTITION(snapshot)
SELECT
    x.forId as forId,
    a.posx as posx1,
    a.posy as posy1,
    b.posx as posx2,
    b.posy as posy2,
    x.snapshot as snapshot
FROM (
    SELECT fromId, toId, forId, snapshot
    FROM ${WIKIDATA_MAP_ITEM_RELATION_TABLE}
    WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
) x
JOIN addshore.wikidata_map_item_pixels a ON (a.id = x.fromId) AND a.snapshot=x.snapshot
JOIN addshore.wikidata_map_item_pixels b ON (b.id = x.toId) AND b.snapshot=x.snapshot
WHERE x.snapshot=${WIKIDATA_MAP_SNAPSHOT}
GROUP BY
    x.forId,
    a.posx,
    a.posy,
    b.posx,
    b.posy,
    x.snapshot
LIMIT 100000000;

Check the ammount of data;

You should have rows for the correct snapshot in all of the tables...

SELECT COUNT(*) FROM addshore.wikidata_map_item_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};
SELECT COUNT(*) FROM addshore.wikidata_map_item_relation_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};

Cleanup?

If you add duplicate stuff by accident, you can clean it up!

INSERT OVERWRITE TABLE addshore.wikidata_map_item_pixels SELECT DISTINCT * FROM addshore.wikidata_map_item_pixels;
INSERT OVERWRITE TABLE addshore.wikidata_map_item_relation_pixels SELECT DISTINCT * FROM addshore.wikidata_map_item_relation_pixels;

Generate the CSVs

Exit spark and do the rest!

Set an environment variable with the snapshot date:

WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17"  "P36"  "P47"  "P138"  "P150"  "P190"  "P197"  "P403")

And write the files...

  • tail -n +2 removes the firt line of output, which will be PYSPARK_PYTHON=python3.7
  • sed 's/[\t]/,/g' turns the TSV into a CSV
spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx, posy, COUNT(*) as count FROM addshore.wikidata_map_item_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' GROUP BY posx, posy ORDER BY count DESC LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g'  > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
    echo $PROPERTY
    spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx1, posy1, posx2, posy2 FROM addshore.wikidata_map_item_relation_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' AND forId = '${PROPERTY}' LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g'  > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv
done

You should find the new files on disk in your current working directory. You can check how many lines they have.

cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv | wc -l
for PROPERTY in ${PropertyArray[*]}; do
    echo $PROPERTY
    cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv | wc -l
done

And they will look something like this

Pixel location and entity count

posx,posy,count
3933,898,7144
3761,812,6263

Relations between pixel locations

posx1,posy1,posx2,posy2
6111,1644,6126,1766
6078,1911,6126,1766

Publishing data

If everything went well, you are ready to publish the data.

Publishing the CSVs

Set an environment variable with the snapshot date:

WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17"  "P36"  "P47"  "P138"  "P150"  "P190"  "P197"  "P403")

And move them into the published directory

cp map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv /srv/published/datasets/one-off/wikidata/addshore/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
    echo $PROPERTY
    cp map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv /srv/published/datasets/one-off/wikidata/addshore/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv 
done
published-sync

This can take a little while to show up...

Make sure the file appears: https://analytics.wikimedia.org/published/datasets/one-off/wikidata/addshore/