The map is powered by pre calculated pixels, not coordinates.
These tables only need to be created once...
CREATE TABLE IF NOT EXISTS addshore.wikidata_map_item_pixels (
`id` string,
`posx` int,
`posy` int
)
PARTITIONED BY (
`snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
CREATE TABLE IF NOT EXISTS addshore.wikidata_map_item_relation_pixels (
`forId` string,
`posx1` int,
`posy1` int,
`posx2` int,
`posy2` int
)
PARTITIONED BY (
`snapshot` string COMMENT 'Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
You can read more about the WMF spark setup here.
SET hivevar:WIKIDATA_MAP_SNAPSHOT='2021-10-18';
-- Modern flow tables
SET hivevar:WIKIDATA_MAP_ITEM_COORD_TABLE=addshore.wikidata_map_item_coordinates;
SET hivevar:WIKIDATA_MAP_ITEM_RELATION_TABLE=addshore.wikidata_map_item_relations;
-- Old backfill tables
SET hivevar:WIKIDATA_MAP_ITEM_COORD_TABLE=addshore.wikidata_map_item_coordinates_old_backfill_text;
SET hivevar:WIKIDATA_MAP_ITEM_RELATION_TABLE=addshore.wikidata_map_item_relations_old_backfill_text;
You also need to set this:
SET hive.exec.dynamic.partition.mode=nonstrict;
From here was want to calculate pixel locations for a canvas, to avoid doing any computation on the client.
The primary target canvas size is 1920 x 1080.
The old "huge" map rendered at 8000 x 4000.
In order to get to a similar quality we will x4 the target size, to 7680 x 4320.
- TODO add ids to the pixel entries, so they can be displayed on the map..
- TODO the below query is for earth only...
INSERT INTO addshore.wikidata_map_item_pixels
PARTITION(snapshot)
SELECT
id,
cast((cast(longitude as decimal(15, 10)) + 180) / 361 * 7680 as int) as posx,
cast(abs((cast(latitude as decimal(15, 10)) - 90) / 181 * 4320)as int) as posy,
snapshot
FROM ${WIKIDATA_MAP_ITEM_COORD_TABLE}
WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
AND globe = "http://www.wikidata.org/entity/Q2";
Then figure out how the relations relate to our pixel map:
INSERT INTO addshore.wikidata_map_item_relation_pixels
PARTITION(snapshot)
SELECT
x.forId as forId,
a.posx as posx1,
a.posy as posy1,
b.posx as posx2,
b.posy as posy2,
x.snapshot as snapshot
FROM (
SELECT fromId, toId, forId, snapshot
FROM ${WIKIDATA_MAP_ITEM_RELATION_TABLE}
WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT}
) x
JOIN addshore.wikidata_map_item_pixels a ON (a.id = x.fromId) AND a.snapshot=x.snapshot
JOIN addshore.wikidata_map_item_pixels b ON (b.id = x.toId) AND b.snapshot=x.snapshot
WHERE x.snapshot=${WIKIDATA_MAP_SNAPSHOT}
GROUP BY
x.forId,
a.posx,
a.posy,
b.posx,
b.posy,
x.snapshot
LIMIT 100000000;
You should have rows for the correct snapshot in all of the tables...
SELECT COUNT(*) FROM addshore.wikidata_map_item_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};
SELECT COUNT(*) FROM addshore.wikidata_map_item_relation_pixels WHERE snapshot=${WIKIDATA_MAP_SNAPSHOT};
If you add duplicate stuff by accident, you can clean it up!
INSERT OVERWRITE TABLE addshore.wikidata_map_item_pixels SELECT DISTINCT * FROM addshore.wikidata_map_item_pixels;
INSERT OVERWRITE TABLE addshore.wikidata_map_item_relation_pixels SELECT DISTINCT * FROM addshore.wikidata_map_item_relation_pixels;
Exit spark and do the rest!
Set an environment variable with the snapshot date:
WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17" "P36" "P47" "P138" "P150" "P190" "P197" "P403")
And write the files...
tail -n +2
removes the firt line of output, which will bePYSPARK_PYTHON=python3.7
sed 's/[\t]/,/g'
turns the TSV into a CSV
spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx, posy, COUNT(*) as count FROM addshore.wikidata_map_item_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' GROUP BY posx, posy ORDER BY count DESC LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g' > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
echo $PROPERTY
spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 -e "SELECT posx1, posy1, posx2, posy2 FROM addshore.wikidata_map_item_relation_pixels WHERE snapshot = '${WIKIDATA_MAP_SNAPSHOT}' AND forId = '${PROPERTY}' LIMIT 100000000" | tail -n +2 | sed 's/[\t]/,/g' > map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv
done
You should find the new files on disk in your current working directory. You can check how many lines they have.
cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv | wc -l
for PROPERTY in ${PropertyArray[*]}; do
echo $PROPERTY
cat map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv | wc -l
done
And they will look something like this
Pixel location and entity count
posx,posy,count
3933,898,7144
3761,812,6263
Relations between pixel locations
posx1,posy1,posx2,posy2
6111,1644,6126,1766
6078,1911,6126,1766
If everything went well, you are ready to publish the data.
Set an environment variable with the snapshot date:
WIKIDATA_MAP_SNAPSHOT='2021-10-18'
PropertyArray=("P17" "P36" "P47" "P138" "P150" "P190" "P197" "P403")
And move them into the published directory
cp map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv /srv/published/datasets/one-off/wikidata/addshore/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-pixels.csv
for PROPERTY in ${PropertyArray[*]}; do
echo $PROPERTY
cp map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv /srv/published/datasets/one-off/wikidata/addshore/map-${WIKIDATA_MAP_SNAPSHOT}-7680-4320-relation-pixels-${PROPERTY}.csv
done
published-sync
This can take a little while to show up...
Make sure the file appears: https://analytics.wikimedia.org/published/datasets/one-off/wikidata/addshore/