Open
Description
Working on HTTPArchive/dataform#48 and noticed that we have many duplicated version values:
SELECT
tech,
ARRAY_LENGTH(tech.info) AS count,
COUNT(DISTINCT version) AS count_distinct
FROM crawl.pages,
UNNEST(technologies) AS tech,
UNNEST(tech.info) AS version
WHERE
date = '2024-11-01' AND
client = 'mobile' AND
rank = 1000 AND
tech.technology LIKE 'WordPress%'
GROUP BY 1,2

Seems like it's the case whenever there are >1 category for a technology:
SELECT
tech,
ARRAY_LENGTH(tech.info) AS count,
COUNT(DISTINCT version) AS count_distinct
FROM crawl.pages,
UNNEST(technologies) AS tech,
UNNEST(tech.info) AS version
WHERE
date = '2024-11-01' AND
client = 'mobile' AND
rank = 1000 AND
ARRAY_LENGTH(tech.categories) > 2
GROUP BY 1,2

Having distinct values instead would make iterating over the versions more efficient.
Metadata
Metadata
Assignees
Labels
No labels