A web scraper to check for broken id associations

This is related to the open w3c issue about using relative element references as a DX improvement. The goal of this scraper is to check whether the current id-based reference approach results in user-facing problems.

Websites

I took the top 100 most-visited websites from https://backlinko.com/most-popular-websites, checked against the wikipedia page with fewer results. I excluded sites that display almost nothing without a login (e.g. facebook), and porn-related sites, since this is running on my work machine.

Findings

So far, the conclusion that duplicate id attribute values are a widespread problem has been based on the total number of dupicate ids existing on the page. However, based on this data, the number of duplicate ids used is far smaller.

In the top 100 sites scraped, there were a total of 2705 id-referencing attributes present. Of those:

62 sites had duplicate ids present, for a total of 1210 total duplicated ids across all sites
10 sites had id-referencing attributes that pointed to a non-unique id, for a total of 155 duplicated ids that were referenced across all sites

The number of non-unique ids used was actually quite a small fraction of the total number of ids used. Specifically, only 5.7% of id-referencing attributes pointed at a duplicated id.

It seems that invalid id refs (i.e. an id-referencing attribute pointing at an id that does not exist) are far more common. This occurred in 41 of the 100 sites, and a total of 249 attributes pointing at invalid ids (or 9% of the total attributes).

These also do not necessarily correspond to a real-world bug -- it is possible, for example, to reference an error message element's id in aria-describedby, counting on the fact that it will only be read if the error message actually exists in the DOM.

Specific duplicate id usage details

duckduckgo.com: mobile vs. desktop navigation where only one shows at a time;
live.com: exclusive tabpanels x 2 (both dupes on the same element pair)
https://samsung.com: exclusively rendered sections x 2; false positive, separate exclusively rendered sections in same search dialog
https://nytimes.com: All duplicates are in their nav -- they are actually relying on the top-down id matching for some ids to work, because the duplicate id is an empty heading node used for spacing that should clearly not have an id; In other cases, the duplicated ids are incorrect and cause lists of links to have the wrong label; In one nav section none of the ids work because there is a space in the id which is clearly generated from the section title; Some duplicates come from mutually exclusive navigation (floating vs. anchored)
https://walmart.com: excluded from results: false positive because of the python code, it matched an empty string
https://zoom.us: all in mutually exclusive navigations for different screen sizes
https://accuweather.com: cannot repro in Japan
https://instructure.com: excluded from results: false positive because of the python code, it matched empty strings
https://theguardian.com: multiple search inputs that are mutually exclusive. No practical impact as the associated labels are not good labels, and the fallback placeholder is better.
https://ikea.com: aria-controls on mutually exclusive buttons with the same label
https://max.com: cannot repro in Japan
https://outlook.com: aria-controls on buttons in exclusively rendered tabs

All of these misuses are centered around the need to have multiple instances in the DOM of the same UI that are conditionally rendered, which might make a good addition to the central problem statement.

How to run the scripts

This is a python-based web scraper, and assumes python3 and pip are installed.

# create the virtual environment
python -m venv venv

# activate the virtual environment
source venv/bin/activate

# install dependencies
pip install -r requirements.txt

# Optional: re-run the scraper to update data
python src/scraper.py

# Print info about the scraped data
python src/process.py

# deactivate the virtual environment when done
deactivate

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
results-dupe-id-sheet.csv		results-dupe-id-sheet.csv
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A web scraper to check for broken id associations

Websites

Findings

Specific duplicate id usage details

How to run the scripts

About

Uh oh!

Releases

Packages

Languages

smhigley/py-soup-scraping

Folders and files

Latest commit

History

Repository files navigation

A web scraper to check for broken id associations

Websites

Findings

Specific duplicate id usage details

How to run the scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages