Domain Knowledge: Learning with Pydomains
You are what you browse. More or less. We jest, just a bit.
To help make it easier to learn from browsing data, we developed a Python package, pydomains. The package provides multiple ways to infer the kind of content hosted by a domain. To illustrate its power (and also the general workflow), we use it to answer two important questions:
Do poor people, minorities, and the less-well-educated visit sites that distribute malware or engage in phishing more frequently than their respective complementary groups---the better-off, the racial majority, the better educated?
How does consumption of pornography vary by education and age?
Browsing data: comScore data are proprietary so we cannot release the data. Codebook translating numerical codes to semantic labels for demographic data is posted here.
- Malware by Age, Race, Education
- Pornography Consumption by Age and Education for comScore 2004
- We pick 2004 because we have data from Trusted Source API for 2004 also. We plan to present some supplementary data and analysis that illustrate some of the issues with comScore data but much of it is beyond the scope of this illustration and we may do it separately.
Suriyan Laohaprapanon and Gaurav Sood