Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
139 lines (88 sloc) 2.63 KB
title author date vignette
Using rdomains
Gaurav Sood
2016-06-15
%\VignetteIndexEntry{Illustrating use of rdomains} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

rdomains: Get the category of content hosted by a domain

Install and Load the package

The latest development version of the package will always be on GitHub. To install the package from GitHub and to load the installed package:

#library(devtools)
install_github("soodoku/domain_classifier/rdomains")

To install the package from CRAN, type in:

install.packages("rdomains")

Next, load the package:

library(rdomains)

Shalla

To get category of the content from shallalist, first download the latest file using:

get_shalla_data()

And then, get the category using:

shalla_cat("http://www.google.com")
##   domain_name shalla_category
## 1  google.com   searchengines

DMOZ

To get category of the content from DMOZ, first download the archived parsed CSV file using:

get_dmoz_data()

And then, get the category using:

dmoz_cat("http://www.google.com")

ML

Probability that Domain Hosts Adult Content Based on features of Domain Name and Suffix alone:

adult_ml1_cat("http://www.google.com")
##   domain_name  category
## 1  google.com 0.3133728

Virustotal

Start by getting the API key from virustotal.

Get virustotal category by running:

virustotal_cat("http://www.google.com")
##                 domain   bitdefender dr_web  alexa        google       websense             trendmicro
## 1 http://www.google.com searchengines  chats google searchengines advertisements search engines portals

Trusted (McAfee)

Get the content category of a domain according to McAfee (Trusted):

trusted_cat("http://www.google.com")
##                    url          status   categorization   reputation
## 2 http://www.google.com Categorized URL - Search Engines Minimal Risk

Alexa Category

To get the category of content from Amazon (Alexa) (which provides it via DMOZ), start by getting credentials from https://aws.amazon.com/. Next, set the environment variables:

Sys.setenv("AWS_ACCESS_KEY_ID", "key_id") 
Sys.getenv("AWS_SECRET_ACCESS_KEY", "secret_key")

Then run,

alexa_cat(domain="http://www.google.com")[1,]
##                   Title                                           AbsolutePath
## 1 Search Engines/Google Top/Computers/Internet/Searching/Search_Engines/Google