## **How to Create a Web Scraping Tool in R**

Have you ever had the need to gather up all of the information from a web page? Here's how to write a tool in R that will do that for you. 

# Install R Packages

- install.packages("httr")
- install.packages("RCurl")
- install.packages("stringr")
- install.packages("stringi")
- install.packages("openssl")

# Let's go
First, I create two functions to detect if a sentence is a french question and another to count the number of words because my goal is to predict feature snippets in Google and I need to know if the content of a HTML element is a question. 
I test with 2 small unit tests.

In [4]:
library("stringr")

nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

detectQuestion <- function(q) {
  substring="qui | quoi |où|quand|comment|combien|pourquoi| quel |?"
  val <- grepl(substring,q)
  return(val)
}

# result : 4
nwords("it is a test")

# result TRUE
detectQuestion("comment on crée un notebook en R")

# RCurl
I use the RCurl Package to get data form URL. We need to get data from the header and the HTML content of HTTP response. If you need more information about this function, please wait my TeknSEO talk.

##  2 methods
- regex
- xptah


In [6]:
library(RCurl)
library(urltools)
library(XML)  
library(stringr)
library(dplyr)
library(readr)
library(stringdist)

seoscraper2df <- function(url,keyword,debug){
    
  html <- ""
  time <- 0

  tryCatch({

    useragent <- "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25"
    h <- basicTextGatherer()
    
    time <- system.time( 
      html <- getURL(url
                    ,followlocation = TRUE
                    ,ssl.verifypeer = FALSE 
                    ,httpheader = c('User-Agent' = useragent)
                    ,headerfunction = h$update
                    #,verbose=TRUE
                   )
    )    


      }, error = function(e) {
        print("error getURL")
        print(e)
     }
   )  
    
  if(html!="") {
      
    #############TRANSFORM TO UTF-8#############################
    ind1 <- grep("^Content-Type",h$value(NULL))

    if(length(ind1)) {
      
      contentType <- tail(h$value(NULL)[ind1],1)

      charset <- str_extract(contentType, "charset=([a-zA-Z0-9-]+)")
      charset <- gsub("charset=","",charset)
      
      if (!grepl("utf-8",charset) && !is.na(charset)) {
        if (debug) print("no UTF-8 encoding, we need to convert html")
         html <- iconv(html, charset, "UTF-8", sub="byte") 
         isUTF8 <- validUTF8(html)
         if (debug && isUTF8) print("OK now it is UTF-8 encoding")
         doc <- htmlParse(html, asText=TRUE,encoding="UTF-8")
           
      } else {
        if (debug) print("UTF-8 encoding detected")
        doc <- htmlParse(html, asText=TRUE,encoding="UTF-8")      
      }
           
        
    } else {
      enc <- guess_encoding(html)
    
      isUTF8 <- validUTF8(html)

       if (!isUTF8) {
         if (debug) print("no UTF-8 encoding, we need to convert html")
          html <- iconv(html, head(enc$encoding,1), "UTF-8")
          isUTF8 <- validUTF8(html)
          if (debug && isUTF8) print("OK now it is UTF-8 encoding")
          doc <- htmlParse(html, asText=TRUE,encoding="UTF-8")      
        } else {
         if (debug) print("UTF-8 encoding detected")
         doc <- htmlParse(html, asText=TRUE,encoding="UTF-8")      
       }
    } 
    ################################################## 
           
    if (is.null(doc)) 
      return(data.frame(Address=url,Domain="",Wordcount="",Size="",StatusCode="",ContentType="",
                        LastModified="",ContentLanguage="",Title="",TitleLength="",
                        DescriptionLength="",H1="",H1Length="",H2="",H2Length="",
                        Robots="",Canonical="",Outlinks="",ExternalOutlinks="",ResponseTime=""))
      
    plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
    maintext <- paste(plain.text, collapse = " ")
    maintext <- gsub("\n", "", maintext)  
    maintext <- gsub("  ", " ", maintext)       
      
    df <- data.frame(Domain=url)
      
    #get Status Code
    ind0 <- grep("HTTP/",h$value(NULL))

    if(length(ind0)) {
      df$StatusCode <- tail(h$value(NULL)[ind0],1)
    } else {
      df$StatusCode <- ""
    } 

    ## Content-Type
    ind1 <- grep("^Content-Type",h$value(NULL))

    if(length(ind1)) {
      df$ContentType <- gsub("Content-Type:","",tail(h$value(NULL)[ind1],1))
    } else {
      df$ContentType <- ""
    }
  
    ## Last-Modified
    ind2 <- grep("Last-Modified",h$value(NULL))
 
    if(length(ind2)) {
      df$LastModified <- gsub("Last-Modified:","",tail(h$value(NULL)[ind2],1))
    } else {
      df$LastModified <- ""
    }  
      
    ind4 <- grep("Location",h$value(NULL)) 

    if(length(ind4)) {
      df$Location <- gsub("Location:","",tail(h$value(NULL)[ind4],1))
    } else {
      df$Location <- ""
    }   
      
    #get title tag
    title <- head(xpathSApply(doc, "//title", xmlValue),1)
      
    if(class(title)=="character") {
      df$Title <- title
      df$TitleLength <- nchar(title)  
    } else {
      df$Title <- ""
      df$TitleLength <- 0
    } 
       
    
    df$TitleDist <- stringdist(keyword,df$Title)   
      
    df$TitleIsQuestion <- detectQuestion(df$Title)   
      
    # <meta name="googlebot" content="nosnippet">  
    # TODO add nosnippet
    nosnippet <- head(xpathSApply(doc, '//meta[@name="googlebot" and @content="nosnippet"]', xmlGetAttr, 'content'),1)    
  
    if (length(nosnippet)) {
      df$noSnippet <- TRUE
    } else {
      df$noSnippet <- FALSE
    }   

    #<script type="application/ld+json">
    pattern <- '<script type=\\"application\\/ld\\+json\\">(.*?)</script>'
    #print(html)
    jsonld <- str_match(html,pattern)
    if(!is.na(jsonld[1])) {
      df$isJsonLD <- TRUE  
    } else {
      df$isJsonLD <- FALSE  
    }
    #print(df$isJsonLD)
      
    #Xpath : //*[@itemtype]/@itemtype
    isItemType <- xpathSApply(doc, "//*[@itemtype]/@itemtype")
    if (length(isItemType)) {
      df$isItemType <- TRUE
    } else {
      df$isItemType <- FALSE
    }  
      
    #Xpath : //*[@itemprop]/@itemprop
    isItemProp <- xpathSApply(doc, "//*[@itemprop]/@itemprop")
    if (length(isItemProp)) {
      df$isItemProp <- TRUE
    } else {
      df$isItemProp <- FALSE
    }      
      
    #TODO : uncomment    
    df$Wordcount <- nwords(maintext)  
      
    df$Size <- nchar(html)   
      
    df$ResponseTime <- time[3]    
      
    #get first H1 tag
    H1 <- head(xpathSApply(doc, "//h1", xmlValue),1)
    #print(length(H1))
    if (length(H1)) {
      df$H1 <- H1
      df$H1Length <- nchar(df$H1)    
    } else {
      df$H1 <- ""
      df$H1Length <- 0
    }
      
    df$H1Dist <- stringdist(keyword,df$H1)    
      
    df$H1IsQuestion <- detectQuestion(df$H1)        

    #get first H2 tag
    H2 <- head(xpathSApply(doc, "//h2", xmlValue),1)
    #print(length(H2))
    if (length(H2)) {
      df$H2 <- H2
      df$H2Length <- nchar(df$H2)    
    } else {
      df$H2 <- ""
      df$H2Length <- 0
    } 
      
    df$H2Dist <- stringdist(keyword,df$H2)  
      
    df$H2IsQuestion <- detectQuestion(df$H2) 
            
    return(df)    
      
  }
  else
    return(data.frame(Address=url,Domain="",Wordcount="",Size="",StatusCode="",ContentType="",
                        LastModified="",ContentLanguage="",Title="",TitleLength="",
                        DescriptionLength="",H1="",H1Length="",H2="",H2Length="",
                        Robots="",Canonical="",Outlinks="",ExternalOutlinks="",ResponseTime=""))
      
    
}



           
seoscraper2df("ovh.com",
               "mot clé",
               FALSE)



Loading required package: bitops

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Domain,StatusCode,ContentType,LastModified,Location,Title,TitleLength,TitleDist,TitleIsQuestion,noSnippet,⋯,Size,ResponseTime,H1,H1Length,H1Dist,H1IsQuestion,H2,H2Length,H2Dist,H2IsQuestion
ovh.com,HTTP/1.1 200 OK,text/html; charset=utf-8,"Wed, 07 Jun 2017 22:10:02 GMT",https://www.ovh.com/fr/,"Hébergement Internet, Cloud, et Serveurs dédiés - OVH",53,48,True,False,⋯,229638,2.641,Private Cloud Healthcare,24,21,True,Commandez votre nom de domaine,30,26,True


# Ranxplorer

Ranxplorer is a good SEO tool to get all Featured Snippets. Please use your API key : https://ranxplorer.com/

*rankplorerGetSERP* function is useful to get the list of results displayed in natural results for a search query.

I test with search query : "ovh"

In [20]:
library(rjson)
# load DF
key <- "XXXXXXXXXXXXXXXXXXXXXXXXX"

rankplorerGetSERP <- function(search,limit) {

  url <- paste("http://api.ranxplorer.com/v1/seo/serps?search=",search,"&limit=",limit,sep="")
    
  h <- basicTextGatherer()
  
  req <-getURL(
    url,
    httpheader = c(
      "X-Ranxplorer-Token" = key,
      "Accept" = "application/json"
    )
    , headerfunction = h$update
    , verbose = TRUE
  );
    
  ind1 <- grepl("200 OK",h$value(NULL)[1])
    
  json <- fromJSON( req )
  
  if (ind1) {
    if (json$errors=="FALSE") {
      json <- json$data  
      df <- do.call(rbind, lapply(json, data.frame, stringsAsFactors=FALSE))
    }
    else {
      df <- data.frame(keyword=search,Url="",Rx="",Ssl="",Date="")
    }
  }
  else {
      df <- data.frame(keyword=search,Url="",Rx="",Ssl="",Date="")
  }
  
  return(df)
}

df <- rankplorerGetSERP("ovh",10)
df <- cbind(keyword="ovh",df)
df

keyword,Url,Rx,Ssl,Date
ovh,www.ovh.com/fr/,1,1,2017-05-05
ovh,www.ovhtelecom.fr/,2,1,2017-05-05
ovh,www.ovhtelecom.fr/telephonie/,3,1,2017-05-05
ovh,mail.ovh.net/fr/,4,1,2017-05-05
ovh,fr.wikipedia.org/wiki/OVH,5,1,2017-05-05
ovh,www.lavoixdunord.fr/98489/article/2017-01-04/ovh-lance-un-recrutement-de-plus-de-200-personnes,6,0,2017-05-05
ovh,twitter.com/ovh?lang=fr,7,1,2017-05-05
ovh,hubic.com/home/,8,1,2017-05-05
ovh,www.kimsufi.com/fr/hosting.xml,9,1,2017-05-05
ovh,www.nic.ovh/,10,0,2017-05-05


Now, I prepare a function to parallelize many threads. 
I test on my 5th line in my dataset.

In [22]:
seoscraperThread <- function(DF,i){
 
 #reload lib by thread
 library(RCurl)
 library(urltools)
 library(XML)  
 library(stringr)
 library(dplyr)
 library(readr)
 library(stringdist)
 library(dataiku)
   
  url <- toString(DF$Url[i])
  url <- URLdecode(url)
    
  keyword <- toString(DF$keyword[i])  
    
  print(url)

  tryCatch({   
    dfcurrent <- seoscraper2df(url,keyword,FALSE)
    dfcurrent <- cbind(Keyword=keyword,dfcurrent)  
  },
  error = function(e) {
      print("end thread")
      print(e)
      return(NULL)
    }     
  )

   
}

result <- seoscraperThread(df,5) 
result

[1] "fr.wikipedia.org/wiki/OVH"


Keyword,Domain,StatusCode,ContentType,LastModified,Location,Title,TitleLength,TitleDist,TitleIsQuestion,⋯,Size,ResponseTime,H1,H1Length,H1Dist,H1IsQuestion,H2,H2Length,H2Dist,H2IsQuestion
ovh,fr.wikipedia.org/wiki/OVH,HTTP/1.1 200 OK,text/html; charset=UTF-8,"Wed, 07 Jun 2017 08:26:15 GMT",https://fr.m.wikipedia.org/wiki/OVH,OVH — Wikipédia,15,15,True,⋯,72858,0.464,β,15,15,True,Sommaire,8,7,True


# Ok, now it is time to use OpenMPI.

MPI : Message Passing Interface is a specification for an API for passing messages between different computers. 

## Install MPI

- sudo yum install openmpi openmpi-devel openmpi-libs

- sudo ldconfig /usr/lib64/openmpi/lib/

- export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/usr/lib64/openmpi/lib/“

- install.packages("Rmpi",
                 configure.args =
                 c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",
                   "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",
                   "--with-Rmpi-type=OPENMPI"))

- install.packages (“doMPI“)


## Programming with MPI 
- Difficult because of Rmpi package defines about 110 R functions
- Needs a parallel programming system to do the actual work in parallel

## The doMPI package acts as an adaptor to the Rmpi package,  which in turn is  an R interface  to an implementation of MPI
- Very easy to install Open MPI, and Rmpi on Debian / Ubuntu
- You can test with one computer



In [23]:
library(doMPI)
# start cluster
cl <- startMPIcluster(count=10)

registerDoMPI(cl)


Loading required package: foreach
Loading required package: iterators
Loading required package: Rmpi


	10 slaves are spawned successfully. 0 failed.


In [25]:
max <- dim(df)[1]
max

The option "rbind" is very useful to gather all results in one dataset

In [26]:
system.time(
y <- foreach(i=1:max, .combine="rbind") %dopar% seoscraperThread(df,i)
)
y

   user  system elapsed 
  1.573   0.665  11.903 

Keyword,Domain,StatusCode,ContentType,LastModified,Location,Title,TitleLength,TitleDist,TitleIsQuestion,⋯,Size,ResponseTime,H1,H1Length,H1Dist,H1IsQuestion,H2,H2Length,H2Dist,H2IsQuestion
ovh,www.ovh.com/fr/,HTTP/1.1 200 OK,text/html; charset=utf-8,"Wed, 07 Jun 2017 22:20:04 GMT",https://www.ovh.com/fr/,"Hébergement Internet, Cloud, et Serveurs dédiés - OVH",53,51,True,⋯,229638,0.632,Private Cloud Healthcare,24,22,True,Commandez votre nom de domaine,30,28,True
ovh,www.ovhtelecom.fr/,HTTP/1.1 200 OK,text/html; charset=utf-8,"Tue, 06 Jun 2017 14:08:01 GMT",https://www.ovhtelecom.fr/,"OVH Télécom : Fournisseur Internet (92 Mb/s), Téléphonie, E-mails...",68,66,True,⋯,95946,0.812,,0,3,True,"OVH Télécom, le fournisseur Internet (FAI) neutre",49,48,True
ovh,www.ovhtelecom.fr/telephonie/,HTTP/1.1 200 OK,text/html; charset=utf-8,"Tue, 06 Jun 2017 14:08:13 GMT",https://www.ovhtelecom.fr/telephonie/,Téléphonie et VoIP pour entreprise - OVH TELECOM,49,48,True,⋯,60595,0.634,Téléphonie Pro,14,13,True,Au service de vos utilisateurs,30,29,True
ovh,mail.ovh.net/fr/,HTTP/1.1 200 OK,text/html,"Thu, 22 Dec 2016 08:48:54 GMT",https://mail.ovh.net/fr/,Webmail - OVH,13,13,True,⋯,8638,0.638,Se connecter au Webmail,23,22,True,Les Webmails proposÃ©s parÂ OVH,31,30,True
ovh,fr.wikipedia.org/wiki/OVH,HTTP/1.1 200 OK,text/html; charset=UTF-8,"Wed, 07 Jun 2017 08:26:15 GMT",https://fr.m.wikipedia.org/wiki/OVH,OVH — Wikipédia,15,15,True,⋯,72858,0.778,β,15,15,True,Sommaire,8,7,True
ovh,www.lavoixdunord.fr/98489/article/2017-01-04/ovh-lance-un-recrutement-de-plus-de-200-personnes,HTTP/1.1 401 Unauthorized,text/html; charset=utf-8,,/check_cookies?url=%2F98489%2Farticle%2F2017-01-04%2Fovh-lance-un-recrutement-de-plus-de-200-personnes,401 Vous devez activer les cookies pour naviguer sur ce site,60,58,True,⋯,394,4.116,Error 401 Vous devez activer les cookies pour naviguer sur ce site,66,64,True,,0,3,True
ovh,twitter.com/ovh?lang=fr,HTTP/1.1 200 OK,,,,OVH (@OVH) sur Twitter,22,22,True,⋯,63251,1.185,,0,3,True,,0,3,True
ovh,hubic.com/home/,HTTP/1.1 200 OK,text/html; charset=UTF-8,,https://hubic.com/en/,hubiC: Online storage for all your files – hubiC.com,52,50,True,⋯,13654,0.693,"My life, my data, my hubiC",26,25,True,"750,000 users!",14,14,True
ovh,www.kimsufi.com/fr/hosting.xml,HTTP/1.1 200 OK,text/html; charset=utf-8,"Thu, 09 Mar 2017 16:22:10 GMT",https://www.kimsufi.com/fr/hosting.xml,Kimsufi : le serveur dédié pas cher !,37,35,True,⋯,40957,1.663,,0,3,True,Trouvez votre nom de domaine,29,27,True
ovh,www.nic.ovh/,HTTP/1.1 200 OK,text/html; charset=utf-8,"Fri, 29 May 2015 14:41:54 GMT",http://www.nic.ovh/fr/index.xml,Extension .OVH : Créez votre domaine.ovh,40,37,True,⋯,27681,0.821,,0,3,True,,0,3,True


In [28]:
# close cluster
closeCluster(cl)

Use the code I went over today as a jupyter notebook for your own R tool. 
    
Now you have the basics, you can easily create a customized tool that can be applied in many different places.
Data Scraping will be more and more complex, I advice you to personnalize your tools.