# UC Davis Stat Club Workshop: Translating R to Python

_Nick Ulle (17/01/12)_

The recipes file used below can be found [here](recipes).

In STA 141a discussion 8, I presented an R script to scrape ingredients from [Serious Eats](http://seriouseats.com/):
```r
# Scrape SeriousEats
#
# 1. Download (and cache) the pages.
# 2. Extract text from the pages with XPath or CSS Selectors.
# 3. Extract content from the text with vectorized regex.

library(xml2)
library(dplyr)

main = function(cachedir = "cache") {
  # The starting point of our scraper.
  urls = readLines("recipes")

  if (!dir.exists(cachedir))
    dir.create(cachedir)

  pages = lapply(urls, function(url) {
    parse_page(fetch_page(url, cachedir), url)
  })
  pages = do.call(rbind, pages)

  # Use regex here on the entire ingredients column.
  # ...

  return (pages)
}

fetch_page = function(url, cachedir) {
  # Download a page and cache it.
  # Cached to cachedir/basename
  cache = file.path(cachedir, basename(url))

  # Load from cache if possible; otherwise web.
  if (file.exists(cache)) {
    return (read_html(cache))
  }

  page = read_html(url)
  write_xml(page, cache)
  return (page)
}

parse_page = function(page, url) {
  # Parse a page to extract ingredients.
  li = xml_find_all(page, "//li[@class = 'ingredient']")
  ingredients = xml_text(li)
  data_frame(ingredient = ingredients, url = url)
}
```

To translate R web scraping to Python:

xml2           | lxml               | BeautifulSoup       | Description
-------------- | ------------------ | ------------------- | -----------
curl()         | requests.get()     | requests.get()      | download a web page
read_html()    | lxml.html.fromstring()  | bs4.BeautifulSoup() | parse a web page
write_xml()    | open() & .write()  | open() & .write()   | save a web page
xml_find_all() | .xpath()           |                     | extract tags by XPath
html_nodes()   | .cssselect()       | .select()           | extract tags by CSS Selector
xml_text()     | .text_content()    | .get_text()         | extract text inside of a tag
xml_attr()     | .get()             | [ ]                 | extract an attribute on a tag
url_absolute() | urlparse.urljoin() | urlparse.urljoin()  | convert relative URL to absolute URL


In [50]:
import requests
import requests_cache

def fetch_page(url):
    response = requests.get(url)
    # Throw an error if status isn't 200 OK.
    response.raise_for_status()
    return response

In [67]:
def main(cache = "cache"):
    # Read the recipes file.
    with open("recipes") as f:
        recipes = f.readlines()
        
    recipes = [x.strip() for x in recipes]
    
    # Turn on the cache.
    requests_cache.install_cache(cache)
    
    # Call fetch_page() and parse_page() on each url.
    data = []
    for url in recipes:
        page = fetch_page(url)
        data.append(parse_page(page))
        
    # Binds the rows together.
    data = pd.concat(data)
        
    return data

In [68]:
import lxml.html as lx
import pandas as pd

def parse_page(page):
    html = lx.fromstring(page.text)
    li = html.xpath("//li[@class = 'ingredient']")
    ingredients = [x.text_content() for x in li]
    
    return pd.DataFrame({"ingredients": ingredients, "url": page.url})

In [70]:
data = main()
data

Unnamed: 0,ingredients,url
0,"16 ounces peeled, roughly diced sweet potatoes...",http://www.seriouseats.com/recipes/2016/11/swe...
1,"1 vanilla bean, split and scraped, seeds reser...",http://www.seriouseats.com/recipes/2016/11/swe...
2,2 or 3 cinnamon sticks (about 6 inches total),http://www.seriouseats.com/recipes/2016/11/swe...
3,1 whole nutmeg (see note above),http://www.seriouseats.com/recipes/2016/11/swe...
4,"26 ounces milk (3 1/4 cups; 740g), any percent...",http://www.seriouseats.com/recipes/2016/11/swe...
5,6 ounces heavy cream (3/4 cup; 170g),http://www.seriouseats.com/recipes/2016/11/swe...
6,7 ounces white or lightly toasted sugar (1 cup...,http://www.seriouseats.com/recipes/2016/11/swe...
7,1 tablespoon ground cinnamon,http://www.seriouseats.com/recipes/2016/11/swe...
8,1 teaspoon freshly grated nutmeg (see note above),http://www.seriouseats.com/recipes/2016/11/swe...
9,1/2 teaspoon Diamond Crystal kosher salt; use ...,http://www.seriouseats.com/recipes/2016/11/swe...
