<h2>Caveat</h2>
Web sites often change the format of their pages so this may not always work. 

If it doesn't, rework the examples after examining the HTML content of the page (most browsers will let you see HTML source)

<h3>Import necessary modules</h3>

In [1]:
import requests
from bs4 import BeautifulSoup

<h3>The http request response cycle</h3>

In [2]:
url = "http://www.epicurious.com/search/Tofu Chili"

# get HTTP request response for the above URl + check that it works
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


In [4]:
# get keywords to search for from a user
keywords = input("Please enter the things you want to see in a recipe: ")

# append their keywords to the search
url = "http://www.epicurious.com/search/" + keywords

# get HTTP request response for the above URl w/ user searches + check that it works
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Please enter the things you want to see in a recipe: Tofu egg chili
Success


**HTML** = HyperText Markup Language = takes text + formats a document,

* All text lives inside **tagged elements** which actually mark up the text.
* **Elements** can have **ttributes**, which often contain formatting commands,
* Closely related to XML (both are SGML based)
* In an HTML page, can also find runnable scripts, typically written in JavaScript which can complicate matters a little bit,
* These scripts = essentially programs/program fragments that generate an output + until the script is run, you're not going to get the output.

* It's not strictly necessary to put an HTML in an HTML page, but it's a good way to do it for forward compatibility.
* The content is always going to be included inside an HTML tag.
* Page = divided into 2 parts: A **head** = metadata, style (**CSS**), + typically includes title (in tab) and a **body** =
* Style = important for web scraping,
contains the actual contents of the page + what shows up on the page itself like headers, paragraphs, subheadings, dividers (into sections)
* pages can API to get some information from somewhere (such as Google) + does something (like embed a map)

* When you're structuring a web page, you define different segments + for each segment you're going to have a different format.
* Ex: Divide page up into parts for an image, for a panel, + for text.
* Any organization delivering HTML to you is designing the page + are going to include certain formatting info inside the tags so it can be formatted appropriately.
* We can use that formatting info to decide whether there is valuable info or not in that section.
* Style tag = defining the different formatting commands  the page might have.
    * Ex: listbox in a div tag = div contains a listbox box class w/ attribute like background of light green + width of 500 px
* Could be more than one div tag on the page (typically have many div tags on the page)
* Each div tag is formatted in a particular way, depending on the content + we can use that formatting as a mechanism for figuring out which section of the page contains useful info for us.

<h1> Beautiful Soup </h1>

**Web scraping** = automating the process of extracting info from web pages = programmatically going to a web page + picking up info from there.

* Go into a browser, type in a URL, then the URL goes to a server you're addressing, server processes it + gets data or does whatever it's supposed to do, + returns the HTML back, browser takes HTML + renders it.
* Instead of rendering a page on a screen, we're going to use the HTML that comes back to extract data from it

Make sure the terms of use of a website don't prohibit you from using programs to get the data for commerical purposes (if using it for educational purposes, you're probably OK)

In general, if getting factual data (population of the US from last census), it's OK b/c it's a fact + is not something generated by their website = not their intellectual content

If scraping proprietary data, then it depends on what are you going to do with it.

* Ex: Go to a stock-trading website + collect certain technical indicator values from a web server taking equity market data + then computing these technical factors to use for your own commercial intent, then that's probably not kosher.

Make sure there's no damage to the scraping

* If you write a script that's not working very well + instead of getting a page, it keeps hitting the server + becomes a **fast crawler** going through millions of URLs in a second, you're going to be damaging the server b/c it's going to slow down.
* Worst case, you might get in a situation = you crash that server (**denial of service**)
* Tested out a script that scrapes the web very carefully before you actually do it.

Any information that's private is probably better not to scrape. If it's public, why not.

* If you can, try to get info openly rather than scraping surreptitiously,
* Might want to see is there an API
* Typically the company lets one subscribe to a service w/ limits + stuff 
    * Can probably pay money + do those kind of things to get more info.
* But w/ an EPI, you can run tests very quickly without having to be doing anything that may/may not be kosher.

Finally, ask yourself is there a public interest involved + if so, it's probably all right.

**3 goals of web scraping**

* To be able to send HTTP requests and responses
* To be able to get the HTML requests back,
* To extract the information from it.

Python provides 3 basic libraries that're very useful for this (are actually many libraries + many of them build upon these 3 or provide services above and beyond this)

* **Requests** handles the request and response cycle necessary to send a request + get the HTML back.
* **Beautiful Soup** utilizes the fact that all content is inside tags + it can quickly parse a page + we can query the resulting Beautiful Soup object.
    * takes a page + constructs a kind of a multi-layered dictionary out of it we can query very quickly to zoom in on particular pieces of content we're interested in.
    * HTML + XML parser that makes use of the tags.
    * creates a multi-layered **parse tree** + keeps track of attributes of them
    * get a big tree structure w/ multiple branches everywhere.
    * we want to be able to dig into these many branches + find our content.
    * Each tag might have attributes
* **Selenium** = originally designed to test a server
    * when building a server, want to be able to test what will happen when users come to + interact w/ it
    * can't hire thousands of users to do that stuff, so instead write a program that keeps hitting server w/ requests.
    * Selenium is an independent package to support that = Python has its API to Selenium that supports that activity
    * will programmatically test a server, but b/c it's hitting the server, we can also use it to scrape the server.
    * particularly useful when a page contains scripts.
        * when a page contains scripts, you send a request to the server who sends back a response that contains HTML used to render the page 
        * inside that HTML, there may be JS tags/scripts/programs + when the pages comes back, you get an HTML w/ these scripts.
        * Those scripts might be going back to the server + programmatically get some more data or might be constructing data for use inside your program itself or inside your HTML itself.
        * Python cannot understand JS or jQuery or anything else that shows up in these scripts.
        * Need to somehow make the browser run those scripts, generate the content that's going to appear in the page, + then scrape it (a second-layered step Beautiful Soup can't handle but Selenium can)
    * Selenium **emulates** a browser to get the data back.
    * Is also useful sometimes when a web server detects you're coming from a program + blocks you.
        * use Selenium to emulate a browser (pretend to be Chrome) so the server doesn't know you're coming from a program.
        * Not a great idea b/c doing things surreptitiously is not a great idea.

<h3>Set up the BeautifulSoup object</h3>

In [5]:
# 
BeautifulSoup(response.content,'lxml')

<!DOCTYPE html>
<html>
<head><meta charset="utf-8"/>
<meta content="app-id=312101965" name="apple-itunes-app"/>
<title>Search | Epicurious.com</title>
<link href="//www.epicurious.com" rel="dns-prefetch"/>
<link href="//assets.adobedtm.com" rel="dns-prefetch"/>
<link href="//www.google-analytics.com" rel="dns-prefetch"/>
<link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
<link href="//static.parsely.com" rel="dns-prefetch"/>
<link href="//cdn.optimizely.com" rel="dns-prefetch"/>
<link href="//condenast.demdex.net" rel="dns-prefetch"/>
<link href="//capture.condenastdigital.com" rel="dns-prefetch"/>
<link href="//pixel.condenastdigital.com" rel="dns-prefetch"/>
<link href="//use.typekit.net" rel="dns-prefetch"/>
<link href="//fonts.typekit.net" rel="dns-prefetch"/>
<link href="//p.typekit.net" rel="dns-prefetch"/>
<link href="//assets.epicurious.com" rel="dns-prefetch"/>
<link href="//ad.doubleclick.net" rel="dns-prefetch"/>
<link href="//pagead2.googlesyndication.com" rel="d

In [6]:
# print out bs4 results in a readable format
results_page = BeautifulSoup(response.content,'lxml')
print(results_page.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="app-id=312101965" name="apple-itunes-app"/>
  <title>
   Search | Epicurious.com
  </title>
  <link href="//www.epicurious.com" rel="dns-prefetch"/>
  <link href="//assets.adobedtm.com" rel="dns-prefetch"/>
  <link href="//www.google-analytics.com" rel="dns-prefetch"/>
  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//static.parsely.com" rel="dns-prefetch"/>
  <link href="//cdn.optimizely.com" rel="dns-prefetch"/>
  <link href="//condenast.demdex.net" rel="dns-prefetch"/>
  <link href="//capture.condenastdigital.com" rel="dns-prefetch"/>
  <link href="//pixel.condenastdigital.com" rel="dns-prefetch"/>
  <link href="//use.typekit.net" rel="dns-prefetch"/>
  <link href="//fonts.typekit.net" rel="dns-prefetch"/>
  <link href="//p.typekit.net" rel="dns-prefetch"/>
  <link href="//assets.epicurious.com" rel="dns-prefetch"/>
  <link href="//ad.doubleclick.net" rel="dns-prefetch"/>
  <link 

<h3>BS4 functions</h3>

<h4>find_all finds all instances of a specified tag</h4>
<h4>returns a result_set (a list)</h4>

In [None]:
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))

<h4>find finds the first instance of a specified tag</h4>
<h4>returns a bs4 element</h4>


In [None]:
div_tag = results_page.find('div')
pr

In [None]:
type(div_tag)


<h4>bs4 functions can be recursively applied on elements</h4>

In [None]:
div_tag.find('a')

<h4>Both find as well as find_all can be qualified by css selectors</h4>
<li>using selector=value
<li>using a dictionary

In [None]:
#When using this method and looking for 'class' use 'class_' (because class is a reserved word in python)
#Note that we get a list back because find_all returns a list
results_page.find_all('article',class_="recipe-content-card")

In [None]:
#Since we're using a string as the key, the fact that class is a reserved word is not a problem
#We get an element back because find returns an element
results_page.find('article',{'class':'recipe-content-card'})

<h4>get_text() returns the marked up text (the content) enclosed in a tag.</h4>
<li>returns a string

In [None]:
results_page.find('article',{'class':'recipe-content-card'}).get_text()

<h4>get returns the value of a tag attribute</h4>
<li>returns a string

In [None]:
recipe_tag = results_page.find('article',{'class':'recipe-content-card'})
recipe_link = recipe_tag.find('a')
print("a tag:",recipe_link)
link_url = recipe_link.get('href')
print("link url:",link_url)
print(type(link_url))

<h1>A function that returns a list containing recipe names, recipe descriptions (if any) and recipe urls</h1>

In [None]:
def get_recipes(keywords):
    recipe_list = list()
    import requests
    from bs4 import BeautifulSoup
    url = "http://www.epicurious.com/search/" + keywords
    response = requests.get(url)
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        recipes = results_page.find_all('article',class_="recipe-content-card")
        for recipe in recipes:
            recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
            recipe_name = recipe.find('a').get_text()
            try:
                recipe_description = recipe.find('p',class_='dek').get_text()
            except:
                recipe_description = ''
            recipe_list.append((recipe_name,recipe_link,recipe_description))
        return recipe_list
    except:
        return None

In [None]:
get_recipes("Tofu chili")

In [None]:
get_recipes('Nothing')

<h2>Let's write a function that</h2>
<h3>given a recipe link</h3>
<h3>returns a dictionary containing the ingredients and preparation instructions</h3>

In [None]:
recipe_link = "http://www.epicurious.com" + '/recipes/food/views/spicy-lemongrass-tofu-233844'

In [None]:
def get_recipe_info(recipe_link):
    recipe_dict = dict()
    import requests
    from bs4 import BeautifulSoup
    try:
        response = requests.get(recipe_link)
        if not response.status_code == 200:
            return recipe_dict
        result_page = BeautifulSoup(response.content,'lxml')
        ingredient_list = list()
        prep_steps_list = list()
        for ingredient in result_page.find_all('li',class_='ingredient'):
            ingredient_list.append(ingredient.get_text())
        for prep_step in result_page.find_all('li',class_='preparation-step'):
            prep_steps_list.append(prep_step.get_text().strip())
        recipe_dict['ingredients'] = ingredient_list
        recipe_dict['preparation'] = prep_steps_list
        return recipe_dict
    except:
        return recipe_dict
        

In [None]:
get_recipe_info(recipe_link)

<h2>Construct a list of dictionaries for all recipes</h2>

In [None]:
def get_all_recipes(keywords):
    results = list()
    all_recipes = get_recipes(keywords)
    for recipe in all_recipes:
        recipe_dict = get_recipe_info(recipe[1])
        recipe_dict['name'] = recipe[0]
        recipe_dict['description'] = recipe[2]
        results.append(recipe_dict)
    return(results)

In [None]:
get_all_recipes("Tofu chili")

<h1>Logging in to a web server</h1>

<h2>Get username and password</h2>
<li>Best to store in a file for reuse
<li>You will need to set up your own login and password and place them in a file called wikidata.txt
<li>Line one of the file should contain your username
<li>Line two your password

In [None]:
with open('wikidata.txt') as f:
    contents = f.read().split('\n')
    username = contents[0]
    password = contents[1]


<h3>Construct an object that contains the data to be sent to the login page</h3>

In [None]:

payload = {
    'wpName': username,
    'wpPassword': password,
    'wploginattempt': 'Log in',
    'wpEditToken': "+\\",
    'title': "Special:UserLogin",
    'authAction': "login",
    'force': "",
    'wpForceHttps': "1",
    'wpFromhttp': "1",
    #'wpLoginToken': ‘', #We need to read this from the page
    }

<h3>get the value of the login token</h3>

In [None]:
def get_login_token(response):
    soup = BeautifulSoup(response.text, 'lxml')
    token = soup.find('input',{'name':"wpLoginToken"}).get('value')
    return token


<h3>Setup a session, login, and get data</h3>

In [None]:
with requests.session() as s:
    response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
    payload['wpLoginToken'] = get_login_token(response)
    #Send the login request
    response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
                           data=payload)
    #Get another page and check if we’re still logged in
    response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
    data = BeautifulSoup(response.content,'lxml')
    print(data.find('div',class_='mw-changeslist').get_text())

<h1>QUIZ</h1>

In [7]:
from bs4 import BeautifulSoup

text = '<div class_="special"> <a href="http://www.somelink.com">Special link</a>'

print(BeautifulSoup(text).find_all('div',class_="special").get_text())



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?