# Web Scrapping

Web scrapping is the process of extracting usefull information from web pages, before we can get into web scrapping we should have a basic understanding of the technologies used to create webpages. 



# Web technologies

There's a tiro of technologies commonly used  in the  creation of webpages:

* HTML - used to create the content of the page
* CSS - used to style the content. 
* Javascript - used to add interactivity to pages, add logic and fetch data.


The browser rendering engine takes the html and css and renders it into a webpage.  Afterwards any javascript is then excuted, this may change the layout of the page, or fetch some additional data. 




## HTML 
HTML (Hyper text markup langauge) consists of html elements or tags which are names surrounded by angle brackets (<>). HTML tags usually come in pairs, for example:

```
<tagname>content goes here...</tagname>

```


Bellow we have an example of what HTML looks like. We can use the ipython html magic (`%%html`) to excute the html in the notebook so we can see how it renders. 

In [7]:
%%html
<body>
    <h1>This is a heading</h1>
    <p>Normal text usually going in paragraph tags</p>
    <ul>
        <li>Item 1</li>
        <li><strong>We can bold text by putting it inside strong tags</strong></li>
        <li> <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element">Link to MDN Docs</a> </li>
    </ul>
</body>

HTML elements can also have attributes, these attributes provide a way to add additional information to the element. For example in the `<a>` tag there is a href attribute which contains a link to another site.

```html
<a href="https://www.javascript.com/" >JavaScript</a>
```

Another common tag is the ` <div>` tag. Web developers use div tags to divide up pages and apply styles easily to lots of elements. The div tags often contain a class attribute, which can be targeted with css. 


```html
  <div class="red">
        <p>Some text that I want read</p>
  </div>
```

A`<style>` tag can be used to contain CSS which allows us to apply different styling to the html. For illustrative purposes I put the CSS inside the style tag but usally it is within a seperate file called a stylesheet and linked to the html document using a `<link>` tag. Class attribtutes can be reused to apply the same styles to many elements. This is in contrast to id attributes which should be used on a single element. Notice that in CSS classes start with a dot e.g `.classname` , whereas ID's start with a hashtag e.g `#myid`.

In [18]:
%%html
<body>
    <style>
        .red {
            color:red
        }
        
        .hidden {
            display:none
        }
        
        #an_id {
            color: blue
        }
    
    </style>
    <h2 class="hidden">This text will be hidden. Try and remove the hidden class to see what happens.</h2>
    <div class="red">
        <p> This text will be red because it is surrounded by a div that has a class of red. </p>
        <p id="an_id">The styling from IDs has higher proirty over classes, hence me being blue. Ids should be unique</p>
    </div>
    <img src="http://www.catster.com/wp-content/uploads/2017/08/A-fluffy-cat-looking-funny-surprised-or-concerned.jpg" alt="">
</body>

Html is infact a type of tree structure so when describing html you may hear the use words such as parent, child, sibling and ancestor node. A child node is anything that is contained within an element, for example the p tags within the div are children of the div. Siblings are elments that are next to each other in the tree for example the two p tags above are siblings. A parents node is the node directly above an element.  For more on the HTML Tree see [here](https://javascript.info/dom-nodes).  There are many html elements, a good resource to learn more about them and web tech in general is MDN (mozilla development network) [documentation](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)

Bellow is one final example of a table in html.

In [2]:
%%html
<table style="width:100%">
  <tr>   
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Firstname,Lastname,Age
Jill,Smith,50
Eve,Jackson,94


# Chrome Development Tools

Before we can scrape a webpage we need to understand it's structure, for this we'll use the chrome developer tools to inspect it. Most of the browsers have development tools but chrome's are among the best. The development tools allow us to inspect a html page easily so that we can find the html elements which we would like to extract information from. They also allow us to look at network requests to see if the page fetching or sending data via AJAX.

# HTTP

HTTP (Hyper text transfer protocol) is a request-response protocol that we use to send and recieve information on the internet. The protocol supports many [methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for example when you go on a webpage your browser makes a GET request, and the sever will repsond with the HTML page, some metadata (a header ) and some kind of status code. Likewise when your on a website and you fill out a form, when you press the submit button you'll send the forms information to the server using a POST request, and server will respond to comform it recieved the POST request.


In [1]:
import requests

In [2]:
url = "https://en.wikipedia.org/wiki/42_(number)"

In [3]:
response = requests.get(url) #make get request

The response object has many methods. Bellow we look at the status code of the response to check if it was successfull.

In [4]:
response.status_code #200 means succesfull

200

Headers allow us to contain additional information we sending and responding to HTTP requests, such as if the request is coming from a laptop or a mobile or which browser was used to make the request. In the response headers bellow we can see some additional information like character encoding ('utf-8') and that the content has been gzipped. The browser will make use of this information to render the page correctly.

In [5]:
response.headers

{'X-UA-Compatible': 'IE=Edge', 'X-Cache-Status': 'hit-front', 'P3P': 'CP="This is not a P3P policy! See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Backend-Timing': 'D=121814 t=1524331584206864', 'Accept-Ranges': 'bytes', 'Link': '</static/images/project-logos/enwiki.png>;rel=preload;as=image;media=not all and (min-resolution: 1.5dppx),</static/images/project-logos/enwiki-1.5x.png>;rel=preload;as=image;media=(min-resolution: 1.5dppx) and (max-resolution: 1.999999dppx),</static/images/project-logos/enwiki-2x.png>;rel=preload;as=image;media=(min-resolution: 2dppx)', 'Content-Length': '47707', 'Set-Cookie': 'WMF-Last-Access=24-Apr-2018;Path=/;HttpOnly;secure;Expires=Sat, 26 May 2018 00:00:00 GMT, WMF-Last-Access-Global=24-Apr-2018;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Sat, 26 May 2018 00:00:00 GMT, GeoIP=US:TX:San_Antonio:29.42:-98.49:v4; Path=/; secure; Domain=.wikipedia.org', 'Cache-Control'

What we're really intrested in is the responses text in other words the HTML document itself.

In [6]:
html = response.text #the html
print("Number of chars: ",len(html))
html[:1000]

Number of chars:  250772


'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>42 (number) - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"42_(number)","wgTitle":"42 (number)","wgCurRevisionId":837567490,"wgRevisionId":837567490,"wgArticleId":191178,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently dead external links","Articles needing additional references from February 2011","All articles needing additional references","All articles with unsourced statements","

As you can see it would be very tricky to extract information out of this  large html document without some help, this is where beautiful soup comes in.

# Beautiful soup

Beauitiful soup will help us parse the html and return an easy to use object, using this object we can get anything we want from the page. First however we need to spot what information we want and what kind of HTML element is it contained in. Does that element always have a certain attribute, or some other unique way to identify it? The easiest way to do this is using the chrome developer tools.

In [7]:
from bs4 import BeautifulSoup, NavigableString, Tag,Comment

In [8]:
soup = BeautifulSoup(html,"lxml") #lxml is a fast html parser written in C

## Links

Using `soup.find_all` we can extract all of the `<a>` tags from a page.

In [9]:
a_tags = soup.find_all('a') #get all a tags
a_tags[:5]

[<a id="top"></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/></a>,
 <a href="/wiki/Wikipedia:Verifiability" title="Wikipedia:Verifiability">verification</a>]

If we only want `<a>` tags with a href attibute, we have to pass `href=True` to find_all.

In [11]:
a_tags = soup.find_all('a',href=True)

In [12]:
hrefs = [ a.get('href') for a in a_tags]
# hrefs = [ a.attrs['href'] for a in a_tags]
hrefs[:5]

['#mw-head',
 '#p-search',
 '/wiki/File:Question_book-new.svg',
 '/wiki/Wikipedia:Verifiability',
 '//en.wikipedia.org/w/index.php?title=42_(number)&action=edit']

If we don't specify the `<a>` tags `find_all` will return all elements with a `href` attibute.

In [13]:
href_tags = soup.find_all(href=True) #will also get style sheets

Another option is to use [css selectors](https://www.w3schools.com/cssref/css_selectors.asp). We'll cover more on css selectors later on. 

In [17]:
soup.select('a[href*=numerals]') # select all a tags where the href contains the substring numerals.

[<a href="/wiki/Greek_numerals" title="Greek numerals">Greek numeral</a>,
 <a href="/wiki/Roman_numerals" title="Roman numerals">Roman numeral</a>,
 <a href="/wiki/Arabic_numerals" title="Arabic numerals">Arabic</a>,
 <a href="/wiki/Chinese_numerals" title="Chinese numerals">Chinese</a>,
 <a href="/wiki/Chuvash_numerals" title="Chuvash numerals">Chuvash</a>,
 <a href="/wiki/Hebrew_numerals" title="Hebrew numerals">Hebrew</a>]

## Text

Often we'll want to extract text from a webpage we can use `get_text` for that.

In [19]:
text = soup.get_text()
text[:1000]

'\n\n\n42 (number) - Wikipedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"42_(number)","wgTitle":"42 (number)","wgCurRevisionId":828403215,"wgRevisionId":828403215,"wgArticleId":191178,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently dead external links","Articles needing additional references from February 2011","All articles needing additional references","All articles with unsourced statements","Articles with unsourced statements from February 2011","Articles containing Afrikaans-language text","Articles containing Albanian-l

This also gets lots of texy that we don't care about. Well have to be more specific, likes only get the text from the p tags.

In [20]:
p_tags =  soup.find_all('p')
p_tags[0]

<p><b>42</b> (<b>forty-two</b>) is the <a href="/wiki/Natural_number" title="Natural number">natural number</a> that succeeds <a href="/wiki/41_(number)" title="41 (number)">41</a> and precedes <a href="/wiki/43_(number)" title="43 (number)">43</a>.</p>

In [22]:
p_tags[0].get_text()

'42 (forty-two) is the natural number that succeeds 41 and precedes 43.'

## Tables

For extracting tables the pandas package has a really usefull function `read_html` , this won't work on all html tables, but can be good for some. The table might not always be formated in the correct way, but this is often easy to fix in pandas.

In [23]:
import pandas as pd

In [24]:
tables = pd.read_html("http://www.nanotech-now.com/metric-prefix-table.htm") #download all tables on page

In [25]:
df = tables[0]
df = df.rename(columns=df.iloc[0])
df = df[df.index > 0]
df.head()

Unnamed: 0,Prefix,Symbol,Multiplier,Exponential
1,yotta,Y,1000000000000000000000000,1024
2,zetta,Z,1000000000000000000000,1021
3,exa,E,1000000000000000000,1018
4,peta,P,1000000000000000,1015
5,tera,T,1000000000000,1012


## Example

When web developers make heavy use of div tags or classes this often makes our life easier because we can target specific tags or attributes to get the information we desire, however sometimes this is not the case. We need to inspect the page with dev tools and try to spot some kind of pattern. In this case I wanted all of the relevant text from a wikipedia article, the first pattern I noticed was that all of the headings have a class of `mw-headline`.

In [30]:

headings = [ h.get_text() for h in soup.find_all(class_="mw-headline")]
headings

['Mathematics',
 'Science',
 'Technology',
 'Astronomy',
 'Religion',
 'Popular culture',
 "The Hitchhiker's Guide to the Galaxy",
 'Works of Lewis Carroll',
 'Music',
 'Television and film',
 'Video games',
 'Sports',
 'Gaming',
 'Architecture',
 'Other fields',
 'Other languages',
 'References',
 'External links']

Let's use those headings as a key in a dictionary, allowing us to keep track of all of that sections text. As we mentioned ealier the HTML document is a type of tree, this means we can use recursive tree algorithms traverse it! We will traverse all of the elements and when we find a element with the class `mw-headline` we'll extract the bellow text and store it in our dictionary.

In [31]:
d = {heading:"" for heading in headings}
k = "garbage" #add garbge key for stuff we don't care about
d[k] = ""
d

{'Architecture': '',
 'Astronomy': '',
 'External links': '',
 'Gaming': '',
 'Mathematics': '',
 'Music': '',
 'Other fields': '',
 'Other languages': '',
 'Popular culture': '',
 'References': '',
 'Religion': '',
 'Science': '',
 'Sports': '',
 'Technology': '',
 'Television and film': '',
 "The Hitchhiker's Guide to the Galaxy": '',
 'Video games': '',
 'Works of Lewis Carroll': '',
 'garbage': ''}

In [34]:
tag_names = ["li","a","p","span"] #only care about text containing nodes.

In [35]:
#use recursive generator to loop over all of the nodes
for element in soup.recursiveChildGenerator():
    if element.name in tag_names:
        if element.has_attr('class') and 'mw-headline' in element['class']:
            k = element.text
        d[k] += element.text


In [36]:
d['Architecture']

'Architecture[edit][edit]The architects of the Rockefeller Center in New York City worked daily in the Graybar Building where on "the twenty-fifth floor, one enormous drafting room contained forty-two identical drawing boards, each the size of a six-seat dining room table; another room harboured twelve more, and an additional fourteen stood just outside the principals\' offices at the top of the circular iron staircase connecting 25 to 26".[26]Rockefeller CenterNew York CityGraybar Building[26]In the Rockefeller Center (New York City) there are a total of "forty-two elevators in five separate banks"[27] which carry tenants and visitors to the sixty-six floors.Rockefeller CenterNew York City[27]'

In [37]:
d['Science']

'Science[edit][edit]42 is the atomic number of molybdenum.atomic numbermolybdenum42 is the atomic mass of one of the naturally occurring stable isotopes of calcium.atomic masscalciumThe angle rounded to whole degrees for which a rainbow appears (the critical angle).rainbowIn 1966, mathematician Paul Cooper theorized that the fastest, most efficient way to travel across continents would be to bore a straight hollow tube directly through the Earth, connecting a set of antipodes, remove the air from the tube and fall through.[9] The first half of the journey consists of free-fall acceleration, while the second half consists of an exactly equal deceleration. The time for such a journey works out to be 42\xa0minutes. Even if the tube does not pass through the exact center of the Earth, the time for a journey powered entirely by gravity (known as a gravity train) always works out to be 42\xa0minutes, so long as the tube remains friction-free, as while the force of gravity would be lessened, 

The next step could be further clean the text, maybe we want to remove the `[edit]` from the text, this could be done with a regex. We may also wish to get the links for the references.

In [38]:
import re 

In [39]:
d = {k: re.sub('\[edit\]',"",v) for k,v in d.items()}

In [40]:
d['Science']

'Science42 is the atomic number of molybdenum.atomic numbermolybdenum42 is the atomic mass of one of the naturally occurring stable isotopes of calcium.atomic masscalciumThe angle rounded to whole degrees for which a rainbow appears (the critical angle).rainbowIn 1966, mathematician Paul Cooper theorized that the fastest, most efficient way to travel across continents would be to bore a straight hollow tube directly through the Earth, connecting a set of antipodes, remove the air from the tube and fall through.[9] The first half of the journey consists of free-fall acceleration, while the second half consists of an exactly equal deceleration. The time for such a journey works out to be 42\xa0minutes. Even if the tube does not pass through the exact center of the Earth, the time for a journey powered entirely by gravity (known as a gravity train) always works out to be 42\xa0minutes, so long as the tube remains friction-free, as while the force of gravity would be lessened, the distance

Get the reference links

In [42]:
references = soup.find(class_="references")
reference_links = [ a.get('href') for a in references.find_all(class_="external text")]
{ i+1 : link for i,link in enumerate(reference_links)}

{1: 'https://oeis.org/A002378',
 2: 'https://oeis.org/A054377',
 3: 'https://oeis.org/A000108',
 4: 'https://oeis.org/A051867',
 5: 'http://www.mathpages.com/home/kmath255.htm',
 6: 'http://oeis.org/A019283',
 7: 'https://web.archive.org/web/20070519144253/http://www.uni.uiuc.edu/gargoyle/2007/05/zhai_poised_to_represent_unite.htm',
 8: 'https://archive.is/20120721120006/http://www.cbc.ca/health/story/2004/07/20/math_win040720.html',
 9: 'http://adsabs.harvard.edu/abs/1966AmJPh..34...68C',
 10: '//doi.org/10.1119%2F1.1972773',
 11: 'http://www.time.com/time/magazine/article/0,9171,842469,00.html',
 12: 'https://web.archive.org/web/20080512131156/http://www.time.com/time/magazine/article/0%2C9171%2C842469%2C00.html',
 13: 'https://web.archive.org/web/20080602142755/https://www.youtube.com/watch?v=FAFUSbIs5KA',
 14: 'https://www.youtube.com/watch?v=FAFUSbIs5KA',
 15: 'http://www.eternalgadgetry.com/ancient_astronomy.html',
 16: 'http://spiedl.aip.org/getabs/servlet/GetabsServlet?prog=nor

## CSS Selectors

Web developers use CSS selectors to select elements on the page they want to apply styling on. Hence CSS Selectors provide us with a succint way to specify which information on the html page we'd like to extract.  The simplest way to get to grips with them is to play one of these games:

* [CSS Diner](https://flukeout.github.io/)
* [CSS Leveler](http://toolness.github.io/css-selector-game/)

Alternatively you could write a simple html page and try to style it.  For a good cheatsheet on CSS Selectors and Xpath see [here](http://www.cheetyr.com/css-selectors).


# Exercises

* ** 1. ** Play one of the CSS Selector Games.
    * [CSS Diner](https://flukeout.github.io/)
    * [CSS Leveler](http://toolness.github.io/css-selector-game/)
* ** 2. ** Write a script that uses pandas `.read_html` to get all of the proxies from  https://free-proxy-list.net/ and write them to file.


In the following exercises you may wish to use the proxies you gathered when you make request to the webpages, this can be done with:

```python
import requests

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

requests.get("http://example.org", proxies=proxies)
```




* ** 3. ** Extract all of the title names and links from [Hacker News](https://news.ycombinator.com/)
* ** 4. ** Extract the `src` path for all of the images on [Reddit](https://www.reddit.com/)
* ** 5. ** Bonus: Use beautiful soup to scrape a website of your choosing.

# Resources 

* [Beautiful Soup Sendex](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfV1MIRBOcqClP6VZXsvyZS)
* [Beautiful soup docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [MDN Docs](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)