# Web Scraping using Python

#### Web Scraping ,What is it?
we can define it as using a program to get data from the web by pulling content without an API(Application Program Interface).

Many sites have APIs you can connect to and use to pull data from. Such as the Twitter API. This is great! But sometimes you need data from a site that doesn't have an API. 

#### Where is it used?
Really any where you think it would be appropriate to gather data.

Some people have built web scraper to look for jobs & find apartments.

Companies may search for email or contact informations

Competitive analysis on a competing company, what prices do they have?

Realtors may scrape housing listings

Understand sentiment and words in reviews

Anytime you want data!

##### Consider the Ethics
read sites terms of serive and robot files

https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

#### Some things you should consider before web scraping a website:

1.) You should check a site's terms and conditions before you scrape them. 

2.) Space out your requests so you don't overload the site's server, doing this could get you blocked.

3.) Scrapers break after time - web pages change their layout all the time, you'll more than likely have to rewrite your code. 

4.) Web pages are usually inconsistent, more than likely you'll have to clean up the data after scraping it.

5.) Every web page and situation is different, you'll have to spend time configuring your scraper.


### Inspect element of a web page
Go to a web page
right click
select inspect element
you should not see a pop up or frame showing the HTML of a web page.
Every "web-scraping job" is going to be unique, this is because almost every website is unique. 

#### Basic components of a WebSite

#### HTML
HTML stands for  Hypertext Markup Language and every website on the internet uses it to display information. Even the jupyter notebook system uses it to display this information in your browser. If you right click on a website and select "View Page Source" you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let's take a look at a simple webpage's HTML:

    <!DOCTYPE html>  
    <html>  
        <head>
            <title>Title on Browser Tab</title>
        </head>
        <body>
            <h1> Website Header </h1>
            <p> Some Paragraph </p>
        <body>
    </html>
    
Let's breakdown these components.

Every <tag> indicates a specific block type on the webpage:

    1.<DOCTYPE html> HTML documents will always start with this type declaration, letting the browser know its an HTML file.
    2. The component blocks of the HTML document are placed between <html> and </html>.
    3. Meta data and script connections (like a link to a CSS file or a JS file) are often placed in the <head> block.
    4. The <title> tag block defines the title of the webpage (its what shows up in the tab of a website you're visiting).
    5. Is between <body> and </body> tags are the blocks that will be visible to the site visitor.
    6. Headings are defined by the <h1> through <h6> tags, where the number represents the size of the heading.
    7. Paragraphs are defined by the <p> tag, this is essentially just normal text on the website.
    8. <header>, <main>, <footer> denotes which part of the page elements belong
    9. <a href=""></a> for hyperlinks, activates a link in the page
    10. <ul>, <ol> creates lists
    11. <li> contains items in lists
    12. <br> Inserts a single line break
    13. <table> for tables, <tr> for table rows, and <td> for table columns..

**Self-closing Tags:**
most HTML tags require an opening and a closing tag. There are a few however that do not:

    1. <img src=""> creates an image in the page
    
    2. <br> creates a break in the content
    
    3. <input type=""> creates an input field
    
    4. <hr> Creates a line in the page
    
**IDs, Classes**
    
IDs and classes are very similar. These are used to target specific elements.

    1. <h1 id="profile-header"></h1>

    2. <h1 class="subject-header"></h1>

IDs should only be used once on a page. IDs can also be used to bring the user to a specific part of the page. your-site/#profile-picture will load the page near the profile picture.

Classes can be used multiple times on a page.

See More tags [here](https://www.w3schools.com/tags/ref_byfunc.asp)

Learn more HTML [here](https://www.w3schools.com/Html/)

    
#### CSS

CSS stands for Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as **id** or **class** to connect an HTML element to a CSS feature, such as a particular color. **id** is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. **class** defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.

#### To learn more about HTML:

[W3School](http://www.w3schools.com/html/)

[Codecademy](http://www.codecademy.com/tracks/web)

    

Here are three approaches for web scraping which are among the most popular:

**1-** Sending an HTTP request, ordinarily via Requests, to a webpage and then parsing the HTML (ordinarily using BeautifulSoup) which is returned to access the desired information.

ex: Standard web scraping problem, refer to the case study.

**2-** Using tools ordinarily used for automated software testing, primarily Selenium, to access a websites‘ content programmatically. 

Typical Use Case: Websites which use Javascript or are otherwise not directly accessible through HTML.

**3-** Scrapy, which can be thought of as more of a general web scraping framework, which can be used to build spiders and scrape data from various websites minimizing repetitions. 

Use Case: Scraping Amazon Reviews.

While you could scrape data using any other programming language as well, Python is commonly used due to its easy syntax as well as the large variety of libraries available for scraping purposes in Python.

Note: Since the standard combination of Requests + BeautifulSoup is generally the most flexible and easiest to pick up, we will  Note that the tools above are not mutually exclusive; you might, for example, get some HTML text with Scrapy or Selenium and then parse it with BeautifulSoup.

### Importing Modules

for the needed libraries for the examples below, you can go to your command line and install them with conda install (if you are using anaconda distribution), or pip install for other python distributions.

1.) requests module to visit a URL and get web a webpage. which you can download by typing: *pip install requests* or *conda install requests* (for the Anaconda distrbution of Python) in your command prompt.

2.) BeautifulSoup is used to parse HTML and extract the information we need from our web page. which you can download by typing: *pip install beautifulsoup4* or *conda install beautifulsoup4* (for the Anaconda distrbution of Python) in your command prompt.

3.) lxml: XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. It aims to provide a Pythonic API by following as much as possible the ElementTree API. We're trying to avoid inventing too many new APIs, or you having to learn new things -- XML is complicated enough. , which you can download by typing: *pip install lxml* or *conda install lxml* (for the Anaconda distrbution of Python) in your command prompt.


In [5]:
#pip install requests,BeautifulSoup4
#conda install requests,beautifulsoup4

In [1]:
import requests
import bs4


### 1- Grabbing the title of a page

To grab the title of a page,you can use the HTML block with the **title** tag. 

For this task we will use **www.example.com** which is a website specifically made to serve as an example domain. 

#### Making a request

Requests will allow us to load a webpage into python so that we can parse it and manipulate it. 

In [2]:
# Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time
res = requests.get("http://www.example.com")


This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [9]:
type(res)


requests.models.Response

In [10]:
print(res.text)


<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

#### Parsing HTML
To analyze the extracted page we'll use **BeautifulSoup** . 

Technically we could use our own custom script to look for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file).


### What is Beautiful Soup?

Beautiful Soup is a Python library for parsing data out of HTML and XML files. It is useful for navigating, searching, and modifying the parse tree. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library.

Once we grab elements, Python makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.

#### Make the Soup

First, we have to turn the website code into a Python object. We have already imported the Beautiful Soup library, so we can start calling some of the methods in the libary. Replace **print(res.text)** with the following. This turns the text into an Python object named **soup**.

An important note: You need to specify the specific parser that Beautiful Soup uses to parse your text. This is done in the second argument of the BeautifulSoup function. The default is the built in Python parser, which we can call using **html.parser**

You can also use **lxml** or **html5lib**. This is nicely described in the [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser).

Using the Beautiful Soup **prettify()** function, we can print the page to see the code printed in a readable and legible manner.


In [162]:
#Using BeautifulSoup you can create a "soup" object that contains all the "ingredients" of the webpage.
soup = bs4.BeautifulSoup(res.text,"lxml")


In [163]:
print(soup.prettify())


<!DOCTYPE html>
<html>
 <head>
  <title>
   Example Domain
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style type="text/css">
   body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
  </style>
 </head>
 <body>
  <div>
   <h1>
    Example Domain
   </h1>
   <p>
    This dom

#### Navigating the Data Structure

Beautiful Soup allows us to navigate the data structure. We called our Beautiful Soup object **soup**, so we can run the Beautiful Soup functions on this object.


In [164]:
# Access the title element
print(soup.title)


<title>Example Domain</title>


In [165]:
# Access the content of the title element
print(soup.title.string)


Example Domain


In [15]:
# Access data in the first 'p' tag
print(soup.p)


<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>


Use the:

**.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'
select('css selector') --> [List of Tags](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)

[BeautifulSoup 4 Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


In [166]:
soup.select('title')


[<title>Example Domain</title>]

In [167]:
soup.select("head > title")


[<title>Example Domain</title>]

In [17]:
type(soup.select('title'))


bs4.element.ResultSet

Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we can use method calls to grab just the text.

In [170]:
p_tag = soup.select('p')


In [171]:
len(p_tag)


2

In [172]:
title_tag[0]


<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [173]:
title_tag[1]


<p><a href="https://www.iana.org/domains/example">More information...</a></p>

In [20]:
type(title_tag[0])


bs4.element.Tag

In [174]:
p_tag[0].getText()


'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'

**find_all(tags, keyword_args, attrs={'attr', 'value'})** --> [List of Tags](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)

The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one <body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method.


**find(tags, keyword_args, attrs={'attr', 'value'})** --> [Tag](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)

In [179]:
soup.find_all("p")
#soup("p")

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

In [181]:
soup.find_all('p', limit=1)

#soup.find('p')

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [183]:
soup.find("body").find("p")


<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [182]:
soup.find("head").find("title")


<title>Example Domain</title>

### 2- Grabbing all elements of a class


We choose for this task to grab all the section headings of the Wikipedia Article on Room 641A from this URL: https://en.wikipedia.org/wiki/Room_641A

In [22]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/Web_scraping')


In [23]:
# Create a soup from request
soup = bs4.BeautifulSoup(res.text,"lxml")


Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case

<table>

<thead >
<tr>
<th>
<p>Syntax to pass to the .select() method</p>
</th>
<th>
<p>Match Results</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p><code>soup.select('div')</code></p>
</td>
<td>
<p>All elements with the <code>&lt;div&gt;</code> tag</p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('#some_id')</code></p>
</td>
<td>
<p>The HTML element containing the <code>id</code> attribute of <code>some_id</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('.notice')</code></p>
</td>
<td>
<p>All the HTML elements with the CSS <code>class</code> named <code>notice</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div span')</code></p>
</td>
<td>
<p>Any elements named <code>&lt;span&gt;</code> that are within an element named <code>&lt;div&gt;</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div &gt; span')</code></p>
</td>
<td>
<p>Any elements named <code class="literal2">&lt;span&gt;</code> that are <span><em >directly</em></span> within an element named <code class="literal2">&lt;div&gt;</code>, with no other element in between</p>
</td>
</tr>
<tr>

</tr>
</tbody>
</table>

In [24]:
soup.select(".mw-headline")


[<span class="mw-headline" id="History">History</span>,
 <span class="mw-headline" id="Techniques">Techniques</span>,
 <span class="mw-headline" id="Human_copy-and-paste">Human copy-and-paste</span>,
 <span class="mw-headline" id="Text_pattern_matching">Text pattern matching</span>,
 <span class="mw-headline" id="HTTP_programming">HTTP programming</span>,
 <span class="mw-headline" id="HTML_parsing">HTML parsing</span>,
 <span class="mw-headline" id="DOM_parsing">DOM parsing</span>,
 <span class="mw-headline" id="Vertical_aggregation">Vertical aggregation</span>,
 <span class="mw-headline" id="Semantic_annotation_recognizing">Semantic annotation recognizing</span>,
 <span class="mw-headline" id="Computer_vision_web-page_analysis">Computer vision web-page analysis</span>,
 <span class="mw-headline" id="Software">Software</span>,
 <span class="mw-headline" id="Legal_issues">Legal issues</span>,
 <span class="mw-headline" id="United_States">United States</span>,
 <span class="mw-headline"

In [25]:
for item in soup.select(".mw-headline"):
    print(item.text)
    

History
Techniques
Human copy-and-paste
Text pattern matching
HTTP programming
HTML parsing
DOM parsing
Vertical aggregation
Semantic annotation recognizing
Computer vision web-page analysis
Software
Legal issues
United States
The EU
Australia
India
Methods to prevent web scraping
See also
References


### 3- Getting an Image from a Website

We choose for this task to grab the Cicada image on this Wikipedia Page: https://en.wikipedia.org/wiki/Cicada_3301

In [129]:
res = requests.get("https://en.wikipedia.org/wiki/Mona_Lisa")


In [124]:
#exemple2:
#res = requests.get("https://en.wikipedia.org/wiki/Cicada_3301")

In [130]:
soup = bs4.BeautifulSoup(res.text,'lxml')

In [133]:
image_info = soup.select('.thumbimage')


In [134]:
image_info


[<img alt="" class="thumbimage" data-file-height="750" data-file-width="2500" decoding="async" height="114" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Mona_Lisa_margin_scribble.jpg/380px-Mona_Lisa_margin_scribble.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Mona_Lisa_margin_scribble.jpg/570px-Mona_Lisa_margin_scribble.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Mona_Lisa_margin_scribble.jpg/760px-Mona_Lisa_margin_scribble.jpg 2x" width="380"/>,
 <img alt="" class="thumbimage" data-file-height="600" data-file-width="410" decoding="async" height="249" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_%28dite_La_Joconde%29_-_Louvre_779_-_Detail_%28right_landscape%29.jpg/170px-Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_%28dite_La_Joconde%29_-_Louvre_779_-_Detail_%28right_landscape%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Leonardo_di_ser_Pi

In [140]:
len(image_info)


11

In [141]:
image = image_info[2]


In [142]:
type(image)

bs4.element.Tag

You can make dictionary like calls for parts of the Tag, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

In [143]:
image['src']


'//upload.wikimedia.org/wikipedia/commons/thumb/6/64/Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_%28dite_La_Joconde%29_-_Louvre_779_-_Detail_%28hands%29.jpg/220px-Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_%28dite_La_Joconde%29_-_Louvre_779_-_Detail_%28hands%29.jpg'

Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add http:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).

In [105]:
image_link = requests.get('http://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Mona_Lisa_margin_scribble.jpg/380px-Mona_Lisa_margin_scribble.jpg')

In [155]:
image_link = requests.get('http://upload.wikimedia.org/wikipedia/commons/thumb/6/64/Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_%28dite_La_Joconde%29_-_Louvre_779_-_Detail_%28hands%29.jpg/220px-Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_%28dite_La_Joconde%29_-_Louvre_779_-_Detail_%28hands%29.jpg')

In [156]:
# The raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content


b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xfe\x00\xa1File source: https://commons.wikimedia.org/wiki/File:Leonardo_di_ser_Piero_da_Vinci_-_Portrait_de_Mona_Lisa_(dite_La_Joconde)_-_Louvre_779_-_Detail_(hands).jpg\xff\xdb\x00C\x00\x06\x04\x05\x06\x05\x04\x06\x06\x05\x06\x07\x07\x06\x08\n\x10\n\n\t\t\n\x14\x0e\x0f\x0c\x10\x17\x14\x18\x18\x17\x14\x16\x16\x1a\x1d%\x1f\x1a\x1b#\x1c\x16\x16 , #&\')*)\x19\x1f-0-(0%()(\xff\xdb\x00C\x01\x07\x07\x07\n\x08\n\x13\n\n\x13(\x1a\x16\x1a((((((((((((((((((((((((((((((((((((((((((((((((((\xff\xc0\x00\x11\x08\x00\xa5\x00\xdc\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1c\x00\x00\x02\x03\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x03\x00\x01\x04\x05\x06\x07\x08\xff\xc4\x00:\x10\x00\x01\x03\x03\x02\x04\x03\x06\x05\x03\x03\x05\x01\x00\x00\x00\x01\x02\x03\x11\x00\x04!\x121\x05AQa\x13"q\x062\x81\x91\xa1\xb1\x14#B\xc1\xd1R\xe1\xf0\x15br\x16%4S\xd2\xf1\xff\xc4\x00\x19\x01\x00\x03\x01\x01\x01\x00\x00\x00\x00\x00\x00\

**Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file.**

In [157]:
f = open('my_file.jpg','wb')


In [158]:
f.write(image_link.content)


6383

In [159]:
f.close()

Now we can display this file right here in the notebook as markdown using:

    <img src="'my_new_file_name.jpg'>
    
Just write the above line in a new markdown cell and it will display the image we just downloaded!

<img src="my_file.jpg">
