# Web Scraping - HTML Basics

In this set of web scraping tutorials, we will use Python to scrape data from the internet. Before we begin, it's important to note that not all websites can or should be scraped by a bot. Before scraping any website, if you are unsure whether scraping is permitted, check the website's terms and conditions. (Here's an example for Yahoo Finance: https://www.verizonmedia.com/policies/us/en/verizonmedia/terms/otos/index.html). There can be legal ramifications for unapproved web scraping.

We will learn three methods to scrape data from websites.
    
1. **scrapy**
2. Text splitting
3. Pandas **read_html**

We will illustrate using Ford's summary page on Yahoo Finance ('https://finance.yahoo.com/quote/F').

#### Method 1 - scrapy

The **scrapy** module in Python allows us to directly interact with the HTML of a website to extract information.

To use **scrapy**, we must install it first. 

1. In Windows, click on the Windows icon in the lower left had corner of your screen.
2. Search for and open "Anaconda Prompt".
3. Within "Anaconda Prompt", type "pip install scrapy".
4. Exit the "Anaconda Prompt" window.

Let's now import **scrapy**, and specifically, the **Selector** function within the **scrapy** module which we will use to read the HTML of our website.

In [1]:
#!pip install scrapy

In [2]:
from scrapy.selector import Selector

To use **scrapy**, we must first understand the basic structure of HTML. I will only cover some of the basics.

Below is some simple HTML code.

In [3]:
html = """<html>
    <body>
        <div>
            <p>
                I love programming!
            </p>
            <p id="id1">
                This is the best career!
            </p>
        </div>
        <p class="MFIN290">
            I am learning to code.
            <a href="www.codingisfun.com">Coding is Fun</a>
        </p>
        <p class="MFIN290">
            MFIN Courses Fun!
            <a href="https://merage.uci.edu/programs/masters/master-finance/index.html">MFIN</a>
        <p>
    </body>
</html>"""
print(html)

<html>
    <body>
        <div>
            <p>
                I love programming!
            </p>
            <p id="id1">
                This is the best career!
            </p>
        </div>
        <p class="MFIN290">
            I am learning to code.
            <a href="www.codingisfun.com">Coding is Fun</a>
        </p>
        <p class="MFIN290">
            MFIN Courses Fun!
            <a href="https://merage.uci.edu/programs/masters/master-finance/index.html">MFIN</a>
        <p>
    </body>
</html>


We use the **Selector** function to read our html with the following code:

In [4]:
response = Selector(text=html)

We can now use **xpath** functions on the **response** variable to extract data from our HTML. The xpath is simply a path that directs us to a specific location within the html document. For example, we can navigate to the `<p>` tags located directly after the `<div>` tag using the following code:

In [5]:
response.xpath('/html/body/div/p').extract()

['<p>\n                I love programming!\n            </p>',
 '<p id="id1">\n                This is the best career!\n            </p>']

The **.extract()** function extracts all instances of the tag in our HTML. To obtain the first instance only, we can use the **.extract_first()** function.

In [6]:
response.xpath('/html/body/div/p').extract_first()

'<p>\n                I love programming!\n            </p>'

We can also specify the first, second, third, etc. tags using bracket notation. For example, to obtain the second `<p>` tag, we can use the following code:

In [7]:
response.xpath('/html/body/div/p[2]').extract_first()

'<p id="id1">\n                This is the best career!\n            </p>'

To obtain the text contained in a tag, rather than the entire tag, we can use the code `/text()`:

In [8]:
response.xpath('/html/body/div/p[2]/text()').extract_first().strip()

'This is the best career!'

We can alternatively obtain all tags of a specific type (e.g., p, div, etc.). For example, to obtain all `<p>` tags, we can use the code:

In [9]:
response.xpath('//p').extract()

['<p>\n                I love programming!\n            </p>',
 '<p id="id1">\n                This is the best career!\n            </p>',
 '<p class="MFIN290">\n            I am learning to code.\n            <a href="www.codingisfun.com">Coding is Fun</a>\n        </p>',
 '<p class="MFIN290">\n            MFIN Courses Fun!\n            <a href="https://merage.uci.edu/programs/masters/master-finance/index.html">MFIN</a>\n        </p>',
 '<p>\n    </p>']

Similarly, we can focus on a specific tag using bracket notation. For example, we can extract the 3rd `<p>` tag using the following code:

In [10]:
response.xpath('(//p)[3]/text()').extract_first().strip()

'I am learning to code.'

#### Exercise

1. Obtain the second `<a>` tag in the html document.
2. Obtain the text contained in the second `<a>` tag.

#### Solution for # 1

In [11]:
response.xpath('(//a)[2]').extract_first()

'<a href="https://merage.uci.edu/programs/masters/master-finance/index.html">MFIN</a>'

#### Solution for # 2

In [12]:
response.xpath('(//a)[2]/text()').extract_first()

'MFIN'