# <font color = red> Tutorial: Web scraping using Python 

## <font color = blue>Goal:</font> retrieve stock indices automatically from the Internet

In [1]:
# import libraries
import urllib2  
from bs4 import BeautifulSoup

Let's take one page from the Bloomberg Quote website as an example.

As someone following the stock market, we would like to get the index name (S&P 500) and its price from this page. 

First right-click and open your browser's inspector to inspect the webpage. Try hovering your cursor on the price and you should be able to see a blue box surrounding it. Click and the related HTML will be selected in the browser console. 

From the result, we can see that the price is inside a few levels of HTML tags, which is:

    <div class="basic-quote"> → <div class="price-container up"> → <div class="price">
  
Similarly, if you hover and click the name "S&P 500 Index", it is inside:

    <div class="basic-quote"> and <h1 class="name"> 
    
Now we know the unique location of our data with the help of id and class (though in this case, Bloomberg do not use id).

** To start, declare a variable for the url of the page.**

In [11]:
# specify the url of S&P 500
quote_page = 'http://www.bloomberg.com/quote/SPX:IND'  

** Then, make use of the Python urllib2 to get the HTML page of the url declared.**

In [3]:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)  

**Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it**

In [4]:
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')  

**Now we have a variable soup containing the HTML of the page. Here's where we can start coding the part that extracts the data.**

**Remember the unique layers of our data? BeautifulSoup can help us get into these layers and extract the content out easily by using find(). In this case, since the HTML tag of name is very unique on this page, we can simple query** *div class="name"* 

In [5]:
# Take out the <div> of name and get its value
name_box = soup.find('h1', attrs={'class': 'name'})  

**After we have the tag, we can get the data by getting its text.**

In [6]:
name = name_box.text.strip() # strip() is used to remove starting and trailing  
print name  

S&P 500 Index


**Similarly, we can retrieve the current price of the S&P 500 index.**

In [7]:
# get the index price
price_box = soup.find('div', attrs={'class':'price'})  
price = price_box.text  
print price  

2,414.22


# <font color = green>Store data to CSV. file </font>

Import the Python csv module and the datetime module to get the record date. 

In [22]:
import csv  
from datetime import datetime  

# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:  
    writer = csv.writer(csv_file)
    writer.writerow([name, price, datetime.now()])

# <font color = purple> Now, try extracting multiple indices at the same time!


First, modify the quote_page into an array of URLs.

In [23]:
quote_page = ['http://www.bloomberg.com/quote/SPX:IND', 'http://www.bloomberg.com/quote/CCMP:IND']  

Then we change the data extraction code into a 'for loop', which will process the URLs one by one and store all the data into a variable data in tuples.

In [24]:
# for loop
data = []  
for pg in quote_page:  
    # query the website and return the html to the variable 'page'
    page = urllib2.urlopen(pg)

    # parse the html using beautiful soap and store in variable `soup`
    soup = BeautifulSoup(page, 'html.parser')

    # Take out the <div> of name and get its value
    name_box = soup.find('h1', attrs={'class': 'name'})
    name = name_box.text.strip() # strip() is used to remove starting and trailing

    # get the index price
    price_box = soup.find('div', attrs={'class':'price'})
    price = price_box.text

    # save the data in tuple
    data.append((name, price))

Store data in csv file as usual.

In [25]:
# open a csv file with append, so old data will not be erased
with open('multiple_index.csv', 'a') as csv_file:  
    writer = csv.writer(csv_file)
    # The for loop
    for name, price in data:
        writer.writerow([name, price, datetime.now()])