# Data Mining

Data mining is the practice of examining data sources in order to generate new information. First, we will get a feel of web crawling/scraping by extracting some living information on the web. Last but not the least, we are going to see how data APIs generally work.

## 1. Web Scraping with BeautifulSoup

BeautifulSoup is a python library that comes with a lot of handy functions for web scraping and gathering information from the internet. There are so many things you can do with BeutifulSoup, but in this notebook, I'll show you a rather specific example of how BeutifulSoup can be applied for data mining.

To this, we will use the Eastern Iowa - Cedar Rapids Airport website as an example. There, they provide a real-time flight status update for travellers (https://flycid.com/flight-status/). Let's click and open this website and see how it look like.

![Cedar Rapids Airport Webpage](https://github.com/stephenbaek/bigdata/blob/master/in-class-assignments/ica03/figures/cid_web.png?raw=1)

### 1.1. Anatomy of a Web Page

Different people would have different approaches, but what I usually do is to take a look at the anatomy of the web page using my web browser's developer tool. If you use Chrome or Firefox, the developer tool can be opened by pressing `ctrl (cmd) + shift + I` or `F12`. If you use Safari, it is called Web Inspector, and can be opened with `cmd + shift + I`. For other web browsers, there should be a menu somewhere, or an instruction on the internet.

Now, in the developer tool, you should find some scripts which define the web page. In Chrome, it looks like this:

![Developer Tools in Chrome](https://github.com/stephenbaek/bigdata/blob/master/in-class-assignments/ica03/figures/dev_tools.png?raw=1)

The script here looks a lot like XML. It is in fact called Hypertext Markup Language, or HTML, which is a standard markup language for web documents. You don't have to know all the tags of HTML. However, if you are curious about some basic HTML tags, here's a [nice summary of most commonly used HTML tags](https://www.geeksforgeeks.org/most-commonly-used-tags-in-html/).

Now, most web browsers highlights a specific part of web document when you hover a mouse cursor over a script in the developer tool, like in the screenshot below.


This is where your job as a data scientist gets less elegant but a little dirty and brute force (welcome to the real world!): The first thing to do to extract an information from a web document is to figure out exactly where the desired information is located. In this example, after a few minutes of digging in (basically hovering the mouse cursor on different locations of the HTML scripts), I found that the flight information was being displayed as an `iframe`, which is basically like a web page within a web page.


What this means is that the airport website is not actually doing anything by itself to retrieve the flight information, but instead, displays an external web page (https://webservice.prodigiq.com/wfids/CID/small?rows=18) within the airport web page as if it is a part of the web page. Long story short, this is where all the desired information we need and, hence, where we will do the web scraping.

### 2.2. Get and Parse HTML

Now that we know where the information exists, let's retrieve the HTML tags and parse them into a useful information for us. First off, let's retrieve the entire web page.

In [None]:
import requests
page = requests.get("https://webservice.prodigiq.com/wfids/CID/small?rows=18")

Now, with BeutifulSoup, we parse the information and display it in the notebook.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
print(soup.prettify())

There are a lot of things going on, but after another dirty work of digging into the tags, we can find the flight information table lives in the tag `table` with an attribute `class="views-table cols-5"`, which can be searched by BeutifulSoup:

In [None]:
table = soup.find('table', {'class': 'views-table cols-5'})
print(table)

Furthermore, within the table, it seems like all the flight information is structured within `tbody` tag.

In [None]:
tbody = table.find('tbody')
tbody

For further break down, each flight now is within `tr` task. So, we are going to find all `tr` tags in `tbody` and create a list. Just as a crash HTML course, `tr` is an abbreviation for 'table row' while `td` is for 'table data (column)'.

In [None]:
trows = tbody.find_all('tr')
trows[0]

As we can see, each table row (`tr`) contains multiple table data (`td`). In this case, the first (counting from zero) `td` tag contains the flight number, the second contains the city of departure, the third the arrival time, the fourth baggage claim, and last the arrival status:

In [None]:
print('{:10s} | {:15s} | {:15s} | {:15s} | {:10s}'.format(
    'Flight', 'Departure City', 'Arrival Time', 'Baggage Claim', 'Status'))
for i, trow in enumerate(trows):
    titems = trow.find_all('td')   # find all the data items in each row
    print('{:10s} | {:15s} | {:15s} | {:15s} | {:10s}'.format(
        titems[1].contents[0],          # contents of the first table item (column)
        titems[2].contents[0],          # contents of the second table item
        titems[3].contents[0],          # contents of the third table item
        titems[4].contents[0],          # contents of the fourth table item
        titems[5].contents[0],          # contents of the fifth table item
    ))

There, now you can retrieve the real-time flight arrival information at the Cedar Rapids Airport!

## 2. Get Live Stock Price using `yahoo_fin` API

As we have seen above, writing a scraping/crawling code from scratch involves a lot of dirty, brute-force works. For many cases, however, there are people who have already gone through all these and quite generously decided to build a set of handy functions that let you skip all those hassles. Or, sometimes, engineers and developers at companies, who actually built the web pages and knows exactly how the information is structured, decided to provide "nerd users" like me ways to access their data. Whatever the reason was, a data set API is basically a predefind set of functions that helps you access the data.

In this quick example, we are going to retrive real-time stock price data using Yahoo! Finance API (`yahoo_fin`). Let us first install `yahoo_fin` API.

In [None]:
!pip install --upgrade yahoo_fin

Now, `yahoo_fin` library comes with lots of modules in it. Among them, in this example, we are going to use `stock_info` module. Importing it should look like this:

In [None]:
from yahoo_fin import stock_info as si

For data set APIs, we are not going to get too much into details, but here are several things you can do to retrieve real-time stock info.

In [None]:
# Get Netflix (NFLX) stock info from year 2015 to 2018
data = si.get_data('NFLX', start_date='01/01/2015', end_date='12/31/2018')
data

In [None]:
data.index # gives time stamps

In [None]:
data['volume'].values  # gives values of the column named 'volume'

In [None]:
data[['open','close']].values # gives multiple columns