< [How to Read and Represent Data](../ica02/How_to_Read_and_Represent_Data.ipynb) | Contents (TODO) | [Data Preprocessing and Visualization](../ica04/Data_Preprocessing_and_Visualization.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica03/Data_Mining.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

# Data Mining

Data mining is the practice of examining data sources in order to generate new information. People say [data is the new oil](https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data) (* and also [otherwise](https://www.forbes.com/sites/bernardmarr/2018/03/05/heres-why-data-is-not-the-new-oil/#1b2106913aa9)...). So we can arrive at this ~~cheesy~~ analogy that data mining is like mining the oil.

Nonetheless, what we are going to do today in this notebook are threefold. First, we are going to see how to access a large collection of public data sets published on a website called, Kaggle. Second, we will get a feel of web crawling/scraping by extracting some living information on the web. Last but not the least, we are going to see how data APIs generally work.

## 1. Kaggle Datasets

In short, Kaggle is a sort of online community for data scientists, now owned by Google (as of 2017). The community allows its users (called Kagglers) to publish data sets, build models in a web-based data-science environment, enter data science challenge competitions, exchange ideas/codes, etc. Around the time Google acquired Kaggle, its user base passed 1,000,000 registered users, spanning 190+ countries, forming the largest and most diverse data science community in the world.

Especially for students of data science, it is an extremely useful source of materials, not just because of the gigantic amount of real-world data sets and data science problems, but also because of the source codes, ideas, etc. shared by other Kagglers. In this section, I'll show one of (many) ways to use Kaggle for your own project.


### 1.1. Getting Started

#### Sign up
Well, first off, you have to sign up (again, I'm not getting paid by Google for encouraging you to sign up :)). The sign up process is rather simple, so I won't explain here. Once the sign up process is complete, just go ahead and navigate what they have on the website. Especially, try to click `Competitions` menu on top, and see the list of competitions. Also, make sure to check out `Datasets` menu, where you can find tens of thousands of real-world data set for free. Lastly, `Notebooks` can be a great place for your future projects, to get an idea of how to solve a specific problem, etc. 

#### Create an API Token
Now, go to `My Account` by clicking the user profile at the top right corner. 
![User settings menu](figures/kaggle_my_account.png)

Scroll down a bit and you will find an API section. Click '`Create New API Token`' button.
![Create New Kaggle API Token](figures/create_new_kaggle_api_token.png)

This will download a file named `kaggle.json`. Open it with a text editor. You will see something like this in the file.
```json
{"username":"bigdata","key":"0123456789abcdefghijklmn"}
```
Make sure to keep it opened somewhere as you will need it a few cells later.

### 1.2. Install Python Kaggle Library

Now, in order to use Kaggle on your local machine (or Google Colab), you will need to install python Kaggle library, which can be done simply by one line of code.

In [1]:
# This is how you install a python library
!pip install kaggle



Now, the next thing is to provide the Kaggle API token to the library we just installed so that it knows you are a legitimate Kaggler. To this, copy and past the contents of the json file you downloaded above in the cell below.

In [3]:
# type your Kaggle API token here
token = {"username":"bigdata","key":"0123456789abcdefghijklmn"}

Now the code below is for advanced users only. You don't have to understand them line by line. Just run the cell and safely ignore what's in it. Just keep in mind in the future that there was something about the API token and you needed to set it up. When you need to do it again in the future, just come back to this notebook and copy the code and reuse it.

In [4]:
# This cell is only for advanced users. Run this cell and you can safely move on to the next cell
import os
from pathlib import Path
import json
import platform

# creates and places the token file at a desired location
home = str(Path.home())
kaggle_root = os.path.join(*[home, '.kaggle'])
os.mkdir(kaggle_root)
with open(os.path.join(*[kaggle_root, 'kaggle.json']), 'w') as file:
    json.dump(token, file)

# make the key file accessible only to the owner
if platform.system() == 'Windows':
    !attrib -R {os.path.join(*[kaggle_root, 'kaggle.json'])}
else:
    !chmod 600 {os.path.join(*[kaggle_root, 'kaggle.json'])}

### 1.3. Downloading a Data Set from Kaggle

Downloading a data set from kaggle is as simple as just one line.
```bash
!kaggle datasets download -d <path-to-dataset> -p <download-location>
```

Path to data set is what comes after `http://www.kaggle.com/` in the data set URL. For example, somebody gathered avocado prices and published as a data set in kaggle, which can be found at https://www.kaggle.com/neuromusic/avocado-prices. So In this case, the path to data set is `neuromusic/avocado-prices`.

Download location means the name of the folder on your hard drive (or Google virtual machine's hard drive, if you're using Colab). Say, you want to create a folder called `data` under the present working directory (where this notebook `ipynb` file is located). You can simply type something like below to download the avocado data set under the said folder.

In [5]:
!kaggle datasets download -d neuromusic/avocado-prices -p data

Downloading avocado-prices.zip to data




  0%|          | 0.00/629k [00:00<?, ?B/s]
100%|##########| 629k/629k [00:00<00:00, 12.3MB/s]


Many of the data sets in Kaggle come with some sort of a compressed file (e.g. zip file in this case). For Mac/Linux users, you can simply type in a code cell:
```bash
!unzip ./data/avocado-prices.zip ./data
```
to extract all the contents of the zip file into the data folder. If you are a Windows user, however, the story is a bit different. You will in fact have to unzip the file manually by opening up the 'File Explorer' and tracking down to the folder that you downloaded the dataset. If you have 7-zip or other Windows unzip tools installed already, or if you have Java Development Kit installed, there are ways to achieve this conveniently in a notebook, without having to do everything manually (see [this](https://stackoverflow.com/questions/1021557/how-to-unzip-a-file-using-the-command-line) for detail). However, certainly, this is beyond the scope of this class, so I won't dive too much into details. 

In [8]:
# Unzipping files.
if platform.system() == 'Windows':
    this_file_path = !echo %cd%
    data_path = os.path.join(*[this_file_path[0], 'data'])
    print('[IMPORTANT] No automatic unzipping supported on Windows.')
    print('You have to open `File Explorer` and manually unzip `' + data_path + '\\avocado-prices.zip`')
    print('Make sure `avocado.csv` file in the zip file is placed directly under `data` folder:')
    print('|- ica03')
    print('    |- data')
    print('        |- avocado.csv')
    print('    |- Data_Mining.ipynb')
else:
    !unzip ./data/avocado-prices.zip ./data

[IMPORTANT] No automatic unzipping supported on Windows.
You have to open `File Explorer` and manually unzip `D:\dev\bigdata\in-class-assignments\ica03\data\avocado-prices.zip`
Make sure `avocado.csv` file in the zip file is placed directly under `data` folder:
|- ica03
    |- data
        |- avocado.csv
    |- Data_Mining.ipynb


Okay, now you are ready to play with the avocado data. The avocado data set comes with a single comma seprated values (CSV) file. As we have seen in the [previous notebook](../ica02/How_to_Read_and_Represent_Data.ipynb), we can borrow the capacity of Pandas to open it up.

In [9]:
import pandas as pd

DF = pd.read_csv('data/avocado.csv')
DF.head(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


### 1.4. Assignment

Find a data set on Kaggle with a search keyword "iowa". Pick one of the data sets from the search result and download it in this notebook. You are encouraged to download something other than CSV formats, such as JSON. Read the downloaded data set using Pandas. Display summary statistics using Pandas (Hint: there was a thing called `describe()` function [last time](../ica02/How_to_Read_and_Represent_Data.ipynb).

## 2. Web Scraping with BeautifulSoup

BeautifulSoup is a python library that comes with a lot of handy functions for web scraping and gathering information from the internet. There are so many things you can do with BeutifulSoup, but in this notebook, I'll show you a rather specific example of how BeutifulSoup can be applied for data mining.

To this, we will use the Eastern Iowa - Cedar Rapids Airport website as an example. There, they provide a real-time flight status update for travellers (https://flycid.com/flight-status/). Let's click and open this website and see how it look like.

TODO: IMAGE HERE

### 2.1. Anatomy of a Web Page

Different people would have different approaches, but what I usually do is to take a look at the anatomy of the web page using my web browser's developer tool. If you use Chrome or Firefox, the developer tool can be opened by pressing `ctrl (cmd) + shift + I` or `F12`. If you use Safari, it is called Web Inspector, and can be opened with `cmd + shift + I`. For other web browsers, there should be a menu somewhere, or an instruction on the internet.

Now, in the developer tool, you should find some scripts which define the web page. In Chrome, it looks like this:

TODO: IMAGE HERE

The script here looks a lot like XML we learned in the [previous lecture](https://docs.google.com/presentation/d/17HzZmXP-xWtvgPrPOptM-AEKFnGaUJSzmEiJjz784_c/edit?usp=sharing). It is in fact called Hypertext Markup Language, or HTML, which is a standard markup language for web documents. You don't have to know all the tags of HTML. However, if you are curious about some basic HTML tags, here's a [nice summary of most commonly used HTML tags](https://www.geeksforgeeks.org/most-commonly-used-tags-in-html/).

Now, most web browsers highlights a specific part of web document when you hover a mouse cursor over a script in the developer tool, like in the screenshot below.

TODO: IMAGE HERE

This is where your job as a data scientist gets less elegant but a little dirty and brute force (welcome to the real world!): The first thing to do to extract an information from a web document is to figure out exactly where the desired information is located. In this example, after a few minutes of digging in (basically hovering the mouse cursor on different locations of the HTML scripts), I found that the flight information was being displayed as an `iframe`, which is basically like a web page within a web page.

TODO: IMAGE HERE

What this means is that the airport website is not actually doing anything by itself to retrieve the flight information, but instead, displays an external web page (https://webservice.prodigiq.com/wfids/CID/small?rows=18) within the airport web page as if it is a part of the web page. Long story short, this is where all the desired information we need and, hence, where we will do the web scraping.

### 2.2. Get and Parse HTML

Now that we know where the information exists, let's retrieve the HTML tags and parse them into a useful information for us. First off, let's retrieve the entire web page.

In [10]:
import requests
page = requests.get("https://webservice.prodigiq.com/wfids/CID/small?rows=18")

Now, with BeutifulSoup, we parse the information and display it in the notebook.

In [11]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 9]><html class="lt-ie10"  lang="en" dir="ltr"><![endif]-->
<!--[if (gt IE 9)|(gt IEMobile 7)]><!-->
<html dir="ltr" lang="en" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
 <!--<![endif]-->
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta charset="utf-8"/>
  <title>
   CID Mobile Web
  </title>
 

There are a lot of things going on, but after another dirty work of digging into the tags, we can find the flight information table lives in the tag `table` with an attribute `class="views-table cols-5"`, which can be searched by BeutifulSoup:

In [13]:
table = soup.find('table', {'class': 'views-table cols-5'})
print(table)

<table class="views-table cols-5">
<thead>
<tr>
<th class="views-field views-field-title">Flight</th>
<th class="views-field views-field-field-destination">City</th>
<th class="views-field views-field-field-scheduled-time">Time</th>
<th class="views-field views-field-field-revised-time">Claim</th>
<th class="views-field views-field-php">Status</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>UA 4789</td>
<td>Chicago - ORD</td>
<td>8:58 am</td>
<td>2</td>
<td class="ontime">Arrived</td>
</tr>
<tr class="even">
<td>DL 3927</td>
<td>Detroit</td>
<td>9:04 am</td>
<td>1</td>
<td class="ontime">Arrived</td>
</tr>
<tr class="odd">
<td>DL 3734</td>
<td>Minneapolis</td>
<td>10:11 am</td>
<td>1</td>
<td class="ontime">On time</td>
</tr>
<tr class="even">
<td>DL 5373</td>
<td>Atlanta</td>
<td>10:39 am</td>
<td>1</td>
<td class="ontime">On time</td>
</tr>
<tr class="odd">
<td>AA 3473</td>
<td>Chicago - ORD</td>
<td>11:04 am</td>
<td>2</td>
<td class="delayed">At 11:15 am</td>
</tr>
<tr class="even

In [20]:
tbody = table.find('tbody')

In [23]:
trows = tbody.find_all('tr')

In [32]:
import pandas as pd
df = pd.DataFrame(index=range(len(trows)), columns=['Flight', 'Departure City', 'Arrival Time', 'Baggage Claim', 'Status'])
for i, trow in enumerate(trows):
    titems = trow.find_all('td')    
    df['Flight'][i] = titems[0].contents[0]
    df['Departure City'][i] = titems[1].contents[0]
    df['Arrival Time'][i] = titems[2].contents[0]
    df['Baggage Claim'][i] = titems[3].contents[0]
    df['Status'][i] = titems[4].contents[0]
    
print(df)

     Flight Departure City Arrival Time Baggage Claim       Status
0   AA 9805  Chicago - ORD      8:00 am             2      Arrived
1   DL 3927        Detroit      9:00 am             1      Arrived
2   UA 4601  Chicago - ORD      9:01 am             2      Arrived
3   DL 3734    Minneapolis     10:11 am             1      On time
4   DL 5373        Atlanta     10:34 am             1      On time
5   AA 3473  Chicago - ORD     11:04 am             2  At 11:13 am
6   UA 3794  Chicago - ORD     11:23 am             2      On time
7   AA 4035   Dallas - DFW     12:24 pm             2      On time
8   DL 3878        Detroit     12:49 pm             1      On time
9   AA 5073      Charlotte      1:13 pm             2      On time
10    G4 18      Las Vegas      1:45 pm             1      On time
11   UA 774         Denver      2:03 pm             2      On time
12  DL 4135    Minneapolis      2:04 pm             1      On time
13  AA 3584  Chicago - ORD      2:17 pm             2      On 

### Assignment: Bongo Bus Arrival at the Downtown Interchange


## Get Live Stock Price using `yahoo_fin` API

In [1]:
!pip install requests_html
!pip install --upgrade yahoo_fin

Requirement already up-to-date: yahoo_fin in c:\users\sbaek\appdata\local\continuum\anaconda3\envs\bigdata\lib\site-packages (0.8.2)


In [2]:
from yahoo_fin import stock_info as si

In [3]:
data = si.get_data('NFLX', start_date='01/01/2015', end_date='12/31/2018')

print(data)

                  open        high         low       close    adjclose  \
date                                                                     
2015-01-02   49.151428   50.331429   48.731430   49.848572   49.848572   
2015-01-05   49.258572   49.258572   47.147144   47.311428   47.311428   
2015-01-06   47.347141   47.639999   45.661430   46.501427   46.501427   
2015-01-07   47.347141   47.421429   46.271427   46.742859   46.742859   
2015-01-08   47.119999   47.835712   46.478573   47.779999   47.779999   
...                ...         ...         ...         ...         ...   
2018-12-21  263.829987  264.500000  241.289993  246.389999  246.389999   
2018-12-24  242.000000  250.649994  233.679993  233.880005  233.880005   
2018-12-26  233.919998  254.500000  231.229996  253.669998  253.669998   
2018-12-27  250.110001  255.589996  240.100006  255.570007  255.570007   
2018-12-28  257.940002  261.910004  249.800003  256.079987  256.079987   

                volume ticker  
date 

In [4]:
data.index # gives time stamps
data['volume'].values  # gives values of the column named 'volume'
data[['open','close']].values # gives multiple columns

array([[ 49.15142822,  49.84857178],
       [ 49.25857162,  47.31142807],
       [ 47.34714127,  46.5014267 ],
       ...,
       [233.91999817, 253.66999817],
       [250.11000061, 255.57000732],
       [257.94000244, 256.07998657]])

In [5]:
import matplotlib.pyplot as plt

data = si.get_data('NFLX', start_date='01/01/2015', end_date='12/31/2018')
plt.plot(data.index, data[['open','close']].values) # open, close, high, low, adjclose, volume
plt.title('Netflix Stock Price')
plt.show()

#'open', 'close', 'high', 'low', 'adjclose', 'volume'


To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()


IndexError: tuple index out of range

< [How to Read and Represent Data](../ica02/How_to_Read_and_Represent_Data.ipynb) | Contents (TODO) | [Data Preprocessing and Visualization](../ica04/Data_Preprocessing_and_Visualization.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica03/Data_Mining.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>