# Web Data Crawling
2017.07.28 @ Lorem Ipsum
## Seungwon Park
yyyyy (at) snu.ac.kr

### Types of good datasets
- Format : `JSON`, `CSV`, ...
- Well-documented / Straightforward directory structure and API
- [CERN Open Data Portal](http://opendata.cern.ch/) : [ROOT](https://root.cern.ch/) framework
- [Sloan Digital Sky Survey (SDSS)](http://www.sdss.org/)


### Types of non-organized data
- No API. GUI-based service.
- [Lottery winning numbers](http://nlotto.co.kr/gameResult.do?method=byWin)
- Misc. data from websites : [Notices](physics.snu.ac.kr/xe/underbbs), [Lecture info](http://sugang.snu.ac.kr), ...
- **Today, we shall study about crawling these.**

# Today's Aim
1. Implement a part of [norazo-lotto](http://swpark.ddns.net/norazo-lotto)'s backend:
  - `urllib` (Python) + cronjob (Linux) ( + PHP frontend)
1. Brief intro to Regex, `BeautifulSoup`.
1. Solve some problems related to web data crawling on [Baekjoon Online Judge(BOJ)](https://acmicpc.net)

### Some HTML tags
- First, let's have a look at some HTML tags.
    - `<a href="http://example.com">Example</a>`
    - `<img src="../images/001.jpg">`
- Use 'View source' or 'Inspect' on your browser!
- Ctrl + F : (perhaps) the fastest way for learning HTML

### Step 0. Find appropriate URL
- Some checklists:
  - Are there no API? Really?
  - `robots.txt` / Ethical issues?
  - Does this page contains all the data you need?
  - Aren't there a smaller frame / page?
  - (Is this server stable/reachable enough?)

### Step 1. Get HTML code from URL
Here, we can use `urllib` or `requests`.

In [1]:
import urllib.request
url = 'http://nlotto.co.kr/gameResult.do?method=byWin'
with urllib.request.urlopen(url) as response:
    html = str(response.read(), 'euc-kr')

In [2]:
import requests
req = requests.get('http://nlotto.co.kr/gameResult.do?method=byWin')
html = req.text

### Step 2. Extract data from HTML
- First, find [where](http://nlotto.co.kr/gameResult.do?method=byWin) our data can be found.

```html
<meta id="desc" name="description" content="나눔로또 763회 당첨번호 3,8,16,32,34,43+10. 1등 총 8명, 1인당 당첨금액 2,138,130,000원." />
...
<span>(2017년 07월 15일 추첨)</span>
```

- We will take a look at 3 possible solutions - `split()`, RegEx, BeautifulSoup.

### 2-1. Using `split()`

In [3]:
a = 'Lorem Ipsum Dolor Sit'
b = a.split()
print(b)

['Lorem', 'Ipsum', 'Dolor', 'Sit']


In [4]:
a = 'Alice,Bob,,David'
b = a.split(',')
print(b)

['Alice', 'Bob', '', 'David']


In [5]:
html_date = html.split('<img src="/img/contents/result/wininfo/txt_lotto_num02.gif"  alt="제" />')[1]
html_date = html_date.split('일 추첨)')[0]
html_date = html_date.split('<span>(')[1]
html_win = html.split('<meta id="desc" name="description" content="나눔로또 ')[1]
html_win = html_win.split('.')[0]

### 2-2. Using Regular Expression
- Easier, shorter coding.
- Same syntax across languages

In [6]:
import re
html_date = re.findall(r'\((.*)일 추첨\)\</span\>\</h3\>', html)[0]
html_win = re.findall(r'\<meta id\="desc" name\="description" content\="나눔로또 (.*). 1등 총', html)[0]

In [7]:
import re

# <span>(2017년 07월 15일 추첨)</span></h3>
html_date = re.findall(r'\((.*)일 추첨\)\</span\>\</h3\>', html)[0]
html_date = html_date.replace('년 ','-')
html_date = html_date.replace('월 ','-')

# <meta id="desc" name="description" content="나눔로또 763회 당첨번호 3,8,16,32,34,43+10. 1등 총 8명, 1인당 당첨금액 2,138,130,000원." />
html_win = re.findall(r'\<meta id\="desc" name\="description" content\="나눔로또 (.*). 1등 총', html)[0]
num = html_win.split('당첨번호 ')[1].replace('+',',').split(',')
lottoNo = int(html_win.split('회')[0])

Now, let's check whether the data is crawled correctly.

In [8]:
print('Parsed %d : %s %s' % (lottoNo, html_date, ','.join([str(x).rjust(3) for x in num])))

Parsed 764 : 2017-07-22   7, 22, 24, 31, 34, 36, 15


### Using BeautifulSoup
- BeautifulSoup : Python library for pulling data out of HTML and XML files
- `pip install beautifulsoup4`

#### CSS Selection
- In Google Chrome:
  - 'Inspect' - (Click the desired part) - 'Copy' - 'Copy Selector'
  - This will look like `#desc` or `body > h3 > a`.
- https://www.w3schools.com/cssref/css_selectors.asp

In [9]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

In [10]:
print(soup.select('#desc'))
print(soup.select('body > section > article > article > div > div.lotto_win_number.mt12 > h3 > span'))

[<meta content="나눔로또 764회 당첨번호 7,22,24,31,34,36+15. 1등 총 7명, 1인당 당첨금액 2,459,975,465원." id="desc" name="description"/>]
[<span>(2017년 07월 22일 추첨)</span>]


### Cronjobs

To fully automate the work, we'll use cronjobs.
```bash
crontab -e
* * * * * update.sh
# min(0-59), hour(0-23), day of month(1-31), month(1-12), day of week(0-6, Sunday=0)
```
- Don't forget to set the permission of file!
```bash
chmod 755 update.sh
```

### Full demonstration
- Uses PHP Frontend.
- Refer to [`index.php`](https://github.com/seungwonpark/norazo-lotto/blob/master/index.php) at [seungwonpark/norazo-lotto](https://github.com/seungwonpark/norazo-lotto).
- Tip : To run PHP code locally, use [MAMP](https://www.mamp.info/).

<center><img src="images/norazo-lotto-screenshot.png" width="500"></center>

## Practice
Solve Pokemon-related problems :
- [포켓몬 마스터 @ BOJ(9987)](https://www.acmicpc.net/problem/9987)
- [포켓몬 GO 진화 @ BOJ(12092)](https://www.acmicpc.net/problem/12092)
- [HM과 TM @ BOJ(9995)](https://www.acmicpc.net/problem/9995)

## My related projects @ GitHub
All of the following lists are publicized at [GitHub](https://github.com/seungwonpark).
- [노라조 로또 성적표](https://github.com/seungwonpark/norazo-lotto)
- [서울대 물천 게시판 RSS](https://github.com/seungwonpark/SNU_physics_board_rss)
- [태양 흑점 분석을 위한 이미지 다운로더](https://github.com/seungwonpark/SunSpotTracker)
- [위성사진 다운로더](https://github.com/seungwonpark/kosc_file_downloader)
- [경기과학고 겹강러 확인 매크로](https://github.com/seungwonpark/lecture)
- [SDSS 데이터로 H-R도 그리기](https://github.com/seungwonpark/HR-Diagram)

## References / Further Reading
- https://beomi.github.io/2017/01/20/HowToMakeWebCrawler/
- https://scotch.io/tutorials/an-introduction-to-regex-in-python
- Web Scraping with Python(Ryan Mitchell)
- Web Scraping with Python(Richard Lawson)
- lxml, scrapy
- https://scrapinghub.com/portia