## Web에서 데이터 추출하기 
- yahoo finance 데이터를 통한 Web 데이터 추출하기
- html, unpack 개념 이해

In [13]:
# !pip install urllib.Request
# !pip install lxml

- parse : 전체를 개별로 분해해서 표현하는 역할 ex) browser

In [2]:
from lxml.html import parse
from urllib.request import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options?ltr=1'))

doc = parsed.getroot()

In [3]:
links = doc.findall('.//a') # .: 임의의 문자  / :특수문자가 의미를 갖도록 정의함.
links[15:20] # 각 Element : tag 즉, tag들이 모여서 만들어 내는 것임을 알 수 있다.

[<Element a at 0x170bf380368>,
 <Element a at 0x170bf3803b8>,
 <Element a at 0x170bf380548>,
 <Element a at 0x170bf380638>,
 <Element a at 0x170bf3807c8>]

In [4]:
lnk = links[28]
lnk

<Element a at 0x170bf380b38>

- element 객체이므로 attribute가 존재함. 실제 링크와 이름은 아래와 같이 가져올 수 있다.
- lnk.get("href")
- lnk.text_content()

In [24]:
lnk.get('href')

'/quote/AAPL/options?strike=147&straddle=false'

In [25]:
lnk.text_content()

'147.00'

In [26]:
# url
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:] 

['/',
 '/watchlists',
 '/portfolios',
 '/screener',
 '/calendar',
 '/industries',
 '/videos',
 '/news',
 '/personal-finance',
 '/tech']

In [27]:
# table
tables = doc.findall('.//table')
tb = [str(ele) for ele in tables]
tb

['<Element table at 0x1a2a80a66d8>', '<Element table at 0x1a2a80c9408>']

In [29]:
# tables의 tag, attrib 반환
for elem in tables:
    print(elem.tag)
    print(elem.attrib)

table
{'data-reactid': '23', 'class': 'calls table-bordered W(100%) Pos(r) Bd(0) Pt(0) list-options'}
table
{'data-reactid': '697', 'class': 'puts table-bordered W(100%) Pos(r) list-options'}


In [31]:
calls = tables[0]
puts = tables[1]

In [32]:
calls.text_content() # <table><\table> 안에 있는 모든 내용 표현.

'Contract NameLast Trade DateStrikeLast PriceBidAskChange% ChangeVolumeOpen InterestImplied VolatilityAAPL190322C001200002019-03-19 3:31PM EDT120.0066.400.000.000.00-200.00%AAPL190322C001250002019-03-19 3:31PM EDT125.0061.420.000.000.00-200.00%AAPL190322C001300002019-03-19 3:31PM EDT130.0056.750.000.000.00-200.00%AAPL190322C001350002019-03-19 3:31PM EDT135.0051.490.000.000.00-100.00%AAPL190322C001400002019-03-08 1:10PM EDT140.0031.800.000.000.00-600.00%AAPL190322C001450002019-03-05 6:53PM EDT145.0030.9540.8041.700.00-1010138.28%AAPL190322C001460002019-03-07 11:11AM EDT146.0027.200.000.000.00-1000.00%AAPL190322C001470002019-03-11 10:50AM EDT147.0031.120.000.000.00-100.00%AAPL190322C001480002019-03-15 1:43PM EDT148.0039.500.000.000.00-1000.00%AAPL190322C001490002019-02-07 1:44PM EDT149.0022.3336.6537.700.00-041125.20%AAPL190322C001500002019-03-18 3:30PM EDT150.0037.850.000.000.00-800.00%AAPL190322C001525002019-03-07 1:32PM EDT152.5021.350.000.000.00-200.00%AAPL190322C001550002019-03-15 3

In [43]:
rows = calls.findall('.//tr')

In [48]:
def _unpack(row, tag='td'):
    tds = row.findall('.//%s' % tag)
    return [td.text_content() for td in tds] # []로 마지막에 감싸줌.

In [49]:
# cf) 객체를 명명할 때 1. 영문자, 2. 숫자, 3. _ 이렇게 3개인데 첫 글자에 올 수 있는건 영문자, _ 임.
# cf) list는 데이터를 임시로 담는 그릇으로 사용함.

In [50]:
_unpack(rows[0], tag='th') # 11개의 column을 가지고 있음.

['Contract Name',
 'Last Trade Date',
 'Strike',
 'Last Price',
 'Bid',
 'Ask',
 'Change',
 '% Change',
 'Volume',
 'Open Interest',
 'Implied Volatility']

In [51]:
_unpack(rows[1], tag='td') # 첫 row에 의한 결과값 나타냄

['AAPL190322C00120000',
 '2019-03-19 3:31PM EDT',
 '120.00',
 '66.40',
 '0.00',
 '0.00',
 '0.00',
 '-',
 '2',
 '0',
 '0.00%']

In [52]:
# TextParser.get_chunk() : 들어오는 데이터들을 type에 따라 개별적으로 분해해줌.
from pandas.io.parsers import TextParser

def parse_options_data(table):
    rows = table.findall('.//tr') 
    header = _unpack(rows[0], tag='th') # row[0] : col_name들을 모아놓은 행
    data = [_unpack(row) for row in rows[1:]] # row[1:] : 모든 record, 그리고 data에는 [] 한 번 더 감싸준다.
    return TextParser(data, names=header).get_chunk() # get_chunk() : get_DataFrame이라고 생각하면 됨.

In [53]:
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)
call_data[:10]

Unnamed: 0,Contract Name,Last Trade Date,Strike,Last Price,Bid,Ask,Change,% Change,Volume,Open Interest,Implied Volatility
0,AAPL190322C00120000,2019-03-19 3:31PM EDT,120.0,66.4,0.0,0.0,0.0,-,2,0,0.00%
1,AAPL190322C00125000,2019-03-19 3:31PM EDT,125.0,61.42,0.0,0.0,0.0,-,2,0,0.00%
2,AAPL190322C00130000,2019-03-19 3:31PM EDT,130.0,56.75,0.0,0.0,0.0,-,2,0,0.00%
3,AAPL190322C00135000,2019-03-19 3:31PM EDT,135.0,51.49,0.0,0.0,0.0,-,1,0,0.00%
4,AAPL190322C00140000,2019-03-08 1:10PM EDT,140.0,31.8,0.0,0.0,0.0,-,6,0,0.00%
5,AAPL190322C00145000,2019-03-05 6:53PM EDT,145.0,30.95,40.8,41.7,0.0,-,10,10,138.28%
6,AAPL190322C00146000,2019-03-07 11:11AM EDT,146.0,27.2,0.0,0.0,0.0,-,10,0,0.00%
7,AAPL190322C00147000,2019-03-11 10:50AM EDT,147.0,31.12,0.0,0.0,0.0,-,1,0,0.00%
8,AAPL190322C00148000,2019-03-15 1:43PM EDT,148.0,39.5,0.0,0.0,0.0,-,10,0,0.00%
9,AAPL190322C00149000,2019-02-07 1:44PM EDT,149.0,22.33,36.65,37.7,0.0,-,0,41,125.20%


- web에서 table 크롤링해보는 실습이었습니당.

## 실습 문제 1
#### 1) lxml 라이브러리를 이용한 웹 데이터 추출
6.1.5 절 1)에서 <a href=url>링크 문자열</a>의 구조를 갖는 Elements를 links로 생성하였다.
links로 부터 [(링크 문자열, active url), .... ]와 같은 구조를 갖는 pandas 데이터프레임 객체를 생성하라.

단, active url은 단순히 웹서버 내에서만 지정되는 상대 경로가 아닌, 사용자의 브라우저에서 직접 입력되어 브라우징이 가능한 url 이어야 한다.

>active url의 예: https://finance.yahoo.com/watchlists

In [97]:
from lxml.html import parse
from urllib.request import urlopen

parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options?ltr=1'))

doc = parsed.getroot()