# 1. HTML 다루기 기초

- Web Scraping with Python(2판), 한빛미디어

In [1]:
from bs4 import BeautifulSoup

## 1.1 연결

In [2]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## http://pythonscraping.com/pages/page1.html

![urlopen](../images/web_scraping_p25.png)

* 오른쪽 마우스버턴, 페이지소스보기 (Cntr+U)
* F12

### python library
- https://docs.python.org/3/library/
- urllib 라이브러리 request 모듈 urlopen 함수

### anaconda 설치 폴더
- Lib 폴더: 기본 library => urllib
- pkgs 폴더: 기타 외부 package 설치

## 1.2 BeautifulSoup 소개

### http://www.crummy.com/software/BeautifulSoup/bs4

- parser
    - html
    - lxml : html 구조 오류 처리
    - html5lib
    
### Beautifulsoup 모듈 설치

- pip install beautifulsoup4

![bs설치](../images/그림15-1-2_p507.png)

### Beautifulsoup 객체 생성
- Beautifulsoup(html.read(), 'html.parser')
    - html.read(): html text 읽음
    - html.parser: 구문분석기

- 기타 구문분석기
    - lxml: 형식을 정확히 지키지 않은 html 코드 분석 (pip3 install lxml)
    - html5lib: html 오류 수정, 구문분석

In [23]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
#bs = BeautifulSoup(html, 'html.parser')
print(bs.title)
print(bs.h1)
print(bs.div)

<title>A Useful Page</title>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


### beautifulsoup 객체 tag 추출

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.html.body.h1)
print(bs.body.h1)
print(bs.html.h1)
print(bs.h1)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>


### web 연결 예외 처리
- web page를 찾을 수 없거나, URL 해석에서 오류 발생 => HTTPError
- 서버를 찾을 수 없는 경우 => URLError
- bs 객체의 tag 가 존재하지 않을 경우 => AttributeError

### HTTP 상태 코드 & 예외처리
https://ko.wikipedia.org/wiki/HTTP_%EC%83%81%ED%83%9C_%EC%BD%94%EB%93%9C

- 1xx (정보): 요청을 받았으며 프로세스를 계속한다
- 2xx (성공): 요청을 성공적으로 받았으며 인식했고 수용하였다
- 3xx (리다이렉션): 요청 완료를 위해 추가 작업 조치가 필요하다
- 4xx (클라이언트 오류): 요청의 문법이 잘못되었거나 요청을 처리할 수 없다
- 5xx (서버 오류): 서버가 명백히 유효한 요청에 대해 충족을 실패했다

In [3]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The server returned an HTTP error", e)
except URLError as e:
    print("The server could not be found!", e)
else:
    print(html.read())

The server could not be found! <urlopen error [Errno 11001] getaddrinfo failed>


In [26]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "html.parser")
        title = bsObj.body.h10
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

Title could not be found


# 2. 고급 HTML 분석

## 2.1 HTML tag
  
- http://tcpschool.com/html-tags/intro

## 2.2 CSS(Cascading Style Sheet)
- HTML 요소들이 각종 미디어에서 어떻게 보이는가를 정의하는 데 사용되는 스타일 시트 언어
- HTML에서 디자인에 필요한 부분 담당
- CSS Syntax
![css](../images/css_select.png)
    - The selector points to the HTML elements
    - The declaration includes a CSS property name and a value, separated by a colon
    
- http://tcpschool.com/css/intro

- Selector: HTML 요소 선택자
![css](../images/css_example1.png)

- Selector: 아이디(id) 선택자
![css](../images/css_example2.png)

- Selector: 클래스(class) 선택자
![css](../images/css_example3.png)

- Selector: 그룹(group) 선택자
![css](../images/css_example4.png)

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs)

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the firs

![bs](../images/web_scraping_p39.PNG)

In [28]:
nameList = bs.findAll('span', {'class': 'green'})

for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


## 2.2 find(), find_all(), findAll() : 특정 태그 추출

- findAll(tag, attributes, recursive, text, limit, keywors)
- find_all(tag, attributes, recursive, text, limit, keywors)
- find(tag, attributes, recursive, text, keywors)

    - tag: tag 이름 문자열, tag 이름 list
    - attributes: 속성으로 이루어진 dictionary
    - recursive: True or false, 태그의 자식, ...를 재귀적 검색
    - text: text contents로 검색
    - limit: 처음 limit 갯수까지만 
    - keywords: 특정 속성이 포함된 tag 선택

In [29]:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]


In [30]:
# limit : 처음 limit 갯수까지만 검색
allText = bs.find_all('span', {'class':{'green', 'red'}},limit=3)
print([text.get_text() for text in allText])
#print([text for text in allText])

["Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.", 'Anna\nPavlovna Scherer', 'Empress Marya\nFedorovna']


In [31]:
# text 매개변수: text 내용과 일치시킴.
nameList = bs.find_all(text='the prince')
print(nameList)
print(len(nameList))

['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
7


## 2.3 tree 이동
- 위치를 기준으로 tag를 찾을 경우

### 자식과 자손
- childrean() 함수: 현재 선택된 tag의 자식
- descendants() : 현재 선택된 tag의 자손 (findAll)

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.prettify())

In [None]:
# 자식 tag
aa= list(bs.children)
aa[2]

In [None]:
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

In [None]:
# descendants
print(len(list(bs.find('table',{'id':'giftList'}).descendants)))
print(len(list(bs.find('table',{'id':'giftList'}).children)))            # \n 도 포함

### 형제 node
- next_siblings, next_sibling
- previous_siblings, previous_sibling

In [None]:
# 테이블에서 첫번째 title 행을 제외한 모든 entry 제품
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling) 

### 부모 다루기
- parent, parents

In [None]:
print(bs.find('img',
              {'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text())

![bs_parent](../images/web_scraping_p48.PNG)

### 속성에 접근하기
- tag.attrs['src'] => tag의 속성이 'src'인 것을 가져옴.
