# Static Crawling

Order of actions
1. Get the wanted URL's HTML contents using the `requests` package
2. Parsing the entire HTML using `BeautifulSoup4`
3. Retrieve only the required information, append to list
4. `print()` or save `to_csv(), to_excel()`

## 1. `requests`
`requests` : package for getting the HTML contents of a webpage

In [1]:
import requests

url = 'https://www.naver.com'
response = requests.get(url)
html_text = response.text

- `response.text` : returns `str` form of HTML
- `response.content` : returns `byte` type of HTML
- `response.json` : returns `json` type of HTML
- `response.status_code` : returns 200 if the request succeeded
- `response.url` : gets the final responded URL (if the URL redirects, returns the redirected URL)
- `response.headers` : returns the headers


## 2. `BeautifulSoup4`
Parsing : the process of cleaning & tidying the messy, long HTML content

In [3]:
from bs4 import BeautifulSoup as bs
soup = bs(html_text, 'html.parser')

### `find()` and `find_all()`
- `find()` : just one / `find_all()` : all values
- if there are many attributes matching, `find()` returns the first value only
- inside parentheses : HTML tags or attributes

In [4]:
# by tag name
soup.find('p')

<p class="dsc">
<i class="imsc ico_election"></i><span class="_alert_passage"></span>
</p>

In [8]:
# by class
soup.find(class_ = 'input_text')
soup.find(attrs={'class':'input_text'})

<input accesskey="s" autocomplete="off" class="input_text" data-atcmp-element="" id="query" maxlength="255" name="query" onclick="document.getElementById('fbm').value=1;" placeholder="검색어를 입력해 주세요." style="ime-mode:active;" tabindex="1" title="검색어 입력" type="search" value=""/>

In [9]:
# by id
soup.find(id='query')
soup.find(attrs={'id':'query'})

<input accesskey="s" autocomplete="off" class="input_text" data-atcmp-element="" id="query" maxlength="255" name="query" onclick="document.getElementById('fbm').value=1;" placeholder="검색어를 입력해 주세요." style="ime-mode:active;" tabindex="1" title="검색어 입력" type="search" value=""/>

In [13]:
# tag name & id
soup.find('a', class_='btn_keyboard')

<a class="btn_keyboard" href="#" id="ke_kbd_btn" onclick="return false;" role="button"><span class="blind">한글 입력기</span><span class="ico_keyboard"></span></a>

### `select()`, `select_one()`
- `select()` = `find_all()` / `select_one()` = `find()`
- CSS selector goes inside parentheses

In [14]:
soup.select_one('a.btn_keyboard')

<a class="btn_keyboard" href="#" id="ke_kbd_btn" onclick="return false;" role="button"><span class="blind">한글 입력기</span><span class="ico_keyboard"></span></a>

In [15]:
soup.select('a.btn_keyboard')

[<a class="btn_keyboard" href="#" id="ke_kbd_btn" onclick="return false;" role="button"><span class="blind">한글 입력기</span><span class="ico_keyboard"></span></a>]

In [19]:
btn = soup.select('a.btn_keyboard')

for i in btn :
    btns = i.get_text()
print(btns)

한글 입력기


## 3. Getting the needed information & Printing
- Example : Getting the titles & news agency names of a certain keyword search result

In [29]:
from bs4 import BeautifulSoup as bs
import requests

query = input('Search for keyword : ')
url = 'https://search.naver.com/search.naver?where=news&sm=tab_jum&query='+'%s'%query

response = requests.get(url)
html_text = response.text

soup = bs(html_text, 'html.parser')

titles = soup.select('a.news_tit')

for i in titles :
    title = i.get_text() # get_text can only be applied to ONE object at a time
    print(title)

뉴진스, 한국대중음악상 3관왕 영예
뉴진스, 日 최대 패션축제 뜨자 '들썩'
리바이스, 150주년 기념 뉴진스 글로벌 앰버서더 발탁
250, 한국대중음악상 ‘올해의 음반’ 등 4관왕…뉴진스 3관왕
뉴진스의, 봄[화보]
250, '올해의 음악인' 등 한대음 4관왕…뉴진스는 3관왕
좋은 곡은 뉴진스에게 먼저?…'SM 사태' 우려 혹은 오류 [연계소문]
'한대음' 뉴진스 3관왕·250 4관왕..윤하 올해의 노래상[종합]
대세 입증 뉴진스, '쿠키'도 스포티파이 스트리밍 1억회 돌파
뉴진스에 없는 딱 한 가지 [Oh!쎈 초점]


#### CSS selector usage
- Using CSS selector
    - `soup.select('tag')`
    - `soup.select('.class')`
    - `soup.select('#id')`
 
    - `soup.select('upper tag > lower tag > lower tag')`
    - `soup.select('upper tag.class > lower tag.class')`
 
- Tag & attributes
    - `soup.select('tag[attribute="value"]')`
 
- Selecting only one value
    - `soup.select_one('tag[attribute="value"]')`

#### Getting elements
- `get_text()` : gets the text of the element
- `get('src')` or `['src']` : images

In [47]:
import urllib.request

img = soup.select_one('#sp_nws1 > div.news_wrap.api_ani_send > a > img')

url = img['data-lazysrc']
savename = 'image.png'

urllib.request.urlretrieve(url, savename)
print("Saved image")

Saved image


#### Others
- `href` extraction : getting links
- `attributes` : for getting individual, specific content with that particular attribute