# Scraping Dynamic Web Pages
- Static webpages
  - A static site contains an HTML file for each page. The information on the page is delivered to the user exactly as it’s stored. All sites were built like this in the early days of the internet.
  - Now, this format is most often used to build sites where the content isn’t constantly changing. Scraping data from static pages is a straightforward process:

    1. Give the scraper the URL of the page you want to scrape.
    2. Identify the location of the data you want (This can be identified with the Inspect tool in Chrome.)
    3. Request the data using selectors.
    4. Export the data into a JSON or CSV file.
- Dynamic websites
  - Dynamic websites have continuously updating feeds, such as websites that deliver stock market data. These sites use Javascript and XML (AJAX) to update the page continuously without constant refreshing.
  - They do this by trading small data packets with the server on the back end. AJAX formatting makes scraping data more complicated since it has to be scraped each time it changes.

  - To scrape a dynamic page, you have to determine the format and destination of the server request so you can copy it and the response so you can extract it. In Chrome, you can identify the request using the following steps:
    1. With the Developer Tools panel open, click on Network to find all of the requests processed for the page.
    2. Under the Headers field, look for Form Data, which should contain the AJAX request.
    3. Find the parameters that designate the request and the endpoint.

  - You can find the response format by looking under the Response tab, which should be JSON or something similar. Now that you’ve identified the output parameter and response format, you can configure your web scraper.
  - You can scrape dynamic web pages, either
    1. automated browsers: simulate user action using local web browser driver (e.g. selenium, splash), or
    2. intercept AJAX calls: scrape the information source page directly

## AJAX (ASynchronous Javascript And XML)
- Have you ever visited a page that automatically loads extra content as you scroll? Then you’ve seen AJAX pages in action. Social media sites with “infinite scroll” are the most common examples of AJAX pages. Still, AJAX can be found on any site that presents dynamic and constantly updating content.
- 자바스크립트를 이용해 서버와 브라우저가 비동기 방식으로 데이터를 교환할 수 있는 통신기능
- 클라이언트와 서버간에 JSON 이나 XML 데이터를 주고받음.
- 비동기 방식을 이용하면 필요한 데이터만 불러오면서 리소스 낭비를 줄일 수 있다.
- AJAX는 XMLHttpRequest객체를 통해 서버에 request한다.
- JSON이나 XML형태로 필요한 데이터만 받아 갱신하기 때문에 그만큼의 자원과 시간을 아낄 수 있다.
- refer to: https://www.w3schools.com/js/js_ajax_intro.asp

## How to scrape AJAX website?
- Go to the page you want to scrape
- use F12 key to access “Developer Tools”
- go to the “Network” tab
- Scroll to the XHR(XMLHttpRequest) section, and refresh your screen if it’s empty
- Explore the different results until you find the one you want, then go to the “Headers” tab
- Scroll to the “Form Data” field (when it is a POST request.)

## URL convention
- Joinsland seb page does not work any more, so we try another example.
- (ex) The URL: https://store.steampowered.com/search/results/?query&start=1&count=100&tags=1702
  - https:// indicates that the website is accessed over a secure HTTPS connection
  - store.steampowered.com: the domain name of the Steam Store website.
  - /search/results/ is the path that specifies the search results page.
  - ?query: indicates that there is a query parameter
  - &start=0: the starting position of the search results (set to 0)
  - &count=100: the number of search results to be displayed per page.
  - &tags=1702: represents a tag or category filter for the search results (The value 1702 corresponds to a specific tag or category within the Steam Store)


# Example 1: Naver 산업분석 리포트
- http://developer88.tistory.com/428how-to-scrape-an-ajax-website-using-python-qw8fuitvi
- 네이버의 산업분석리포트 scraping (페이지 1~ 3 까지)
- modified by jyj (2023-7-18)

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'https://finance.naver.com/research/industry_list.naver'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In [3]:
response.status_code

200

In [4]:
# see page numbers (page navigation list)
pagenation = soup.find('table', class_='Nnavi')
print(pagenation)

<table align="center" class="Nnavi" summary="페이지 네비게이션 리스트">
<caption>페이지 네비게이션</caption>
<tr>
<td class="on">
<a href="/research/industry_list.naver?&amp;page=1">1</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=2">2</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=3">3</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=4">4</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=5">5</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=6">6</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=7">7</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=8">8</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=9">9</a>
</td>
<td>
<a href="/research/industry_list.naver?&amp;page=10">10</a>
</td>
<td class="pgR">
<a href="/research/industry_list.naver?&amp;page=11">
				다음<img alt="" border="0" height="5" src="https://ssl.pstatic.net/static/n/cmn/bu_pgarR.gif" width="3"/>
</a>
</td>
<td class="p

In [5]:
pages = pagenation.find_all('a')
pages

[<a href="/research/industry_list.naver?&amp;page=1">1</a>,
 <a href="/research/industry_list.naver?&amp;page=2">2</a>,
 <a href="/research/industry_list.naver?&amp;page=3">3</a>,
 <a href="/research/industry_list.naver?&amp;page=4">4</a>,
 <a href="/research/industry_list.naver?&amp;page=5">5</a>,
 <a href="/research/industry_list.naver?&amp;page=6">6</a>,
 <a href="/research/industry_list.naver?&amp;page=7">7</a>,
 <a href="/research/industry_list.naver?&amp;page=8">8</a>,
 <a href="/research/industry_list.naver?&amp;page=9">9</a>,
 <a href="/research/industry_list.naver?&amp;page=10">10</a>,
 <a href="/research/industry_list.naver?&amp;page=11">
 				다음<img alt="" border="0" height="5" src="https://ssl.pstatic.net/static/n/cmn/bu_pgarR.gif" width="3"/>
 </a>,
 <a href="/research/industry_list.naver?&amp;page=1014">맨뒤
 				<img alt="" border="0" height="5" src="https://ssl.pstatic.net/static/n/cmn/bu_pgarRR.gif" width="8"/>
 </a>]

In [6]:
pages[0]['href']

'/research/industry_list.naver?&page=1'

In [7]:
last_page = 3

for k in range(1, last_page+1):
    page_url = 'https://finance.naver.com' + pages[k-1]['href']
    print(page_url)

https://finance.naver.com/research/industry_list.naver?&page=1
https://finance.naver.com/research/industry_list.naver?&page=2
https://finance.naver.com/research/industry_list.naver?&page=3


In [8]:
page_url_0 = 'https://finance.naver.com/research/industry_list.naver?&page=1'
soup = BeautifulSoup(requests.get(page_url_0).text, 'html.parser')
table_body = soup.find('table', class_='type_1')
print(table_body)

<table cellpadding="0" cellspacing="0" class="type_1" summary="산업분석 리포트 게시판 글목록">
<caption>산업분석 리포트게시판</caption>
<col width="17%"/><col width="*%"/><col width="15%"/><col width="5%"/><col width="9%"/><col width="7%"/>
<tr>
<th>분류</th>
<th>제목</th>
<th style="text-align:left">증권사</th>
<th>첨부</th>
<th>작성일</th>
<th>조회수</th>
</tr>
<tr><td class="blank_07" colspan="6"></td></tr>
<tr>
<td style="padding-left:10">자동차</td>
<td><a href="industry_read.naver?nid=34037&amp;page=1">23년 6월 유럽 자동차 판매: 높아진 수요, 높..</a><img alt="NEW" class="ico_new" height="8" src="https://ssl.pstatic.net/imgstock/images5/ico_research_new.gif" width="8"/></td>
<td>하이투자증권</td>
<td class="file"><a href="https://ssl.pstatic.net/imgstock/upload/research/industry/1689825589627.pdf" target="_blank"><img align="absmiddle" alt="pdf" src="https://ssl.pstatic.net/imgstock/images5/down.gif"/></a></td>
<td class="date" style="padding-left:5px">23.07.20</td>
<td class="date">124</td>
</tr>
<tr>
<td style="padding-left:10">항공운송</td>
<

In [9]:
trs = table_body.find_all('tr')
trs[0]

<tr>
<th>분류</th>
<th>제목</th>
<th style="text-align:left">증권사</th>
<th>첨부</th>
<th>작성일</th>
<th>조회수</th>
</tr>

In [10]:
trs[1]

<tr><td class="blank_07" colspan="6"></td></tr>

In [11]:
trs[2]

<tr>
<td style="padding-left:10">자동차</td>
<td><a href="industry_read.naver?nid=34037&amp;page=1">23년 6월 유럽 자동차 판매: 높아진 수요, 높..</a><img alt="NEW" class="ico_new" height="8" src="https://ssl.pstatic.net/imgstock/images5/ico_research_new.gif" width="8"/></td>
<td>하이투자증권</td>
<td class="file"><a href="https://ssl.pstatic.net/imgstock/upload/research/industry/1689825589627.pdf" target="_blank"><img align="absmiddle" alt="pdf" src="https://ssl.pstatic.net/imgstock/images5/down.gif"/></a></td>
<td class="date" style="padding-left:5px">23.07.20</td>
<td class="date">124</td>
</tr>

In [12]:
trs[3]

<tr>
<td style="padding-left:10">항공운송</td>
<td><a href="industry_read.naver?nid=34036&amp;page=1">업계 재편, 어떻게 봐야 할까? ①합병이 된다..</a><img alt="NEW" class="ico_new" height="8" src="https://ssl.pstatic.net/imgstock/images5/ico_research_new.gif" width="8"/></td>
<td>한화투자증권</td>
<td class="file"><a href="https://ssl.pstatic.net/imgstock/upload/research/industry/1689814031637.pdf" target="_blank"><img align="absmiddle" alt="pdf" src="https://ssl.pstatic.net/imgstock/images5/down.gif"/></a></td>
<td class="date" style="padding-left:5px">23.07.20</td>
<td class="date">279</td>
</tr>

- let's try one

In [13]:
# from trs[2] ... trs[len(trs)-1]
tr = trs[2]
tds = tr.find_all('td')
tds

[<td style="padding-left:10">자동차</td>,
 <td><a href="industry_read.naver?nid=34037&amp;page=1">23년 6월 유럽 자동차 판매: 높아진 수요, 높..</a><img alt="NEW" class="ico_new" height="8" src="https://ssl.pstatic.net/imgstock/images5/ico_research_new.gif" width="8"/></td>,
 <td>하이투자증권</td>,
 <td class="file"><a href="https://ssl.pstatic.net/imgstock/upload/research/industry/1689825589627.pdf" target="_blank"><img align="absmiddle" alt="pdf" src="https://ssl.pstatic.net/imgstock/images5/down.gif"/></a></td>,
 <td class="date" style="padding-left:5px">23.07.20</td>,
 <td class="date">124</td>]

In [14]:
tds[0]

<td style="padding-left:10">자동차</td>

In [15]:
tds[1]

<td><a href="industry_read.naver?nid=34037&amp;page=1">23년 6월 유럽 자동차 판매: 높아진 수요, 높..</a><img alt="NEW" class="ico_new" height="8" src="https://ssl.pstatic.net/imgstock/images5/ico_research_new.gif" width="8"/></td>

In [16]:
tds[1].a.text

'23년 6월 유럽 자동차 판매: 높아진 수요, 높..'

In [17]:
tds[1].a['href']

'industry_read.naver?nid=34037&page=1'

In [18]:
url_head = 'https://finance.naver.com/research/'

def get_research(tds):
    company = tds[0].string
    title = tds[1].a.string
    url_query = tds[1].a['href']
    result = {'company': company,
              'title': title,
              'url': url_head + url_query
    }
    return result

- let's put them altogether

In [19]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url_head = 'https://finance.naver.com/research/'

def get_research(tds):
    category = tds[0].string
    title = tds[1].a.string
    researh_url = url_head + tds[1].a['href']
    result = {'category': [category],
              'title': [title],
              'researh_url': [researh_url]
    }
    return result

url = 'https://finance.naver.com/research/industry_list.naver'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

pagenation = soup.find('table', class_='Nnavi')
pages = pagenation.find_all('a')

last_page = 3
reports = pd.DataFrame({"category":[],
                        "title":[],
                        "researh_url":[]})

for k in range(1, last_page+1):
    page_url = 'https://finance.naver.com' + pages[k-1]['href']

    soup = BeautifulSoup(requests.get(page_url).text, 'html.parser')
    table_body = soup.find('table', class_='type_1')

    trs = table_body.find_all('tr')

    for tr in trs[2:]:
        tds = tr.find_all('td')
        if (len(tds) < 2): continue  # skip border lines
        report = get_research(tds)
        reports = pd.concat([reports, pd.DataFrame(report)], ignore_index=True)

reports

Unnamed: 0,category,title,researh_url
0,자동차,"23년 6월 유럽 자동차 판매: 높아진 수요, 높..",https://finance.naver.com/research/industry_re...
1,항공운송,"업계 재편, 어떻게 봐야 할까? ①합병이 된다..",https://finance.naver.com/research/industry_re...
2,자동차,유럽 자동차 6월: 현대차/기아 +5%,https://finance.naver.com/research/industry_re...
3,전기전자,"2Q23 실적 관전포인트, ""저점 확인""",https://finance.naver.com/research/industry_re...
4,기타,(23-07) 이제는 다시 펀더멘탈,https://finance.naver.com/research/industry_re...
...,...,...,...
85,제약,기초 체력이 레벨 업 되는 기업을 주목,https://finance.naver.com/research/industry_re...
86,철강금속,5년 내 최고치를 기록하는 구리 TC/RC,https://finance.naver.com/research/industry_re...
87,항공운송,6월 항공 데이터: 비수기 통과,https://finance.naver.com/research/industry_re...
88,자동차,테슬라발 가격 인하 효과로 글로벌 전기차 판..,https://finance.naver.com/research/industry_re...


# Example 2: Web scraping dynamic content only using beautiful soup
- How to find the information source?
  0. go to: https://finviz.com/quote.ashx?t=tsla
     - you see "income statements" in the middle of the page (not "cash flow")
     - now you want to extract "cash flow" information
  1. use developer's tool -> network-> XHR
  2. try to click item of information you want to extract (if you click, it will get dynamically loaded)
  3. see how traffic moves whenever you click the item
  4. see "Preview" and "Header" to find the source url (which is in
https://finviz.com/api/statement.ashx?t=tsla&s=CA), which has json format information

- Accessing the url using requests.get()
  - When using requests.get() to make HTTP requests in Python, providing headers can be important for several reasons:
    - User-Agent: Many websites use the User-Agent header to identify the client making the request. Some websites may block or limit access to their content based on the User-Agent. By setting a User-Agent header that resembles a common web browser, you can mimic a typical browser request and avoid being blocked or throttled.
    - Authentication: Some websites require authentication to access certain resources or APIs
    - Content Negotiation: The Accept header allows you to specify the preferred content type for the response (e.g., JSON, XML, HTML). By setting the Accept header, you can ensure that the server returns the response in the format you desire.
    - Custom Headers: Certain APIs or web services may require specific custom headers to function correctly.
  - Normally, the default User-Agent should work fine for most APIs and web services.
  - However, if you encounter issues with specific APIs or web services, and they require a valid User-Agent header, you can use a generic User-Agent string like the one used by common web browsers.

In [20]:
# typical User-Agent string that mimics a web browser's User-Agent header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
           'Accept': 'application/json',}
cashflow = 'https://finviz.com/api/statement.ashx?t=tsla&s=CA'
cf = requests.get(cashflow, headers=headers)
cf.text

'{"currency":"USD","data":{"Period End Date":["TTM","12/31/2022","12/31/2021","12/31/2020","12/31/2019","12/31/2018","12/31/2017","12/31/2016"],"Period Length":["12 Months","12 Months","12 Months","12 Months","12 Months","12 Months","12 Months","12 Months"],"Net Income":["11,846.00","12,587.00","5,644.00","862.00","-775.00","-1,062.58","-2,240.58","-773.05"],"Depreciation/Depletion":["3,913.00","3,747.00","2,911.00","2,322.00","2,154.00","1,901.05","1,636.00","947.10"],"Amortization":["","","","","","","",""],"Deferred Taxes":["","","","","","","",""],"Non-Cash Items":["2,319.00","2,298.00","2,424.00","2,575.00","1,375.00","1,201.38","1,040.52","395.98"],"Changes in Working Capital":["-4,836.00","-3,908.00","518.00","184.00","-349.00","57.95","-496.60","-693.86"],"Cash from Operating Activities":["13,242.00","14,724.00","11,497.00","5,943.00","2,405.00","2,097.80","-60.65","-123.83"],"Capital Expenditures":["-7,464.00","-7,172.00","-8,014.00","-3,242.00","-1,437.00","-2,319.52","-4,081

In [21]:
cfdata = cf.json()

In [22]:
import json
import pandas as pd
pd.DataFrame(json.loads(cf.text))

Unnamed: 0,currency,data
Amortization,USD,"[, , , , , , , ]"
Capital Expenditures,USD,"[-7,464.00, -7,172.00, -8,014.00, -3,242.00, -..."
Cash Interest Paid,USD,"[, 152.00, 266.00, 444.00, 455.00, 380.84, 182..."
Cash Taxes Paid,USD,"[, 1,203.00, 561.00, 115.00, 54.00, 35.41, 65...."
Cash from Financing Activities,USD,"[-1,846.00, -3,527.00, -5,203.00, 9,973.00, 1,..."
Cash from Investing Activities,USD,"[-12,290.00, -11,973.00, -7,868.00, -3,132.00,..."
Cash from Operating Activities,USD,"[13,242.00, 14,724.00, 11,497.00, 5,943.00, 2,..."
Changes in Working Capital,USD,"[-4,836.00, -3,908.00, 518.00, 184.00, -349.00..."
Deferred Taxes,USD,"[, , , , , , , ]"
Depreciation/Depletion,USD,"[3,913.00, 3,747.00, 2,911.00, 2,322.00, 2,154..."


In [23]:
pd.DataFrame(cfdata)

Unnamed: 0,currency,data
Amortization,USD,"[, , , , , , , ]"
Capital Expenditures,USD,"[-7,464.00, -7,172.00, -8,014.00, -3,242.00, -..."
Cash Interest Paid,USD,"[, 152.00, 266.00, 444.00, 455.00, 380.84, 182..."
Cash Taxes Paid,USD,"[, 1,203.00, 561.00, 115.00, 54.00, 35.41, 65...."
Cash from Financing Activities,USD,"[-1,846.00, -3,527.00, -5,203.00, 9,973.00, 1,..."
Cash from Investing Activities,USD,"[-12,290.00, -11,973.00, -7,868.00, -3,132.00,..."
Cash from Operating Activities,USD,"[13,242.00, 14,724.00, 11,497.00, 5,943.00, 2,..."
Changes in Working Capital,USD,"[-4,836.00, -3,908.00, 518.00, 184.00, -349.00..."
Deferred Taxes,USD,"[, , , , , , , ]"
Depreciation/Depletion,USD,"[3,913.00, 3,747.00, 2,911.00, 2,322.00, 2,154..."


In [24]:
# for "balance sheet"
balance_sheet = 'https://finviz.com/api/statement.ashx?t=tsla&s=BA'
bs = requests.get(balance_sheet, headers=headers)
bsdata = bs.json()
pd.DataFrame(bsdata)

Unnamed: 0,currency,data
Accounts Payable,USD,"[15,255.00, 10,025.00, 6,051.00, 3,771.00, 3,4..."
"Accounts Receivable - Trade, Net",USD,"[2,952.00, 1,913.00, 1,886.00, 1,324.00, 949.0..."
Accrued Expenses,USD,"[5,553.00, 4,303.00, 2,814.00, 2,091.00, 1,372..."
Additional Paid-In Capital,USD,"[32,177.00, 29,803.00, 27,260.00, 12,736.82, 1..."
Capital Lease Obligations,USD,"[568.00, 991.00, 1,094.00, 1,232.00, 993.18, 5..."
Cash and Equivalents,USD,"[16,253.00, 17,576.00, 19,384.00, 6,268.00, 3,..."
Cash and Short Term Investments,USD,"[22,185.00, 17,707.00, 19,384.00, 6,268.00, 3,..."
Common Stock,USD,"[3.00, 1.00, 1.00, 0.18, 0.17, 0.17, 0.16, 0.13]"
Current Port. of LT Debt/Capital Leases,USD,"[1,502.00, 1,589.00, 2,132.00, 1,785.00, 2,567..."
Full-Time Employees,USD,"[127,855, 99,290, 70,757, 48,016, 48,817, 37,5..."


- if you want to extract some part of the data

In [25]:
cashflow = 'https://finviz.com/api/statement.ashx?t=tsla&s=CA'
cf = requests.get(cashflow, headers=headers)
soup = BeautifulSoup(cf.content, 'html.parser')

In [26]:
soup

{"currency":"USD","data":{"Period End Date":["TTM","12/31/2022","12/31/2021","12/31/2020","12/31/2019","12/31/2018","12/31/2017","12/31/2016"],"Period Length":["12 Months","12 Months","12 Months","12 Months","12 Months","12 Months","12 Months","12 Months"],"Net Income":["11,846.00","12,587.00","5,644.00","862.00","-775.00","-1,062.58","-2,240.58","-773.05"],"Depreciation/Depletion":["3,913.00","3,747.00","2,911.00","2,322.00","2,154.00","1,901.05","1,636.00","947.10"],"Amortization":["","","","","","","",""],"Deferred Taxes":["","","","","","","",""],"Non-Cash Items":["2,319.00","2,298.00","2,424.00","2,575.00","1,375.00","1,201.38","1,040.52","395.98"],"Changes in Working Capital":["-4,836.00","-3,908.00","518.00","184.00","-349.00","57.95","-496.60","-693.86"],"Cash from Operating Activities":["13,242.00","14,724.00","11,497.00","5,943.00","2,405.00","2,097.80","-60.65","-123.83"],"Capital Expenditures":["-7,464.00","-7,172.00","-8,014.00","-3,242.00","-1,437.00","-2,319.52","-4,081.

In [27]:
cf.content

b'{"currency":"USD","data":{"Period End Date":["TTM","12/31/2022","12/31/2021","12/31/2020","12/31/2019","12/31/2018","12/31/2017","12/31/2016"],"Period Length":["12 Months","12 Months","12 Months","12 Months","12 Months","12 Months","12 Months","12 Months"],"Net Income":["11,846.00","12,587.00","5,644.00","862.00","-775.00","-1,062.58","-2,240.58","-773.05"],"Depreciation/Depletion":["3,913.00","3,747.00","2,911.00","2,322.00","2,154.00","1,901.05","1,636.00","947.10"],"Amortization":["","","","","","","",""],"Deferred Taxes":["","","","","","","",""],"Non-Cash Items":["2,319.00","2,298.00","2,424.00","2,575.00","1,375.00","1,201.38","1,040.52","395.98"],"Changes in Working Capital":["-4,836.00","-3,908.00","518.00","184.00","-349.00","57.95","-496.60","-693.86"],"Cash from Operating Activities":["13,242.00","14,724.00","11,497.00","5,943.00","2,405.00","2,097.80","-60.65","-123.83"],"Capital Expenditures":["-7,464.00","-7,172.00","-8,014.00","-3,242.00","-1,437.00","-2,319.52","-4,08

--------------------

# Exercise

In [28]:
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
df

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [29]:
dic = {"a":[77], "b":[88]}
pd.DataFrame(dic)

Unnamed: 0,a,b
0,77,88


In [30]:
pd.concat([df, pd.DataFrame(dic)], ignore_index=True)

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6
3,77,88
