<a href="https://colab.research.google.com/github/rtajeong/M1_2025/blob/main/Ch6_Scraping_ref.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

- 웹 실행 과정:
  - 클라이언트(웹 브라우저)가 서버에 HTTP 요청을 보냄
  - 서버는 HTML 파일 및 관련 자원(CSS, JS 등)을 응답으로 클라이언트에 보뇀
  - 클라이언트는 이 파일을 받아 웹 페이지를 렌더링하여 화면에 표시

# Crawling and Scraping
- crawling: 웹 페이지를 자동으로 탐색하는 과정으로, 주로 웹 사이트의 링크 구조를 따라가면서 페이지를 방문하고 데이터를 수집
- scraping: 특정 웹 페이지에서 원하는 정보를 추출하는 데 집중된 방법. 페이지의 HTML 구조를 분석해 필요한 데이터(예: 텍스트, 이미지, 표)를 가져오는 작업.
- Crawling focuses on discovering and indexing web pages, while scraping targets specific data extraction from those pages

# JSON format
- json: JavaScript Object Notation (자바 스크립트 객체 표기법)
- 데이터를 쉽게 '교환' 하고 '저장' 하기 위한 텍스트 기반의 데이터 교환 표준

## encode/decode

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import json
from bs4 import BeautifulSoup

In [None]:
# exercise 1 : a simple json format
obj = """
{
    "name": "Kim",
    "places_lived": ["Seoul", "Korea"],
    "pet": null,
    "siblings": [{"name": "Scott", "age":25, "pet":"Zuko"}]
}
"""

In [None]:
type(obj)

str

In [None]:
r = json.loads(obj)   # decoding (json --> dict)
print(r)
type(r)

{'name': 'Kim', 'places_lived': ['Seoul', 'Korea'], 'pet': None, 'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'}]}


dict

In [None]:
json.dumps(r)    # encoding (dict --> json)

'{"name": "Kim", "places_lived": ["Seoul", "Korea"], "pet": null, "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"}]}'

In [None]:
# Exercise 2 (from https://rfriend.tistory.com/474)
data_dict = {

    "1.FirstName": "Gildong",
    "2.LastName": "Hong",
    "3.Age": 20,
    "4.University": "Hangook University",
    "5.Courses": [
        {
            "Classes": [
                "Probability",
                "Generalized Linear Model",
                "Categorical Data Analysis"
            ],
            "Major": "Statistics"
        },
        {
            "Classes": [
                "Data Structure",
                "Programming",
                "Algorithms"
            ],
            "Minor": "ComputerScience"
        }
    ]
}

In [None]:
type(data_dict)

dict

In [None]:
pd.DataFrame(data_dict)

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"{'Classes': ['Probability', 'Generalized Linea..."
1,Gildong,Hong,20,Hangook University,"{'Classes': ['Data Structure', 'Programming', ..."


In [None]:
data_dict.keys()

dict_keys(['1.FirstName', '2.LastName', '3.Age', '4.University', '5.Courses'])

In [None]:
data_dict.values()

dict_values(['Gildong', 'Hong', 20, 'Hangook University', [{'Classes': ['Probability', 'Generalized Linear Model', 'Categorical Data Analysis'], 'Major': 'Statistics'}, {'Classes': ['Data Structure', 'Programming', 'Algorithms'], 'Minor': 'ComputerScience'}]])

In [None]:
data_dict.items()

dict_items([('1.FirstName', 'Gildong'), ('2.LastName', 'Hong'), ('3.Age', 20), ('4.University', 'Hangook University'), ('5.Courses', [{'Classes': ['Probability', 'Generalized Linear Model', 'Categorical Data Analysis'], 'Major': 'Statistics'}, {'Classes': ['Data Structure', 'Programming', 'Algorithms'], 'Minor': 'ComputerScience'}])])

In [None]:
data_dict['5.Courses']

[{'Classes': ['Probability',
   'Generalized Linear Model',
   'Categorical Data Analysis'],
  'Major': 'Statistics'},
 {'Classes': ['Data Structure', 'Programming', 'Algorithms'],
  'Minor': 'ComputerScience'}]

In [None]:
import json
data_json = json.dumps(data_dict)
print(data_json)
type(data_json)

{"1.FirstName": "Gildong", "2.LastName": "Hong", "3.Age": 20, "4.University": "Hangook University", "5.Courses": [{"Classes": ["Probability", "Generalized Linear Model", "Categorical Data Analysis"], "Major": "Statistics"}, {"Classes": ["Data Structure", "Programming", "Algorithms"], "Minor": "ComputerScience"}]}


str

In [None]:
pd.Series(data_dict)

Unnamed: 0,0
1.FirstName,Gildong
2.LastName,Hong
3.Age,20
4.University,Hangook University
5.Courses,"[{'Classes': ['Probability', 'Generalized Line..."


In [None]:
pd.DataFrame(data_dict)

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"{'Classes': ['Probability', 'Generalized Linea..."
1,Gildong,Hong,20,Hangook University,"{'Classes': ['Data Structure', 'Programming', ..."


In [None]:
pd.DataFrame(data_dict).iloc[:,-1]

Unnamed: 0,5.Courses
0,"{'Classes': ['Probability', 'Generalized Linea..."
1,"{'Classes': ['Data Structure', 'Programming', ..."


In [None]:
pd.DataFrame.from_dict(data_dict)  # create a dataframe from a dict (more flexible, more options)

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"{'Classes': ['Probability', 'Generalized Linea..."
1,Gildong,Hong,20,Hangook University,"{'Classes': ['Data Structure', 'Programming', ..."


In [None]:
pd.json_normalize(data_dict)  # transforms a complex nested data structures to a flat dataframe

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"[{'Classes': ['Probability', 'Generalized Line..."


In [None]:
data_dict['5.Courses']

[{'Classes': ['Probability',
   'Generalized Linear Model',
   'Categorical Data Analysis'],
  'Major': 'Statistics'},
 {'Classes': ['Data Structure', 'Programming', 'Algorithms'],
  'Minor': 'ComputerScience'}]

In [None]:
pd.json_normalize(data_dict, "5.Courses")

Unnamed: 0,Classes,Major,Minor
0,"[Probability, Generalized Linear Model, Catego...",Statistics,
1,"[Data Structure, Programming, Algorithms]",,ComputerScience


In [None]:
pd.json_normalize(data_dict,
                  record_path = "5.Courses",
                  meta = ['3.Age'])

Unnamed: 0,Classes,Major,Minor,3.Age
0,"[Probability, Generalized Linear Model, Catego...",Statistics,,20
1,"[Data Structure, Programming, Algorithms]",,ComputerScience,20


## json_normalize (data, record_path, meta, ...)
- flatten nested structures
- data: dict or list of dict
- record_path : decode 해줘야할 열 지정 [{}, {}, {} ....]
- meta : decode 하는 열과 동일 차원에 존재하는 열들 중 데이터 프레임에 포함시킬 열 선택

In [None]:
# JSON exercise3
# from https://pandas.pydata.org/pandas-docs/stable/reference/api/\
#              pandas.io.json.json_normalize.html

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

In [None]:
type(data), len(data)

(list, 2)

In [None]:
pd.json_normalize(data)

Unnamed: 0,state,shortname,counties,info.governor
0,Florida,FL,"[{'name': 'Dade', 'population': 12345}, {'name...",Rick Scott
1,Ohio,OH,"[{'name': 'Summit', 'population': 1234}, {'nam...",John Kasich


In [None]:
pd.json_normalize(data, 'counties')

Unnamed: 0,name,population
0,Dade,12345
1,Broward,40000
2,Palm Beach,60000
3,Summit,1234
4,Cuyahoga,1337


In [None]:
pd.json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

Unnamed: 0,name,population,state,shortname,info.governor
0,Dade,12345,Florida,FL,Rick Scott
1,Broward,40000,Florida,FL,Rick Scott
2,Palm Beach,60000,Florida,FL,Rick Scott
3,Summit,1234,Ohio,OH,John Kasich
4,Cuyahoga,1337,Ohio,OH,John Kasich


# YAML parsing
- YAML (short for "YAML Ain't Markup Language") is a human-readable data serialization format that is commonly used for configuration files, data exchange, and other structured data.
- an alternative to more complex formats like XML and JSON.
- pyyaml: Python library that enables parsing and serialization of data in YAML format (parsing and conversion)

- example 1

In [None]:
!pip install pyyaml



In [None]:
import yaml

yaml_data = """
name: John Doe
age: 30
hobbies:
  - reading
  - hiking
  - cooking
"""

# Convert YAML to a Python dictionary
data = yaml.safe_load(yaml_data)
print(data)

{'name': 'John Doe', 'age': 30, 'hobbies': ['reading', 'hiking', 'cooking']}


- example 2

In [None]:
%%writefile config.yaml
# sample YAML configuration file (generateed by ChatGPT)

server:
  host: localhost
  port: 8080

database:
  host: localhost
  port: 5432
  user: myuser
  password: mypassword

Writing config.yaml


In [None]:
!cat config.yaml

# sample YAML configuration file (generateed by ChatGPT)

server:
  host: localhost
  port: 8080

database:
  host: localhost
  port: 5432
  user: myuser
  password: mypassword


In [None]:
import yaml

with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)
    # config = yaml.load(f)        # for more complex types
print(config)
print(type(config))

{'server': {'host': 'localhost', 'port': 8080}, 'database': {'host': 'localhost', 'port': 5432, 'user': 'myuser', 'password': 'mypassword'}}
<class 'dict'>


In [None]:
print(config['server'])
print(config['server']['host'])    # Output: localhost
print(config['database']['user'])  # Output: myuser

{'host': 'localhost', 'port': 8080}
localhost
myuser


# HTML Parsing
- before you do this example, try to see and run some example HTML files which are in this directory

- \<div> :"division"의 줄임말로, 문서 안에서 블록(block)을 구분할 때 사용,
  - 기본적으로 한 줄 전체를 차지하는 블록 요소
- \<class> : 태그가 아니라 속성(attribute)으로 div, p, span 등 모든 HTML 태그에 붙일 수 있음
  - 스타일이나 동작을 그룹화 (tag 에 붙이는 "라벨" 같은 것)
  - 같은 class 이름을 여러 요소에 붙일 수 있어서, 여러 요소를 그룹화하는 데 유용
- (예시)

```
  <div>이건 하나의 블록입니다</div>
  <div>이건 또 다른 블록입니다</div>
```
```
  <div class="box">첫 번째 박스</div>
  <div class="box">두 번째 박스</div>
  <p class="box">문단도 같은 클래스 적용 가능</p>

```


```
<div class="notice">
  <h2 class="title">공지사항</h2>
  <p class="content">오늘은 휴무입니다.</p>
</div>
```

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_text = """
<html>
<body>
  <h1> reading web page with python </h1>
     <p> page analysis </p>
     <p> page alignment </p>
     <td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
</body>
</html>
"""

In [None]:
soup = BeautifulSoup(html_text, 'html.parser')
soup


<html>
<body>
<h1> reading web page with python </h1>
<p> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
</body>
</html>

In [None]:
type(soup)

In [None]:
soup.h1

<h1> reading web page with python </h1>

In [None]:
soup.h1.text.strip()

'reading web page with python'

In [None]:
soup.p

<p> page analysis </p>

In [None]:
soup.p.next_sibling.next_sibling

<p> page alignment </p>

In [None]:
soup.td.next_sibling.next_sibling

<td><p>more text</p></td>

In [None]:
print(soup.td.next_sibling, soup.td.next_sibling.text)

<td></td> 


In [None]:
html_text2 = """
<html>
<body>
  <h1 id="title"> reading web page with python </h1>
     <p id="body"> page analysis </p>
     <p> page alignment </p>
     <td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
     <ul>
         <li><a href = "http://www.naver.com"> naver</a></li>
         <li><a href = "http://www.daum.net"> daum</a></li>
     </ul>
  <div id="xxx">
    <h1> Wiki-books store </h1>
    <ul class="item all">
      <li> introduction to game design </li>
      <li> introduction to python </li>
      <li> introduction to web design </li>
    </ul>
  </div>
</body>
</html>
"""

In [None]:
soup = BeautifulSoup(html_text2, 'html.parser')

In [None]:
soup


<html>
<body>
<h1 id="title"> reading web page with python </h1>
<p id="body"> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
<ul>
<li><a href="http://www.naver.com"> naver</a></li>
<li><a href="http://www.daum.net"> daum</a></li>
</ul>
<div id="xxx">
<h1> Wiki-books store </h1>
<ul class="item all">
<li> introduction to game design </li>
<li> introduction to python </li>
<li> introduction to web design </li>
</ul>
</div>
</body>
</html>

In [None]:
soup.h1.text.strip()

'reading web page with python'

## access by tags

In [None]:
soup


<html>
<body>
<h1 id="title"> reading web page with python </h1>
<p id="body"> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
<ul>
<li><a href="http://www.naver.com"> naver</a></li>
<li><a href="http://www.daum.net"> daum</a></li>
</ul>
<div id="xxx">
<h1> Wiki-books store </h1>
<ul class="item all">
<li> introduction to game design </li>
<li> introduction to python </li>
<li> introduction to web design </li>
</ul>
</div>
</body>
</html>

In [None]:
soup.find(id='title')

<h1 id="title"> reading web page with python </h1>

In [None]:
soup.find(id='body').text

' page analysis '

In [None]:
soup.find_all('p')

[<p id="body"> page analysis </p>,
 <p> page alignment </p>,
 <p>more text</p>,
 <p>more text</p>]

In [None]:
soup.find_all('li')

[<li><a href="http://www.naver.com"> naver</a></li>,
 <li><a href="http://www.daum.net"> daum</a></li>,
 <li> introduction to game design </li>,
 <li> introduction to python </li>,
 <li> introduction to web design </li>]

In [None]:
soup.find_all('li')[0]

<li><a href="http://www.naver.com"> naver</a></li>

In [None]:
soup.find_all('li')[0].text, soup.find_all('li')[0].attrs

(' naver', {})

In [None]:
soup.find_all('a')

[<a href="http://www.naver.com"> naver</a>,
 <a href="http://www.daum.net"> daum</a>]

In [None]:
soup.find_all('a')[0].text, soup.find_all('a')[0].attrs

(' naver', {'href': 'http://www.naver.com'})

In [None]:
for aa in soup.find_all('a'):
    href = aa.attrs['href']
    text = aa.text
    print (text, "-->", href)

 naver --> http://www.naver.com
 daum --> http://www.daum.net


## access by regular expression

In [None]:
soup


<html>
<body>
<h1 id="title"> reading web page with python </h1>
<p id="body"> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
<ul>
<li><a href="http://www.naver.com"> naver</a></li>
<li><a href="http://www.daum.net"> daum</a></li>
</ul>
<div id="xxx">
<h1> Wiki-books store </h1>
<ul class="item all">
<li> introduction to game design </li>
<li> introduction to python </li>
<li> introduction to web design </li>
</ul>
</div>
</body>
</html>

In [None]:
import re
soup.find_all(re.compile("^p"))   # tags starting with a character 'p'

[<p id="body"> page analysis </p>,
 <p> page alignment </p>,
 <p>more text</p>,
 <p>more text</p>]

In [None]:
soup.find_all(re.compile("div" ))

[<div id="xxx">
 <h1> Wiki-books store </h1>
 <ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>
 </div>]

In [None]:
soup.find_all(href=re.compile("^http://"))

[<a href="http://www.naver.com"> naver</a>,
 <a href="http://www.daum.net"> daum</a>]

## access by css (Cascading Style Sheets) selector

In [None]:
soup.select('h1')    # by tags

[<h1 id="title"> reading web page with python </h1>,
 <h1> Wiki-books store </h1>]

In [None]:
soup.select('#xxx')  # by id

[<div id="xxx">
 <h1> Wiki-books store </h1>
 <ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>
 </div>]

In [None]:
soup.select('.item') # by class name

[<ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>]

In [None]:
soup.select('div .item')  # multi-components(tag=div, class=item)

[<ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>]

In [None]:
soup.select("#xxx > ul > li")  # hierarchy (child)

[<li> introduction to game design </li>,
 <li> introduction to python </li>,
 <li> introduction to web design </li>]

In [None]:
soup.select_one("#xxx > ul > li")  # hierarchy (child)

<li> introduction to game design </li>

In [None]:
soup.select("div")

[<div id="xxx">
 <h1> Wiki-books store </h1>
 <ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>
 </div>]

In [None]:
soup.select("div ul")   # hierarchy (div tag >>> ul tag) (descendants)

[<ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>]

In [None]:
soup.select("div li")

[<li> introduction to game design </li>,
 <li> introduction to python </li>,
 <li> introduction to web design </li>]

- find with classes

In [None]:
text = '<p class="body strikeout"></p>'

soup = BeautifulSoup(text, 'html.parser')
soup.find_all("p", class_="strikeout")  # <p> 태그 중에서 class="strikeout" 속성을 가진 것
                                        # class 속성은 공백으로 구분된 여러 개의 클래스 이름을 가질 수 있음.

[<p class="body strikeout"></p>]

In [None]:
soup.find_all("p", class_="body")

[<p class="body strikeout"></p>]

In [None]:
# If you want to search for tags that match two or more CSS classes,
# you should use a CSS selector:

soup.select("p.body.strikeout")

[<p class="body strikeout"></p>]

In [None]:
soup.select("p.strikeout")

[<p class="body strikeout"></p>]

# Example from JOBKOREA (Homework)
- confirmed on 2025.10.1
- search for 'data scientist', 'seoul' in JobKorea
- Remember that web pages are changing upon refreshing.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

# search for 'data scientist', 'seoul' in JobaKorea
url = 'https://www.jobkorea.co.kr/Search/?stext=data%20scientist&local=I000'
# url = 'https://www.jobkorea.co.kr/Search/?stext=datascience&local=I000'
response = requests.get(url)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
response

<Response [200]>

In [None]:
response.text

'<!DOCTYPE html><html lang="ko"><head><meta charset="utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=edge" /><title>보안정책</title><style> *{margin:0;padding:0;box-sizing:border-box}body{font-family:\'NanumGothic\',\'나눔고딕\',\'Malgun Gothic\',\'맑은 고딕\',sans-serif;background-color:#f8f9fa;line-height:1.6;color:#333;min-height:100vh;display:flex;flex-direction:column}.container{max-width:800px;margin:50px auto;padding:40px 20px;background-color:#fff;box-shadow:0 2px 10px rgba(0,0,0,0.1);border-radius:8px}.header{text-align:center;margin-bottom:40px;padding-bottom:20px;border-bottom:2px solid #e9ecef}.header h1{font-size:28px;font-weight:700;color:#2c3e50;margin:0 0 10px 0}.content{margin-bottom:40px}.content p{font-size:15px;color:#4a5568;margin-bottom:25px;text-align:justify;line-height:1.7;font-weight:400}.content p:last-child{margin-bottom:35px}.contact-info{background:#f7fafc;padding:35px;border-radius:16px;margin-bottom:40px;border:1px solid rgba(102,126,234,0.1);position:relati

- it seems that the direct access (automatic scraping) to this web page is not allowed.
- we can modify our request to include headers that mimic a browser request.
  - Added User-Agent Header: We added a User-Agent header to the request. This header is used by web browsers to identify themselves when making requests to web servers.
  - By setting a User-Agent that mimics a popular browser (e.g., Google Chrome), **we made our request appear as though it was coming from a real user rather than an automated script.** This helps to avoid being blocked by websites that restrict automated access.

In [None]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Fetch the HTML content from the URL with headers
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
response.text



In [None]:
soup.select('div .styles_mb_space2__dk46ts4t')[0]

<div class="styles_mb_space2__dk46ts4t" data-sentry-element="Block" data-sentry-source-file="index.tsx"><a class="Flex_display_flex__i0l0hl2 Flex_gap_space6__i0l0hl1g Flex_align_center__i0l0hl8 styles_mb_space2__dk46ts4t" data-interactive="true" data-sentry-element="BaseLink" data-sentry-source-file="index.tsx" href="https://www.jobkorea.co.kr/Recruit/GI_Read/47797902?Oem_Code=C1&amp;logpath=1&amp;stext=data+scientist&amp;listno=1&amp;sc=630" rel="noopener noreferrer" style="max-width:700px" target="_blank"><span class="Typography_variant_size18__344nw25 Typography_weight_medium__344nw2d Typography_color_gray900__344nw2m Typography_truncate__344nw2y" data-accent-color="gray900" data-sentry-element="Typography" data-sentry-source-file="index.tsx">데이터 사이언티스트</span></a></div>

In [None]:
ss0 = soup.select('div .styles_mb_space2__dk46ts4t')[0]
ss0.find_all('a')[0].text

'데이터 사이언티스트'

# ------- 다음부터는 여러분의 숙제 입니다. ----------

# Exercise

In [None]:
import pandas as pd

d = {"col1": [10, 20, 30]}
pd.Series(d)


Unnamed: 0,0
col1,"[10, 20, 30]"


In [None]:
pd.DataFrame(d)

Unnamed: 0,col1
0,10
1,20
2,30
