<a href="https://colab.research.google.com/github/rtajeong/M1_new/blob/main/gg09_json_yaml_html_parsing_rev10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping using API
- When you type www.google.com in your browser's address bar, your computer is actually asking the www.google.com server for a web page; the server returns the page to your browser.
- APIs work much the same way, except instead of your web browser asking for a web page, your program asks for data. The API usually returns this data in JavaScript Object Notation (JSON) format.
  - GET request: to retrieve information from the OpenNotify API.
  -

### Crawling and Scraping
- Crawling refers to the process of automatically traversing through a website and following links to other pages in order to discover and index content.
- Scraping, on the other hand, refers to the process of extracting specific data elements from web pages. Scraping involves sending HTTP requests to web pages and then parsing the HTML or other markup language in the response to extract the desired data.
- Crawling focuses on discovering and indexing web pages, while scraping targets specific data extraction from those pages

# JSON format
- json: JavaScript Object Notation (자바 스크립트 객체 표기법)
- 데이터를 쉽게 '교환' 하고 '저장' 하기 위한 텍스트 기반의 데이터 교환 표준

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import json
from bs4 import BeautifulSoup

In [None]:
# exercise 1 : a simple json format
obj = """
{
    "name": "Kim",
    "places_lived": ["Seoul", "Korea"],
    "pet": null,
    "siblings": [{"name": "Scott", "age":25, "pet":"Zuko"}]
}
"""

In [None]:
type(obj)

str

In [None]:
r = json.loads(obj)   # decoding (json --> dict)
print(r)
type(r)

{'name': 'Kim', 'places_lived': ['Seoul', 'Korea'], 'pet': None, 'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'}]}


dict

In [None]:
json.dumps(r)    # encoding (dict --> json)

'{"name": "Kim", "places_lived": ["Seoul", "Korea"], "pet": null, "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"}]}'

In [2]:
# Exercise 2 (from https://rfriend.tistory.com/474)
py_data = {

    "1.FirstName": "Gildong",
    "2.LastName": "Hong",
    "3.Age": 20,
    "4.University": "Hangook University",
    "5.Courses": [
        {
            "Classes": [
                "Probability",
                "Generalized Linear Model",
                "Categorical Data Analysis"
            ],
            "Major": "Statistics"
        },
        {
            "Classes": [
                "Data Structure",
                "Programming",
                "Algorithms"
            ],
            "Minor": "ComputerScience"
        }
    ]
}

In [3]:
type(py_data)

dict

In [4]:
py_data.keys()

dict_keys(['1.FirstName', '2.LastName', '3.Age', '4.University', '5.Courses'])

In [5]:
py_data.values()

dict_values(['Gildong', 'Hong', 20, 'Hangook University', [{'Classes': ['Probability', 'Generalized Linear Model', 'Categorical Data Analysis'], 'Major': 'Statistics'}, {'Classes': ['Data Structure', 'Programming', 'Algorithms'], 'Minor': 'ComputerScience'}]])

In [6]:
py_data.items()

dict_items([('1.FirstName', 'Gildong'), ('2.LastName', 'Hong'), ('3.Age', 20), ('4.University', 'Hangook University'), ('5.Courses', [{'Classes': ['Probability', 'Generalized Linear Model', 'Categorical Data Analysis'], 'Major': 'Statistics'}, {'Classes': ['Data Structure', 'Programming', 'Algorithms'], 'Minor': 'ComputerScience'}])])

In [7]:
py_data['5.Courses']

[{'Classes': ['Probability',
   'Generalized Linear Model',
   'Categorical Data Analysis'],
  'Major': 'Statistics'},
 {'Classes': ['Data Structure', 'Programming', 'Algorithms'],
  'Minor': 'ComputerScience'}]

In [8]:
import json
json_str = json.dumps(py_data)
print(json_str)
type(json_str)

{"1.FirstName": "Gildong", "2.LastName": "Hong", "3.Age": 20, "4.University": "Hangook University", "5.Courses": [{"Classes": ["Probability", "Generalized Linear Model", "Categorical Data Analysis"], "Major": "Statistics"}, {"Classes": ["Data Structure", "Programming", "Algorithms"], "Minor": "ComputerScience"}]}


str

In [None]:
pd.Series(py_data)

1.FirstName                                               Gildong
2.LastName                                                   Hong
3.Age                                                          20
4.University                                   Hangook University
5.Courses       [{'Classes': ['Probability', 'Generalized Line...
dtype: object

In [9]:
pd.DataFrame(py_data)

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"{'Classes': ['Probability', 'Generalized Linea..."
1,Gildong,Hong,20,Hangook University,"{'Classes': ['Data Structure', 'Programming', ..."


In [10]:
pd.DataFrame(py_data).iloc[:,-1]

Unnamed: 0,5.Courses
0,"{'Classes': ['Probability', 'Generalized Linear Model', 'Categorical Data Analysis'], 'Major': 'Statistics'}"
1,"{'Classes': ['Data Structure', 'Programming', 'Algorithms'], 'Minor': 'ComputerScience'}"


In [None]:
pd.DataFrame.from_dict(py_data)  # create a dataframe from a dict (more flexible)

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"{'Classes': ['Probability', 'Generalized Linea..."
1,Gildong,Hong,20,Hangook University,"{'Classes': ['Data Structure', 'Programming', ..."


In [11]:
pd.json_normalize(py_data)  # transforms a complex nested data structures to a flat dataframe

Unnamed: 0,1.FirstName,2.LastName,3.Age,4.University,5.Courses
0,Gildong,Hong,20,Hangook University,"[{'Classes': ['Probability', 'Generalized Line..."


In [None]:
py_data['5.Courses']

[{'Classes': ['Probability',
   'Generalized Linear Model',
   'Categorical Data Analysis'],
  'Major': 'Statistics'},
 {'Classes': ['Data Structure', 'Programming', 'Algorithms'],
  'Minor': 'ComputerScience'}]

In [12]:
pd.json_normalize(py_data, "5.Courses")

Unnamed: 0,Classes,Major,Minor
0,"[Probability, Generalized Linear Model, Catego...",Statistics,
1,"[Data Structure, Programming, Algorithms]",,ComputerScience


In [13]:
pd.json_normalize(py_data, "5.Courses", ['3.Age'])

Unnamed: 0,Classes,Major,Minor,3.Age
0,"[Probability, Generalized Linear Model, Catego...",Statistics,,20
1,"[Data Structure, Programming, Algorithms]",,ComputerScience,20


# json_normalize (data, record_path, meta, ...)
- data: dict or list of dict
- record_path : decode 해줘야할 열 지정 [{}, {}, {} ....]
- meta : decode 하는 열과 동일 차원에 존재하는 열들 중 데이터 프레임에 포함시킬 열 선택

In [14]:
# JSON exercise3
# from https://pandas.pydata.org/pandas-docs/stable/reference/api/\
#              pandas.io.json.json_normalize.html
data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

In [None]:
type(data), len(data)

(list, 2)

In [15]:
pd.json_normalize(data)

Unnamed: 0,state,shortname,counties,info.governor
0,Florida,FL,"[{'name': 'Dade', 'population': 12345}, {'name...",Rick Scott
1,Ohio,OH,"[{'name': 'Summit', 'population': 1234}, {'nam...",John Kasich


In [16]:
pd.json_normalize(data, 'counties')

Unnamed: 0,name,population
0,Dade,12345
1,Broward,40000
2,Palm Beach,60000
3,Summit,1234
4,Cuyahoga,1337


In [17]:
pd.json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

Unnamed: 0,name,population,state,shortname,info.governor
0,Dade,12345,Florida,FL,Rick Scott
1,Broward,40000,Florida,FL,Rick Scott
2,Palm Beach,60000,Florida,FL,Rick Scott
3,Summit,1234,Ohio,OH,John Kasich
4,Cuyahoga,1337,Ohio,OH,John Kasich


# YAML parsing
- YAML (short for "YAML Ain't Markup Language") is a human-readable data serialization format that is commonly used for configuration files, data exchange, and other structured data.
- pyyaml: Python library that enables parsing and serialization of data in YAML format (parsing and conversion)

In [18]:
%%writefile config.yaml
# sample YAML configuration file (generateed by ChatGPT)

server:
  host: localhost
  port: 8080

database:
  host: localhost
  port: 5432
  user: myuser
  password: mypassword

Writing config.yaml


In [19]:
!cat config.yaml

# sample YAML configuration file (generateed by ChatGPT)

server:
  host: localhost
  port: 8080

database:
  host: localhost
  port: 5432
  user: myuser
  password: mypassword


In [20]:
!pip install pyyaml



In [22]:
import yaml

with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)
    # config = yaml.load(f)        # for more complex types
print(config, type(config))

{'server': {'host': 'localhost', 'port': 8080}, 'database': {'host': 'localhost', 'port': 5432, 'user': 'myuser', 'password': 'mypassword'}} <class 'dict'>


In [23]:

print(config['server'])
print(config['server']['host'])    # Output: localhost
print(config['database']['user'])  # Output: myuser

{'host': 'localhost', 'port': 8080}
localhost
myuser


# HTML Parsing
- before you do this example, try to see and run some example HTML files which are in this directory

In [24]:
from bs4 import BeautifulSoup

In [25]:
html_text = """
<html>
<body>
  <h1> reading web page with python </h1>
     <p> page analysis </p>
     <p> page alignment </p>
     <td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
</body>
</html>
"""

In [26]:
soup = BeautifulSoup(html_text, 'html.parser')
soup


<html>
<body>
<h1> reading web page with python </h1>
<p> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
</body>
</html>

In [27]:
type(soup)

In [31]:
soup.h1

<h1> reading web page with python </h1>

In [33]:
soup.h1.text.strip()

'reading web page with python'

In [34]:
soup.p

<p> page analysis </p>

In [35]:
soup.p.next_sibling.next_sibling

<p> page alignment </p>

In [None]:
soup.td.next_sibling.next_sibling

<td><p>more text</p></td>

In [None]:
print(soup.td.next_sibling, soup.td.next_sibling.text)

<td></td> 


In [36]:
html_text2 = """
<html>
<body>
  <h1 id="title"> reading web page with python </h1>
     <p id="body"> page analysis </p>
     <p> page alignment </p>
     <td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
     <ul>
         <li><a href = "http://www.naver.com"> naver</a></li>
         <li><a href = "http://www.daum.net"> daum</a></li>
     </ul>
  <div id="xxx">
    <h1> Wiki-books store </h1>
    <ul class="item all">
      <li> introduction to game design </li>
      <li> introduction to python </li>
      <li> introduction to web design </li>
    </ul>
  </div>
</body>
</html>
"""

In [37]:
soup = BeautifulSoup(html_text2, 'html.parser')

In [38]:
soup


<html>
<body>
<h1 id="title"> reading web page with python </h1>
<p id="body"> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
<ul>
<li><a href="http://www.naver.com"> naver</a></li>
<li><a href="http://www.daum.net"> daum</a></li>
</ul>
<div id="xxx">
<h1> Wiki-books store </h1>
<ul class="item all">
<li> introduction to game design </li>
<li> introduction to python </li>
<li> introduction to web design </li>
</ul>
</div>
</body>
</html>

## access by tags

In [39]:
soup.find(id='title')

<h1 id="title"> reading web page with python </h1>

In [41]:
soup.find(id='body').text

' page analysis '

In [42]:
soup.find_all('p')

[<p id="body"> page analysis </p>,
 <p> page alignment </p>,
 <p>more text</p>,
 <p>more text</p>]

In [43]:
soup.find_all('li')

[<li><a href="http://www.naver.com"> naver</a></li>,
 <li><a href="http://www.daum.net"> daum</a></li>,
 <li> introduction to game design </li>,
 <li> introduction to python </li>,
 <li> introduction to web design </li>]

In [44]:
soup.find_all('li')[0]

<li><a href="http://www.naver.com"> naver</a></li>

In [45]:
soup.find_all('li')[0].text, soup.find_all('li')[0].attrs

(' naver', {})

In [None]:
soup.find_all('a')[0]

<a href="http://www.naver.com"> naver</a>

In [None]:
soup.find_all('a')[0].text, soup.find_all('a')[0].attrs

(' naver', {'href': 'http://www.naver.com'})

In [46]:
for aa in soup.find_all('a'):
    href = aa.attrs['href']
    text = aa.string
    print (text, "-->", href)

 naver --> http://www.naver.com
 daum --> http://www.daum.net


## access by regular expression

In [None]:
soup


<html>
<body>
<h1 id="title"> reading web page with python </h1>
<p id="body"> page analysis </p>
<p> page alignment </p>
<td>some text</td><td></td><td><p>more text</p></td><td>even <p>more text</p></td>
<ul>
<li><a href="http://www.naver.com"> naver</a></li>
<li><a href="http://www.daum.net"> daum</a></li>
</ul>
<div id="xxx">
<h1> Wiki-books store </h1>
<ul class="item all">
<li> introduction to game design </li>
<li> introduction to python </li>
<li> introduction to web design </li>
</ul>
</div>
</body>
</html>

In [None]:
import re
soup.find_all(re.compile("^p"))   # tags starting with a character 'p'

[<p id="body"> page analysis </p>,
 <p> page alignment </p>,
 <p>more text</p>,
 <p>more text</p>]

In [None]:
soup.find_all(re.compile("div" ))

[<div id="xxx">
 <h1> Wiki-books store </h1>
 <ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>
 </div>]

In [None]:
soup.find_all(href=re.compile("^http://"))

[<a href="http://www.naver.com"> naver</a>,
 <a href="http://www.daum.net"> daum</a>]

## access by css (Cascading Style Sheets) selector

In [47]:
soup.select('h1')    # by tags

[<h1 id="title"> reading web page with python </h1>,
 <h1> Wiki-books store </h1>]

In [48]:
soup.select('#xxx')  # by id

[<div id="xxx">
 <h1> Wiki-books store </h1>
 <ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>
 </div>]

In [49]:
soup.select('.item') # by class name

[<ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>]

In [50]:
soup.select('div .item')  # multi-components(tag=div, class=item)

[<ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>]

In [None]:
soup.select("#xxx > ul > li")  # hierarchy (child)

[<li> introduction to game design </li>,
 <li> introduction to python </li>,
 <li> introduction to web design </li>]

In [51]:
soup.select_one("#xxx > ul > li")  # hierarchy (child)

<li> introduction to game design </li>

In [None]:
soup.select("div")

[<div id="xxx">
 <h1> Wiki-books store </h1>
 <ul class="item all">
 <li> introduction to game design </li>
 <li> introduction to python </li>
 <li> introduction to web design </li>
 </ul>
 </div>]

In [None]:
soup.select("div li")   # hierarchy (div tag >>> ul tag) (descendants)

[<li> introduction to game design </li>,
 <li> introduction to python </li>,
 <li> introduction to web design </li>]

- find with classes

In [None]:
text = '<p class="body strikeout"></p>'

css_soup = BeautifulSoup(text, 'html.parser')
css_soup.find_all("p", class_="strikeout")  # can have multiple values for a class

[<p class="body strikeout"></p>]

In [None]:
css_soup.find_all("p", class_="body")

[<p class="body strikeout"></p>]

In [None]:
# If you want to search for tags that match two or more CSS classes,
# you should use a CSS selector:
css_soup.select("p.body.strikeout")

[<p class="body strikeout"></p>]

In [None]:
css_soup.select("p.strikeout")

[<p class="body strikeout"></p>]

# Example from JOBKOREA
- confirmed on 2024.7.18
- search for 'data scientist', 'seoul' in JobKorea
- Remember that web pages are changing upon refreshing.

In [52]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

# search for 'data scientist', 'seoul' in JobaKorea
url = 'https://www.jobkorea.co.kr/Search/?stext=data%20scientist&local=I000'
# url = 'https://www.jobkorea.co.kr/Search/?stext=datascience&local=I000'
response = requests.get(url)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')

In [54]:
response.text

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\r\n<html>\r\n<head>\r\n<meta http-equiv="content-type" content="text/html; charset=utf-8">\r\n</head>\r\n<body>\r\n<br>\r\n<br>\r\n<center>\r\n<h2>\r\n회원님께서는 현재 입력할 수 없는 문자열의 사용으로 인해 차단이 되었습니다.<br>\r\n문제가 지속적으로 발생할 경우 아래 고객센터로 문의하시기 바랍니다.<br>\r\n이용에 불편을 드려 죄송합니다.<br>\r\n\r\n문의(고객센터): 1588-9350<br>\r\n<br>\r\n</body>\r\n</html>'

- it seems that the direct access (automatic scraping) to this web page is not allowed.
- we can modify our request to include headers that mimic a browser request.
  - Added User-Agent Header: We added a User-Agent header to the request. This header is used by web browsers to identify themselves when making requests to web servers.
  - By setting a User-Agent that mimics a popular browser (e.g., Google Chrome), **we made our request appear as though it was coming from a real user rather than an automated script.** This helps to avoid being blocked by websites that restrict automated access.

In [55]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Fetch the HTML content from the URL with headers
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')

In [64]:
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html lang="ko">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n  <meta content="IE=edge; chrome=1" http-equiv="X-UA-Compatible"/>\n  <title>\n   data scientist 채용공고 | 총 55건의 검색결과 - 잡코리아\n  </title>\n  <meta content="data scientist 채용공고 | 총 55건의 검색결과 - 잡코리아" name="title"/>\n  <meta content="data scientist 검색결과 총 55건- 2024 data scientist 채용정보가 더 알고 싶다면? 기업정보, 취업정보를 No.1 커리어 플랫폼 잡코리아에서 확인해보세요." name="description"/>\n  <meta content="data scientist, 채용, 구인구직사이트, 취업, 공기업, 직업, 취업박람회, 자소서, 채용사이트, 구직사이트,취업정보사이트, 채용공고, WORK, 직업종류, 커리어, 이직, 공고, 구인, 구직, 헤드헌팅, 경력, 신입, 인턴, 공채, 취업정보, 취업 정보, 취업뉴스, 취업 속보, 취업 뉴스, 취업상담실, 해외취업, 취업센터,채용박람회, 직업적성검사, 면접, 대기업채용, 리쿠르팅, 구인사이트, 잡, 구인광고, 직업추천, 청년일자리, 일자리사이트, 취업지원센터, 리크루팅, 구직자, 채용공고사이트, 인턴십, 일자리박람회, 일자리구하기, 취업성공, 취업사이트, 채용, 채용포털, 채용정보, 고용정보, 알바, 일자리, 구인정보, 이력서, Work, Job, 전직, 재취업, 여성취업, 정보통신취업, IT취업, 임원, CEO, 리쿠르트, 리크루트, 기업, 대기업, 중소기업, 벤처기업, 잡코리아, Jobkorea, wkqzhfldk, 원픽, OnePick, on

In [70]:
len(soup.select('article .content-list .list'))

1

In [73]:
soup.select('article .content-list .list')[0]

<article class="list">
<article class="list-item" data-gainfo='{"dimension42":"45064061", "dimension43":"전기·전자·제어", "dimension44":"DBA,데이터엔지니어,데이터사이언티스트", "dimension45":"[BS사업본부] Data Scientist", "dimension46":"서울", "dimension65":"N", "dimension66":"일반기업", "dimension70":"무료", "dimension47":"21247745", "dimension48":"엘지전자㈜"}' data-gavirturl="https://www.jobkorea.co.kr/virtual/Recruit/GI_Read/45064061?Oem_Code=C1&amp;logpath=1&amp;stext=data scientist&amp;listno=1" data-gino="47305364" data-gno="45064061" data-listno="1" data-mem-sys="21247745" data-mem-type="C">
<div class="list-section-corp">
<a class="corp-name-link dev-view" href="/Recruit/GI_Read/45064061?Oem_Code=C1&amp;logpath=1&amp;stext=data scientist&amp;listno=1" nav-src="/Search/_ContentsGIRead?Gno=45064061&amp;Mem_Type_Code=C&amp;Mem_Sys_No=21247745" onclick="$(this).closest('.list-item').addClass('checked'); GA_Virtual_Dimension($(this).closest('.list-post[data-gno=45064061]').data('gainfo')); GA_Virtual('홈&gt;통합검색&gt;공고뷰',

In [75]:
ss = soup.select('article .content-list .list')[0].select('.list-item')
len(ss)

20

In [83]:
ss[0].select('.list-section-corp')[0].a.text.strip() # corp name

'엘지전자㈜'

In [88]:
ss[0].select('.information-title')[0].a.text.strip() # title

'[BS사업본부] Data Scientist'

In [108]:
ss[0].select('.list-section-information')[0].select('ul')[0].find_all('li')

[<li class="chip-information-item">경력</li>,
 <li class="chip-information-item">대졸↑</li>,
 <li class="chip-information-item">서울 영등포구</li>,
 <li class="chip-information-item dday">오늘마감</li>]

In [111]:
# location and other information
 [i.text for i in ss[0].select('.list-section-information')[0].select('ul')[0].find_all('li')]

['경력', '대졸↑', '서울 영등포구', '오늘마감']

In [115]:
ss = soup.select('article .content-list .list')[0].select('.list-item')

names, titles, infos = [], [], []

for k in range(len(ss)):
    corp = ss[k]
    name = ss[k].select('.list-section-corp')[0].a.text.strip()
    title = ss[k].select('.information-title')[0].a.text.strip()
    info = [i.text for i in ss[k].select('.list-section-information')[0]\
            .select('ul')[0].find_all('li')]
    names.append(name)
    titles.append(title)
    infos.append(info)


In [120]:
pd.DataFrame(np.c_[names, titles, infos],
             columns=["names", "titles", "infos"])

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (20,) + inhomogeneous part.

# Exercise
- .text: extract the combined text content of an element and its descendants. (recommended)
- .string: extract the text content of an individual element without considering any nested elements.
- It's important to note that the string attribute can only be used when an element has a single string as its content. If an element has multiple strings or tags, string will return None.
- In summary, text returns the combined text content of an element and its descendants, while string returns the text content of an individual element without considering its children.


In [None]:
# .text and .string
html_text = """
<html>
<body>
  <td>some text</td>
  <td></td>
  <td><p>more text</p></td>
  <td>even <p>more text</p></td>"
</body>
</html>
"""
soup = BeautifulSoup(html_text, 'html.parser')

In [None]:
soup.find_all('td')

[<td>some text</td>,
 <td></td>,
 <td><p>more text</p></td>,
 <td>even <p>more text</p></td>]

In [None]:
for i in soup.find_all('td'):
    print(i.string)

some text
None
more text
None


In [None]:
for i in soup.find_all('td'):
    print(i.text)

some text

more text
even more text


### Request
- get
- put

In [None]:
import requests

# the following two are the same.
parameters = {"lat": 40.71, "lon": -74}
response = requests.get("http://api.open-notify.org/iss-now.json",
                        params=parameters)
print(response.content)

response = requests.get("http://api.open-notify.org/iss-now.json?lat=40.71&lon=-74")
print(response.content)

b'{"timestamp": 1689835502, "message": "success", "iss_position": {"latitude": "28.6522", "longitude": "-80.1510"}}'
b'{"timestamp": 1689835502, "message": "success", "iss_position": {"latitude": "28.6522", "longitude": "-80.1510"}}'


In [None]:
response.status_code

200

In [None]:
response = requests.get("http://api.open-notify.org/iss-now.json")

In [None]:
response.content

b'{"timestamp": 1689835502, "message": "success", "iss_position": {"latitude": "28.6522", "longitude": "-80.1510"}}'

- getting JSON from a request

In [None]:
# Make the same request we did

response = requests.get("http://api.open-notify.org/iss-now.json")

# Get the response data as a Python object.  Verify that it's a dictionary.
json_data = response.json()
print(type(json_data))
print(json_data)

<class 'dict'>
{'timestamp': 1689835502, 'message': 'success', 'iss_position': {'latitude': '28.6522', 'longitude': '-80.1510'}}


In [None]:
# Headers is a dictionary
print(response.headers)
content_type = response.headers['Content-Type']
content_type

{'Server': 'nginx/1.10.3', 'Date': 'Thu, 20 Jul 2023 06:45:02 GMT', 'Content-Type': 'application/json', 'Content-Length': '113', 'Connection': 'keep-alive', 'access-control-allow-origin': '*'}


'application/json'