# 시카고 샌드위치 맛집 분석
## [학습목표]
- 웹 데이터를 가져오는 Beautiful Soup 익히기
- 크롬 개발자 도구를 이용해서 원하는 태그 찾기
- 시카고 샌드위치 맛집 소개 사이트에 접근하기
- 접근한 웹 페이지에서 원하는 데이터 추출하고 정리하기
- 다수의 웹페이지에서 자동으로 접근해서 원하는 정보 가져오기
- Jupyter notebook 상태 진행바 생성
- 상태 진행바 적용 페이지 접근하기
- 50개 웹페이지 정보 가져오기
- 맛집 위치를 지도에 표시하기

In [1]:
# BeautifulSoup 연습
# bs4 모듈.BeautifulSoup 이라는 클래스
from bs4 import BeautifulSoup

In [2]:
# html 예제 파일 읽기
page = open('./data_01/03. test_first.html', 'r').read() # r모드 
page

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>Very Simple HTML Code by PinkWink</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text first-item" id="first">\n                Happy PinkWink.\n                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>\n            </p>\n            <p class="inner-text second-item">\n                Happy Data Science.\n                <a href="https://www.python.org" id="py-link">Python</a>\n            </p>\n        </div>\n        <p class="outer-text first-item" id="second">\n            <b>\n                Data Science is funny.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                All I need is Love.\n            </b>\n        </p>\n    </body>\n</html>'

In [3]:
# 파싱 => Dom Tree object 생성
soup = BeautifulSoup(page, 'html.parser') # html로 파싱 => DOM Tree 생성
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Very Simple HTML Code by PinkWink
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    Happy PinkWink.
    <a href="http://www.pinkwink.kr" id="pw-link">
     PinkWink
    </a>
   </p>
   <p class="inner-text second-item">
    Happy Data Science.
    <a href="https://www.python.org" id="py-link">
     Python
    </a>
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    Data Science is funny.
   </b>
  </p>
  <p class="outer-text">
   <b>
    All I need is Love.
   </b>
  </p>
 </body>
</html>


In [4]:
# CSS selection
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>Very Simple HTML Code by PinkWink</title>
 </head>
 <body>
 <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>
 </body>
 </html>]

In [5]:
html = list(soup.children)[2] # soup => html
html

<html>
<head>
<title>Very Simple HTML Code by PinkWink</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>
</html>

- html만 나옴!

In [6]:
list(html.children)

['\n',
 <head>
 <title>Very Simple HTML Code by PinkWink</title>
 </head>,
 '\n',
 <body>
 <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>
 </body>,
 '\n']

이 소스에서는 \n이 들어간 부분부터 인덱스가 주어짐

In [7]:
body = list(html.children)[3]
body

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

In [8]:
soup.body #(list(soup.children)[2]).children[3]

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

soup.body로 입력하면, body = list(html.children)[3]와 동일한 값이 나옴
- body라는 태그를 선택한 것!

In [9]:
len(list(body.children))

7

In [10]:
soup.find_all('p') # p 태그를 다 찾아라! => selection (태그 선택자) 
#  => 리스트로 나옴

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>,
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

- Dom Tree가 만들어져 있기 때문에 위의 코드 가능

#### 리스트로 리턴된다는 것은 반복문 !! 사용 가능!! 

In [11]:
soup.find('p') # p 태그의 첫번째 것만 찾아라 !

<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>

- 첫번째 p 태그 object를 찾는 것!

In [12]:
soup.find_all('p', class_='outer-text') # class_ : 클래스를 찾을 때 사용

[<p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

- **클래스** 를 선택하면 조금더 정밀하게 찾을 수 있다

In [13]:
# class_ : 클래스를 찾을 때 사용, 중복이 안되는 경우 p태그 없이도 검색가능! 
soup.find_all(class_='outer-text') 

[<p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

In [14]:
soup.find_all(id='first') # id로 검색 가능!

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>]

### find_all의 리턴타입은 0개 라도 리스트로 리턴됨!

In [15]:
type(soup.find('p')) # 첫번째 p를 찾는다

bs4.element.Tag

Tag 라는 object가 나옴

In [16]:
type(soup.find_all('p'))

bs4.element.ResultSet

ResultSet => List

#### ResultSet 안에 하나하나 Tag 라는 object가 들어있음을 알 수 있음

In [17]:
type(soup)

bs4.BeautifulSoup

BeautifulSoup이라는 object

In [18]:
type(soup.children)

list_iterator

list_iterator : iterable 한 놈 => list는 (X)

In [19]:
type(soup.head) # == type(soup.p)

bs4.element.Tag

Tag object가 나옴!

In [20]:
list(soup)

['html',
 '\n',
 <html>
 <head>
 <title>Very Simple HTML Code by PinkWink</title>
 </head>
 <body>
 <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>
 </body>
 </html>]

In [21]:
soup.head

<head>
<title>Very Simple HTML Code by PinkWink</title>
</head>

In [22]:
soup.head.next_sibling # 같은 라인을 의미함 (다른 자식들)

'\n'

body를 가져오려면 head를 기준으로 두번 next 해야 body를 가져옴!

In [23]:
soup.head.next_sibling.next_sibling

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

In [24]:
soup.head.previous_sibling # 이전 형제들

'\n'

In [25]:
soup.body.previous_sibling.previous_sibling # 이전 형제들 불러오기

<head>
<title>Very Simple HTML Code by PinkWink</title>
</head>

In [26]:
body.p

<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>

In [27]:
soup.html.body.p # soup.find('p')

<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>

Dom Tree가 만들어졌기 때문에, 부모자식, 형제들이 생기는 것!

In [28]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>,
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

In [29]:
# tmp_tag : p Tag object
for tmp_tag in soup.find_all('p'):
    print(tmp_tag.get_text()) # get_text() : 텍스트만추출! 


                Happy PinkWink.
                PinkWink


                Happy Data Science.
                Python



                Data Science is funny.
            



                All I need is Love.
            



**get_text()** : tag를 뺀 text만 추출!

In [30]:
body.get_text()

'\n\n\n                Happy PinkWink.\n                PinkWink\n\n\n                Happy Data Science.\n                Python\n\n\n\n\n                Data Science is funny.\n            \n\n\n\n                All I need is Love.\n            \n\n'

공백도 문자! body를 가지고 get_text 할 일 없음
- targetting 하는 것이 바람직!

In [31]:
# a 태그 찾기
soup.find_all('a')

[<a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>,
 <a href="https://www.python.org" id="py-link">Python</a>]

find_all 하면 리스트로 나옴!

In [32]:
for tmp_tag in soup.find_all('a'):
    # attribute 속성 접근 => value값 가져올 수 있음
    href_value = tmp_tag['href']
    text = tmp_tag.string
    print(text + ' -> ' + href_value)

PinkWink -> http://www.pinkwink.kr
Python -> https://www.python.org


---
## 크롬 개발자 도구를 이용해서 원하는 태그 찾기

### 원 달러 가격 추출

In [33]:
# 원 달러 가격 추출
# https://finance.naver.com/marketindex/

# urlopen : 웹 상에 있는 파일을 가져옴
from urllib.request import urlopen

In [34]:
url = 'https://finance.naver.com/marketindex/'
page = urlopen(url)
page

<http.client.HTTPResponse at 0x1aa722a69c8>

In [35]:
soup = BeautifulSoup(page, 'html.parser')
soup # 돔트리 구성


<script language="javascript" src="/template/head_js.nhn?referer=info.finance.naver.com&amp;menu=marketindex&amp;submenu=market"></script>
<script src="https://ssl.pstatic.net/imgstock/static.pc/20210325123932/js/info/jindo.min.ns.1.5.3.euckr.js" type="text/javascript"></script>
<script src="https://ssl.pstatic.net/imgstock/static.pc/20210325123932/js/jindo.1.5.3.element-text-patch.js" type="text/javascript"></script>
<div id="container" style="padding-bottom:0px;">
<div class="market_include">
<div class="market_data">
<div class="market1">
<div class="title">
<h2 class="h_market1"><span>환전 고시 환율</span></h2>
</div>
<!-- data -->
<div class="data">
<ul class="data_lst" id="exchangeList">
<li class="on">
<a class="head usd" href="/marketindex/exchangeDetail.nhn?marketindexCd=FX_USDKRW" onclick="clickcr(this, 'fr1.usdt', '', '', event);">
<h3 class="h_lst"><span class="blind">미국 USD</span></h3>
<div class="head_info point_up">
<span class="value">1,133.80</span>
<span class="txt_krw"><s

파싱이 끝나면, 선택이 가능해진다.

In [36]:
soup.find_all('span', class_='value')

[<span class="value">1,133.80</span>,
 <span class="value">1,031.29</span>,
 <span class="value">1,334.03</span>,
 <span class="value">172.42</span>,
 <span class="value">109.7600</span>,
 <span class="value">1.1766</span>,
 <span class="value">1.3774</span>,
 <span class="value">92.9500</span>,
 <span class="value">61.56</span>,
 <span class="value">1533.6</span>,
 <span class="value">1712.2</span>,
 <span class="value">62221.58</span>]

In [37]:
soup.find_all('span', class_='value')[0].string

'1,133.80'

In [38]:
# exchangeList > li:nth-child(1) > a.head.usd > div > span.value
# 원달러 환율로 바로가는 코드

#### 개발자도구 (f12) => copy select

## 시카고 샌드위치 맛집 소개 사이트

In [39]:
from bs4 import BeautifulSoup 
from urllib.request import urlopen, Request

In [40]:
url_base = 'https://www.chicagomag.com'
url_sub = '/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'

# 메인화면에서 서브 페이지(50개 a)의 정보를 추출
# 1. 메인화면 진입
url = url_base + url_sub # url : 메인페이지 URL
req = Request(url, headers={'User-Agent' : 'Mozilla/5.0'}) # 브라우저에서 요청하는 것 처럼 인식시키는 것
html_ = urlopen(req) 
html

<html>
<head>
<title>Very Simple HTML Code by PinkWink</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>
</html>

In [41]:
# https://www.chicagomag.com/Chicago-Magazine/November-2012/
# Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/

# https://www.chicagomag.com/Chicago-Magazine/November-2012/
# Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/

In [42]:
# 파싱
soup = BeautifulSoup(html_, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [43]:
soup.find_all('div', class_='sammy')

[<div class="sammy" style="position: relative;">
 <div class="sammyRank">1</div>
 <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
 Old Oak Tap<br/>
 <em>Read more</em> </a></div>
 </div>,
 <div class="sammy" style="position: relative;">
 <div class="sammyRank">2</div>
 <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/"><b>Fried Bologna</b><br/>
 Au Cheval<br/>
 <em>Read more</em> </a></div>
 </div>,
 <div class="sammy" style="position: relative;">
 <div class="sammyRank">3</div>
 <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/"><b>Woodland Mushroom</b><br/>
 Xoco<br/>
 <em>Read more</em> </a></div>
 </div>,
 <div class="sammy" style="position: relative;">
 <div class="sammyRank">4</div>
 <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-i

In [44]:
# 1등만 끄집어 내자
soup.find_all('div', class_='sammy')[0]

<div class="sammy" style="position: relative;">
<div class="sammyRank">1</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
Old Oak Tap<br/>
<em>Read more</em> </a></div>
</div>

- Rank 확인!
- href : 상대 주소
- 메뉴
- 주소

In [45]:
tmp_one = soup.find_all('div', class_='sammy')[0]

In [46]:
type(tmp_one)

bs4.element.Tag

In [47]:
tmp_one

<div class="sammy" style="position: relative;">
<div class="sammyRank">1</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
Old Oak Tap<br/>
<em>Read more</em> </a></div>
</div>

In [48]:
# Rank 가져오기
tmp_one.find(class_='sammyRank')

<div class="sammyRank">1</div>

In [49]:
tmp_one.find(class_='sammyRank').string

'1'

In [50]:
tmp_one.find(class_='sammyRank').get_text()

'1'

In [51]:
tmp_one.find(class_='sammyListing')

<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
Old Oak Tap<br/>
<em>Read more</em> </a></div>

In [52]:
tmp_one.find(class_='sammyListing').get_text() # Text == String

'BLT\nOld Oak Tap\nRead more '

In [53]:
tmp_one.find('a')

<a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
Old Oak Tap<br/>
<em>Read more</em> </a>

In [54]:
tmp_one.find('a')['href'] # attribute 접근할 때 사용

'/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

In [55]:
tmp_string = \
tmp_one.find(class_='sammyListing').get_text()

In [56]:
tmp_string

'BLT\nOld Oak Tap\nRead more '

In [57]:
tmp_string.split('\n')

['BLT', 'Old Oak Tap', 'Read more ']

In [58]:
print(tmp_string.split('\n')[0])
print(tmp_string.split('\n')[1])

BLT
Old Oak Tap


In [59]:
from urllib.parse import urljoin # URL 연결 함수

In [60]:
# Rank, main_menu, cage_name, url_add 추출 => 저장
rank = []
main_menu = []
cafe_name = []
url_add = []

# 순위 리스트 50 추출
list_soup = soup.find_all('div', class_='sammy')

for item in list_soup:
    # 1. 순위 저장
    rank.append(item.find(class_='sammyRank').get_text()) # 나온놈 append
    
    # 2. main_menu, cafe_name 추출, 저장
    tmp_string = item.find(class_='sammyListing').get_text()
    main_menu.append(tmp_string.split('\n')[0]) # 메인메뉴가 0번째
    cafe_name.append(tmp_string.split('\n')[1]) # 카페이름 1번째
    
    # 3. URL 추출, 저장
    url_add.append(urljoin(url_base, item.find('a')['href'])) # 합치는 중
    
    

In [61]:
rank

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50']

In [62]:
url_add[:5]

['https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Als-Deli-Roast-Beef/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Publican-Quality-Meats-PB-L/']

#### DataFrame으로 저장

In [63]:
# DataFrema
import pandas as pd

In [64]:
data  = {
      'Rank' : rank
    , 'Menu' : main_menu
    , 'Cafe' : cafe_name
    , 'URL'  : url_add
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Rank,Menu,Cafe,URL
0,1,BLT,Old Oak Tap,https://www.chicagomag.com/Chicago-Magazine/No...
1,2,Fried Bologna,Au Cheval,https://www.chicagomag.com/Chicago-Magazine/No...
2,3,Woodland Mushroom,Xoco,https://www.chicagomag.com/Chicago-Magazine/No...
3,4,Roast Beef,Al’s Deli,https://www.chicagomag.com/Chicago-Magazine/No...
4,5,PB&L,Publican Quality Meats,https://www.chicagomag.com/Chicago-Magazine/No...


Menu 보다는 앞의 cafe 이름이 나오는 것이 바람직!

In [65]:
df = pd.DataFrame(
      data
    , columns=['Rank', 'Cafe', 'Menu', 'URL']
    
)
df.head()

Unnamed: 0,Rank,Cafe,Menu,URL
0,1,Old Oak Tap,BLT,https://www.chicagomag.com/Chicago-Magazine/No...
1,2,Au Cheval,Fried Bologna,https://www.chicagomag.com/Chicago-Magazine/No...
2,3,Xoco,Woodland Mushroom,https://www.chicagomag.com/Chicago-Magazine/No...
3,4,Al’s Deli,Roast Beef,https://www.chicagomag.com/Chicago-Magazine/No...
4,5,Publican Quality Meats,PB&L,https://www.chicagomag.com/Chicago-Magazine/No...


In [66]:
# CSV 파일 저장
df.to_csv(
      './data_01/03. best_sandwiches_list_chicago.csv'
    , encoding='utf-8'
    , sep=','
)

### 하위 페이지 정보 추출
- url 저장 해놓은 것은 서브 페이지 까지 갈 수 있음

In [67]:
df.head(3)

Unnamed: 0,Rank,Cafe,Menu,URL
0,1,Old Oak Tap,BLT,https://www.chicagomag.com/Chicago-Magazine/No...
1,2,Au Cheval,Fried Bologna,https://www.chicagomag.com/Chicago-Magazine/No...
2,3,Xoco,Woodland Mushroom,https://www.chicagomag.com/Chicago-Magazine/No...


#### 첫 번째 URL 가져와서 처리 코드

In [68]:
# 첫번째거 제대로 조회되는지 확인 => 정상적이면 50개 반복문 조회
df['URL'][0]

'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

In [69]:
req = Request(
    df['URL'][0]
    , headers={'User-Agent':'Mozilla/5.0'}
)
html_ = urlopen(req)
html_

<http.client.HTTPResponse at 0x1aa727a3cc8>

In [70]:
# Pasing
soup_tmp = BeautifulSoup(html_, 'html.parser')
type(soup_tmp)

bs4.BeautifulSoup

In [71]:
print(soup_tmp.find('p', class_='addy'))

<p class="addy">
<em>$10. 2109 W. Chicago Ave., 773-772-0406, <a href="http://www.theoldoaktap.com/">theoldoaktap.com</a></em></p>


In [72]:
price_tmp = soup_tmp.find('p', class_='addy').get_text()
price_tmp

'\n$10. 2109 W. Chicago Ave., 773-772-0406, theoldoaktap.com'

In [75]:
price_tmp.split() # 디폴트 공백 분할

['$10.', '2109', 'W.', 'Chicago', 'Ave.,', '773-772-0406,', 'theoldoaktap.com']

In [74]:
price_tmp.split()[0][:-1]

'$10'

In [78]:
price_tmp.split()[1:-2]

['2109', 'W.', 'Chicago', 'Ave.,']

In [79]:
# 한 개의 문자열 변환 => 전체 주소 변환
' '.join(price_tmp.split()[1:-2]) 
# 공백으로 나눠진 단어를 구분하여 문자열로 변환

'2109 W. Chicago Ave.,'

#### 리스트를 문자열로 바꿀 때 JOIN 사용!

In [80]:
df

Unnamed: 0,Rank,Cafe,Menu,URL
0,1,Old Oak Tap,BLT,https://www.chicagomag.com/Chicago-Magazine/No...
1,2,Au Cheval,Fried Bologna,https://www.chicagomag.com/Chicago-Magazine/No...
2,3,Xoco,Woodland Mushroom,https://www.chicagomag.com/Chicago-Magazine/No...
3,4,Al’s Deli,Roast Beef,https://www.chicagomag.com/Chicago-Magazine/No...
4,5,Publican Quality Meats,PB&L,https://www.chicagomag.com/Chicago-Magazine/No...
5,6,Hendrickx Belgian Bread Crafter,Belgian Chicken Curry Salad,http://www.chicagomag.com/Chicago-Magazine/Nov...
6,7,Acadia,Lobster Roll,https://www.chicagomag.com/Chicago-Magazine/No...
7,8,Birchwood Kitchen,Smoked Salmon Salad,https://www.chicagomag.com/Chicago-Magazine/No...
8,9,Cemitas Puebla,Atomica Cemitas,https://www.chicagomag.com/Chicago-Magazine/No...
9,10,Nana,Grilled Laughing Bird Shrimp and Fried Po’ Boy,https://www.chicagomag.com/Chicago-Magazine/No...


In [88]:
# 50개의 서브 사이트를 접근하여 정보(가격, 주소) 추출
# 3개 처리
price = []
address = []

for n in df.index[:3]: # 3번만 반복
    # 1. 서브 사이트 추출
    req = Request(df['URL'][n], headers={'User-Agent':'Mozilla/5.0'})
    html_=urlopen(req)
    soup_tmp = BeautifulSoup(html_, 'html.parser') #lxml install 
    
    # 2. 추출
    get_str = soup_tmp.find('p', class_='addy').get_text()
    
    # 3. 리스트에 추가
    price.append(get_str.split()[0][:-1]) # 가격
    address.append(' '.join(get_str.split()[1:-2]))

In [84]:
price

['$10', '$9', '$9.50']

In [85]:
address

['2109 W. Chicago Ave.,', '800 W. Randolph St.,', '445 N. Clark St.,']

In [91]:
from tqdm import tqdm # 앞은 모듈 뒤는 function
from time import sleep

In [94]:
text = ''
for char in tqdm(['a','b','c','d']):
    sleep(1)
    text = text + char
    
print(text)

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.01s/it]

abcd





In [95]:
# 50개의 서브 사이트를 접근하여 정보(가격, 주소) 추출
price = []
address = []

for n in tqdm(df.index): # 전체 인덱스 수만큼 반복
    # 1. 서브 사이트 추출
    req = Request(df['URL'][n], headers={'User-Agent':'Mozilla/5.0'})
    html_=urlopen(req)
    soup_tmp = BeautifulSoup(html_, 'html.parser') #lxml install 
    
    # 2. 추출
    get_str = soup_tmp.find('p', class_='addy').get_text()
    
    # 3. 리스트에 추가
    price.append(get_str.split()[0][:-1]) # 가격
    address.append(' '.join(get_str.split()[1:-2]))

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [01:08<00:00,  1.36s/it]


In [96]:
price

['$10',
 '$9',
 '$9.50',
 '$9.40',
 '$10',
 '$7.25',
 '$16',
 '$10',
 '$9',
 '$17',
 '$11',
 '$5.49',
 '$14',
 '$10',
 '$13',
 '$4.50',
 '$11.95',
 '$11.50',
 '$6.25',
 '$15',
 '$5',
 '$6',
 '$8',
 '$5.99',
 '$7.52',
 '$11.95',
 '$7.50',
 '$12.95',
 '$7',
 '$21',
 '$9.79',
 '$9.75',
 '$13',
 '$7.95',
 '$9',
 '$9',
 '$8',
 '$8',
 '$7',
 '$6',
 '$7.25',
 '$11',
 '$6',
 '$9',
 '$5.49',
 '$8',
 '$6.50',
 '$7.50',
 '$8.75',
 '$6.85']

In [97]:
address

['2109 W. Chicago Ave.,',
 '800 W. Randolph St.,',
 '445 N. Clark St.,',
 '914 Noyes St., Evanston,',
 '825 W. Fulton Mkt.,',
 '100 E. Walton',
 '1639 S. Wabash Ave.,',
 '2211 W. North Ave.,',
 '3619 W. North Ave.,',
 '3267 S. Halsted St.,',
 '2537 N. Kedzie Blvd.,',
 'Multiple',
 '3124 N. Broadway,',
 '3455 N. Southport Ave.,',
 '2657 N. Kedzie Ave.,',
 '1120 W. Grand Ave.,',
 '1141 S. Jefferson St.,',
 '333 E. Benton Pl.,',
 '1411 N. Wells St.,',
 '1747 N. Damen Ave.,',
 '3209 W. Irving Park',
 'Multiple',
 '5347 N. Clark St.,',
 '2954 W. Irving Park Rd.,',
 'Multiple',
 '191 Skokie Valley Rd., Highland Park,',
 'Multiple',
 '1818 W. Wilson Ave.,',
 '2517 W. Division St.,',
 '218 W. Kinzie',
 'Multiple',
 '1547 N. Wells St.,',
 '415 N. Milwaukee Ave.,',
 '1840 N. Damen Ave.,',
 '1220 W. Webster Ave.,',
 '5357 N. Ashland Ave.,',
 '1834 W. Montrose Ave.,',
 '615 N. State St.,',
 'Multiple',
 '241 N. York Rd., Elmhurst,',
 '1323 E. 57th St.,',
 '655 Forest Ave., Lake Forest,',
 'Hotel L

In [99]:
df['price'] = price
df['Address'] = address
df.head()

Unnamed: 0,Rank,Cafe,Menu,URL,price,Address
0,1,Old Oak Tap,BLT,https://www.chicagomag.com/Chicago-Magazine/No...,$10,"2109 W. Chicago Ave.,"
1,2,Au Cheval,Fried Bologna,https://www.chicagomag.com/Chicago-Magazine/No...,$9,"800 W. Randolph St.,"
2,3,Xoco,Woodland Mushroom,https://www.chicagomag.com/Chicago-Magazine/No...,$9.50,"445 N. Clark St.,"
3,4,Al’s Deli,Roast Beef,https://www.chicagomag.com/Chicago-Magazine/No...,$9.40,"914 Noyes St., Evanston,"
4,5,Publican Quality Meats,PB&L,https://www.chicagomag.com/Chicago-Magazine/No...,$10,"825 W. Fulton Mkt.,"


In [104]:
df = df.loc[:, ['Rank', 'Cafe', 'Menu', 'price', 'Address']]
df.set_index('Rank', inplace=True) # 특정 컬럼을 인덱스 만들어라
df.head()

Unnamed: 0_level_0,Cafe,Menu,price,Address
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Old Oak Tap,BLT,$10,"2109 W. Chicago Ave.,"
2,Au Cheval,Fried Bologna,$9,"800 W. Randolph St.,"
3,Xoco,Woodland Mushroom,$9.50,"445 N. Clark St.,"
4,Al’s Deli,Roast Beef,$9.40,"914 Noyes St., Evanston,"
5,Publican Quality Meats,PB&L,$10,"825 W. Fulton Mkt.,"
