# 다양한 웹사이트 레이아웃 다루기
### 구문 분석 기능
- 제목 요소를 선택하고 제목 텍스트 추출
- 기사의 주요 콘텐츠 선택
- 다른 필요한 콘텐츠 선택

#### brookings.edu

In [1]:
!pip install requests

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import requests
from bs4 import BeautifulSoup

class Content:
    def __init__(self, url, title, body):
        self.url=url
        self.title=title
        self.body=body
        
def getPage(url):
    req=requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

def scrapeBrookings(url):
    bs=getPage(url)
    title=bs.find('h1').text
    body=bs.find('div', class_='post-body').text
    return Content(url, title, body)

url='https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'

content=scrapeBrookings(url)
print('Title : {}'.format(content.title))
print('URL : {}'.format(content.url))
print(content.body)

Title : Delivering inclusive urban access: 3 uncomfortable truths
URL : https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/

The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.







Jeffrey Gutman

					Former Nonresident Fellow, Global Economy and Development										







Adie Tomer

					Senior Fellow - Brookings Metro 

 Twitter
AdieTomer





But all is not well within our so-called Urban Century. Rapid urbanization, rising gentrification, concentrated poverty, and shortages of basic infrastructure have combined to create spatial inequity in cities and suburbs across the globe. The challenges of housing, moving, and employing so many people have led to longer travel t

## 검색을 통한 사이트 크롤링

In [41]:
class Content:
    def __init__(self, topic, url, title, body):
        self.topic=topic
        self.url=url
        self.title=title
        self.body=body
        
    def print(self):
        print('New article found for topic : {}'.format(self.topic))
        print('URL : {}'.format(self.url))
        print('TITLE : {}'.format(self.title))
        print('BODY : \n{}'.format(self.body))
        
        
class Website:
    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name=name
        self.url=url
        self.searchUrl=searchUrl
        self.resultListing=resultListing
        self.resultUrl=resultUrl
        self.absoluteUrl=absoluteUrl
        self.titleTag=titleTag
        self.bodyTag=bodyTag

In [42]:
import requests
from bs4 import BeautifulSoup

class Crawler:
    def getPage(self, url):
        try:
            req=requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def safeGet(self, pageObj, selector):
        childObj=pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            return childObj[0].get_text()
        else:
            return ' '
        
    def getAllBody(self, pageObj, selector):
        # 해당 tag를 가지는 모든 내용을 출력함
        childObj=pageObj.select(selector)
        bodyText=''
        if childObj is not None:
            for i in range(len(childObj)):
                bodyText=bodyText+childObj[i].get_text()+'\n'
            return bodyText
        else:
            return ''
        
    def search(self, topic, site):
        # site:Website 객체
        print('searchUrl+topic:', site.searchUrl+topic)

        bs=self.getPage(site.searchUrl+topic)
        searchResults=bs.select(site.resultListing)

        for result in searchResults:
            url=result.select(site.resultUrl)[0].attrs['href']
            if (site.absoluteUrl):
                bs=self.getPage(url)
            else:
                bs=self.getPage(site.url+url)
            if bs is None:
                print('Something was wrong with that page or URL. Skipping')
                return

            title=self.safeGet(bs, site.titleTag)
            #body=self.safeGet(bs, site.bodyTag)  # 첫 번째 paragraph만 출력
            body=self.getAllBody(bs, site.bodyTag)  # 전체 기사 출력

            if title!='' and body!='':
                content=Content(topic, url, title, body)
                content.print()

In [43]:
crawler=Crawler()

siteData1=[
    ['Reuters',
    'http://reuters.com',
    'http://www.reuters.com/search/news?blob=',
    'div.search-result-content',
    'h3.search-result-title > a',
    False,
    'h1',
    'p.Paragraph-paragraph-2Bgue']
]

sites=[]
for row in siteData1:
    sites.append(Website(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))
    
topics=['python']
for topic in topics:
    print('GETTING INFO ABOUT : '+topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)

GETTING INFO ABOUT : python
searchUrl+topic: http://www.reuters.com/search/news?blob=python
New article found for topic : python
URL : /article/idUSKCN11S04G
TITLE : Python in India demonstrates huge appetite
BODY : 
A 20 feet rock python was caught on camera in Junagadh district of India’s western Gujarat state with a swollen stomach after it consumed an antelope on Tuesday (September 20).
Residents informed authorities at Girnar Wildlife Sanctuary after they spotted the reptile lying in discomfort in a field.
In view of the massive swelling of the python’s stomach, the forest authorities suspect that it gobbled up a full-grown ‘nilgai’ or blue bull.
The python - unable to move now - was rescued by the forest personnel and has been put under observation.
“We will keep it (python) under observation. We will release it back in the wild once it digests the antelope and the swelling subsides,” said Assistant Conservator of Forest, S.D. Tilala.
A blue bull is far larger than an ideal prey 

New article found for topic : python
URL : /article/idUSKBN1OD2CM
TITLE : UK woman illegally imported python-skin products
BODY : 
LONDON (Reuters) - A British woman who illegally imported and sold fashion accessories made from python skin was convicted on Friday, London police said.
Stephanie Scolaro, 26, was involved in the illegal import of a parcel containing 10 python-skin hats and two bags which was seized by customs in 2016 at Leipzig airport in Germany, Southwark Crown Court had heard.
An investigation began after London police’s Wildlife Crime Unit was alerted to the incident.
The enquiry found that Scolaro operated an online company named ‘SS-Python.com’, where she sold python-skin hats, bags, chokers and mobile phone covers.
More illicit python snake products were subsequently found in Scolaro’s central London home, and for sale at three shops in London.
“Pythons are one of many species protected under CITES, an international treaty to protect endangered plants and animals,”

New article found for topic : python
URL : /article/idUSL5N0J50QB20131120
TITLE : Monty Python not dead after all - stage show planned
BODY : 
LONDON, Nov 20 (Reuters) - The comic team Monty Python, whose BBC TV series from the 1970s and feature films took their subversive humour and “Dead Parrot” routine around the world, are to reunite for a stage show, British media reported on Wednesday.
A news release issued on behalf of the five surviving Pythons, Eric Idle, John Cleese, Terry Gilliam, Michael Palin and Terry Jones, all in their 70s, said that an official announcement would be made on Thursday.
But several British newspapers and media outlets reported that the five would be appearing on stage for the first time together since the 1980s.
The group was famed for its skits about a man trying to return a dead parrot to a shopkeeper who claimed the bird was “resting” and for poking fun at the establishment, the military and religion.
“We’re getting together and putting on a show - it’

# 링크를 통한 크롤링

In [31]:
import requests
from bs4 import BeautifulSoup

url='https://www.reuters.com'
link_list=[]
req=requests.get(url)
soup=BeautifulSoup(req.text, 'html.parser')

data_testid_links=soup.find_all('a', attrs={'data-testid':['Heading', 'Link']})

i=0
for link in data_testid_links:
    if link['href'] not in link_list:
        print('[{:4}] : {}'.format(i, link['href']))
        i+=1
        
print('link_list 길이:', len(link_list))

[   0] : https://www.reuters.com/world/europe/
[   1] : /world/
[   2] : /world/europe/ukraine-russia-what-you-need-know-right-now-2022-07-03/
[   3] : /world/
[   4] : /world/europe/ukraine-says-18-medics-killed-hundreds-facilities-damaged-since-invasion-2022-07-24/
[   5] : /world/
[   6] : /world/europe/zelenskiy-says-ukraine-unbowed-even-russians-expect-defeat-2022-07-24/
[   7] : /world/
[   8] : /world/europe/lavrov-offers-reassurance-over-russian-grain-supplies-cairo-visit-2022-07-24/
[   9] : /world/
[  10] : /world/europe/russia-says-it-hit-military-boat-odesa-port-ukraine-2022-07-24/
[  11] : /world/
[  12] : /world/europe/odesa-strike-shows-it-will-not-be-easy-export-grain-via-ports-ukraine-2022-07-24/
[  13] : /world/
[  14] : /world/europe/russian-investigator-says-wants-new-tribunal-ukraine-2022-07-25/
[  15] : /world/
[  16] : /world/middle-east/ukraine-works-resume-grain-exports-flags-russian-strikes-risk-2022-07-24/
[  17] : /world/europe/russian-investigator-says-want

## Selenium 라이브러리 설치

In [32]:
!pip install selenium

Defaulting to user installation because normal site-packages is not writeable
Collecting selenium
  Downloading selenium-4.3.0-py3-none-any.whl (981 kB)
Collecting trio-websocket~=0.9
  Using cached trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Using cached h11-0.13.0-py3-none-any.whl (58 kB)
Installing collected packages: outcome, h11, async-generator, wsproto, trio, trio-websocket, selenium
Successfully installed async-generator-1.10 h11-0.13.0 outcome-1.2.0 selenium-4.3.0 trio-0.21.0 trio-websocket-0.9.2 wsproto-1.1.0
