<a href="https://colab.research.google.com/github/yjyuwisely/AI_project_mastery_bootcamp/blob/main/cnn_ai_article_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

미니 프로젝트 (2024.11.03)

- **목표**: CNN (edition.cnn.com)에서 "Artificial Intelligence(인공지능)"를 검색한 후,
검색 결과에서 기사 제목과 URL을 각각 5개 가져오기

- **한계**: 수업에서 진행한 네이버 뉴스 메인 페이지의 '검색 아이콘 요소'를 가져오는 방식이 CNN 메인 웹페이지에서는 작동하지 않았습니다. 대신, 돋보기 아이콘을 클릭해 빈 검색 창을 열었을 때 표시되는 검색 페이지 URL (https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all&section=) 을 사용하여 작업을 완료했습니다.

- **코드 및 결과**



In [None]:
# Import necessary modules
# 필요한 모듈을 가져옵니다
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import time

# Initialize the Chrome WebDriver
# Chrome 웹드라이버 초기화
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

# Open CNN search page
# CNN 검색 페이지 열기
url = 'https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all&section='
driver.get(url)

# Locate the search input field and search button
# 검색 입력 필드와 검색 버튼 찾기
search_input = driver.find_element(By.XPATH, '//*[@id="search"]/div[1]/div[1]/input')
search_button = driver.find_element(By.XPATH, '//*[@id="search"]/div[1]/div[1]/button[2]')

# Input search term and click search
# 검색어 입력 및 검색 버튼 클릭
search_input.send_keys("Artificial Intelligence")
search_button.click()

# Wait for the results to load
# 결과가 로드될 때까지 대기
time.sleep(3)  # 필요에 따라 이 지연 시간을 조정

# Define a general XPath pattern for all article link elements
# 모든 기사 링크 요소에 대한 일반 XPath 패턴 정의
path = '//*[@id="search"]/div[2]/div/div[2]//a[2]'

# Get all elements that match the XPath for article links
# 기사 링크에 대한 XPath와 일치하는 모든 요소 가져오기
article_elements = driver.find_elements(By.XPATH, path)

# Print the text and URL of the first 5 articles, if available
# 사용 가능한 경우 첫 5개의 기사 제목 및 URL 출력
for i, article in enumerate(article_elements[:5]):  # Limit to first 5 articles (첫 5개의 기사로 제한)
    title = article.find_element(By.XPATH, './div/div[1]/span').text  # Get the title text (제목 텍스트 가져오기)
    url = article.get_attribute('href')  # Get the URL (URL 가져오기)
    print(f"기사 {i + 1}: {title}")  # Print article title (기사 제목 출력)
    print(f"URL: {url}\n")  # Print article URL (기사 URL 출력)

기사 1: Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far
URL: https://www.cnn.com/2024/10/31/tech/iphone-16-ai-early-sales-numbers-earnings/index.html

기사 2: Apple debuted AI on the iPhone today. Here’s what to look out for
URL: https://www.cnn.com/2024/10/28/tech/apple-iphone-ai/index.html

기사 3: Welcome to the ‘show me the money’ quarter
URL: https://www.cnn.com/2024/10/26/investing/q3-earnings-ai-tech/index.html

기사 4: Painting by AI robot Ai-Da could fetch more than $120,000 at auction
URL: https://www.cnn.com/2024/10/24/style/ai-da-ai-robot-painting-auction-sothebys-intl-scli-tan/index.html

기사 5: Biden makes clear AI can’t launch nukes as he looks to harness new technology’s power
URL: https://www.cnn.com/2024/10/24/politics/joe-biden-artificial-intelligence-nuclear-weapons/index.html



- **결과물 저장 방식**
1. Text 파일로 저장하기

In [None]:
with open("cnn_ai_articles.txt", "w", encoding="utf-8") as file:
    for i, article in enumerate(article_elements[:5]):
        title = article.find_element(By.XPATH, './div/div[1]/span').text
        url = article.get_attribute('href')
        file.write(f"기사 {i + 1}: {title}\n")
        file.write(f"URL: {url}\n\n")

결과

In [None]:
기사 1: Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far
URL: https://www.cnn.com/2024/10/31/tech/iphone-16-ai-early-sales-numbers-earnings/index.html

기사 2: Apple debuted AI on the iPhone today. Here’s what to look out for
URL: https://www.cnn.com/2024/10/28/tech/apple-iphone-ai/index.html

기사 3: Welcome to the ‘show me the money’ quarter
URL: https://www.cnn.com/2024/10/26/investing/q3-earnings-ai-tech/index.html

기사 4: Painting by AI robot Ai-Da could fetch more than $120,000 at auction
URL: https://www.cnn.com/2024/10/24/style/ai-da-ai-robot-painting-auction-sothebys-intl-scli-tan/index.html

기사 5: Biden makes clear AI can’t launch nukes as he looks to harness new technology’s power
URL: https://www.cnn.com/2024/10/24/politics/joe-biden-artificial-intelligence-nuclear-weapons/index.html

2. CSV 파일로 저장하기

In [None]:
import csv

with open("cnn_ai_articles.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Article Number", "Title", "URL"])  # Header row
    for i, article in enumerate(article_elements[:5]):
        title = article.find_element(By.XPATH, './div/div[1]/span').text
        url = article.get_attribute('href')
        writer.writerow([i + 1, title, url])

결과

In [None]:
Article Number,Title,URL
1,Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far,https://www.cnn.com/2024/10/31/tech/iphone-16-ai-early-sales-numbers-earnings/index.html
2,Apple debuted AI on the iPhone today. Here’s what to look out for,https://www.cnn.com/2024/10/28/tech/apple-iphone-ai/index.html
3,Welcome to the ‘show me the money’ quarter,https://www.cnn.com/2024/10/26/investing/q3-earnings-ai-tech/index.html
4,"Painting by AI robot Ai-Da could fetch more than $120,000 at auction",https://www.cnn.com/2024/10/24/style/ai-da-ai-robot-painting-auction-sothebys-intl-scli-tan/index.html
5,Biden makes clear AI can’t launch nukes as he looks to harness new technology’s power,https://www.cnn.com/2024/10/24/politics/joe-biden-artificial-intelligence-nuclear-weapons/index.html

3. JSON 파일로 저장하기

In [None]:
import json

articles = []
for i, article in enumerate(article_elements[:5]):
    title = article.find_element(By.XPATH, './div/div[1]/span').text
    url = article.get_attribute('href')
    articles.append({"Article Number": i + 1, "Title": title, "URL": url})

with open("cnn_ai_articles.json", "w", encoding="utf-8") as file:
    json.dump(articles, file, ensure_ascii=False, indent=4)

결과

In [None]:
[
    {
        "Article Number": 1,
        "Title": "Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far",
        "URL": "https://www.cnn.com/2024/10/31/tech/iphone-16-ai-early-sales-numbers-earnings/index.html"
    },
    {
        "Article Number": 2,
        "Title": "Apple debuted AI on the iPhone today. Here’s what to look out for",
        "URL": "https://www.cnn.com/2024/10/28/tech/apple-iphone-ai/index.html"
    },
    {
        "Article Number": 3,
        "Title": "Welcome to the ‘show me the money’ quarter",
        "URL": "https://www.cnn.com/2024/10/26/investing/q3-earnings-ai-tech/index.html"
    },
    {
        "Article Number": 4,
        "Title": "Painting by AI robot Ai-Da could fetch more than $120,000 at auction",
        "URL": "https://www.cnn.com/2024/10/24/style/ai-da-ai-robot-painting-auction-sothebys-intl-scli-tan/index.html"
    },
    {
        "Article Number": 5,
        "Title": "Biden makes clear AI can’t launch nukes as he looks to harness new technology’s power",
        "URL": "https://www.cnn.com/2024/10/24/politics/joe-biden-artificial-intelligence-nuclear-weapons/index.html"
    }
]

- **수업 예제** (2024.11.01)

**1. Selenium을 이용한 웹 크롤링**

In [None]:
# 작업에 필요한 패키지를 불러옵니다
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

In [None]:
# Chrome 브라우저를 오픈합니다
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

In [None]:
url = 'https://ldjwj.github.io/webPage/'
driver.get(url)

selected_id = driver.find_element(By.ID, 'rank')
print(selected_id)
print(selected_id.tag_name)  # 해당 요소의 태그 이름
print(selected_id.text)      # 해당 요소의 텍스트 정보

<selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.23")>
a
10. 랭킹 정보 가져오기(웹 크롤링)


In [None]:
from selenium.webdriver.common.by import By
select_id = driver.find_element(By.ID, 'rank')
print(select_id)
print(select_id.text)

<selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.23")>
10. 랭킹 정보 가져오기(웹 크롤링)


In [None]:
sel_tag_h1 = driver.find_element(By.TAG_NAME, 'h1')
print(sel_tag_h1.text)
print(sel_tag_h1)

my web page
<selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.24")>


In [None]:
from selenium.webdriver.common.by import By

## 전체 a태그 정보 가져오기
# selected_tags_a = driver.find_elements_by_tag_name('a')
selected_tag_a = driver.find_elements(By.TAG_NAME, 'a')
print(selected_tag_a)

[<selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.25")>, <selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.26")>, <selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.27")>, <selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.28")>, <selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c058f", element="f.781CE7D7A19D2CE1A344B51CA5D62DAD.d.0BA3DBF09179CB2931FB1417C8E5BA0F.e.29")>, <selenium.webdriver.remote.webelement.WebElement (session="1eb0fdc87eb68b3c77ff9ceeef2c05

In [None]:
sel_tag_a1 = driver.find_elements(By.TAG_NAME, 'a')
print(type(sel_tag_a1))
for one in sel_tag_a1:
    print(one.text)

<class 'list'>
01. 제목 가져오기(title)
02. 텍스트 가져오기(p)
03. 링크 가져오기(a)
04. 이미지 정보 가져오기(img)
05. 리스트 정보 가져오기(ul,ol)
06. id를 활용한 정보 획득
07. class를 활용한 정보 획득
08. 하나의 이미지 다운로드
09. 여러개의 이미지 다운로드
10. 랭킹 정보 가져오기(웹 크롤링)


**2. 네이버 뉴스에서 키워드 검색 예제**

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://news.naver.com/'

# 웹 드라이버를 사용하여 지정된 URL로 이동합니다.
driver.get(url)

In [None]:
from selenium.webdriver.common.by import By

# 검색 아이콘 요소 찾기
# /html/body/section/header/div[1]/div/div/div[2]/div[3]/a
search_icon = driver.find_element(By.XPATH, '/html/body/section/header/div[1]/div/div/div[2]/div[3]/a')
print(search_icon.tag_name)
print(search_icon.text)
search_icon.click()


# 검색창 요소 찾기
# //*[@id="u_hs"]/div/div/input
search_input = driver.find_element(By.XPATH, '//*[@id="u_hs"]/div/div/input')
print(search_input.tag_name)
print(search_input.text)

# 검색 버튼 요소 찾기
# //*[@id="u_hs"]/div/div/button[2]
search_button = driver.find_element(By.XPATH, '//*[@id="u_hs"]/div/div/button[2]')
print(search_button.tag_name)
print(search_button.text)

# 검색어 입력 및 검색 실행
search_input.send_keys("패션")
search_button.click()

a
검색
input

button
뉴스검색


In [None]:
import time

# 현재 탭 핸들 저장
current_tab = driver.current_window_handle
print(current_tab)

# 모든 탭 핸들 가져오기
all_tabs = driver.window_handles
print(all_tabs)

# 새로운 탭으로 전환
for tab in all_tabs:
    if tab != current_tab:
        driver.switch_to.window(tab)
        break

# 새로운 탭에서 URL 가져오기
time.sleep(2)  # 페이지 로딩 대기
current_url = driver.current_url
print("새로운 탭의 URL:", current_url)

781CE7D7A19D2CE1A344B51CA5D62DAD
['781CE7D7A19D2CE1A344B51CA5D62DAD', '8FE080541522135F5192FB2E13B2DB74']
새로운 탭의 URL: https://search.naver.com/search.naver?where=news&ie=utf8&sm=nws_hty&query=%ED%8C%A8%EC%85%98


2.1 검색 후 결과

In [None]:
## 검색결과 창에서 정보가져오기
# //*[@id="sp_nws1"]/div[1]/div/div[2]/a[2]

path = '//*[@id="sp_nws1"]/div[1]/div/div[2]/a[2]'
sel_xpath = driver.find_element(By.XPATH, path)
print(sel_xpath.text)

[포토]김설희, 패션모델부문 인기상


**3. CNN (edition.cnn.com)에서 "Artificial Intelligence" 검색 결과에서 기사 제목과 링크 5개 가져오기**

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

#url = 'https://edition.cnn.com/'
url = 'https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all&section='

# 웹 드라이버를 사용하여 지정된 URL로 이동합니다.
driver.get(url)

In [None]:
from selenium.webdriver.common.by import By

# 검색 아이콘 요소 찾기
# /html/body/section/header/div[1]/div/div/div[2]/div[3]/a
# //*[@id="headerMenuIcon"]/svg
# //*[@id="headerMenuIcon"]/svg/path
# search_icon = driver.find_element(By.XPATH, '//*[@id="headerMenuIcon"]/svg/path')
# print(search_icon.tag_name)
# print(search_icon.text)
# search_icon.click()

# 검색창 요소 찾기
# //*[@id="u_hs"]/div/div/input
# //*[@id="pageHeader"]/div/div/div[2]/div/div[1]/form/input
# //*[@id="search"]/div[1]/div[1]/input
search_input = driver.find_element(By.XPATH, '//*[@id="search"]/div[1]/div[1]/input')
print(search_input.tag_name)
print(search_input.text)

# 검색 버튼 요소 찾기
# //*[@id="u_hs"]/div/div/button[2]
# //*[@id="pageHeader"]/div/div/div[2]/div/div[1]/form/button
# //*[@id="search"]/div[1]/div[1]/button[2]
search_button = driver.find_element(By.XPATH, '//*[@id="search"]/div[1]/div[1]/button[2]')
print(search_button.tag_name)
print(search_button.text)

# 검색어 입력 및 검색 실행
search_input.send_keys("Artificial Intelligence")
search_button.click()

input

button



In [None]:
import time

# 현재 탭 핸들 저장
current_tab = driver.current_window_handle
print(current_tab)

# 모든 탭 핸들 가져오기
all_tabs = driver.window_handles
print(all_tabs)

# 새로운 탭으로 전환
for tab in all_tabs:
    if tab != current_tab:
        driver.switch_to.window(tab)
        break

# 새로운 탭에서 URL 가져오기
time.sleep(2)  # 페이지 로딩 대기
current_url = driver.current_url
print("새로운 탭의 URL:", current_url)

BE34C1F2EFED0BA0E6346BC5D2FBBBE9
['BE34C1F2EFED0BA0E6346BC5D2FBBBE9']
새로운 탭의 URL: https://edition.cnn.com/search?q=Artificial+Intelligence&from=0&size=10&page=1&sort=newest&types=all&section=


In [None]:
## 검색결과 창에서 정보가져오기
# //*[@id="sp_nws1"]/div[1]/div/div[2]/a[2]

#path = '//*[@id="sp_nws1"]/div[1]/div/div[2]/a[2]'
path = '//*[@id="search"]/div[2]/div/div[2]/div/div[2]/div/div/div[1]/a[2]/div/div[1]/span'
sel_xpath = driver.find_element(By.XPATH, path)
print(sel_xpath.text)

Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far


3.1 기사 제목 가져오기

In [None]:
# Define a more general XPath pattern for all article titles
# This XPath should capture all articles, not just one
path = '//*[@id="search"]/div[2]/div/div[2]//a[2]/div/div[1]/span'

# Get all elements that match the XPath for article titles
article_elements = driver.find_elements(By.XPATH, path)

# Print the text of the first 5 articles, if available
for i, article in enumerate(article_elements[:5]):  # Limit to first 5 articles
    print(f"기사 {i + 1}: {article.text}")


기사 1: Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far
기사 2: Apple debuted AI on the iPhone today. Here’s what to look out for
기사 3: Welcome to the ‘show me the money’ quarter
기사 4: Painting by AI robot Ai-Da could fetch more than $120,000 at auction
기사 5: Biden makes clear AI can’t launch nukes as he looks to harness new technology’s power


3.2 URL 추가해서 가져오기

In [None]:
# Define a more general XPath pattern for all article link elements
# 모든 기사 링크 요소에 대한 일반 XPath 패턴 정의
# Adjust this if necessary based on CNN's structure.
# CNN의 구조에 따라 필요에 따라 수정

path = '//*[@id="search"]/div[2]/div/div[2]//a[2]'

# Get all elements that match the XPath for article links
# 기사 링크에 대한 XPath와 일치하는 모든 요소 가져오기
article_elements = driver.find_elements(By.XPATH, path)

# Print the text and URL of the first 5 articles, if available
# 사용 가능한 경우 첫 5개의 기사 제목 및 URL 출력
for i, article in enumerate(article_elements[:5]):  # Limit to first 5 articles (첫 5개의 기사로 제한)
    title = article.find_element(By.XPATH, './div/div[1]/span').text  # Get the title text (제목 텍스트 가져오기)
    url = article.get_attribute('href')  # Get the URL (URL 가져오기)
    print(f"기사 {i + 1}: {title}")  # Print article title (기사 제목 출력)
    print(f"URL: {url}\n")  # Print article URL (기사 URL 출력)

기사 1: Apple wants its AI iPhone to turn around a sales rut. Here’s how it’s going so far
URL: https://www.cnn.com/2024/10/31/tech/iphone-16-ai-early-sales-numbers-earnings/index.html

기사 2: Apple debuted AI on the iPhone today. Here’s what to look out for
URL: https://www.cnn.com/2024/10/28/tech/apple-iphone-ai/index.html

기사 3: Welcome to the ‘show me the money’ quarter
URL: https://www.cnn.com/2024/10/26/investing/q3-earnings-ai-tech/index.html

기사 4: Painting by AI robot Ai-Da could fetch more than $120,000 at auction
URL: https://www.cnn.com/2024/10/24/style/ai-da-ai-robot-painting-auction-sothebys-intl-scli-tan/index.html

기사 5: Biden makes clear AI can’t launch nukes as he looks to harness new technology’s power
URL: https://www.cnn.com/2024/10/24/politics/joe-biden-artificial-intelligence-nuclear-weapons/index.html

