# **Web scraping de noticias bursátiles para el Análisis de Sentimiento**
### *Implementación De Un Modelo De Recomendación Para Compra o Venta de Acciones En El Mercado Financiero Basado en Análisis De Sentimiento*

[Robert Garcia Rey](https://www.notion.so/Robert-Garcia-Rey-Data-Analyst-6d7b578d2bf848d585dc9d1a97b1036c?pvs=4)
- garcia.robert.0514@eam.edu.co
- https://www.linkedin.com/in/robert-garcia-rey/

### 1. Introducción

Los artículos de noticias bursátiles de 2021-10-01 al 2023-11-09. Se recopilarán mediante web scraping dinámico desde Investing.com utilizando una combinación de la biblioteca ***Request*** para automatizar la interacción con el navegador permitiendo la extracción de datos mediante ***Beautiful Soup.***

### 2. Importamos librerias

In [2]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import yfinance as yf
import time
from newspaper import Article
from htmldate import find_date
import warnings
warnings.filterwarnings('ignore')

### 3. Recopilación de datos

##### *Scraping para creacion de dataset noticias de investing.com*

In [130]:
def get_newslinks(company, page_number):
    """Scrapes article URLs for a given company and page number from a news website.

    :param company: name of the company to scrape articles for
    :param page_number: page number on the news website to iterate over

    :return: list of article URLs
    """
    url = f"https://www.investing.com/equities/{company}-news/{page_number}"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }

    req = Request(url, headers=headers)
    html_content = urlopen(req).read()

    soup = BeautifulSoup(html_content, "html.parser")
    articles = soup.find_all('article')

    cleaned_links = []
    
    for article in articles:
        links = article.find_all('a')
        for link in links:
            partial_link = link.get('href')
            if partial_link is not None:  # Verificar si partial_link no es None
                if 'https' in partial_link:
                    cleaned_links.append(partial_link)
                elif partial_link.startswith('/'):
                    cleaned_links.append('https://www.investing.com' + partial_link)

    return list(set(cleaned_links))

    all_company_urls = []
for page in range(1, 119):
    results = get_newslinks('meta', page)
    all_company_urls.extend(results)
    time.sleep(5)  # Agrega un retraso de 5 segundos entre solicitudes
all_company_urls

In [131]:
all_company_urls = []
for page in range(1, 119):
    results = get_newslinks('meta', page)
    all_company_urls.extend(results)
    time.sleep(5)  # Agrega un retraso de 5 segundos entre solicitudes
all_company_urls

['https://www.investing.com/news/stock-market-news/goldman-sachs-highlights-seven-tech-giants-as-market-leaders-93CH-3237107#comments',
 'https://www.investing.com/news/cryptocurrency-news/meta-introduces-ai-models-for-video-generation-image-editing-3238168',
 'https://www.investing.com/news/stock-market-news/goldman-sachs-highlights-seven-tech-giants-as-market-leaders-93CH-3237107',
 'https://www.investing.com/news/cryptocurrency-news/meta-introduces-ai-models-for-video-generation-image-editing-3237043',
 'https://www.investing.com/news/stock-market-news/meta-launches-aibased-video-editing-tools-3237113',
 'https://www.investing.com/news/stock-market-news/sirius-xm-shares-slip-as-market-shows-mixed-results-93CH-3237177',
 'https://www.investing.com/news/stock-market-news/exclusivemetas-head-of-augmented-reality-software-stepping-down-3238124',
 'https://www.investing.com/pro/offers/breaking-news-offer?referral=3238125_news_8',
 'https://www.investing.com/news/stock-market-news/meta-sh

In [132]:
# Save URLS to text file

with open('moderna_article_investing.txt', 'w') as f:
    for link in all_company_urls:
        f.write("%s\n" % link)

In [2]:
# Lista para almacenar las líneas del archivo
all_company_urls = []

# Abrir el archivo y leer sus líneas
with open('moderna_article_investing.txt', 'r', encoding='utf-8') as archivo:
    # Leer cada línea y agregarla a la lista
    for linea in archivo:
        all_company_urls.append(linea.strip())

In [4]:
ticker = 'META'
article_sentiments = pd.DataFrame(columns=['ticker', 'publish_date', 'title', 'body_text', 'url'])

# Loop over all the articles
for link in all_company_urls:
    article = Article(link)
    article.download()

    try:
        article.parse()
        text = article.text
    except Exception as e:
        print(f"Error processing URL {link}: {str(e)}")
        continue

    #sid = SentimentIntensityAnalyzer()
    #polarity = sid.polarity_scores(text)

    tmpdic = {'ticker': ticker, 'publish_date': find_date(link), 'title': article.title, 'body_text': article.text, 'url': link}
    #tmpdic.update(polarity)
    article_sentiments = pd.concat([article_sentiments, pd.DataFrame(tmpdic, index=[0])])

article_sentiments.reset_index(drop=True, inplace=True)

Error processing URL https://www.investing.com/news/world-news/us-govt-tells-vaccine-makers-to-price-updated-covid-shots-reasonably-3125494: Article `download()` failed with HTTPSConnectionPool(host='www.investing.com', port=443): Read timed out. (read timeout=7) on URL https://www.investing.com/news/world-news/us-govt-tells-vaccine-makers-to-price-updated-covid-shots-reasonably-3125494
Error processing URL https://www.investing.com/news/economy/moderna-raises-covid19-vaccine-sales-forecast-to-19-billion-2771155#comments: Article `download()` failed with HTTPSConnectionPool(host='www.investing.com', port=443): Read timed out. (read timeout=7) on URL https://www.investing.com/news/economy/moderna-raises-covid19-vaccine-sales-forecast-to-19-billion-2771155#comments


Dataset investing.com

In [9]:
article_sentiments

Unnamed: 0,ticker,publish_date,title,body_text,url
0,META,2023-11-02,Stock Market Today: Dow ends higher as Treasur...,"Published Nov 01, 2023 07:08PM ET Updated Nov ...",https://www.investing.com/news/stock-market-ne...
1,META,2023-11-02,Wall Street indexes rally on bets of peak US i...,"Published Nov 02, 2023 05:51AM ET Updated Nov ...",https://www.investing.com/news/economy/futures...
2,META,2023-11-03,"Moderna downside has played out, HSBC raises r...","Published Nov 03, 2023 09:49AM ET\n\n© Reuters...",https://www.investing.com/news/stock-market-ne...
3,META,2023-11-03,Uber notches upgrade ahead of Q3 earnings: 4 b...,"Published Nov 03, 2023 06:10AM ET\n\n© Reuters...",https://www.investing.com/news/stock-market-ne...
4,META,2023-11-08,Moderna Highlights its Digital and AI Strategy...,"Published Nov 08, 2023 07:04AM ET\n\n© Reuters...",https://www.investing.com/news/assorted/modern...
...,...,...,...,...,...
1497,META,2021-10-04,"Sweden to give 12-15 year olds Pfizer vaccine,...","Published Oct 04, 2021 10:45AM ET Updated Oct ...",https://www.investing.com/news/coronavirus/swe...
1498,META,2021-10-04,EU regulator backs mRNA vaccine booster for pe...,"Published Oct 04, 2021 10:19AM ET Updated Oct ...",https://www.investing.com/news/stock-market-ne...
1499,META,2021-10-04,"Tesla, Merck Rise Premarket; 3M Falls By Inves...",By Peter Nurse\n\nInvesting.com -- Stocks in f...,https://www.investing.com/news/stock-market-ne...
1500,META,2021-10-04,"Amid COVID-19 booster data dilemma, EU nations...","Published Oct 03, 2021 08:14PM ET\n\n© Reuters...",https://www.investing.com/news/stock-market-ne...


In [10]:
article_sentiments['publish_date'].min(), article_sentiments['publish_date'].max()

('2021-10-01', '2023-11-09')

In [11]:
article_sentiments['url'].iloc[1456]

'https://www.investing.com/news/stock-market-news/wall-street-opens-higher-on-debt-ceiling-energy-relief-dow-up-400-pts-2637565'

In [12]:
article_sentiments['body_text'].iloc[1456]



##### *Scraping para creacion de dataset precios de acciones Moderna desde Yahoo Finance*

In [3]:
df_meta_price = yf.download("META", start='2021-10-01', end='2023-11-09')
df_meta_price

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-10-01,341.609985,345.019989,338.640015,343.010010,343.010010,14905300
2021-10-04,335.529999,335.940002,322.700012,326.230011,326.230011,42885000
2021-10-05,328.579987,335.179993,326.160004,332.959991,332.959991,35377900
2021-10-06,329.739990,334.380005,325.799988,333.640015,333.640015,26443000
2021-10-07,337.000000,338.839996,328.980011,329.220001,329.220001,28307500
...,...,...,...,...,...,...
2023-11-02,317.299988,318.820007,308.329987,310.869995,310.869995,21631800
2023-11-03,312.549988,315.549988,311.019989,314.600006,314.600006,16754100
2023-11-06,315.980011,318.329987,314.450012,315.799988,315.799988,12887700
2023-11-07,317.059998,321.000000,315.119995,318.820007,318.820007,14055600


### 4. Guardar los dos DataFrame

In [23]:
article_sentiments.to_csv("../data/raw/apple_article_investing.csv", sep=',', encoding='utf-8', header=True)

print("DataFrame guardado como 'meta_article_investing.csv'")

DataFrame guardado como 'meta_article_investing.csv'


In [26]:
# Guardar el DataFrame en un archivo CSV
df_meta_price.to_csv('../data/raw/apple_price_yfinance.csv')

print("DataFrame guardado como 'meta_price_yfinance.csv'")

DataFrame guardado como 'meta_price_yfinance.csv'


In [17]:
article_sentiments.to_pickle("../data/raw/meta_article_investing.pkl")
df_meta_price.to_pickle("../data/raw/meta_price_yfinance.pkl")