# **Web scraping de noticias bursátiles para el Análisis de Sentimiento**
### *Implementación De Un Modelo De Recomendación Para Compra o Venta de Acciones En El Mercado Financiero Basado en Análisis De Sentimiento*

[Robert Garcia Rey](https://www.notion.so/Robert-Garcia-Rey-Data-Analyst-6d7b578d2bf848d585dc9d1a97b1036c?pvs=4)
- garcia.robert.0514@eam.edu.co
- https://www.linkedin.com/in/robert-garcia-rey/

### 1. Introducción

Los artículos de noticias bursátiles de 2019-10-21 al 2023-11-09. Se recopilarán mediante web scraping dinámico desde Investing.com utilizando una combinación de la biblioteca ***Request*** para automatizar la interacción con el navegador permitiendo la extracción de datos mediante ***Beautiful Soup.***

### 2. Importamos librerias

In [1]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import yfinance as yf
import time
from newspaper import Article
from htmldate import find_date
import warnings
warnings.filterwarnings('ignore')

### 3. Recopilación de datos

##### *Scraping para creacion de dataset noticias de investing.com*

In [2]:
def get_newslinks(company, page_number):
    """Scrapes article URLs for a given company and page number from a news website.

    :param company: name of the company to scrape articles for
    :param page_number: page number on the news website to iterate over

    :return: list of article URLs
    """
    url = f"https://www.investing.com/equities/{company}-news/{page_number}"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }

    req = Request(url, headers=headers)
    html_content = urlopen(req).read()

    soup = BeautifulSoup(html_content, "html.parser")
    articles = soup.find_all('article')

    cleaned_links = []
    
    for article in articles:
        links = article.find_all('a')
        for link in links:
            partial_link = link.get('href')
            if partial_link is not None:  # Verificar si partial_link no es None
                if 'https' in partial_link:
                    cleaned_links.append(partial_link)
                elif partial_link.startswith('/'):
                    cleaned_links.append('https://www.investing.com' + partial_link)

    return list(set(cleaned_links))

In [4]:
all_company_urls = []
for page in range(1, 119):
    results = get_newslinks('microsoft-corp', page)
    all_company_urls.extend(results)
    time.sleep(5)  # Agrega un retraso de 5 segundos entre solicitudes
all_company_urls

['https://www.investing.com/news/stock-market-news/wedbush-expects-short-covering-for-the-ages-says-new-tech-bull-market-has-now-begun-3238057#comments',
 'https://www.investing.com/news/assorted/openai-announces-ceo-sam-altman-to-depart-the-company-432SI-3238126',
 'https://www.investing.com/news/stock-market-news/ousted-openai-ceo-altman-welcome-in-france-minister-says-3238264#comments',
 'https://www.investing.com/news/stock-market-news/in-ousting-ceo-sam-altman-chatgpt-loses-its-best-fundraiser-3238215#comments',
 'https://www.investing.com/news/stock-market-news/fbi-warns-on-scattered-spider-hackers-urges-victims-to-come-forward-3236929',
 'https://www.investing.com/news/stock-market-news/ibm-shares-near-52week-high-as-market-closes-mixed-93CH-3237130',
 'https://www.investing.com/news/stock-market-news/in-ousting-ceo-sam-altman-chatgpt-loses-its-best-fundraiser-3238215',
 'https://www.investing.com/news/stock-market-news/citi-maintains-buy-on-microsoft-with-a-432-target-despite-h

In [5]:
# Save URLS to text file

with open('microsoft_article_investing.txt', 'w') as f:
    for link in all_company_urls:
        f.write("%s\n" % link)

In [8]:
ticker = 'MSFT'
article_sentiments = pd.DataFrame(columns=['ticker', 'publish_date', 'title', 'body_text', 'url'])

# Loop over all the articles
for link in all_company_urls:
    article = Article(link)
    article.download()

    try:
        article.parse()
        text = article.text
    except Exception as e:
        print(f"Error processing URL {link}: {str(e)}")
        continue

    #sid = SentimentIntensityAnalyzer()
    #polarity = sid.polarity_scores(text)

    tmpdic = {'ticker': ticker, 'publish_date': find_date(link), 'title': article.title, 'body_text': article.text, 'url': link}
    #tmpdic.update(polarity)
    article_sentiments = pd.concat([article_sentiments, pd.DataFrame(tmpdic, index=[0])])

article_sentiments.reset_index(drop=True, inplace=True)

Error processing URL https://www.investing.com/news/stock-market-news/magnificent-seven-tech-stocks-primed-for-turnaround-goldman-sachs-suggests-93CH-3188396#comments: Article `download()` failed with 502 Server Error: Bad Gateway for url: https://www.investing.com/news/stock-market-news/magnificent-seven-tech-stocks-primed-for-turnaround-goldman-sachs-suggests-93CH-3188396#comments on URL https://www.investing.com/news/stock-market-news/magnificent-seven-tech-stocks-primed-for-turnaround-goldman-sachs-suggests-93CH-3188396#comments
Error processing URL https://www.investing.com/news/stock-market-news/4-big-deal-reports-ftc-falls-flat-in-attempt-to-block-microsoftactivision-3126354: Article `download()` failed with HTTPSConnectionPool(host='www.investing.com', port=443): Read timed out. (read timeout=7) on URL https://www.investing.com/news/stock-market-news/4-big-deal-reports-ftc-falls-flat-in-attempt-to-block-microsoftactivision-3126354


Dataset investing.com

In [9]:
article_sentiments

Unnamed: 0,ticker,publish_date,title,body_text,url
0,MSFT,2023-11-17,"Wedbush expects 'short covering for the ages,'...","Published Nov 17, 2023 01:39PM ET\n\n© Reuters...",https://www.investing.com/news/stock-market-ne...
1,MSFT,2023-11-17,OpenAI announces CEO Sam Altman to depart the ...,"Published Nov 17, 2023 03:36PM ET\n\n© Reuters...",https://www.investing.com/news/assorted/openai...
2,MSFT,2023-11-18,"Ousted OpenAI CEO Altman welcome in France, mi...","Published Nov 18, 2023 09:04AM ET\n\n2/2 © Reu...",https://www.investing.com/news/stock-market-ne...
3,MSFT,2023-11-18,"In ousting CEO Sam Altman, ChatGPT loses its b...","Published Nov 17, 2023 08:29PM ET\n\n© Reuters...",https://www.investing.com/news/stock-market-ne...
4,MSFT,2023-11-17,"FBI warns on Scattered Spider hackers, urges v...","Published Nov 16, 2023 03:37PM ET Updated Nov ...",https://www.investing.com/news/stock-market-ne...
...,...,...,...,...,...
1453,MSFT,2023-06-21,Accenture and Microsoft Expand Collaboration t...,"Published Jun 21, 2023 03:00PM ET\n\n© Reuters...",https://www.investing.com/news/assorted/accent...
1454,MSFT,2023-06-21,Stock market today: Dow ends lower as Powell's...,"Published Jun 21, 2023 05:01PM ET\n\n© Reuters...",https://www.investing.com/news/stock-market-ne...
1455,MSFT,2023-06-22,Tech companies including Google gripe about un...,"Published Jun 21, 2023 08:07PM ET Updated Jun ...",https://www.investing.com/news/economy/tech-co...
1456,MSFT,2023-06-21,"Short Bets on US Stocks Hit $1 Trillion, Most ...","Published Jun 21, 2023 02:24PM ET\n\nUS500 +0....",https://www.investing.com/news/stock-market-ne...


In [10]:
article_sentiments['publish_date'].min(), article_sentiments['publish_date'].max()

('2023-06-21', '2023-11-18')

In [11]:
article_sentiments['url'].iloc[1456]

'https://www.investing.com/news/stock-market-news/short-bets-on-us-stocks-hit-1-trillion-most-since-april-2022-3110337#comments'

In [12]:
article_sentiments['body_text'].iloc[1456]



##### *Scraping para creacion de dataset precios de acciones Moderna desde Yahoo Finance*

In [13]:
df_msft_price = yf.download("MSFT", start="2023-06-21", end="2023-11-18")
df_msft_price

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-06-21,336.369995,337.730011,332.070007,333.559998,332.181030,25117800
2023-06-22,334.119995,340.119995,333.339996,339.709991,338.305634,23556800
2023-06-23,334.359985,337.959991,333.450012,335.019989,333.635010,23084700
2023-06-26,333.720001,336.109985,328.489990,328.600006,327.241577,21520600
2023-06-27,331.859985,336.149994,329.299988,334.570007,333.186890,24354100
...,...,...,...,...,...,...
2023-11-13,368.220001,368.470001,365.899994,366.679993,365.937256,19986500
2023-11-14,371.010010,371.950012,367.350006,370.269989,369.519989,27683900
2023-11-15,371.279999,373.130005,367.109985,369.670013,369.670013,26860100
2023-11-16,370.959991,376.350006,370.179993,376.170013,376.170013,27182300


### 4. Guardar los dos DataFrame

In [14]:
article_sentiments.to_csv("../data/raw/msft_article_investing.csv", sep=',', encoding='utf-8', header=True)

print("DataFrame guardado como 'msft_article_investing.csv'")

DataFrame guardado como 'msft_article_investing.csv'


In [16]:
# Guardar el DataFrame en un archivo CSV
df_msft_price.to_csv('../data/raw/msft_price_yfinance.csv')

print("DataFrame guardado como 'msft_price_yfinance.csv'")

DataFrame guardado como 'msft_price_yfinance.csv'


In [18]:
article_sentiments.to_pickle("../data/raw/msft_article_investing.pkl")
df_msft_price.to_pickle("../data/raw/msft_price_yfinance.pkl")