# Scraping News from newswire
This is the part 1 of the project where we scrape the news for last 2 weeks and store the news details such as date, url, title, tickers found in the news content in the excel sheet.

## Importing Libraries
Importing all the necessary libraries to execute the code.

In [1]:
from requests_html import HTMLSession
import pandas as pd
from time import sleep
import requests

# keep track of loading progress
from tqdm.notebook import tqdm

import pathlib
import time

## Get Ticket Function
The aim of this function is to get the ticker symbol from the news content. For this, url and htmlssesssion is sent as an input parameters for the function and and the output is the ticker symbol.

In [1]:
def get_ticker(url, session):
    # Takes the url of the news
    r = session.get(url)
    # Saves the content of the news in a variable
    content = r.html.find('section.release-body')
    # Runs a for loop over the content variable to find the ticker symbol
    try:
        for item in content:
            ticker = item.find('a.ticket-symbol', first=True).text
    except AttributeError:
        ticker = None
    try:
        return ticker # Returns the ticker symbol
    except UnboundLocalError:
        return None # Return none if no tickers found

## Get Page Details Function
The aim of this function is to fetch all the detials of the news such as date, title, ticker and url and store it in a list so that all the news from last 2 weeks can be exported in an excel sheet. For this, no of page to be parsed, htmlsession and an empty list is sent as an input parameter and output parameter is a list with all appended values.

In [2]:
def get_page_details(x, session,data=[]):
    # Fetching the url of the newswire.{x} tells us the from which page we need to get the news and each page will contain 100 news.
    url = f'https://www.prnewswire.com/news-releases/news-releases-list/?page={x}&pagesize=100'
    r = session.get(url)
    content = r.html.find('div.row.arabiclistingcards')
    for item in tqdm(content, desc='Parsing page...\t', leave=False): # Adding progress bars to see if news are getting parsed.
        date = item.find('h3', first=True).text.split('ET')[-2] # getting date value
        title = item.find('h3', first=True).text.split('ET')[-1] # getting title value
        article_url = 'https://www.prnewswire.com' + item.find('a.newsreleaseconsolidatelink', first=True).attrs['href']
        ticker = get_ticker(article_url, session) # calling ticker function to get the ticker value
        try:
            dic = {
              'Date': pd.to_datetime(date),
              'Title': title,
              'Ticker': ticker,
              'Article URL': article_url
            } # Storing values in a dictionary
            data.append(dic) # Appending all the values in a list.
        except Exception:
            pass
        
    return data # Returning the final list will all the values

## Get News Function
This is the main function. This function takes no of pages as an input parametes and calls details function as a sub function to get all the details. It converts the list to dataframe to export the data in to excel sheet.

In [4]:
def get_news(pages):
    session = HTMLSession()
    data = []

    for x in tqdm(range(1, pages+1), desc='Loading Pages...\t'): # Loading progress bars
        get_page_details(x, session,data)
    
    df = pd.DataFrame(data) # Converting list to dataframe
    return df

In [5]:
df = get_news(2) # Calling the main function

HBox(children=(HTML(value='Loading Pages...\t'), FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(HTML(value='Parsing page...\t'), FloatProgress(value=0.0), HTML(value='')))

HBox(children=(HTML(value='Parsing page...\t'), FloatProgress(value=0.0), HTML(value='')))




In [6]:
print(df) # printing the dataframe

                   Date                                              Title  \
0   2021-11-21 21:27:00   Ding Lei: New Ultra-Futuristic HiPhi Z from H...   
1   2021-11-21 21:15:00   Pace raises USD40 million Series A funding fr...   
2   2021-11-21 20:53:00            Lanvin Group Unveils New Brand Identity   
3   2021-11-21 20:41:00   Aston Martin Vantage mit BAPE+GEEKVAPE-Beschi...   
4   2021-11-21 20:34:00   New Models, New Energy | GAC at the Auto Guan...   
..                  ...                                                ...   
194 2021-11-19 17:42:00   Lucid Named To Inaugural Inc. Best-Led Compan...   
195 2021-11-19 17:41:00   Nabors Energy Transition Corp. Announces Clos...   
196 2021-11-19 17:39:00   Cyber Defense Labs' CEO Robert Anderson Jr, N...   
197 2021-11-19 17:30:00   Agnico Eagle and Kirkland Lake Gold Remind Sh...   
198 2021-11-19 17:30:00   Water Bottles with Filters Market Size to inc...   

    Ticker                                        Article URL  

In [7]:
df.to_csv('Outputfile.csv', index=False) # Exporting the dataframe into excel sheet