<a href="https://colab.research.google.com/github/radroid/simple-stock-tracker/blob/main/notebooks/AIDI_1100_01-02_FINAL_PROJECT_GROUP_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Section 1: Scan/Parse 

- Scan (the last two weeks or the last week, up to you) from the “newswire” website.
- Parse scanned news.

# Coding Section 2: Search/Track/Store
- Keep track of the news by storing the parsed news - CSV file.
- For all parsed news, search the content of the tracked news to find at least 2-3 stock symbols in a specific industry of your choice; e.g: (TSX: TSLA); (TSX: GM).

## Team Notes
Three functions are created to carry out the tasks described above. 
- Function `page_parse()` scans [PRNewsWire.com]() and returns URLs of scanned articles.
- Function `url_parse()` parses each article and detects any tickers mentioned in the article.
- **Function `run_scanner()` uses the above functions and returns a pandas dataframe containing:**
    - `Articel Date`
    - `Article Title`
    - `Ticker`
    - `Article URL`.

In [1]:
import pandas as pd
from requests_html import HTMLSession
import requests

# keep track of loading progress
from tqdm.notebook import tqdm

import pathlib
import time

In [2]:
# Parse particular news for tickers' mentions
# Takes in an url of an article to pass and an instance of a parsing session from page parser
def url_parse(parse_url, parse_session):
    parse_request = parse_session.get(parse_url)
    content = parse_request.html.find('section.release-body')
    try:
        for item in content:
            parse_ticker = item.find('a.ticket-symbol', first=True).text
    except AttributeError:
        parse_ticker = None
    try:
        return parse_ticker
    except UnboundLocalError:
        return None # Return non if no tickers found

In [3]:
# Function to parse a particular page for all the news to later parse them for tickers.
# Takes 2 parameters: a number of pages and initial dataset of already saved news.
def page_parse(x, page_session, data=[]):
    page_url = f'https://www.prnewswire.com/news-releases/english-releases/?page={x}&pagesize=100'
    page_request = page_session.get(page_url)
    content = page_request.html.find('div.row.arabiclistingcards')
    for item in tqdm(content, desc='Parsing page...\t', leave=False):
        date = item.find('h3', first=True).text.split('ET')[-2]
        title = item.find('h3', first=True).text.split('ET')[-1]
        article_url = 'https://www.prnewswire.com' + item.find('a.newsreleaseconsolidatelink', first=True).attrs['href']
        ticker = url_parse(article_url, page_session)
        try:
            dic = {
              'Date': pd.to_datetime(date),
              'Title': title,
              'Ticker': ticker,
              'Article URL': article_url
            }
            data.append(dic)
        except Exception:
            pass
        
    return data

In [4]:
# Main scanner, takes a number of pages to parse -  default `50` pages.
# `50` pages cover a week of news
# `100` pages cover 2 weeks of data.
def run_scanner(pages=10):
    session = HTMLSession()
    data = []

    for x in tqdm(range(1, pages+1), desc='Loading Pages...\t'):
        page_parse(x, session, data)
    
    df = pd.DataFrame(data)
    df.dropna(subset=['Ticker'], inplace=True)
    df.set_index('Date', inplace=True)
    return df

In [5]:
t0 = time.time()
df = run_scanner(10)
print(f'Time Taken: {(time.time() - t0): .1f}s')

Loading Pages...	:   0%|          | 0/10 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Parsing page...	:   0%|          | 0/100 [00:00<?, ?it/s]

Time Taken:  178.4s


In [7]:
df.head()

Unnamed: 0_level_0,Title,Ticker,Article URL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-11-09 19:48:00,Intercorp Financial Services announces virtua...,IFS,https://www.prnewswire.com/news-releases/inter...
2021-11-09 19:46:00,"ICC Holdings, Inc. Reports 2021 Third Quarter...",ICCH,https://www.prnewswire.com/news-releases/icc-h...
2021-11-09 19:00:00,Smith+Nephew establishes its first Medical Ed...,SNN,https://www.prnewswire.com/news-releases/smith...
2021-11-09 18:56:00,Cambridge Bancorp Announces Expansion of Weal...,CATC,https://www.prnewswire.com/news-releases/cambr...
2021-11-09 18:35:00,EQT Ventures and EQT Growth to exit its holdi...,DASH,https://www.prnewswire.com/news-releases/eqt-v...


# Saving the file
In this section we will save the Pandas DataFrame created above.

In [8]:
# create filename using the first and last dates.
sorted_df = df.sort_values('Date')

start_date = sorted_df.index[0].date()
end_date = sorted_df.sort_values('Date').index[-1].date()

# concate dates to create name for the CSV file.
filename = start_date.strftime('%d%b%y') + '-to-' + end_date.strftime('%d%b%y') + '.csv'

filename

'09Nov21-to-09Nov21.csv'

In [9]:
data_dir = pathlib.Path('../datasets/')

if data_dir.exists():
    df.to_csv(data_dir / filename) # Save dataframe as a csv for further analysis
    print(f'CSV file saved to: {data_dir / filename}')
else:
    print('Please define a valid directory to save the CSV file to.')

CSV file saved to: ..\datasets\09Nov21-to-09Nov21.csv
