# Web Scraping Project - End of the Day Stock data (EODData)

Data Source : [EODDATA - End of the Day Stock data](http://eoddata.com/stocklist/TSX/A.htm)
![](https://i.imgur.com/rWt0W74.jpg)


## Web Scraping 

>### Q1. What is Web Scraping?
In the most simple terms, **Web Scraping** is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on. 

>When we say 'Easy to work on', we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in a `CSV File, an Excel File or a Database`.

>### Q2. How does web scraping work?
![](https://i.imgur.com/iv6RhmW.png)

>To understand web scraping, it’s important to first understand that web pages are built with text-based mark-up languages – the most common being `HTML`.

>A mark-up language defines the structure of a website’s content. Since there are universal components and tags of mark-up languages, this makes it much easier for web scrapers to pull all the information that it needs.
Once the HTML is parsed, the scraper then extracts the necessary data and stores it.  
**Note**  : Not all websites allow Web Scraping, especially when personal information of the users is involved, so we should always ensure that we do not explore too much, and don't get our hands on information which might belong to someone else.
Websites generally have protections at place, and they would block our access to the website if they see us scraping a large amount of data from their website.

### About EODData

![](https://i.imgur.com/sOQ7li5.png)

EODData provides free quality end of day stock market data to traders with wide range of exchanges, data formats, tools and services.The website also provides historical data with minimum monthly fee.

The website also have a variety of servers that are dedicated to finding and correcting the numerous errors that stock exchanges produce. All of our historical data has been carefully screened and adjusted for splits.

### Project Idea

As part of this project, we will parse through the EODData website to get the details for Toronto Stock Exchange information.

We will retrieve information from the page **’Toronto Stock Exchange’** using _web scraping_: a process of extracting information from a website programmatically. For this specific project we will be scraping stocks starting with Alphabets A to H.

### Project Goal

The project goal is to build a web scraper that withdraws stock information and assemble them into a single CSV. The format of the output CSV file is shown below:

|#|Code|Name|High|Low|Close|Volume|Stock Page URL
|-|----------|-------|---------------|-----|------|-----------------|-----------
|1|AAB|Aberdeen International Inc|0.1400|0.1350|0.1400|13138|http://eoddata.com/stockquote/TSX/AAB.htm
|2|AAV|Advantage Oil & Gas Ltd|6.370|6.130|6.360|684302|http://eoddata.com/stockquote/TSX/AAV.htm

### Project steps
Here is an outline of the steps we'll follow :

1. Download the webpage using `requests`
2. Parse the HTML source code using `BeautifulSoup` library and extract the desired infromation
3. Building the scraper components
4. Compile the extracted information into Python list and dictionaries
5. Converting the python dictionaries into `Pandas DataFrames`
5. Write information to the final CSV file
7. Future work and references



>### Packages Used:
>1. Requests — For downloading the HTML code from the IMDB URL
>2. BeautifulSoup4 — For parsing and extracting data from the HTML string
>3. Pandas — to gather my data into a dataframe for further processing

### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

## Lets start with scraping

>Note : We will use the `Jovian` library and its `commit()` function throughout the code to save our progress as we move along.

In [1]:
!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="final-web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ivarchan/final-web-scraping-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ivarchan/final-web-scraping-project[0m


'https://jovian.ai/ivarchan/final-web-scraping-project'

## Download the webpage using `requests`



>#### **What is `requests`**


>Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

>We use `pip`, a package-management system, to install and manage softwares. Since the platform we selected is **Binder**, we would have to type a line of code `!pip install` to install `requests`. You will see lots codes of `!pip` when installing other packages.

>When we attempt to use some prewritten functions from a certain library, we would use the `import` statement. e.g. When we would have to type `import requests` after installation, we are able to use any function from `requests` library.

In [2]:
!pip install requests --quiet --upgrade
import requests

#### **requests.get()**

In order to **download a web page**, we use `requests.get()` to **send the HTTP request** to the **IMDB server** and what the function returns is a **response object**, which is **the HTTP response**. 

![](https://i.imgur.com/ssV51Yc.png)

In [3]:
home_url = 'http://eoddata.com/stocklist/TSX/A.htm'   #The URL Address of the webpage we will scrape, i.e. Stocks starting from A
response = requests.get(home_url)      #requests.get()

#### **Status code**

Now, we have to `check` if we succesfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't get `the feedback` directly if we didn't send HTTP requests successfully.

In general, the method to check out if the server sended a HTTP response back is the **status code**. In `requests` library, `requests.get` returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.


If the request was successful, `response.status_code` is set to a value between **200 and 299**.

In [4]:
response.status_code    #Here we are checking the Status code, -> 200-299 will mean that the request was successful

200

The HTTP response contains HTML that is ready to be displayed in browser. Here we can use `response.text` to retrive the HTML document.

In [5]:
page_contents = response.text
len(page_contents)    #The `len` fucnction tells us the length of the response object

112462

In [6]:
page_contents[:1000]   #This displays the first 1000 characters of `page_contents`

'\r\n\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head><link rel="stylesheet" href="../../styles/jquery-ui-1.10.0.custom.min.css" type="text/css" /><link rel="stylesheet" href="../../styles/main.css" type="text/css" /><link rel="stylesheet" href="../../styles/button.css" type="text/css" /><link rel="stylesheet" href="../../styles/nav.css" type="text/css" />\r\n  <script src="/scripts/jquery-1.9.0.min.js" type="text/javascript"></script>\r\n  <script src="/scripts/jquery-ui-1.10.0.custom.min.js" type="text/javascript"></script>\r\n\t<script type="text/javascript">\t\tvar _sf_startpt = (new Date()).getTime()</script>\r\n  \r\n\t<script type="text/javascript" src="scripts/jquery-1.4.2.min.js"></script>\r\n<meta name="keywords" content="list of symbols for Toronto Stock Exchange,list of stock symbols,download symbols,stock symbols list,TSX symbol list,TSX stoc

- What we see above is the source code of the web page. It is written in a language called HTML. 
- It defines and display the content and structure of the web page by the help of the browsers like Chrome

In [7]:
jovian.commit() #Saving the work done till now

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ivarchan/final-web-scraping-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/ivarchan/final-web-scraping-project[0m


'https://jovian.ai/ivarchan/final-web-scraping-project'

## Parse the HTML source code using Beautiful Soup library


>### What is Beautiful Soup?

>Beautiful Soup is **a Python package** for **parsing HTML and XML documents**. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It's a handy tool when it comes to web scraping. You can read more on their documentation site. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help

>To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let's install the library and import **the BeautifulSoup class** from **the bs4 module.**

In [8]:
!pip install beautifulsoup4 --quiet --upgrade
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')  #Now 'doc' contains entire html in parsed format

In [9]:
type(doc)

bs4.BeautifulSoup

### Inspecting the HTML source code of a web page



>In Beautiful Soup library, we can specify `html.parser` to ask Python to read components of the page, instead of reading it as a long string. 

>### What is HTML?
Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.

>The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.







![](https://i.imgur.com/ChftiDR.png)

#### **An HTML tag comprises of three parts:**

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


### Common tags and attributes

#### **Tags in HTML**

There are around 100 types of HTML tags but on a day to day basis, around 15 to 20 of them are the most common use, such as `<div>` tag, `<p>` tag, `<section>` tag, `<img>` tag, `<a>` tags.


Of many tags, I wanted to highlight **`<a>` tag**, which  can contain attributes such as `href` (hyperlink reference), because `<a>` tag allows users to click and they would be directed to another site. That's why the name of `<a>` tag is  **anchor**.

#### **Attributes**

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)

`What we can do with **a BeautifulSoup object** is to get **a specifc types of a tag in HTML** by calling the name of a tag, as shown in code cell below.`

Here, we use the `find()` function of BeautifulSoup to find the first `<title>` tag in the HTML document and display its content

In [10]:
title = doc.find('title')
title

<title>
	List of Symbols for Toronto Stock Exchange [TSX] Starting with A
</title>

### Inspecting HTML in the Browser

>To view the **source code** of any webpage right within **your browser**, you can **right click** anywhere on a page and **select** the **"Inspect"** option. You access the **"Developer Tools"** mode, where you can see the source code as **a tree**. You can expand and collapse various nodes and find the source code for a specific portion of the page

![](https://i.imgur.com/ByuBJbA.png)


As shown in the photo above, I've cursored over one of the Stock to display how the entire content was presented. 
I found out that each `stock` was present inside the `<tr>` tag.

Since I've pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.

#### Here we get the main tr tag for complete stock information. Note we have alternate stocks so getting both

In [11]:
tr_parent1 = doc.find_all('tr',{'class':'ro'}) 
tr_parent2 = doc.find_all('tr',{'class':'re'})

#### Looks like we have around 120 records for stocks starting with 'A'

In [12]:
len(tr_parent1) + len(tr_parent2)

121

#### Now lets get the indivdual td for the first stock which has all the information required

In [13]:
td_child1 = tr_parent1[0].find_all('td')

In [14]:
td_child1

[<td><a href="/stockquote/TSX/AAB.htm" title="Display Quote &amp; Chart for TSX,AAB">AAB</a></td>,
 <td>Aberdeen International Inc</td>,
 <td align="right">0.1350</td>,
 <td align="right">0.1300</td>,
 <td align="right">0.1300</td>,
 <td align="right">146,215</td>,
 <td align="right">-0.0050</td>,
 <td align="center"><img src="/images/dn.gif"/></td>,
 <td align="left">3.70</td>,
 <td align="right"><a href="/stockquote/TSX/AAB.htm" title="Download Data for TSX,AAB"><img height="14" src="/images/dl.gif" width="14"/></a> <a href="/stockquote/TSX/AAB.htm" title="View Quote and Chart for TSX,AAB"><img height="14" src="/images/chart.gif" width="14"/></a></td>]

### Get the individual information 

#### Symbol

In [15]:
symbol = td_child1[0].find('a').text.strip()

#### Name

In [16]:
name = td_child1[1].text.strip()

#### High value

In [17]:
high = td_child1[2].text.strip()

#### Low value

In [18]:
low = td_child1[3].text.strip()

#### Closing value of the day

In [19]:
close = td_child1[4].text.strip()

#### Total volume of the day

In [20]:
volume = td_child1[5].text.strip().replace(',', '') # Here we remove the comma

#### Stock URL

In [21]:
url = "http://eoddata.com/" + td_child1[0].find('a')['href'] # Here we append the base url

#### Print all the values 

In [22]:
print("Symbol:", format(symbol))
print("Name:", format(name))
print("High:", format(high))
print("Low:", format(low))
print("Volume:", format(volume))
print("URL:", format(url))

Symbol: AAB
Name: Aberdeen International Inc
High: 0.1350
Low: 0.1300
Volume: 146215
URL: http://eoddata.com//stockquote/TSX/AAB.htm


## Create the generic function with all the information

In [23]:
def parse_document(tr_tag):
    
    td_tag = tr_tag.find_all('td')
    symbol = td_tag[0].find('a').text.strip()
    name = td_tag[1].text.strip()
    high = td_tag[2].text.strip()
    low = td_tag[3].text.strip()
    close = td_tag[4].text.strip()
    volume = td_tag[5].text.strip().replace(',', '')
    url = "http://eoddata.com/" + td_child1[0].find('a')['href']
    
    print("Symbol:", format(symbol))
    print("Name:", format(name))
    print("High:", format(high))
    print("Low:", format(low))
    print("Volume:", format(volume))
    print("URL:", format(url))
    

### Let's test the function by for specific stock

In [24]:
parse_document(tr_parent1[2])

Symbol: ABST
Name: Absolute Software Corp
High: 11.65
Low: 11.10
Volume: 186919
URL: http://eoddata.com//stockquote/TSX/AAB.htm


In [25]:
parse_document(tr_parent1[10])

Symbol: AD.UN
Name: Alaris Equity Partners Income Trust
High: 18.26
Low: 17.92
Volume: 98156
URL: http://eoddata.com//stockquote/TSX/AAB.htm


### Now let's update the function to return dictionary 

In [26]:
def parse_document(tr_tag):
    
    td_tag = tr_tag.find_all('td')
    symbol = td_tag[0].find('a').text.strip()
    name = td_tag[1].text.strip()
    high = td_tag[2].text.strip()
    low = td_tag[3].text.strip()
    close = td_tag[4].text.strip()
    volume = td_tag[5].text.strip().replace(',', '')
    url = "http://eoddata.com/" + td_tag[0].find('a')['href']
    
    # Return a dictionary
    return {
        'Symbol': symbol,
        'Name': name,        
        'High': high,
        'Low': low,
        'Close': close,
        'Volume': volume,
        'URL': url
    }   

### Now use the above function to get all the stock information of the given page

In [27]:
all_records_1 = [parse_document(tag) for tag in tr_parent1]
all_records_2 = [parse_document(tag) for tag in tr_parent2]

In [28]:
len(all_records_1) + len(all_records_2) # The length the page records matches with the len we found earlier.

121

###### Combine both the list 

In [29]:
all_records = [item for sublist in zip(all_records_1, all_records_2) for item in sublist]

In [30]:
len(all_records)

120

## Writing information to CSV files

In [31]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

### Testing the function

In [32]:
write_csv(all_records,"A.csv")

In [33]:
import pandas as pd

In [34]:
pd.read_csv('A.csv')

Unnamed: 0,Symbol,Name,High,Low,Close,Volume,URL
0,AAB,Aberdeen International Inc,0.135,0.130,0.130,146215,http://eoddata.com//stockquote/TSX/AAB.htm
1,AAV,Advantage Oil & Gas Ltd,6.630,6.060,6.610,2106233,http://eoddata.com//stockquote/TSX/AAV.htm
2,ABCT,ABC Technologies Holdings Inc,5.790,5.550,5.790,4029,http://eoddata.com//stockquote/TSX/ABCT.htm
3,ABCT.RT,ABC Technologies Holdings Inc Rights,0.010,0.005,0.005,58102,http://eoddata.com//stockquote/TSX/ABCT.RT.htm
4,ABST,Absolute Software Corp,11.650,11.100,11.130,186919,http://eoddata.com//stockquote/TSX/ABST.htm
...,...,...,...,...,...,...,...
115,AX.PR.E,Artis REIT Pref Ser E,24.350,24.250,24.300,3300,http://eoddata.com//stockquote/TSX/AX.PR.E.htm
116,AX.PR.I,Artis REIT Pref Series I,25.660,25.400,25.660,1269,http://eoddata.com//stockquote/TSX/AX.PR.I.htm
117,AX.UN,Artis Real Estate Investment Trust Units,13.130,12.930,13.020,308738,http://eoddata.com//stockquote/TSX/AX.UN.htm
118,AXU,Alexco Resource Corp,1.950,1.820,1.930,162738,http://eoddata.com//stockquote/TSX/AXU.htm


![](https://i.imgur.com/NlQftOh.png)

## Final function with all the information above 

In [35]:
def scrap_stockInfo(alpha_list):  
    base_url = "http://eoddata.com/stocklist/TSX/"
    
    for i in range(len(alpha_list)):
        data_url = base_url + alpha_list[i] +".htm"
        response = requests.get(data_url)
        page_contents = response.text
        doc = BeautifulSoup(page_contents, 'html.parser')
        tr_tags1 = doc.find_all('tr',{'class':'ro'})
        tr_tags2 = doc.find_all('tr',{'class':'re'})
        all_records_1 = [parse_document(tag) for tag in tr_tags1]
        all_records_2 = [parse_document(tag) for tag in tr_tags2]
        all_records = [item for sublist in zip(all_records_1, all_records_2) for item in sublist]
        
        file_name = alpha_list[i] + ".csv"
        write_csv(all_records,file_name)

### Let's create separate csv for each alphabet across multiple pages

In [36]:
alpha_list = ['A','B','D','E','F','G','H']

In [37]:
scrap_stockInfo(alpha_list)

![](https://i.imgur.com/5rNA8M3.png)

In [38]:
pd.read_csv('H.csv')

Unnamed: 0,Symbol,Name,High,Low,Close,Volume,URL
0,H,Hydro One Ltd,31.21,30.770,30.92,1238299,http://eoddata.com//stockquote/TSX/H.htm
1,HAB,Horizons Active Corporate Bond ETF,10.47,10.440,10.47,2705,http://eoddata.com//stockquote/TSX/HAB.htm
2,HAC,Horizons Seasonal Rotation ETF,25.55,25.270,25.31,4568,http://eoddata.com//stockquote/TSX/HAC.htm
3,HAD,Horizons Active CDN Bond ETF,9.78,9.730,9.73,1055,http://eoddata.com//stockquote/TSX/HAD.htm
4,HAEB,Horizons Active ESG Corporate Bond ETF,9.46,9.460,9.46,323,http://eoddata.com//stockquote/TSX/HAEB.htm
...,...,...,...,...,...,...,...
167,HYI,Horizons Active High Yield Bond ETF,8.77,8.560,8.68,13927,http://eoddata.com//stockquote/TSX/HYI.htm
168,HYLD,Hamilton Enhanced U.S. Covered Call ETF,16.03,15.670,15.67,159908,http://eoddata.com//stockquote/TSX/HYLD.htm
169,HYLD.U,Hamilton Enhanced US Coverd Call ETF USD,15.95,15.700,15.70,5626,http://eoddata.com//stockquote/TSX/HYLD.U.htm
170,HZD,Betapro Silver 2X Daily Bear ETF,18.57,18.100,18.10,22030,http://eoddata.com//stockquote/TSX/HZD.htm


### Now that we can created csv for each stock starting with the Alphabets, let's combine everything and remove the individual files

In [39]:
import os

# create empty list
final_list = []
 
# append individual csv into the list
for i in range(len(alpha_list)):
    temp_df = pd.read_csv(alpha_list[i]+".csv")
    os.remove(alpha_list[i]+".csv")
    final_list.append(temp_df)
    
# create new data frame with the combined list
merged_df = pd.concat(final_list,axis=0, ignore_index=True)

# export into final csv
merged_df.to_csv( "Toronto_Stocks.csv", index=None)

![](https://i.imgur.com/vEHvvoc.png)

### Check few records in the csv file

In [40]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 976 entries, 0 to 975
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Symbol  976 non-null    object 
 1   Name    976 non-null    object 
 2   High    976 non-null    float64
 3   Low     976 non-null    float64
 4   Close   976 non-null    float64
 5   Volume  976 non-null    int64  
 6   URL     976 non-null    object 
dtypes: float64(3), int64(1), object(3)
memory usage: 53.5+ KB


In [41]:
merged_df.head(10)

Unnamed: 0,Symbol,Name,High,Low,Close,Volume,URL
0,AAB,Aberdeen International Inc,0.135,0.13,0.13,146215,http://eoddata.com//stockquote/TSX/AAB.htm
1,AAV,Advantage Oil & Gas Ltd,6.63,6.06,6.61,2106233,http://eoddata.com//stockquote/TSX/AAV.htm
2,ABCT,ABC Technologies Holdings Inc,5.79,5.55,5.79,4029,http://eoddata.com//stockquote/TSX/ABCT.htm
3,ABCT.RT,ABC Technologies Holdings Inc Rights,0.01,0.005,0.005,58102,http://eoddata.com//stockquote/TSX/ABCT.RT.htm
4,ABST,Absolute Software Corp,11.65,11.1,11.13,186919,http://eoddata.com//stockquote/TSX/ABST.htm
5,ABTC,Accelerate Carbon Negative Bitcoin ETF,3.85,3.84,3.84,200,http://eoddata.com//stockquote/TSX/ABTC.htm
6,ABTC.U,Accelerate Carbon Neg Bitcoin ETF USD,3.0,3.0,3.0,16100,http://eoddata.com//stockquote/TSX/ABTC.U.htm
7,ABX,Barrick Gold Corp,29.47,28.85,29.11,4830196,http://eoddata.com//stockquote/TSX/ABX.htm
8,AC,Air Canada,25.98,24.71,24.91,5230236,http://eoddata.com//stockquote/TSX/AC.htm
9,ACB,Aurora Cannabis Inc,5.15,4.75,4.76,1566158,http://eoddata.com//stockquote/TSX/ACB.htm


In [42]:
merged_df.tail(10)

Unnamed: 0,Symbol,Name,High,Low,Close,Volume,URL
966,HXT.U,Horizons S&P TSX 60 Index ETF USD,39.2,39.2,39.2,4000,http://eoddata.com//stockquote/TSX/HXT.U.htm
967,HXU,Betapro S&P TSX 60 2X Daily Bull ETF,20.81,20.28,20.31,104968,http://eoddata.com//stockquote/TSX/HXU.htm
968,HXX,Horizons Euro Stoxx 50 Index ETF,36.97,36.93,36.97,650,http://eoddata.com//stockquote/TSX/HXX.htm
969,HYBR,Horizons Active Hybrd Bond Prf Share ETF,9.73,9.67,9.67,11742,http://eoddata.com//stockquote/TSX/HYBR.htm
970,HYDR,Horizons Global Hydrogen Index ETF,15.2,15.2,15.2,156,http://eoddata.com//stockquote/TSX/HYDR.htm
971,HYI,Horizons Active High Yield Bond ETF,8.77,8.56,8.68,13927,http://eoddata.com//stockquote/TSX/HYI.htm
972,HYLD,Hamilton Enhanced U.S. Covered Call ETF,16.03,15.67,15.67,159908,http://eoddata.com//stockquote/TSX/HYLD.htm
973,HYLD.U,Hamilton Enhanced US Coverd Call ETF USD,15.95,15.7,15.7,5626,http://eoddata.com//stockquote/TSX/HYLD.U.htm
974,HZD,Betapro Silver 2X Daily Bear ETF,18.57,18.1,18.1,22030,http://eoddata.com//stockquote/TSX/HZD.htm
975,HZM,Horizonte Minerals Plc,0.11,0.105,0.11,2603800,http://eoddata.com//stockquote/TSX/HZM.htm


In [43]:
merged_df.loc[115:125]

Unnamed: 0,Symbol,Name,High,Low,Close,Volume,URL
115,AX.PR.E,Artis REIT Pref Ser E,24.35,24.25,24.3,3300,http://eoddata.com//stockquote/TSX/AX.PR.E.htm
116,AX.PR.I,Artis REIT Pref Series I,25.66,25.4,25.66,1269,http://eoddata.com//stockquote/TSX/AX.PR.I.htm
117,AX.UN,Artis Real Estate Investment Trust Units,13.13,12.93,13.02,308738,http://eoddata.com//stockquote/TSX/AX.UN.htm
118,AXU,Alexco Resource Corp,1.95,1.82,1.93,162738,http://eoddata.com//stockquote/TSX/AXU.htm
119,AYA,Aya Gold and Silver Inc,10.55,9.86,10.36,543086,http://eoddata.com//stockquote/TSX/AYA.htm
120,BABY,Else Nutrition Holdings Inc,1.17,1.14,1.17,53438,http://eoddata.com//stockquote/TSX/BABY.htm
121,BABY.WT,Else Nutrition Holdings Inc WT,0.065,0.05,0.05,25000,http://eoddata.com//stockquote/TSX/BABY.WT.htm
122,BABY.WT.A,Else Nutrition Holdings Inc.,0.21,0.2,0.2,13500,http://eoddata.com//stockquote/TSX/BABY.WT.A.htm
123,BAM.A,Brookfield Asset Management Inc Cl A Lv,68.84,66.6,66.7,1304902,http://eoddata.com//stockquote/TSX/BAM.A.htm
124,BAM.PF.A,Brookfield Asset Mgmt Inc Pref Ser 32,24.85,24.71,24.85,10649,http://eoddata.com//stockquote/TSX/BAM.PF.A.htm


## Summary

Finally, we have managed to `parse` 'EOD Data website' to get our hands on very **interesting and insightful data** when it comes world of financial stock information.  
We have saved all the information we could extract from that website for our needs in a `CSV` file using which we can further get answers to a lot of questions we may want to ask, e.g - `Which stock was best but on the given day`
![](https://imgur.com/uAGgHE3.jpg)



Let us look at the steps that we took from start to finish : 

1. We downloaded the webpage using `requests`  


2. We `parsed` the HTML source code using `BeautifulSoup` library and extracted the desired information, i.e.
    * Stock Name
    * Opening and closing price of each stock


3. We extracted detailed information for each stock,such as :
    * Stock Symbol
    * Stock Name
    * Highest price
    * Lowest price
    * Closing price	
    * Total volumes traded
    * URL to get the historical data of the stock	


4. We then created a `Python Dictionary` to save all these details


5. We converted the python dictionary into `Pandas DataFrames`


6. Then we combined the multiple csv files generated for each alphabets into single data frame and remove others.


7. With one single DataFrame in hand, we then converted it into a single `CSV` file, which was the goal of our project.

## Future Work

We can now work forward to explore this data more and more to fetch meaningful information out of it.  

With all the insights , and further analysis into the data, we can have answers to a lot of questions like -   
* Which stock performed better on the given day 
* Which stock traded more based on volume
* Individual stock information
* Gain/Loss information of the stock

And the list goes on..

In the future, I would like to work to make this `DataSet` even richer with 

* Stock information for symbols starting with other alphabets
* Scrap the individual stock detail page to get more insights of specific stock
* Scrap different exchanges like NASDAQ and others...
* Automation script to scrap the stock information on daily basis to generate the data set which can be further used for Exploratory Data Analysis and draw interesting insights for stock market across different exchanges.

## References


[1] Python offical documentation. https://docs.python.org/3/


[2] Requests library. https://pypi.org/project/requests/


[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/


[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api


[5] Pandas library documentation. https://pandas.pydata.org/docs/


[6] IMDB Website. https://www.imdb.com/chart/top


[7] Web Scraping Article. https://www.toptal.com/python/web-scraping-with-python


[8] Web Scraping Image. https://morioh.com/p/431153538ecb

[8] Working with Jupyter Notebook https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd

In [44]:
jovian.commit(files=['Toronto_Stocks.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ivarchan/final-web-scraping-project" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/ivarchan/final-web-scraping-project[0m


'https://jovian.ai/ivarchan/final-web-scraping-project'