# Web Scraping with Python · Stock Market Data

Completed by [Anton Starshev](http://linkedin.com/in/starshev) on 25/02/2024

### Context

The task is to extract stock data for specific companies using the `yfinance` library or by making HTTP requests with the `requests` library to a web page. Then, parse the data using the `BeautifulSoup` library and convert it into a pandas DataFrame.

### Data

Links for html pages contating the desired tables are saved into the following variables in advance:

* `url_tsla` for task 2
* `url_gme` for task 4

### Execution

Imported all necessary libraries.

In [72]:
import pandas as pd
import requests
import yfinance as yf
from bs4 import BeautifulSoup

**Task 1:** Using the ticker object, extract stock information about Tesla company and save it in a dataframe named `tesla_data`, given that its ticker symbol is `TSLA`.

Created a ticker object for `TSLA` ticker.

In [4]:
ticker = yf.Ticker('TSLA')

Using the ticker object, extracted stock information and saved it in a dataframe named `tesla_data`. Set the `period` parameter to `max` in order to get information for the maximum amount of time.

In [75]:
tesla_data = ticker.history(period = 'max')
tesla_data.reset_index(inplace = True)

tesla_data

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2002-02-13 00:00:00-05:00,1.620129,1.693350,1.603296,1.691667,76216000,0.0,0.0
1,2002-02-14 00:00:00-05:00,1.712707,1.716074,1.670626,1.683250,11021600,0.0,0.0
2,2002-02-15 00:00:00-05:00,1.683250,1.687458,1.658001,1.674834,8389600,0.0,0.0
3,2002-02-19 00:00:00-05:00,1.666418,1.666418,1.578047,1.607504,7410400,0.0,0.0
4,2002-02-20 00:00:00-05:00,1.615920,1.662210,1.603296,1.662210,6892800,0.0,0.0
...,...,...,...,...,...,...,...,...
5659,2024-08-08 00:00:00-04:00,21.010000,21.879999,20.809999,21.750000,5439700,0.0,0.0
5660,2024-08-09 00:00:00-04:00,21.510000,22.170000,21.459999,21.930000,4828900,0.0,0.0
5661,2024-08-12 00:00:00-04:00,21.980000,22.270000,21.450001,21.879999,4449100,0.0,0.0
5662,2024-08-13 00:00:00-04:00,21.959999,22.379999,21.860001,22.270000,3904400,0.0,0.0


**Task 2.1:** Use web scraping to extract Tesla Quarterly Revenue data into a dataframe named `tsla_q_rev` from the given webpage source: `url_tsla`.


Made an HTTP request to a source url and saved the response into a variable.

In [89]:
html_tsla = requests.get(url_tsla)

Parsed the html code as a string using BeautifulSoup to find the desired table and save it into a variable.

In [90]:
soup = BeautifulSoup(html_tsla.text)
tsla_tables = soup.find_all('table')

for item in tsla_tables:
    if 'tesla' in item.getText().lower() and 'quarterly' in item.getText().lower():
        tsla_q_rev_table = item
        break
        
if 'tsla_q_rev_table' in locals():
    print('Tesla Revenue table is created.')
else:
    print('Tesla Revenue table was not found.')

Tesla Revenue table is created.


Extracted headers and rows from the table tag object, saved table into a dataframe and validated the names of columns.

In [91]:
headers = [th.get_text() for th in tsla_q_rev_table.find_all('th')]

rows = []
for row in tsla_q_rev_table.find_all('tr')[1:]:
    cells = [td.get_text() for td in row.find_all('td')]
    rows.append(cells)

if len(headers) != len(rows[0]):
    headers = ['Quarter', 'TSLA Revenue'] 
    
tsla_q_rev = pd.DataFrame(data = rows, columns = headers)

tsla_q_rev.head(10)

Unnamed: 0,Quarter,TSLA Revenue
0,2022-09-30,"$21,454"
1,2022-06-30,"$16,934"
2,2022-03-31,"$18,756"
3,2021-12-31,"$17,719"
4,2021-09-30,"$13,757"
5,2021-06-30,"$11,958"
6,2021-03-31,"$10,389"
7,2020-12-31,"$10,744"
8,2020-09-30,"$8,771"
9,2020-06-30,"$6,036"


**Task 2.2:** Complete the same task as 2.1, but using just Pandas library: extract Tesla Quarterly Revenue data into a dataframe named `tsla_q_rev2` from the given webpage source: `url_tsla`.

Downloaded the HTML source data with Pandas into a list of dataframes.

In [120]:
tsla_tables2 = pd.read_html(url_tsla)

Identified the desired dataframe by keywords and saved it into a variable `tsla_q_rev_table2`.

In [121]:
for item in tsla_tables2:
    if any('tesla' in word.lower() for word in item.columns) and any('quarterly' 
                                                               in word.lower() for word in item.columns):
        tsla_q_rev2 = item
        break
        
if 'tsla_q_rev2' in locals():
    print('Tesla Revenue table is created.')
else:
    print('Tesla Revenue table was not found.')

Tesla Revenue table is created.


Modified column names and checked first 10 rows of the dataframe.

In [122]:
tsla_q_rev2.columns = ['Quarter', 'TSLA Revenue']

tsla_q_rev2.head(10)

Unnamed: 0,Quarter,TSLA Revenue
0,2022-09-30,"$21,454"
1,2022-06-30,"$16,934"
2,2022-03-31,"$18,756"
3,2021-12-31,"$17,719"
4,2021-09-30,"$13,757"
5,2021-06-30,"$11,958"
6,2021-03-31,"$10,389"
7,2020-12-31,"$10,744"
8,2020-09-30,"$8,771"
9,2020-06-30,"$6,036"


**Task 3:** Using the ticker symbol `GME`, extract stock data for GameStop company.


Created a ticket object.

In [115]:
ticker = yf.Ticker('GME')

Using the ticker object, extracted stock information and saved it in a dataframe named `gme_data`. Set the `period` parameter to `max` for the maximum amount of time.

In [116]:
gme_data = ticker.history(period = 'max')
gme_data.reset_index(inplace = True)

gme_data

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2002-02-13 00:00:00-05:00,1.620128,1.693350,1.603296,1.691666,76216000,0.0,0.0
1,2002-02-14 00:00:00-05:00,1.712707,1.716074,1.670626,1.683250,11021600,0.0,0.0
2,2002-02-15 00:00:00-05:00,1.683251,1.687459,1.658002,1.674834,8389600,0.0,0.0
3,2002-02-19 00:00:00-05:00,1.666418,1.666418,1.578048,1.607504,7410400,0.0,0.0
4,2002-02-20 00:00:00-05:00,1.615920,1.662210,1.603296,1.662210,6892800,0.0,0.0
...,...,...,...,...,...,...,...,...
5659,2024-08-08 00:00:00-04:00,21.010000,21.879999,20.809999,21.750000,5439700,0.0,0.0
5660,2024-08-09 00:00:00-04:00,21.510000,22.170000,21.459999,21.930000,4828900,0.0,0.0
5661,2024-08-12 00:00:00-04:00,21.980000,22.270000,21.450001,21.879999,4449100,0.0,0.0
5662,2024-08-13 00:00:00-04:00,21.959999,22.379999,21.860001,22.270000,3913700,0.0,0.0


**Task 4:** Use web scraping to extract GameStop Quarterly Revenue data into a dataframe named `gme_q_rev` from the given webpage source: `url_gme`.

Downloaded the webpage via its given url.

In [117]:
html_gme = requests.get(url_gme)

Parsed the html data and extracted all tables from the webpage.

In [135]:
soup = BeautifulSoup(html_gme.text)

gme_tables = soup.findAll('table')

Parsed table tags in order to detect the desired Quarterly Revenue table for GameStop.

In [136]:
for gme_table in gme_tables:
    if any('gamestop' in th.getText().lower() 
           for th in gme_table.findAll('th')) and any('quarterly' in th.getText().lower() 
                                                      for th in gme_table.findAll('th')):
        gme_q_rev_table = gme_table
        break
        
if 'gme_q_rev_table' in locals():
    print('GameStop Revenue table is created.')
else:
    print('GameStop Revenue table was not found.')

GameStop Revenue table is created.


Converted the table tag object into a dataframe, adjusted column names and previewed the final dataframe.

In [137]:
gme_q_rev = pd.DataFrame(
    [[td.getText() for td in tr.findAll('td')] for tr in gme_q_rev_table.findAll('tr')[1:]], 
    columns = ['Quarter', 'GME Revenue'])

gme_q_rev

Unnamed: 0,Quarter,GME Revenue
0,2020-04-30,"$1,021"
1,2020-01-31,"$2,194"
2,2019-10-31,"$1,439"
3,2019-07-31,"$1,286"
4,2019-04-30,"$1,548"
...,...,...
57,2006-01-31,"$1,667"
58,2005-10-31,$534
59,2005-07-31,$416
60,2005-04-30,$475


### Acknowledgment

I would like to express gratitude to IBM and Coursera for supporting the educational process and providing the opportunity to refine and showcase skills acquired during the courses by completing real-life scenario portfolio projects, such as this.

### Reference

This is a workplace scenario project proposed within the syllabus of IBM Data Analyst Professional Certificate on Coursera.