<a href="https://colab.research.google.com/github/realmistic/PythonInvest-basic-fin-analysis/blob/master/PythonInvest_com_3_Scraping_financial_data_Earnings_per_share_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> **Scraping Financial Data (EPS) with Python**
* **What?** Get Earning-per-share data scrapped from the Yahoo Finance Website in Python using BeautifulSoup library
* **Why?** EPS is one of the most important financial indicators, which is tracked by analysts. It is updated during the quartely Earning Calls, and it may be a big driver of a stock's price change.
* **How?**
  * Get the raw data with the ***Requests*** library
  * Extract data with ***BeautifulSoup***
  * Data cleaning in ***Pandas***
  * Basic visualisation of a dataframe (scatterplot and histogram)

  *

## 1) IMPORTS

In [None]:
# https://pypi.org/project/beautifulsoup4/

!pip install beautifulsoup4



In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## 2) Scraping web data with Requests library
* Screenshot:

In [38]:

url = "https://www.investing.com/equities/societe-moderne-de-ceramique-financial-summary"

In [39]:
r = requests.get(url)
r.ok

True

In [40]:
r.status_code

200

In [41]:
r.text



## 3) Extract tags data with BeautifulSoup

In [42]:
soup = BeautifulSoup(r.text)

In [43]:
table = soup.find_all('table')

In [44]:
table

[<table class="genTbl openTbl companyFinancialSummaryTbl">
 <thead>
 <tr>
 <th class="arial_11 noBold title right period">Period Ending:</th>
 <th>Jun 30, 2023</th>
 <th>Mar 31, 2023</th>
 <th>Dec 31, 2022</th>
 <th>Sep 30, 2022</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td class="bold left">Total Revenue</td>
 <td>23.89</td>
 <td>23.89</td>
 <td>51.24</td>
 <td>51.24</td>
 </tr>
 <tr>
 <td class="bold left">Gross Profit</td>
 <td>10.04</td>
 <td>10.04</td>
 <td>24.91</td>
 <td>24.91</td>
 </tr>
 <tr>
 <td class="bold left">Operating Income</td>
 <td>2.99</td>
 <td>2.99</td>
 <td>1.95</td>
 <td>1.95</td>
 </tr>
 <tr>
 <td class="bold left">Net Income</td>
 <td>-0.461</td>
 <td>-0.461</td>
 <td>-3.99</td>
 <td>-3.99</td>
 </tr>
 </tbody>
 </table>,
 <table class="genTbl openTbl companyFinancialSummaryTbl">
 <thead>
 <tr>
 <th class="arial_11 noBold title right period">Period Ending:</th>
 <th>Jun 30, 2023</th>
 <th>Mar 31, 2023</th>
 <th>Dec 31, 2022</th>
 <th>Sep 30, 2022</th>
 </tr>
 </th

In [45]:
# Just 1 table found which is good
len(table)

11

In [51]:
# Get all column names
spans = soup.table.thead.find_all('th')
spans

[<th class="arial_11 noBold title right period">Period Ending:</th>,
 <th>Jun 30, 2023</th>,
 <th>Mar 31, 2023</th>,
 <th>Dec 31, 2022</th>,
 <th>Sep 30, 2022</th>]

In [52]:
columns = []
for span in spans:
  print(span.text)
  columns.append(span.text)
columns

Period Ending:
Jun 30, 2023
Mar 31, 2023
Dec 31, 2022
Sep 30, 2022


['Period Ending:',
 'Jun 30, 2023',
 'Mar 31, 2023',
 'Dec 31, 2022',
 'Sep 30, 2022']

In [53]:
rows = soup.table.tbody.find_all('tr')

In [54]:
# 100 rows in the table
len(rows)

4

In [55]:
# read row by row
stocks_df = pd.DataFrame(columns=columns)

for row in rows:
  elems = row.find_all('td')
  dict_to_add = {}
  for i,elem in enumerate(elems):
    dict_to_add[columns[i]] = elem.text
    # if i<=2:
    #   dict_to_add[columns[i]]=elem.text
    # else:
    #   dict_to_add[columns[i]]= float(elem.text)
  stocks_df = stocks_df.append(dict_to_add, ignore_index=True)

  stocks_df = stocks_df.append(dict_to_add, ignore_index=True)
  stocks_df = stocks_df.append(dict_to_add, ignore_index=True)
  stocks_df = stocks_df.append(dict_to_add, ignore_index=True)
  stocks_df = stocks_df.append(dict_to_add, ignore_index=True)


In [56]:
stocks_df

Unnamed: 0,Period Ending:,"Jun 30, 2023","Mar 31, 2023","Dec 31, 2022","Sep 30, 2022"
0,Total Revenue,23.89,23.89,51.24,51.24
1,Gross Profit,10.04,10.04,24.91,24.91
2,Operating Income,2.99,2.99,1.95,1.95
3,Net Income,-0.461,-0.461,-3.99,-3.99


In [57]:
stocks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Period Ending:  4 non-null      object
 1   Jun 30, 2023    4 non-null      object
 2   Mar 31, 2023    4 non-null      object
 3   Dec 31, 2022    4 non-null      object
 4   Sep 30, 2022    4 non-null      object
dtypes: object(5)
memory usage: 288.0+ bytes


**texte en gras**##  Data cleaning in Pandas

In [59]:
filter1 = stocks_df['Period Ending:']!='-'
filter2 = stocks_df['Jun 30, 2023']!='-'
filter3 = stocks_df['Mar 31, 2023']!='-'

stocks_df_noMissing = stocks_df[filter1 & filter2 & filter3]

In [60]:
stocks_df_noMissing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Period Ending:  4 non-null      object
 1   Jun 30, 2023    4 non-null      object
 2   Mar 31, 2023    4 non-null      object
 3   Dec 31, 2022    4 non-null      object
 4   Sep 30, 2022    4 non-null      object
dtypes: object(5)
memory usage: 192.0+ bytes


In [61]:
stocks_df_noMissing.head()

Unnamed: 0,Period Ending:,"Jun 30, 2023","Mar 31, 2023","Dec 31, 2022","Sep 30, 2022"
0,Total Revenue,23.89,23.89,51.24,51.24
1,Gross Profit,10.04,10.04,24.91,24.91
2,Operating Income,2.99,2.99,1.95,1.95
3,Net Income,-0.461,-0.461,-3.99,-3.99


In [62]:
stocks_df_noMissing['Period Ending:'] = stocks_df_noMissing['Period Ending:'].astype('string')
stocks_df_noMissing['Jun 30, 2023'] = stocks_df_noMissing['Jun 30, 2023'].astype('string')
stocks_df_noMissing['Mar 31, 2023'] = stocks_df_noMissing['Mar 31, 2023'].astype('string')

In [63]:
stocks_df_noMissing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Period Ending:  4 non-null      string
 1   Jun 30, 2023    4 non-null      string
 2   Mar 31, 2023    4 non-null      string
 3   Dec 31, 2022    4 non-null      object
 4   Sep 30, 2022    4 non-null      object
dtypes: object(2), string(3)
memory usage: 192.0+ bytes


In [65]:
stocks_df_noMissing.set_index('Period Ending:', inplace=True)

In [66]:
stocks_df_noMissing.head()

Unnamed: 0_level_0,"Jun 30, 2023","Mar 31, 2023","Dec 31, 2022","Sep 30, 2022"
Period Ending:,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Total Revenue,23.89,23.89,51.24,51.24
Gross Profit,10.04,10.04,24.91,24.91
Operating Income,2.99,2.99,1.95,1.95
Net Income,-0.461,-0.461,-3.99,-3.99
