# YahooFinance data extraction and formatting

## Purpose

The purpose of this notebook is to allow retrieving financial data from Yahoo in a consistent way across multiple companies

It was created to support a data project started in 2022, and this version was created on 2025-06-01 to focus on just the companeis within the MIB40 index of the [Borsa Italiana](https://www.borsaitaliana.it/homepage/homepage.htm)

## Structure

* Section A: reading a CSV containing the list, and loading a Pandas DataFrame
* Section B: data retrieval

The data retrieval section allows also a "dry run" limited to three records- see the comment at the top of he section.

## Results

The following data are extracted:
* income statement
* balance sheet
* cashflow
* sustainability
* information about the company, restructured for readability; the list of officers is kept as a data structure, to ease future processing (e.g. collating data from multiple companies, and seen who serves in multiple companies)

Each file generated from the data retrieved follows the same name convention:

<code>ISIN\_Yahooticker\_filecontent.CSV</code>

| filename part | description |
| --- | --- |
| ISIN | it is the unique code identifying the listed company, prefixed by the 2-letter ISO code of the country |
| Yahooticker | it is the "nickname" used by Yahoo |
| file content | one of the following: incomestatement, balancesheet, cashflow, sustainability, info |

The library <code>yfinance</code> allows also e.g. extracting price, dividends, news, etc- please search online for further information

Note: the version used in 2022 for the initial version had a different syntax and structure- the library is actively maintained


# Section A: reading a CSV containing the list

In [1]:
import pandas as pd
import yfinance as yf
import time

In [2]:
print(yf.__version__)

0.2.61


In [3]:
# read in the reference table containing three columns: name, isin, yahoo ticker code
referencefile = "listino_catalog_kaggle_yahoo_mib40.csv"
datainput = pd.read_csv(referencefile, header=None)
datainput.columns = ["name","isin","yahoolink","yahoocode"]
datainput.dropna(inplace=True)

# Section B: data retrieval

In [4]:
# number of companies processed
counter = 0

# where the retrieved data will be saved
sharedpath = "./financials/rawdata/"

# for test, just two records are processed
testprocess = True

for index, row in datainput.iterrows():
    # process all the records - showing the names
    print(row['yahoocode'])
    
    # selecting the item
    ISINticker = row['yahoocode']
    
    # read from yahoo
    ISIN = yf.Ticker(ISINticker)
        
    # income statement
    tempfilename = sharedpath + row['isin'] + "_" + ISINticker + "_incomestatement.csv"
    incomestatement = ISIN.get_financials()
    incomestatement.to_csv(tempfilename)

    # balance sheet
    tempfilename = sharedpath + row['isin'] + "_" + ISINticker + "_balancesheet.csv"
    balancesheet = ISIN.get_balance_sheet()
    balancesheet.to_csv(tempfilename)
    
    # cashflow
    tempfilename = sharedpath + row['isin'] + "_" + ISINticker + "_cashflow.csv"
    cashflow = ISIN.get_cashflow()
    cashflow.to_csv(tempfilename)
        
    # sustainability
    tempfilename = sharedpath + row['isin'] + "_" + ISINticker + "_sustainability.csv"
    sustainability = ISIN.get_sustainability()
    sustainability.to_csv(tempfilename)
    
    # information about the company and its organizational structure
    tempfilename = sharedpath + row['isin'] + "_" + ISINticker + "_info.txt"
    info = ISIN.get_info()
    text = ""
    for key, value in info.items():
        text = text + str(key) + ":\n " + str(value) + "\n\n"  
    with open(tempfilename, 'w') as file:
        file.write(text)
    
    time.sleep(5)
    
    counter += 1
    
    if testprocess and (counter > 2):
        break

print("total:", len(datainput))
print("process:", counter)

A2A.MI
AMP.MI
AZM.MI
total: 40
process: 3
