# Introduction to google Colab and yFinance.

Good morning all, I am just putting together this quick notebook so that we can reference it if we need to. This will allow you to get comfortable with the yfinance API and it will also allow us to figure out how to connect to the shared network with our notebooks. The fact that the Machine Learning Pipelines is one of the prereq's I imagine we should use the markfile setup with multiple python scripts that can be run alone or run together with the if __name__ == "__main__": command. So let's get started with connecting.

In [None]:
# Just going to put an easy dynamic variable to change with whoever is working in the notebook

user = 'Tom'

## Connecting Colab to Shared Drive
We will start by adding a few libraries and then setting the file path. The file path will likely be different for all of us so I will make it dynamic and you can put in the pathway you have and we can just switch out the names.

In [None]:
# importing only the connection libraries
import os
from google.colab import drive
drive.mount('/content/drive')

#Specify paths specific to the user
if user.lower() == 'tom':
    path = '/content/drive/MyDrive/Colab Notebooks/696 - Milestone II/696 - Milestone II - Shared'
if user.lower() == 'peter':
    path = ''
if user.lower() == 'melody':
    path = ''

#Change to the directory for the shared network
os.chdir(path)
os.getcwd()

Mounted at /content/drive


'/content/drive/MyDrive/Colab Notebooks/696 - Milestone II/696 - Milestone II - Shared'

In [None]:
#Let's make sure its connected to the right one by listing some files in the directory
os.listdir()

["Yueyao's idea", 'Intro to yfinance.ipynb']

## Introduction to yfinance
yfinance is relatively easy to use and one of the most popular free sources of stock data. I used it in milestone I to obtain price data for thousands of stocks (be careful of rate limiting) and I am excited to explore more of the fundamental side of the API with this project. I'll show you a very quick and easy example but feel free to check out the documentation [here](https://https://ranaroussi.github.io/yfinance/).


In [None]:
# Import the libraries for working with yfinance
import numpy as np
import pandas as pd
import yfinance as yf
import altair as alt

In [None]:
#Let's start with a really popular ticker for Microsoft
'''Just as a side note, a ticker is the stock symbol, which is typically a three to four letter code to identify the stock on the stock exchange. We will be dealing with them a lot just so you get the terminology'''

#Assign our ticker
msft = yf.Ticker('MSFT')

#Let's start by getting some price data. Let's do the past three weeks in 5' intervals (it will give us the date/time, high, low, open, close and volume for every 5 mins of trading over the past three weeks)
msft_price = yf.download(tickers = 'MSFT', period = '21d', interval = '5m')
msft_price.head()

  msft_price = yf.download(tickers = 'MSFT', period = '21d', interval = '5m')
[*********************100%***********************]  1 of 1 completed


Price,Close,High,Low,Open,Volume
Ticker,MSFT,MSFT,MSFT,MSFT,MSFT
Datetime,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2025-08-01 13:30:00+00:00,531.289978,535.799988,530.320007,535.315002,2232960
2025-08-01 13:35:00+00:00,530.695007,531.859985,530.0,531.289978,577468
2025-08-01 13:40:00+00:00,530.655029,532.60498,530.200012,530.659973,713810
2025-08-01 13:45:00+00:00,529.780029,530.859985,528.98999,530.655029,559370
2025-08-01 13:50:00+00:00,528.909973,530.089905,528.470093,529.799988,477935


Great, so it was that simple to get historical price data for a stock. Now, let's get some fundamental information like the sector, market cap, etc. These are things that we will likely want to add to the dataframe for our machine learning pipeline. so it's pretty easy to pull the info we want. Let's just look at a few.

In [None]:
#Get the corporate info for Microsoft
msft_info = msft.info
#Let's print off some easy information
print(f'Sector: {msft_info.get('sector')}')
print(f'Industry: {msft_info.get('industry')}')
print(f'Market Capitalization: {msft_info.get('marketCap')}')
print(f'Employees: {msft_info.get('fullTimeEmployees')}')

Sector: Technology
Industry: Software - Infrastructure
Market Capitalization: 3781996838912
Employees: 228000


Now, for the big part of our work. We are going to be working a lot with financail statements in order to model revenue, expenses, margin, and other health indicators for a company moving forward so that we can predict a price target for the share price in the future. So let's grab some annual statements and some quarterly statements.

In [None]:
#Let's pull some financial information from Microsoft
annual_statements = msft.get_income_stmt(freq='yearly')
quarterly_statements = msft.get_income_stmt(freq='quarterly')

Now, let's take a look at both of these dataframes. We will need to figure out which metrics are good, which are likely unimportant (or highly correlated with other metrics). We can then also think about normalization (percentage of revenue) so that companies are comparable regardless of market cap and a way of bring in timeseries analysis (using year over year change instead of just the raw values alone). These are just my thoughts so far.

In [None]:
annual_statements

Unnamed: 0,2025-06-30,2024-06-30,2023-06-30,2022-06-30
TaxEffectOfUnusualItems,-77217840.0,-100090000.0,-2850000.0,43754000.0
TaxRateForCalcs,0.176296,0.182313,0.19,0.131
NormalizedEBITDA,160603000000.0,133558000000.0,105155000000.0,99905000000.0
TotalUnusualItems,-438000000.0,-549000000.0,-15000000.0,334000000.0
TotalUnusualItemsExcludingGoodwill,-438000000.0,-549000000.0,-15000000.0,334000000.0
NetIncomeFromContinuingOperationNetMinorityInterest,101832000000.0,88136000000.0,72361000000.0,72738000000.0
ReconciledDepreciation,34153000000.0,22287000000.0,13861000000.0,14460000000.0
ReconciledCostOfRevenue,87831000000.0,74114000000.0,65863000000.0,62650000000.0
EBITDA,160165000000.0,133009000000.0,105140000000.0,100239000000.0
EBIT,126012000000.0,110722000000.0,91279000000.0,85779000000.0


Let's look at the revenue for the past four years and see if it is growing and where we would forecast it to go just by eye.

In [None]:
plotting_df = annual_statements.T
plotting_df.reset_index(inplace=True, names = 'date')
#plotting_df.head()
rev = alt.Chart(plotting_df).mark_bar(width=20, color = 'black').encode(x='date:T', y='TotalRevenue:Q')
rev.show()

In [None]:
quarterly_statements

Unnamed: 0,2025-06-30,2025-03-31,2024-12-31,2024-09-30,2024-06-30
TaxEffectOfUnusualItems,495125.1,69660000.0,-203220000.0,57190000.0,-16263850.0
TaxRateForCalcs,0.165042,0.18,0.18,0.19,0.191339
NormalizedEBITDA,44431000000.0,40324000000.0,37915000000.0,37933000000.0,34416000000.0
TotalUnusualItems,3000000.0,387000000.0,-1129000000.0,301000000.0,-85000000.0
TotalUnusualItemsExcludingGoodwill,3000000.0,387000000.0,-1129000000.0,301000000.0,-85000000.0
NetIncomeFromContinuingOperationNetMinorityInterest,27233000000.0,25824000000.0,24108000000.0,24667000000.0,22036000000.0
ReconciledDepreciation,11203000000.0,8740000000.0,6827000000.0,7383000000.0,6380000000.0
ReconciledCostOfRevenue,24014000000.0,21919000000.0,21799000000.0,20099000000.0,19684000000.0
EBITDA,44434000000.0,40711000000.0,36786000000.0,38234000000.0,34331000000.0
EBIT,33231000000.0,31971000000.0,29959000000.0,30851000000.0,27951000000.0


Okay, I think that is a good introduction and hopefully it gives you something to play around with and look at. Let me know if you have any questions!