----
# Data Cleaning
----

## Set Up
---

In [1]:
import numpy as np
import pandas as pd

# plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns

# stats
from statsmodels.api import tsa # time series analysis
import statsmodels.api as sm

## Data Loading
----

In [2]:
msft_df = pd.read_csv('../../data/microsoft_data.csv', index_col='Date')

## Checking datatypes
---

In [3]:
msft_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1259 entries, 2019-07-29 to 2024-07-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       1259 non-null   float64
 1   High       1259 non-null   float64
 2   Low        1259 non-null   float64
 3   Close      1259 non-null   float64
 4   Adj Close  1259 non-null   float64
 5   Volume     1259 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 68.9+ KB


### Reset Date to be datetime type

In [4]:
msft_df.index = pd.to_datetime(msft_df.index)

In [5]:
msft_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2019-07-29 to 2024-07-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       1259 non-null   float64
 1   High       1259 non-null   float64
 2   Low        1259 non-null   float64
 3   Close      1259 non-null   float64
 4   Adj Close  1259 non-null   float64
 5   Volume     1259 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 68.9 KB


**Comment:**
Remaining datatypes are numerical and so we can continue with data cleaning

## Looking for missing dates
---

In [6]:
# Getting first and last day
first_day = msft_df.index.min()
last_day = msft_df.index.max()


In [7]:
# Calculate difference between last and first day
last_day -  first_day

Timedelta('1827 days 00:00:00')

In [8]:
msft_df.shape

(1259, 6)

**Comment:**

There is a difference of 568 days, which is due to the absence of records for weekend dates. Given that this dataset is based on stock price data, I will intentionally leave gaps in the date range for weekends to accurately reflect the non-trading days.


In [9]:
# Calculate full date range between first and last day
# Using freq B to ensure dates reflect business days (excludes weekends and BHs)
full_range = pd.date_range(start=first_day, end=last_day, freq="B")

In [10]:
difference = full_range.difference(msft_df.index)

In [11]:
difference.shape

(47,)

**Comment:** 

There are 47 missing days over the 5-year period. I will review these missing dates to determine if there is an underlying reason as to they are missing. Given that Yahoo Finance typically provides a clean dataset, I expect there may be specific reasons for these missing days, especially after considering business days only.


In [12]:
difference

DatetimeIndex(['2019-09-02', '2019-11-28', '2019-12-25', '2020-01-01',
               '2020-01-20', '2020-02-17', '2020-04-10', '2020-05-25',
               '2020-07-03', '2020-09-07', '2020-11-26', '2020-12-25',
               '2021-01-01', '2021-01-18', '2021-02-15', '2021-04-02',
               '2021-05-31', '2021-07-05', '2021-09-06', '2021-11-25',
               '2021-12-24', '2022-01-17', '2022-02-21', '2022-04-15',
               '2022-05-30', '2022-06-20', '2022-07-04', '2022-09-05',
               '2022-11-24', '2022-12-26', '2023-01-02', '2023-01-16',
               '2023-02-20', '2023-04-07', '2023-05-29', '2023-06-19',
               '2023-07-04', '2023-09-04', '2023-11-23', '2023-12-25',
               '2024-01-01', '2024-01-15', '2024-02-19', '2024-03-29',
               '2024-05-27', '2024-06-19', '2024-07-04'],
              dtype='datetime64[ns]', freq=None)

**Comment:**

On investigating the missing dates, it looks like they correspond to US bank holidays. Since the stock market is closed on these dates, we should exclude them in addition to weekends when analysing the dataset. To do this, I will use `holidays` library and `CustomBusinessDay` module to include US bank holidays in the list of dates to exclude.


In [13]:
from pandas.tseries.offsets import CustomBusinessDay
import holidays

In [14]:
us_bank_hols = holidays.UnitedStates(years=[2019,2020,2021,2022,2023,2024])

In [15]:
# Now excluding weekends AND US bank holidays
cust_business_days = CustomBusinessDay(holidays=us_bank_hols)

In [16]:
# Recalcualting full date range between first and last day
# Now excluding weekends and US holidays 
business_days = pd.date_range(start=first_day, end=last_day, freq=cust_business_days)

In [17]:
business_days.difference(msft_df.index)

DatetimeIndex(['2020-04-10', '2021-04-02', '2022-04-15', '2023-04-07',
               '2024-03-29'],
              dtype='datetime64[ns]', freq=None)

**Comment:**

Only 5 dates missing now, one per year which is intereseting.

After looking into these dates futher, it appears these dates represent Good Friday holiday for each year. For some reason these holidays were not exlcuded in us_bank_hols: 

    April 10, 2020: Good Friday
    April 2, 2021: Good Friday
    April 15, 2022: Good Friday
    April 7, 2023: Good Friday
    March 29, 2024: Good Friday


To proceed with cleaning as there are no missing dates which require propagation.

In [18]:
# Define Good Friday dates (for 2019 to 2024)
good_fridays = pd.to_datetime([
    '2019-04-19',  # Good Friday 2019
    '2020-04-10',  # Good Friday 2020
    '2021-04-02',  # Good Friday 2021
    '2022-04-15',  # Good Friday 2022
    '2023-04-07',  # Good Friday 2023
    '2024-03-29',  # Good Friday 2024
])

all_holidays = pd.to_datetime(list(us_bank_hols) + list(good_fridays))
cust_business_days = CustomBusinessDay(holidays=all_holidays)

In [19]:
all_holidays = pd.to_datetime(list(us_bank_hols) + list(good_fridays))

In [20]:
# Create a custom business day calendar including these holidays
cust_business_days = CustomBusinessDay(holidays=all_holidays)

In [21]:
business_days = pd.date_range(start=first_day, end=last_day, freq=cust_business_days)

In [22]:
business_days.difference(msft_df.index)

DatetimeIndex([], dtype='datetime64[ns]', freq=None)

**Comment:**

I have decided not to use interpolation to 'fill' the missing dates in my stock market data. The missing dates occur on days when the stock market is not open, such as weekends and bank holidays. Interpolating data for non-trading days could lead to unreliable results and insights since there are no actual market transactions or price changes during these periods.

Moving forward, I will explore the effect of aggregating the data to address the issue of missing dates due to market closures. Timeseries analysis requires a continuous date range, and since interpolation is not a suitable choice in this context, I will aggregate the data to a weekly/monthly level.


## Checking for missing values
----

In [23]:
msft_df.isna().sum() 

Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

**Comment:** 

As expected, there are no missing values in the dataset. Yahoo Finance provides a clean dataset with minimal need for data cleaning.


## Exporting dataframe to csv
----

In [24]:
msft_df.to_csv('../../data/microsoft_data_cleaned.csv')

TODOs:

- format of graphs
- add intro/conc