----
# Data Cleaning UPDATED
----

## Set Up
---

In [35]:
import numpy as np
import pandas as pd

# plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns

# stats
from statsmodels.api import tsa # time series analysis
import statsmodels.api as sm

## Data Loading
----

In [36]:
msft_df = pd.read_csv('../../data/microsoft_data.csv', index_col='Date')

In [37]:
msft_df.head(5)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-07-29,141.5,141.509995,139.369995,141.029999,134.49971,16605900
2019-07-30,140.139999,141.220001,139.800003,140.350006,133.851212,16846500
2019-07-31,140.330002,140.490005,135.080002,136.270004,129.960098,38598800
2019-08-01,137.0,140.940002,136.929993,138.059998,131.667221,40557500
2019-08-02,138.089996,138.320007,135.259995,136.899994,130.560913,30791600


## Checking datatypes
---

In [38]:
msft_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1259 entries, 2019-07-29 to 2024-07-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       1259 non-null   float64
 1   High       1259 non-null   float64
 2   Low        1259 non-null   float64
 3   Close      1259 non-null   float64
 4   Adj Close  1259 non-null   float64
 5   Volume     1259 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 68.9+ KB


### Reset Date to be datetime type

In [39]:
msft_df.index = pd.to_datetime(msft_df.index)

In [40]:
msft_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2019-07-29 to 2024-07-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       1259 non-null   float64
 1   High       1259 non-null   float64
 2   Low        1259 non-null   float64
 3   Close      1259 non-null   float64
 4   Adj Close  1259 non-null   float64
 5   Volume     1259 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 68.9 KB


**Comment:**
Remaining datatypes are numerical and so we can continue with data cleaning

## Looking for missing dates
---

In [41]:
# Getting first and last day from dataset
first_day = msft_df.index.min()
last_day = msft_df.index.max()


In [42]:
# Calculate difference between last and first day
last_day -  first_day

Timedelta('1827 days 00:00:00')

In [43]:
msft_df.shape

(1259, 6)

**Comment:**

There is a difference of 568 days, I assune these dates are non-trading days such as weekends and US bank holidays. These dates will need to be filled in order to perform timeseries forecasting since the methods to be used require a continour date range.

See Appendix/Missing Dates for checking dates are non-trading days (weekends/bank holidays).


## Checking for missing values
----

In [44]:
msft_df.isna().sum() 

Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

**Comment:** 

As expected, there are no missing values in the dataset. Yahoo Finance provides a clean dataset with minimal need for data cleaning.

## Reindexing dates
---

Before filling the missing non-trading dates, I will need to reindex the  datarange to ensure the index (dates) are continous.

In [45]:
# Assuming df is your DataFrame with a DateTime index
# Reindex to include all days (daily frequency)
full_index = pd.date_range(start=first_day, end=last_day, freq='D')


In [46]:
msft_df = msft_df.reindex(full_index)

### Checking date range is now continuous

In [47]:
full_range = pd.date_range(start=first_day, end=last_day, freq='D')

In [48]:
full_range.difference(msft_df.index)

DatetimeIndex([], dtype='datetime64[ns]', freq='D')

No longer missing dates in date range, to continue in analysis.

In [49]:
msft_df.isna().sum()

Open         569
High         569
Low          569
Close        569
Adj Close    569
Volume       569
dtype: int64

Now there are missing values for the non-trading dates added during re-indexing.

### Filling values for missing dates using Interpolation


ADD DESCRIPTION OF INTERPOLATION

In [50]:
# Linear interpolation
msft_df= msft_df.interpolate(method='linear')

In [51]:
msft_df.isna().sum()

Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

No longer see any null values, to export the cleaned dataset for EDA.

## Exporting clean data
----

In [52]:
msft_df.to_csv('../../data/microsoft_data_cleaned.csv')

## Conclusion
----

#TODOs

- add in extra comments
- checking of commetns and flow
- intro/conc to be written

## Appendix
-----

### Missing Dates

In [12]:
# Calculate full date range between first and last day
# Using freq B to ensure dates reflect business days (excludes weekends and BHs)
full_range = pd.date_range(start=first_day, end=last_day, freq="B")

In [13]:
difference = full_range.difference(msft_df.index)

In [14]:
difference.shape

(47,)

**Comment:** 

There are 47 missing days over the 5-year period. I will review these missing dates to determine if there is an underlying reason as to they are missing. Given that Yahoo Finance typically provides a clean dataset, I expect there may be specific reasons for these missing days, especially after considering business days only.

In [15]:
difference

DatetimeIndex(['2019-09-02', '2019-11-28', '2019-12-25', '2020-01-01',
               '2020-01-20', '2020-02-17', '2020-04-10', '2020-05-25',
               '2020-07-03', '2020-09-07', '2020-11-26', '2020-12-25',
               '2021-01-01', '2021-01-18', '2021-02-15', '2021-04-02',
               '2021-05-31', '2021-07-05', '2021-09-06', '2021-11-25',
               '2021-12-24', '2022-01-17', '2022-02-21', '2022-04-15',
               '2022-05-30', '2022-06-20', '2022-07-04', '2022-09-05',
               '2022-11-24', '2022-12-26', '2023-01-02', '2023-01-16',
               '2023-02-20', '2023-04-07', '2023-05-29', '2023-06-19',
               '2023-07-04', '2023-09-04', '2023-11-23', '2023-12-25',
               '2024-01-01', '2024-01-15', '2024-02-19', '2024-03-29',
               '2024-05-27', '2024-06-19', '2024-07-04'],
              dtype='datetime64[ns]', freq=None)

**Comment:**

On investigating the missing dates, it looks like they correspond to US bank holidays. Since the stock market is closed on these dates, we should exclude them in addition to weekends when analysing the dataset. To do this, I will use `holidays` library and `CustomBusinessDay` module to include US bank holidays in the list of dates to exclude.

In [16]:
from pandas.tseries.offsets import CustomBusinessDay
import holidays

In [17]:
us_bank_hols = holidays.UnitedStates(years=[2019,2020,2021,2022,2023,2024])

In [18]:
# Now excluding weekends AND US bank holidays
cust_business_days = CustomBusinessDay(holidays=us_bank_hols)

In [19]:
# Recalcualting full date range between first and last day
# Now excluding weekends and US holidays 
business_days = pd.date_range(start=first_day, end=last_day, freq=cust_business_days)

In [20]:
business_days.difference(msft_df.index)

DatetimeIndex(['2020-04-10', '2021-04-02', '2022-04-15', '2023-04-07',
               '2024-03-29'],
              dtype='datetime64[ns]', freq=None)

**Comment:**

Only 5 dates missing now, one per year which is intereseting.

After looking into these dates futher, it appears these dates represent Good Friday holiday for each year. For some reason these holidays were not exlcuded in us_bank_hols: 

    April 10, 2020: Good Friday
    April 2, 2021: Good Friday
    April 15, 2022: Good Friday
    April 7, 2023: Good Friday
    March 29, 2024: Good Friday



In [21]:
# Define Good Friday dates (for 2019 to 2024)
good_fridays = pd.to_datetime([
    '2019-04-19',  # Good Friday 2019
    '2020-04-10',  # Good Friday 2020
    '2021-04-02',  # Good Friday 2021
    '2022-04-15',  # Good Friday 2022
    '2023-04-07',  # Good Friday 2023
    '2024-03-29',  # Good Friday 2024
])

all_holidays = pd.to_datetime(list(us_bank_hols) + list(good_fridays))
cust_business_days = CustomBusinessDay(holidays=all_holidays)

In [22]:
all_holidays = pd.to_datetime(list(us_bank_hols) + list(good_fridays))

In [23]:
# Create a custom business day calendar including these holidays
cust_business_days = CustomBusinessDay(holidays=all_holidays)

In [24]:
business_days = pd.date_range(start=first_day, end=last_day, freq=cust_business_days)

In [25]:
business_days.difference(msft_df.index)

DatetimeIndex([], dtype='datetime64[ns]', freq=None)

Confirms missing dates are all either the weekends (not business days) or US bank holidays. Both of which are non-trading days