## <u>Data lifecycle (cf. Module 1)</u>
##### **Relate your project to one or more of the lifecycle models discussed in class, review each other's work**

The datasets we chose are the **historical stock market data for Apple Inc. from Yahoo Finance**, and the **historical S&P 500 Index data from the Federal Reserve Economic Data (FRED) database**.

##### Based on the Yahoo Finance dataset, we have detremined that this is a restricted data science lifecycle model.  

Planning: Access to Yahoo's finance data is not gauranteed because it's stored on its private servers, and you are not allowed to webscrape the data.  It is important to understand their terms and conditions so no rules are violated when using their dataset.
   
Acquisition: The data was collected by Yahoo finance from the financial markets, and the data may be processed from multiple different sources.
   
Processing: The dataset is processed so that the only way to access it is by downloading the `yfinance` library.  
   
Preservation: The data is stored on Yahoo servers but access to them isn't gauranteed. They can delete their data anytime they like, so they are a somewhat unreliable source. 
    
Publishing/sharing: Cannot share the dataset, and the dataset can only be used for noncommercial uses, and you cannot redistribute it.

##### Based on the FRED dataset, we have determined that this is a standard data science lifecycle model.  

Planning: The FRED data is accessible for all, as long as you give credit.  Knowing how to properly cite the data is important.
     
Acquisition: The data is collected and maintained by the Federal Reserve Bank of St. Louis, and they use official financial sources.
   
Processing: The data is processed in a way that one can access the data as a CSV format or through an API.
   
Preservation: The dataset can be used freely by anybody, the historical data is preserved by using an API. 
   
Publishing/sharing: You are allowed to publish and share you results, the only request is that you credit them as your source.

## <u>Ethical data handling (cf. Module 2)</u>
##### **Identification of all ethical, legal, or policy constraints and how they were addressed. This includes issues related to consent, privacy/confidentiality, copyright, licenses and terms of use.**


##### <u>For the Apple Inc. stock market data from Yahoo Finance</u>
Consent: This data is public market data, and no individual data is used, so no need for consent. 

Privacy/Confidentiality: The dataset does not contain any personal or identifiable information. 

Copyright: Under its Terms of Service, we are allowed to use the dataset as long as it's not for commerical use, so using for educational use is acceptable 

Licenses: This dataset is a proprietary license.  According to the Terms of Service, the data is allowed to be used as long as it follows the Terms and Conditions, and you are not allowed to sell or create something new based on their data. 

Terms of Use: In the Terms of Service, it states how it is prohibited to use data scraping beyond API access.  In  order to address this, we use the library `yfinance` where it complies with those rules.  

#### <u>For the historical S&P 500 Index data from the Federal Reserve Economic Data (FRED) database</u>

Consent: This data is public market data, so no individual data is used, so no need for consent.

Privacy/Confidentiality: No personal data was used in the dataset.

Copyright: The FRED data is public domain and it's available for non-commercial research and for educational use, 

License: The FRED data is a permissive license, meaning that it's available for everyone to use, but they request that they are cited for using their data

## <u>Data Collection and Acquisition (cf. Module 3)</u>
##### **Collection or acquisition of at least 2 different datasets from distinct trustworthy sources**

The first datset we chose is the historical stock market data for Apple Inc. (AAPL) from Yahoo Finance

In [83]:
import yfinance as yf
import pandas as pd

df_yf=yf.download(
    'AAPL',
    start='2015-01-01',
    end='2025-10-25',
    interval='3mo',
    auto_adjust=False,
    progress=False

)

df_yf = df_yf.reset_index()


df_yf = df_yf.rename(columns={'Date': 'Date',
                              'Open': 'Open',
                              'High': 'High',
                              'Low': 'Low',
                              'Close': 'Close',
                              'Adj Close': 'Adj_Close',
                              'Volume': 'Volume'})

df_yf.to_csv('AAPL_2015_2025.csv', index=False)

df_yf.head()


Price,Date,Adj_Close,Close,High,Low,Open,Volume
Ticker,Unnamed: 1_level_1,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL
0,2015-01-01,27.61183,31.1075,33.400002,26.157499,27.8475,14321762800
1,2015-04-01,27.94359,31.3575,33.634998,30.775,31.205,11315577200
2,2015-07-01,24.675539,27.575001,33.2425,23.0,31.725,15486588000
3,2015-10-01,23.654613,26.315001,30.955,26.205,27.2675,11140271600
4,2016-01-01,24.597677,27.247499,27.605,23.0975,25.6525,11315040800


The second dataset we chose is the historical S&P 500 Index data from the Federal Reserve Economic Data (FRED) database

In [82]:
from fredapi import Fred
import pandas as pd
import matplotlib.pyplot as plt

fred = Fred(api_key='3dd37ba3122e1228a5bacd7f8c6f3775')

sp500_series = fred.get_series(
    'SP500',
    observation_start='2015-01-01',
    observation_end='2025-10-25',
    interval='3mo'
)

sp500 = pd.DataFrame(sp500_series)
sp500 = sp500.rename(columns={0: 'S&P 500 Index'})

sp500 = sp500.reset_index().rename(columns={'index': 'Date'})

sp500.to_csv('sp500_2015_2025.csv', index=False)

sp500.head()


Unnamed: 0,Date,S&P 500 Index
0,2015-11-16,2053.19
1,2015-11-17,2050.44
2,2015-11-18,2083.58
3,2015-11-19,2081.24
4,2015-11-20,2089.17


## Storage and organization (cf. Modules 4-5)
##### **Select and describe a specific storage and organization strategy. This may include use of tabular, relational, or semi-structured models via filesystems or databases as well as filesystem structures and naming conventions**

We decided to convert both datasets into a csv file which is a tabular model.  

## Extraction and enrichment (cf. Module 6)

## Data integration (cf. Module 7-8)
##### **Integration of datasets (Python/Pandas or SQL)**

In [85]:
import pandas as pd

sp500 = pd.read_csv('sp500_2015_2025.csv', parse_dates=['Date'])
df_yf = pd.read_csv('AAPL_2015_2025.csv', parse_dates=['Date'])

merged_df = pd.merge(df_yf, sp500, on='Date', how='inner')


merged_df.to_csv('AAPL_SP500_merged.csv', index=False)

merged_df



Unnamed: 0,Date,Adj_Close,Close,High,Low,Open,Volume,S&P 500 Index
0,2016-01-01,24.59767723083496,27.247499465942383,27.604999542236328,23.09749984741211,25.65250015258789,11315040800,
1,2016-04-01,21.69279670715332,23.899999618530277,28.09749984741211,22.36750030517578,27.19499969482422,10210211600,2072.78
2,2016-07-01,25.80860710144043,28.262500762939453,29.045000076293945,23.592500686645508,23.872499465942383,9135694800,2102.95
3,2018-01-01,39.3307991027832,41.94499969482422,45.875,37.560001373291016,42.540000915527344,9205205600,
4,2018-10-01,37.401893615722656,39.435001373291016,58.36750030517578,36.647499084472656,56.98749923706055,10599989600,2924.59
5,2019-01-01,45.19637680053711,47.48749923706055,49.42250061035156,35.5,38.72249984741211,7806437600,
6,2019-04-01,47.294715881347656,49.47999954223633,53.82749938964844,42.567501068115234,47.90999984741211,7043172000,2867.19
7,2019-07-01,53.725677490234375,55.99250030517578,56.60499954223633,48.14500045776367,50.79249954223633,6790001600,2964.33
8,2019-10-01,70.70809173583984,73.4124984741211,73.49250030517578,53.782501220703125,56.26750183105469,6615331600,2940.25
9,2020-01-01,61.41440963745117,63.5724983215332,81.9625015258789,53.15250015258789,74.05999755859375,12233722000,


## <u>Data quality (cf. Module 9)</u>
##### **Document data quality assessment results**

## <u>Data cleaning (cf. Module 10)</u>
##### **Describe any data cleaning methods applied (e.g., missing values, outliers, syntactic or semantic cleaning)**

## <u>Workflow automation and provenance (cf. Module 11-12)</u>
##### **Provide an automated end-to-end workflow**

## <u>Reproducibility and transparency (cf. Module 13)</u>
##### **Your project must provide sufficient information to allow someone else to reproduce your workflow and analysis**

## <u>Metadata and data documentation (cf. Module 15)</u>
##### **Metadata and data documentation to support discovery, understandability, and reuse**

## <u>Create README file</u>