
# Daily Stock Prices Dataset

This dataset provides a diverse representation of historical stock prices from selected S&P 500 companies over the last three decades. It has been curated with the aim to capture the underlying trends, patterns, and fluctuations of individual stock prices, independent of broader market influences. The dataset offers a rich blend of leading companies from the S&P 500 list of 1990 combined with a random selection from the remainder of the list. The dataset contains daily stock data spanning 1000 days (roughly equivalent to 4 years) for each of the 52 selected stocks. In a step towards ensuring strict technical analysis, the stock tickers have been replaced with randomly generated unique strings to mask the original tickers and remove any potential analyst bias.

### Dataset Composition:

1. **Historical Reference:** The dataset includes the top 10 companies from the 1990 S&P 500 list that are still in existence, supplemented by a random selection from the remaining companies, ensuring all companies included are currently operational.

2. **Random Time Window:** A random start date for each company's data was selected from between the beginning of 1990 to the beginning of 2016. From these start dates, 1000 consecutive trading days of data were gathered, amounting to roughly four years of trading data for each stock.

3. **De-correlation Objective:** The choice of random start dates for each stock dataset prevents the ensemble from reflecting a uniform market period, avoiding highly correlated stock movements. This method aims to offer a more decorrelated and realistic picture of each stock's performance, irrespective of market trends.

4. **Anonymization of Identifiers:** To focus the analysis purely on price movements and technical indicators, company identifiers have been replaced with anonymized strings, creating a level playing field for technical analysis without preconceived biases.

5. **Dataset Structure:** The final saved dataset is structured as a CSV file, with columns for the anonymized ticker, the epoch (day number from 1 to 1000), and various stock price fields such as 'Adj Close', 'Close', 'High', 'Low', 'Open', and 'Volume'.

### Advantages of the Dataset:

- **Diverse Temporal Insights:** The dataset spans various market conditions, offering insights into stock behavior during different economic cycles, including bull markets and recessions.

- **Time Series Forecasting:** With its temporal spread and de-correlation strategy, the dataset serves as an ideal benchmark for time series forecasting models that aim to predict stock price movements.

- **Technical Analysis Focus:** The anonymization of stock tickers shifts the focus entirely to the technical aspects of the stock data, making it a robust resource for analysts practicing technical analysis without influence from the companies' fundamental data.

The **S&P 500 Historical Sampler Dataset** is carefully balanced to mimic the complexities of real-world stock market dynamics and provides a comprehensive resource for advanced time series analysis and forecasting techniques. 


## Imports

In [None]:
import os
import numpy as np
import pandas as pd
import random
import string

In [None]:
dataset_name = "daily_stock_prices_full"

In [None]:
# Run the jupyter notebook "download_stocks_data.ipynb" in this present directory
# to create the following file
input_fname = "original_downloaded_stocks_data.csv" 

In [None]:
output_dir = f'./../../processed/{dataset_name}/'
os.makedirs(output_dir, exist_ok=True)
outp_fname = os.path.join(output_dir, f'{dataset_name}.csv')
outp_fig_fname = os.path.join(output_dir, f'{dataset_name}.png')

## Read selected stocks

In [None]:
stocks = pd.read_csv(input_fname)
stocks.head()

## Pick random date for each ticker

In [None]:
tickers = stocks['Ticker'].unique().tolist()

In [None]:
len(tickers)

In [None]:
mask = (stocks['Date'] >= '1990-01-01') & (stocks['Date'] <= '2016-01-01')
unique_valid_dates = stocks[mask]['Date'].unique()
random_dates = pd.Series(unique_valid_dates).sample(len(tickers), replace=True, random_state=42).tolist()
# print(random_dates)

### Get 4 years of daily data for each ticker starting from it's corresponding random date 

In [None]:
stock_df = []
for t, start_date in zip(tickers, random_dates):
    subset = stocks[base['Ticker'] == t]
    subset = subset[subset['Date'] >= start_date]
    subset = subset.iloc[:1000]
    stock_df.append(subset)

final_data = pd.concat(stock_df)
final_data.head()

In [None]:
# verify all tickers have same number of rows
final_data['Ticker'].value_counts().nunique()

## Anonymize Stock Tickers

In [None]:
random.seed(0)

unique_tickers = final_data['Ticker'].unique()

# Generate a random string of fixed length, say 5 characters
def generate_random_string(length=8):
    return ''.join(random.choices(string.ascii_uppercase, k=length))

# Create a mapping dictionary from original tickers to random strings
ticker_mapping = {ticker: generate_random_string() for ticker in unique_tickers}

# Ensure uniqueness of the random strings, if not regenerate
while len(set(ticker_mapping.values())) < len(ticker_mapping):
    for ticker in ticker_mapping:
        ticker_mapping[ticker] = generate_random_string()

# Create a column with masked tickers 
final_data['Masked_Ticker'] = final_data['Ticker'].map(ticker_mapping)

print(final_data.head())

## Define Fields in Data

In [None]:
series_col = "Masked_Ticker"
epoch_col = 'Day_Num'
time_col='Date'
value_col = 'Adj Close'
exog_cols=['Close', 'High', 'Low', 'Open', 'Volume']

## Add the Time Field (if Missing)

In [None]:
# While we do have the 'Date' field, note that we chose different dates per stock
# They are not the same dates. We want to treat the time as being the same. 
# So we will create a new field called 'Day_Num' and make it integer type. 
# It will start at 1 and increment per day of data. So all stock tickers will
# go 1 to 1000 under this field. 
if epoch_col not in final_data.columns:
    final_data[epoch_col]=-1
    unique_series = final_data[series_col].unique().tolist()
    for s in unique_series:
        idx = final_data[series_col] == s
        final_data.loc[idx, epoch_col] = np.arange(sum(idx)) + 1
final_data.head()

## Save processed data

In [None]:
all_cols = [series_col, epoch_col, value_col] + exog_cols    
final_data.sort_values(by=[series_col, epoch_col], inplace=True)
final_data[all_cols].to_csv(outp_fname, index=False)

In [None]:
final_data[all_cols].head()