## Data Preparation for Linear Regression
This notebook outlines the data preparation steps for a linear regression analysis. It organizes the workflow into different tiers:

### Bronze Tier
Tables as they are

### Silver Tier
The Silver Tier contains curated, cleaned, and joined data.

#### Table `encoded_train_df`:  
Numerical variables: `total_daily_sales`, `days_since_earliest_date`, `transactions`, `onpromotion`.  

Categorical variables: `store_nbr`, `city`, `state`, `type`, `cluster`, `day_of_week`, `day_of_month`, `month`, `year`.  
All categorical variables are one-hot encoded using prefix `is_<varname>_` e.g. `is_state__Ohio` (note the double underscore).


This tier is designed for analysis and modeling, providing a structured dataset with relevant features for linear regression.

In [0]:
# imports
import os
from datetime import datetime

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql import DataFrame as SparkDataFrame

In [0]:
VOLUME_ROOT_PATH = "/Volumes/cscie103_catalog/final_project/data"
# place where raw csvs land after download
VOLUME_TARGET_DIR = f"{VOLUME_ROOT_PATH}/raw"
# raw data
VOLUME_BRONZE_DIR = f"{VOLUME_ROOT_PATH}/bronze"
# place where prepared data is written
VOLUME_SILVER_DIR = f"{VOLUME_ROOT_PATH}/silver"
# place where final data is written
VOLUME_GOLD_DIR = f"{VOLUME_ROOT_PATH}/gold"

# ensure all paths exist
for path in [VOLUME_TARGET_DIR, VOLUME_BRONZE_DIR, VOLUME_SILVER_DIR, VOLUME_GOLD_DIR]:
  if not os.path.exists(path):
    os.makedirs(path, exist_ok=True)

In [0]:
# load the data from local volumes
filenames = {
    'holidays_events': 'holidays_events.csv',
    'oil': 'oil.csv',
    'sample_submission': 'sample_submission.csv',
    'stores': 'stores.csv',
    'test': 'test.csv',
    'train': 'train.csv',
    'transactions': 'transactions.csv'
}

# holidays_events_df = spark.read.csv(f"{VOLUME_TARGET_DIR}/{filenames.get('holidays_events')}", header=True, inferSchema=True)
# oil_df = spark.read.csv(f"{VOLUME_TARGET_DIR}/{filenames.get('oil')}", header=True, inferSchema=True)
stores_df = spark.read.csv(f"{VOLUME_TARGET_DIR}/{filenames.get('stores')}", header=True, inferSchema=True)
transactions_df = spark.read.csv(f"{VOLUME_TARGET_DIR}/{filenames.get('transactions')}", header=True, inferSchema=True)
train_df = spark.read.csv(f"{VOLUME_TARGET_DIR}/{filenames.get('train')}", header=True, inferSchema=True)

test_df = spark.read.csv(f"{VOLUME_TARGET_DIR}/{filenames.get('test')}", header=True, inferSchema=True)

## Bronze Tier

In [0]:
# write all dfs as they are into bronze
for df, name in zip([stores_df, transactions_df, train_df, test_df], ['stores', 'transactions', 'train', 'test']):
  df.write.mode("overwrite").parquet(f"{VOLUME_BRONZE_DIR}/{name}")

## Silver Tier

Produce & persist table:  

store_nbr	|   int  
date	    |   date  
id	        |   int  
family	    |   string  
sales	    |   double  
onpromotion	|   int  
transactions|	int  
city	    |   string  
state	    |   string  
type	    |   string  
cluster	    |   int  

In [0]:
# prepare: train_df
def smart_na_drop(df):
    """
    Drops all rows with any null values in columns.
    """
    before = df.count()
    df = df.dropna()
    after = df.count()
    print(f"dropped {before - after} rows")
    return df

In [0]:
train_df = smart_na_drop(train_df)
transactions_df = smart_na_drop(transactions_df)
stores_df = smart_na_drop(stores_df)

In [0]:
%skip
# This shows the rows which are dropped in the cell below after merging with transactions
test_df = train_df
test_df = test_df.join(transactions_df, on=['date', 'store_nbr'], how='left')
test_df = test_df.withColumn(
    'transactions',
    F.when(F.col('sales') == 0, 0).otherwise(F.col('transactions'))
)
# show only rows where nulls are present
test_df.where(F.col('transactions').isNull()).show()

In [0]:
# 1. Merge with transactions data
#       .a Fill transactions as 0 when total_daily_sales is 0
#       .b Drop rows where any column is null
# 2. Aggregate daily sales across all product families per store_nbr into total_daily_sales
# 3. Merge with stores_df

strain_df = train_df

# 1. Merge with transactions data
strain_df = strain_df.join(transactions_df, on=['date', 'store_nbr'], how='left')
strain_df = strain_df.withColumn(
    'transactions',
    F.when(F.col('sales') == 0, 0).otherwise(F.col('transactions'))
)
strain_df = smart_na_drop(strain_df) # expected to drop 3248 rows

# 2. Aggregate daily sales across all product families per store_nbr into total_daily_sales
strain_df = strain_df.groupBy('date', 'store_nbr').agg(
    F.sum('sales').alias('total_daily_sales'),
    F.sum('onpromotion').alias('onpromotion'),
    F.sum('transactions').alias('transactions')
)

# 3. Merge with stores_df
strain_df = strain_df.join(stores_df, ['store_nbr'], how='left')
strain_df = smart_na_drop(strain_df) # expected to drop 0 rows

In [0]:
# Schema of strain_df
strain_df.printSchema()

strain_df.display()

In [0]:
# 1. Add columns day_of_week, day, month, year
# 2. Add column days_since_earliest_date as number of days since earliest date
# 3. Drop date column

# 1. Add columns day_of_week, day, month, year
strain_df = strain_df.withColumn('day_of_week', F.dayofweek(F.col('date')))
strain_df = strain_df.withColumn('day_of_month', F.dayofmonth(F.col('date')))
strain_df = strain_df.withColumn('month', F.month(F.col('date')))
strain_df = strain_df.withColumn('year', F.year(F.col('date')))

# 2. Add column time_since_earliest_date
earliest_date = strain_df.select(F.min('date')).collect()[0][0] # Normally is 2013-01-01
strain_df = strain_df.withColumn(
    'days_since_earliest_date',
    F.datediff(F.col('date'), F.lit(earliest_date))
)

# 3. Drop date column
strain_df = strain_df.drop('date')

In [0]:
# Preparation of the data for Logistic Regression
# 1. One-hot encode all categorical
#  variables: store_nbr, city, state, type, cluster, day_of_week, day_of_month, month, year

setrain_df = strain_df.toPandas()

# 1. One-hot encode all categorical variables: store_nbr, city, state, type, cluster, day_of_week, day_of_month, month, year
for colname in ['store_nbr', 'city', 'state', 'type', 'cluster', 'day_of_week', 'day_of_month', 'month', 'year']:
    setrain_df = pd.get_dummies(
        setrain_df,
        columns=[colname],
        dtype=int,
        prefix=f'is_{colname}_'
    )
display(setrain_df)

In [0]:
# 1. Convert setrain_df back to spark dataframe
# 2. Write setrain_df to silver table

# 1. Convert setrain_df back to spark dataframe
setrain_df = spark.createDataFrame(setrain_df)

# 2. Write setrain_df to silver table
setrain_df.write.mode("overwrite").parquet(f"{VOLUME_SILVER_DIR}/encoded_train_df")