#  Spare Parts Inventory Project: Raw Data Import

This notebook loads the raw data files required for the **Spare Parts Inventory Optimization** project. These datasets are derived from the M5 Forecasting competition dataset and adapted for spare parts planning.

### Files Loaded

| File | Description |
|------|-------------|
| `calendar.csv` | Provides the calendar mapping for each `d_` column in the sales dataset, including event info and dates. |
| `sales_train_validation.csv` | Contains unit sales data for each product in each store for each day. Used to generate demand forecasts. |
| `sell_prices.csv` | Contains product price history by store and week. Useful for calculating margins and revenue. |

These datasets form the foundation for demand forecasting, price trend analysis, and inventory policy simulation in subsequent steps.


In [1]:
import pandas as pd 
calendar=pd.read_csv(r"C:\Users\priya\SparePartsInventory\raw_data\calendar.csv")
sales=pd.read_csv(r"C:\Users\priya\SparePartsInventory\raw_data\sales_train_validation.csv")
sell_prices=pd.read_csv(r"C:\Users\priya\SparePartsInventory\raw_data\sell_prices.csv")

In [2]:
calendar.info()
calendar.isnull().sum()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1969 entries, 0 to 1968
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          1969 non-null   object
 1   wm_yr_wk      1969 non-null   int64 
 2   weekday       1969 non-null   object
 3   wday          1969 non-null   int64 
 4   month         1969 non-null   int64 
 5   year          1969 non-null   int64 
 6   d             1969 non-null   object
 7   event_name_1  162 non-null    object
 8   event_type_1  162 non-null    object
 9   event_name_2  5 non-null      object
 10  event_type_2  5 non-null      object
 11  snap_CA       1969 non-null   int64 
 12  snap_TX       1969 non-null   int64 
 13  snap_WI       1969 non-null   int64 
dtypes: int64(7), object(7)
memory usage: 215.5+ KB


date               0
wm_yr_wk           0
weekday            0
wday               0
month              0
year               0
d                  0
event_name_1    1807
event_type_1    1807
event_name_2    1964
event_type_2    1964
snap_CA            0
snap_TX            0
snap_WI            0
dtype: int64

In [3]:

sales.info()
calendar.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30490 entries, 0 to 30489
Columns: 1919 entries, id to d_1913
dtypes: int64(1913), object(6)
memory usage: 446.4+ MB


date               0
wm_yr_wk           0
weekday            0
wday               0
month              0
year               0
d                  0
event_name_1    1807
event_type_1    1807
event_name_2    1964
event_type_2    1964
snap_CA            0
snap_TX            0
snap_WI            0
dtype: int64

In [4]:
sell_prices.info()
sell_prices.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6841121 entries, 0 to 6841120
Data columns (total 4 columns):
 #   Column      Dtype  
---  ------      -----  
 0   store_id    object 
 1   item_id     object 
 2   wm_yr_wk    int64  
 3   sell_price  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 208.8+ MB


store_id      0
item_id       0
wm_yr_wk      0
sell_price    0
dtype: int64

#  Calendar Feature Engineering

The calendar dataset has been enriched with key time-based and event-based features to support time-series forecasting and inventory pattern detection.

###  Engineered Features

| Feature | Description |
|--------|-------------|
| `is_event` | Binary flag indicating if any event occurs on the day. |
| `is_weekend` | Marks Saturdays and Sundays. |
| `snap_flag` | SNAP promotion active in any region. |
| `week_of_month` | 1st–5th week of each month. |
| `season` | Mapped from month to one of Winter, Spring, Summer, Fall. |
| `day_of_year` | Ordinal day in the year (1–365/366). |
| `is_payday` | Flags typical paydays (1st, 15th, 30th, 31st). |
| `is_working_Day` | Marks true working days (not weekend or event). |
| `event_in_3days`, `event_in_7days` | Lookahead indicators for event planning. |
| `is_month_start`, `is_month_end` | Marks first and last day of month. |

These features enhance temporal awareness for downstream demand forecasting models.


In [5]:
calendar['date']=pd.to_datetime(calendar['date'])
calendar["is_event"] = calendar["event_name_1"].notnull().astype(int)
calendar["is_weekend"]=calendar["weekday"].isin(["Saturday","Sunday"]).astype(int)
calendar['snap_flag']=calendar[["snap_CA","snap_TX","snap_WI"]].max(axis=1)
calendar['week_of_month']=(calendar['date'].dt.day-1)//7+1
calendar['season'] = calendar['month'].map({12:"Winter", 1:"Winter", 2:"Winter",
                                            3:"Spring", 4:"Spring", 5:"Spring",
                                            6:"Summer", 7:"Summer", 8:"Summer",
                                            9:"Fall", 10:"Fall", 11:"Fall"})
calendar["day_of_year"] = calendar["date"].dt.dayofyear
calendar['is_payday']=calendar['date'].dt.day.isin([1,15,30,31]).astype(int)
calendar["is_working_Day"] =((calendar['is_weekend']==0) & (calendar['is_event']==0)).astype(int)
calendar["event_in_3days"] = calendar["is_event"].shift(-3).fillna(0).astype(int)
calendar["event_in_7days"] = calendar["is_event"].shift(-7).fillna(0).astype(int)
calendar["is_month_start"] = calendar["date"].dt.is_month_start.astype(int)
calendar["is_month_end"] = calendar["date"].dt.is_month_end.astype(int)
calendar.head(20)

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,...,snap_flag,week_of_month,season,day_of_year,is_payday,is_working_Day,event_in_3days,event_in_7days,is_month_start,is_month_end
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,...,0,5,Winter,29,0,0,0,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,...,0,5,Winter,30,1,0,0,1,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,...,0,5,Winter,31,1,1,0,0,0,1
3,2011-02-01,11101,Tuesday,4,2,2011,d_4,,,,...,1,1,Winter,32,1,1,0,0,1,0
4,2011-02-02,11101,Wednesday,5,2,2011,d_5,,,,...,1,1,Winter,33,0,1,0,0,0,0
5,2011-02-03,11101,Thursday,6,2,2011,d_6,,,,...,1,1,Winter,34,0,1,1,0,0,0
6,2011-02-04,11101,Friday,7,2,2011,d_7,,,,...,1,1,Winter,35,0,1,0,0,0,0
7,2011-02-05,11102,Saturday,1,2,2011,d_8,,,,...,1,1,Winter,36,0,0,0,0,0,0
8,2011-02-06,11102,Sunday,2,2,2011,d_9,SuperBowl,Sporting,,...,1,1,Winter,37,0,0,0,0,0,0
9,2011-02-07,11102,Monday,3,2,2011,d_10,,,,...,1,1,Winter,38,0,1,0,1,0,0


#  Sales Data Sampling (1%)

To enable faster experimentation and reduce memory load, a **1% stratified sample** of the merged sales dataset was taken per day (`'d'` column). This ensures:

- Temporal distribution remains intact.
- Approximately 580,000 rows in total are retained (assuming ~58M full records).
- Random sampling reproducibility ensured with `random_state=42`.

This subset is ideal for prototyping models and visualizations.


In [1]:
import pandas as pd

df = pd.read_csv(r'C:\Users\priya\SparePartsInventory\data\Sales_merged.csv')

# 1% sample per date = ~580k total
sampled_df = df.groupby('d', group_keys=False).apply(lambda x: x.sample(frac=0.01, random_state=42))

  sampled_df = df.groupby('d', group_keys=False).apply(lambda x: x.sample(frac=0.01, random_state=42))


In [2]:
sampled_df.to_csv('Sampled_sales.csv')

#  Sales Data Transformation: Wide to Long Format

The original `sales` dataset was in **wide format**, where each column from `d_1` to `d_1913` (or more) represented daily sales for an item-store combination.

To make the data analysis-ready, especially for time series modeling or merging with the calendar:



In [6]:
sales_melted = sales.melt(
    id_vars=['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'],
    var_name='d',
    value_name='units_sold'
)

In [9]:
sales_melted.to_csv("Sales_merged.csv")

In [10]:
calendar.to_csv("Calendar_new.csv")