# 🚩Problem Definition

The objective of this project is to build a sales forecasting model that predicts the sales of product families at Favorita stores in Ecuador. Using the historical sales data, along with the provided features such as store information, promotional data, oil prices, and holidays/events, the model should accurately forecast the sales for the given test dates.

The sales forecasting model will assist Favorita in optimizing inventory management, planning promotions, and meeting customer demand effectively. Accurate sales predictions can help in making informed business decisions, ensuring optimal stock levels, and maximizing profitability.

# 💾 Data Loading

### Data:
1. **train.csv**: Sales time series data with features like `store_nbr`, `family`, and `onpromotion`.
2. **test.csv**: Similar to `train.csv`, used for sales prediction on provided dates.
3. **sample_submission.csv**: Sample submission file format for predictions.
4. **stores.csv**: Store metadata including `city`, `state`, `type`, and `cluster` grouping.
5. **oil.csv**: Daily oil prices, important for Ecuador's oil-dependent economy.
6. **holidays_events.csv**: Information on holidays and events, including transferred dates and bridge days.

### How to?:
 1. Import `load_data` from `data_preprocessing.py`, this shall return you the dataframes of the said csv's.
 


In [1]:
import data_preprocessing as dp
import util as ut

train_data, test_data, stores_data, oil_data, holidays_data = dp.load_data()


In [2]:
train_data.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [3]:
test_data.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [4]:
stores_data.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [5]:
oil_data.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [6]:
holidays_data.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


# 🧭 Exploration

### What to do?
1. Understand the structure and contents of your data by examining the DataFrame.
2. Use descriptive methods and functions like `head()`, `info()`, and `describe()` to get an overview of the data.
3. Check for missing values using functions like `isnull()` or `isna()` and handle them if necessary.
4. Perform data transformations and cleaning, such as removing unnecessary columns or converting data types.
5. Visualize the data using plots, histograms, box plots, scatter plots, or any other relevant visualizations to gain insights into the data distribution, patterns, and relationships.

### How to?
1. Go ahead and clean the data sets note the cleaning functions must be definded in the `data_preprocessing` python script, We want to keep this notebook readable so keep out the complex code.
2. After having cleaned a dataset use `df.to_csv()` to save it inside a folder named `data_clean`.
Note: the folder MUST be named data_clean (Else it wont be ignored by git and it will try to upload the cleaned data set to github).


### 🏋️ Training Datatset

 **train.csv**: This file contains time series data of sales and features such as `store_nbr, family, and onpromotion`. The target variable is the `sales`, representing the total sales for a product family at a particular store on a given date.

To Do: 
1. 🧹 Clean the data set 
2. 🔎 Analysis: Figure out the most important features and retain them or List them. 

In [7]:
# Viewing the data set
train_data

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [8]:
train_data.isnull().values.any()

False

The data provided is free of null values, Hence we can proceed with visualising what this the `train.csv` data says

In [9]:
train_data.describe()

Unnamed: 0,id,store_nbr,sales,onpromotion
count,3000888.0,3000888.0,3000888.0,3000888.0
mean,1500444.0,27.5,357.7757,2.60277
std,866281.9,15.58579,1101.998,12.21888
min,0.0,1.0,0.0,0.0
25%,750221.8,14.0,0.0,0.0
50%,1500444.0,27.5,11.0,0.0
75%,2250665.0,41.0,195.8473,0.0
max,3000887.0,54.0,124717.0,741.0


In [10]:
train_data['onpromotion'].unique()

array([  0,   3,   5,   1,  56,  20,  19,   2,   4,  18,  17,  12,   6,
         7,  10,   9,  50,   8,  16,  42,  51,  13,  15,  47,  21,  40,
        37,  54,  24,  58,  22,  59,  11,  45,  25,  55,  26,  43,  35,
        14,  28,  46,  36,  32,  53,  57,  27,  39,  41,  30,  29,  49,
        23,  48,  44,  38,  31,  52,  33,  34,  61,  60, 116,  86,  73,
       113, 102,  68, 104,  93,  70,  92, 121,  72, 178, 174, 161, 118,
       105, 172, 163, 167, 142, 154, 133, 180, 181, 173, 165, 168, 186,
       140, 149, 145, 169, 188,  62,  84, 111,  65, 107,  63, 101,  87,
       125,  94, 114, 171, 153, 170, 166, 141, 155, 179, 192, 131, 147,
       151, 189,  79,  74, 110,  64,  67,  99, 123, 157, 117, 150, 182,
       162, 160, 194, 135, 190,  69, 108,  89, 126, 156, 103, 146, 132,
       177, 164, 176, 112,  75, 109,  91, 128, 175, 187, 148, 137, 184,
       196, 144, 158, 119, 106,  66, 100,  90, 120, 115,  98, 159, 152,
       185, 139, 143,  80, 124,  71, 134, 193,  78,  88, 122, 13

In [11]:

dp.assess_data(train_data)


['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion']
{'id': dtype('int64'), 'date': dtype('O'), 'store_nbr': dtype('int64'), 'family': dtype('O'), 'sales': dtype('float64'), 'onpromotion': dtype('int64')}


# 🔧 Feature Engineering

1. Encapsulate data preprocessing tasks in `data_preprocessing.py`:
2. Handle missing values by imputation or removal based on analysis.
3. Perform feature engineering, such as transforming dates, encoding categorical variables, and creating additional relevant features.
4. Apply any necessary data transformations, such as scaling or normalization.

# ⚙️ Model Selection & Training 

1. Encapsulate model training functionality in `model_training.py`:
2. Select a suitable forecasting model based on the problem requirements.
3. Implement functions or classes to handle model selection, hyperparameter tuning, and training on the training data.
4. Save the trained model for later use.
5. Encapsulate model evaluation functionality in `model_evaluation.py`:
6. Implement functions or classes to evaluate the performance of the trained model using metrics like MAE, RMSE, or others.
7. Generate evaluation reports, visualizations, or any additional analysis to assess the model's accuracy and quality.

# 🔮 Model Prediction

1. Encapsulate prediction functionality in `model_prediction.py`:
2. Create functions or classes to load the trained model and make predictions on the test data or unseen data.
3. Format the predictions and prepare them for further analysis or visualization.

# 🚀 Deployment 
1. Deploy using streamlit