# Meal Demand Forecasting

Background:
The client is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.

The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Secondly, staffing of the centers is also one area wherein accurate demand forecasts are really helpful. 

__Objective:__ predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set:
- Historical data of demand for a product-center combination (Weeks: 1 to 145)
- Product(Meal) features such as category, sub-category, current price and discount
- Information for fulfillment center like center area, city information etc.

Data:
https://www.kaggle.com/sureshmecad/meal-demand-forecasting

In [None]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Import datasets

### Product (Meal) categories

In [None]:
#load Product features
meal_info = pd.read_csv("../input/food-demand-forecasting/meal_info.csv")
print(meal_info.shape)
meal_info.head()

In [None]:
#verify no null values
meal_info.info()

### Fulfillment center information

In [None]:
fulfilment_center = pd.read_csv('../input/food-demand-forecasting/fulfilment_center_info.csv')
print(fulfilment_center.shape)
fulfilment_center.head()

In [None]:
fulfilment_center.info()

### Product demand historical data

In [None]:
#import train set
raw_train = pd.read_csv('../input/food-demand-forecasting/train.csv')
print(raw_train.shape)
raw_train.head()

In [None]:
raw_train.info()

In [None]:
raw_test = pd.read_csv('../input/food-demand-forecasting/test.csv')
print(raw_test.shape)
raw_test.head()

In [None]:
raw_test.info()

## Data Wrangling

Merge information from the relational tables usind the IDs to obtain Train/Test datasets with the features needed for the multivariate model. Handle any missing/incorrect values.
Any data type errors are corrected. Some Features of numeric type(city_code, region_code) are treated as categorical variables, as they represent a location.

In [None]:
# Merge datasets based on ID features
test = pd.merge(left = raw_test, right = fulfilment_center, left_on = 'center_id', right_on = 'center_id', how='left')
test = pd.merge(left = test, right = meal_info, left_on = 'meal_id', right_on = 'meal_id', how = 'left')

#change type of incorrectly classified features
#test[['city_code','region_code']] = test[['city_code','region_code']].astype('object')
#test.set_index('id', inplace=True)

print(test.shape)
test.head()

In [None]:
# Merge datasets based on ID features
train = pd.merge(left = raw_train, right = fulfilment_center, left_on = 'center_id', right_on = 'center_id', how='left')
train = pd.merge(left = train, right = meal_info, left_on = 'meal_id', right_on = 'meal_id', how = 'left')

#change type of incorrectly classified features
#train[['city_code','region_code']] = train[['city_code','region_code']].astype('object')
#train.set_index('id', inplace=True)

print(train.shape)
train.head()

In [None]:
append = test.append(train)
append.shape

## Explore data

In [None]:
fig, ax = plt.subplots(1,2, figsize=(14,6))
fig.suptitle('Meal info distribution')
sns.histplot(ax = ax[0], data=train[['category','cuisine']],x='category', hue='cuisine', multiple='stack').set_title("Train set")
sns.histplot(ax = ax[1], data=test[['category','cuisine']],x='category', hue='cuisine', multiple='stack').set_title("Test set")

for ax in ax:
    ax.tick_params(axis='x', labelrotation=90)
plt.show()

In [None]:
fig, ax = plt.subplots(1,2, figsize=(14,6))
fig.suptitle('Fulfilment center sampled region and center types')
sns.countplot(ax = ax[0], data=train[['region_code','center_type']],x='region_code', hue='center_type').set_title("Train set")
sns.countplot(ax = ax[1], data=test[['region_code','center_type']],x='region_code', hue='center_type').set_title("Test set")

for ax in ax:
    ax.tick_params(axis='x', labelrotation=90)
plt.show()

In [None]:
plt.figure(figsize=(13,7))
sns.lineplot(data=append, x='week', y='num_orders').set_title('Historic demand')
plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
append['center_type'] = le.fit_transform(append['center_type'])
plt.figure(figsize=(15,10))
sns.heatmap(append.corr(), cbar=True, annot=True, square=True, fmt='.2f')
plt.show()

Observing the linear correlation heatmap, it is clear that the operation area (size in m2), pries, as well as promotions, have the largest impact on the number of orders.

## Model Development
- One hot encoding for categorical variables
- Normalization
- Model computing

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import metrics 

In [None]:
train

### One hot encoding

In [None]:
features_to_encode = ['center_type', 'category', 'cuisine']

def one_hot_encode(features_to_encode, dataset):
    encoder = OneHotEncoder(sparse=False)
    encoder.fit(dataset[features_to_encode])

    encoded_cols = pd.DataFrame(encoder.transform(dataset[features_to_encode]),columns=encoder.get_feature_names())
    dataset = dataset.drop(columns=features_to_encode)
    for cols in encoded_cols.columns:
        dataset[cols] = encoded_cols[cols]
    return dataset

In [None]:
OH_train = one_hot_encode(features_to_encode, train)
OH_train.set_index('id', inplace=True)
OH_train_y = OH_train['num_orders']
OH_train_X = OH_train.drop(columns='num_orders')
OH_train

In [None]:
OH_test = one_hot_encode(features_to_encode, test)
OH_test.set_index('id', inplace=True)
OH_test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(OH_train_X, OH_train_y, test_size = 0.30)

### Ensemble Model: RandomForestRegressor

In [None]:
RF_pipe = make_pipeline(StandardScaler(),RandomForestRegressor())
RF_pipe.fit(X_train, y_train)
RF_train_y_pred = RF_pipe.predict(X_val)
print(RF_pipe.score(X_val, y_val))
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, RF_train_y_pred)))

### Linear Models: Stochastic Gradient Descent

In [None]:
# make pipeline
SGD_pipe = make_pipeline(StandardScaler(),SGDRegressor())
SGD_pipe.fit(X_train, y_train)
SGD_train_y_pred = SGD_pipe.predict(X_val)
print(SGD_pipe.score(X_val, y_val))
#print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, SGD_train_y_pred)))

### Decision Tree: Regressor

In [None]:
DT_pipe = make_pipeline(StandardScaler(),DecisionTreeRegressor())
DT_pipe.fit(X_train, y_train)
DT_train_y_pred = DT_pipe.predict(X_val)
print(DT_pipe.score(X_val, y_val))
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, DT_train_y_pred)))

### Tune hyperparameters for best fitting model. Cross Validate.
Tune hyperparameters on best fitting model through cross validation. Fit to whole training dataset. 

In [None]:
for keys in RF_pipe.get_params().items():
    print(keys[0],": ",keys[1])

In [None]:
hyperparameters = {'randomforestregressor__max_features' : ['auto','log2'],
                  'randomforestregressor__max_depth' : [None]}

RF_grid_search = GridSearchCV(RF_pipe, hyperparameters, cv=2, verbose=1)

#Fit and tune model
RF_grid_search.fit(OH_train_X, OH_train_y)
print (RF_grid_search.best_params_)
print (RF_grid_search.refit)

### Forecast test data with trained, optimized model

In [None]:
OH_test_y_pred = RF_grid_search.predict(OH_test)

submission = pd.DataFrame(data=OH_test_y_pred, index=OH_test.index, columns = ['num_orders'])
submission.reset_index(inplace=True)
submission.to_csv('./submission.csv')
submission