In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**PROBLEM STATEMENT**

Your client is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.

The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Secondly, staffing of the centers is also one area wherein accurate demand forecasts are really helpful. Given the following information, the task is to predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set:  

Historical data of demand for a product-center combination (Weeks: 1 to 145)
Product(Meal) features such as category, sub-category, current price and discount
Information for fulfillment center like center area, city information etc.

**Data Dictionary**
 

**Weekly Demand data (train.csv): Contains the historical demand data for all centers, test.csv contains all the following features except the target variable.**

Variable               Definition

id                     Unique ID

week                    Week No

center_id               Unique ID for fulfillment center

meal_id                 Unique ID for Meal

checkout_price          Final price including discount, taxes & delivery charges

base_price              Base price of the meal

emailer_for_promotion   Emailer sent for promotion of meal

homepage_featured       Meal featured at homepage

num_orders              (Target) Orders Count
   

**fulfilment_center_info.csv: Contains information for each fulfilment center**
 

Variable                 Definition

center_id                Unique ID for fulfillment center

city_code                Unique code for city

region_code              Unique code for region

center_type              Anonymized center type

op_area                  Area of operation (in km^2)
 

**meal_info.csv: Contains information for each meal being served**
 

Variable          Definition

meal_id           Unique ID for the meal

category          Type of meal (beverages/snacks/soups….)

cuisine           Meal cuisine (Indian/Italian/…)
 

**Evaluation Metric**

The evaluation metric for this competition is 100*RMSLE where RMSLE is Root of Mean Squared Logarithmic Error across all entries in the test set.

**Importing necessary libraries**

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None
 
# to display the float values upto 6 decimal places     
pd.options.display.float_format = '{:.6f}'.format

# import train-test split 
from sklearn.model_selection import train_test_split

# import StandardScaler to perform scaling
from sklearn.preprocessing import StandardScaler 

plt.rcParams['figure.figsize'] = [15,8]

# import various functions from sklearn
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV

**Importing the datasets**

In [None]:
train = pd.read_csv("../input/food-demand-forecasting/train.csv")
test = pd.read_csv("../input/food-demand-forecasting/test.csv")

In [None]:
train.head()

In [None]:
test.head()

In [None]:
meal_info = pd.read_csv("../input/food-demand-forecasting/meal_info.csv")
center_info = pd.read_csv("../input/food-demand-forecasting/fulfilment_center_info.csv")

**Let us now see the number of variables and observations in the train data.**

In [None]:
train.shape

*The data has 456548 observations and 9 variables.*

In [None]:
train.info()

In [None]:
test.info()

**Checking Null values**

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
train['num_orders'].describe()

**Merging dataframes**

In [None]:
trainfinal = pd.merge(train, meal_info, on="meal_id", how="outer")
trainfinal = pd.merge(trainfinal, center_info, on="center_id", how="outer")
trainfinal.head()

In [None]:
trainfinal = trainfinal.drop(['center_id', 'meal_id'], axis=1)
trainfinal.head()

In [None]:
cols = trainfinal.columns.tolist()
print(cols)

In [None]:
trainfinal = trainfinal[cols]

**Label Encoding required features**

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
lb1 = LabelEncoder()
trainfinal['center_type'] = lb1.fit_transform(trainfinal['center_type'])

lb2 = LabelEncoder()
trainfinal['category'] = lb1.fit_transform(trainfinal['category'])

lb3 = LabelEncoder()
trainfinal['cuisine'] = lb1.fit_transform(trainfinal['cuisine'])

In [None]:
trainfinal.head()

In [None]:
trainfinal.shape

**(456548,13) is the shape of the final train dataset.**

**Checking correlation**

Obtaining 8 features which are having high correlation  with respect to the target.

In [None]:
trainfinal2 = trainfinal.drop(['id'], axis=1)
correlation = trainfinal2.corr(method='pearson')
columns = correlation.nlargest(8, 'num_orders').index
columns

In [None]:
correlation_map = np.corrcoef(trainfinal2[columns].values.T)
sns.set(font_scale=1.0)
heatmap = sns.heatmap(correlation_map, cbar=True, annot=True, square=True, fmt='.2f', yticklabels=columns.values, xticklabels=columns.values)
plt.show()

**Splitting the dataset into 30% test and 70% train.**

In [None]:
features = columns.drop(['num_orders'])
trainfinal3 = trainfinal[features]
X = trainfinal3.values
y = trainfinal['num_orders'].values

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.30)

In [None]:
trainfinal3.head()

**Importing libraries for model building**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor

## XGBoost Regressor Model

In [None]:
from xgboost import XGBRegressor
XG = XGBRegressor()
XG.fit(X_train, y_train)
y_pred = XG.predict(X_val) 
y_pred[y_pred<0] = 0 
from sklearn import metrics 
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, y_pred)))

## Linear Regression

In [None]:
LR = LinearRegression()
LR.fit(X_train, y_train) 
y_pred = LR.predict(X_val) 
y_pred[y_pred<0] = 0 
from sklearn import metrics 
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, y_pred)))

## Decision Tree Regressor

In [None]:
DT = DecisionTreeRegressor()
DT.fit(X_train, y_train)
y_pred = DT.predict(X_val)
y_pred[y_pred<0] = 0
from sklearn import metrics
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, y_pred)))

## K Neighbors Classifier

In [None]:
KNN = KNeighborsRegressor()
KNN.fit(X_train, y_train)
y_pred = KNN.predict(X_val)
y_pred[y_pred<0] = 0
from sklearn import metrics
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, y_pred)))

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV, cross_val_score

## Hyperparameter Tuning on Decision Tree Regressor Model

In [None]:
param_grid = { "min_samples_split": [2, 4, 8, 16], "min_samples_leaf": [1, 2, 3, 4], "max_leaf_nodes": [None, 10, 20, 100] }
grid_cv_dtm = GridSearchCV(DT, param_grid, cv=5)
grid_cv_dtm.fit(X_train, y_train)

In [None]:
print("R-Squared::{}".format(grid_cv_dtm.best_score_))
print("Best Hyperparameters::\n{}".format(grid_cv_dtm.best_params_))

In [None]:
df = pd.DataFrame(data=grid_cv_dtm.cv_results_)
df.head()

In [None]:
predicted = grid_cv_dtm.best_estimator_.predict(X)
residuals = y.flatten()-predicted

fig, ax = plt.subplots()
ax.scatter(y.flatten(), residuals)
ax.axhline(lw=2,color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Residual')
plt.show()

In [None]:
grid_cv_dtm.best_estimator_.fit(X_train, y_train)
y_pred = grid_cv_dtm.best_estimator_.predict(X_val)
y_pred[y_pred<0] = 0
from sklearn import metrics
print('RMSLE:', 100*np.sqrt(metrics.mean_squared_log_error(y_val, y_pred)))

**The RMSLE did not change before tuning and after tuning.**

In [None]:
testfinal = pd.merge(test, meal_info, on="meal_id", how="outer")
testfinal = pd.merge(testfinal, center_info, on="center_id", how="outer")
testfinal = testfinal.drop(['meal_id', 'center_id'], axis=1)
testcols = testfinal.columns.tolist()
print(testcols)

In [None]:
lb1 = LabelEncoder()
testfinal['center_type'] = lb1.fit_transform(testfinal['center_type'])

lb2 = LabelEncoder()
testfinal['category'] = lb1.fit_transform(testfinal['category'])

lb3 = LabelEncoder()
testfinal['cuisine'] = lb1.fit_transform(testfinal['cuisine'])

testfinal.head()

In [None]:
X_test = testfinal[features].values
X_test

In [None]:
pred = DT.predict(X_test)
pred[pred<0] = 0
submission = pd.DataFrame({
    'id' : testfinal['id'],
    'num_orders' : pred})

In [None]:
submission.to_csv("submission.csv", index=False)

In [None]:
submission.head()