<a href="https://colab.research.google.com/github/sidharth-ds/Demand-Forecasting-in-Store-project/blob/main/store_Orders_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Predicting the number of orders for a product is one of the strategies a business can follow in determining how much to invest in marketing their product.
* If you want to predict the number of orders a company may receive for a particular product, then you need to have historical data about the number of orders received by the company. 
* The dataset contains the sales data of supplements that have been collected from Kaggle. 

Features:
* Product ID
* Store ID
* The type of store where the supplement was sold
* The type of location the order was received from
* Sales Date
* Region code
* Whether it is a public holiday or not at the time of order
* Whether the product was on discount or not
* Number of orders placed
* Sales

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

In [None]:
data = pd.read_csv("/content/order prediction.csv")
data.head()

Unnamed: 0,ID,Store_id,Store_Type,Location_Type,Region_Code,Date,Holiday,Discount,#Order,Sales
0,T1000001,1,S1,L3,R1,2018-01-01,1,Yes,9,7011.84
1,T1000002,253,S4,L2,R1,2018-01-01,1,Yes,60,51789.12
2,T1000003,252,S3,L2,R1,2018-01-01,1,Yes,42,36868.2
3,T1000004,251,S2,L3,R1,2018-01-01,1,Yes,23,19715.16
4,T1000005,250,S2,L3,R4,2018-01-01,1,Yes,62,45614.52


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188340 entries, 0 to 188339
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ID             188340 non-null  object 
 1   Store_id       188340 non-null  int64  
 2   Store_Type     188340 non-null  object 
 3   Location_Type  188340 non-null  object 
 4   Region_Code    188340 non-null  object 
 5   Date           188340 non-null  object 
 6   Holiday        188340 non-null  int64  
 7   Discount       188340 non-null  object 
 8   #Order         188340 non-null  int64  
 9   Sales          188340 non-null  float64
dtypes: float64(1), int64(3), object(6)
memory usage: 14.4+ MB


In [None]:
data.isnull().sum()

ID               0
Store_id         0
Store_Type       0
Location_Type    0
Region_Code      0
Date             0
Holiday          0
Discount         0
#Order           0
Sales            0
dtype: int64

In [None]:
data.describe()

Unnamed: 0,Store_id,Holiday,#Order,Sales
count,188340.0,188340.0,188340.0,188340.0
mean,183.0,0.131783,68.205692,42784.327982
std,105.366308,0.338256,30.467415,18456.708302
min,1.0,0.0,0.0,0.0
25%,92.0,0.0,48.0,30426.0
50%,183.0,0.0,63.0,39678.0
75%,274.0,0.0,82.0,51909.0
max,365.0,1.0,371.0,247215.0


### EDA:

In [None]:
pie = data["Store_Type"].value_counts()
store = pie.index
orders = pie.values

fig = px.pie(data, values=orders, names=store, width=700, height=400,
             title="distribution of the number of orders received according to the store type:")
fig.show()

* around 50% of the orders are received in store-1
* 25% in store-4

In [None]:
pie2 = data["Location_Type"].value_counts()
location = pie2.index
orders = pie2.values

fig = px.pie(data, values=orders, names=location,  width=650, height=400,
             title="distribution of the number of orders received according to the location:")
fig.show()

* 45% of the orders are received in location-1
* least in location-4

In [None]:
pie3 = data["Discount"].value_counts()
discount = pie3.index
orders = pie3.values

fig = px.pie(data, values=orders, names=discount,  width=500, height=400,
             title='Purchased with discount:')
fig.show()

* According to the above figure, 55% of people buy supplements even if there is no discount on them.

In [None]:
pie4 = data["Holiday"].value_counts()
holiday = pie4.index
orders = pie4.values

fig = px.pie(data, values=orders, names=holiday, width=400, height=400,
             title='supplement bought during holidays:')
fig.show()

* According to the above figure, most of the people (87%) buy supplements in working days. 

### Encoding:

In [None]:
data["Discount"] = data["Discount"].map({"No": 0, "Yes": 1})
data["Store_Type"] = data["Store_Type"].map({"S1": 1, "S2": 2, "S3": 3, "S4": 4})
data["Location_Type"] = data["Location_Type"].map({"L1": 1, "L2": 2, "L3": 3, "L4": 4, "L5": 5})
data.dropna()

Unnamed: 0,ID,Store_id,Store_Type,Location_Type,Region_Code,Date,Holiday,Discount,#Order,Sales
0,T1000001,1,1,3,R1,2018-01-01,1,1,9,7011.84
1,T1000002,253,4,2,R1,2018-01-01,1,1,60,51789.12
2,T1000003,252,3,2,R1,2018-01-01,1,1,42,36868.20
3,T1000004,251,2,3,R1,2018-01-01,1,1,23,19715.16
4,T1000005,250,2,3,R4,2018-01-01,1,1,62,45614.52
...,...,...,...,...,...,...,...,...,...,...
188335,T1188336,149,2,3,R2,2019-05-31,1,1,51,37272.00
188336,T1188337,153,4,2,R1,2019-05-31,1,0,90,54572.64
188337,T1188338,154,1,3,R2,2019-05-31,1,0,56,31624.56
188338,T1188339,155,3,1,R2,2019-05-31,1,1,70,49162.41


### splitting

In [None]:
x = np.array(data[["Store_Type", "Location_Type", "Holiday", "Discount"]])
y = np.array(data["#Order"])

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Modelling:

In [None]:
import xgboost as xgb           
from sklearn.model_selection import cross_val_score

for lr in [0.01,0.05,0.1,0.2,0.5,0.7,1, 1.3]:
  model = xgb.XGBRegressor(learning_rate = lr, n_estimators=100, verbosity = 0) 
                                                                                
  model.fit(x_train,y_train)
  model.score(x_test, y_test) # scoring the model - r2 squared

  print("Learning rate : ", lr, " Train score : ", model.score(x_train,y_train), " Cross-Val score : ", np.mean(cross_val_score(model, x_train, y_train, cv=10)))

Learning rate :  0.01  Train score :  -0.17858922215566553  Cross-Val score :  -0.17915389168753434
Learning rate :  0.05  Train score :  0.5918524674605412  Cross-Val score :  0.5916227957046243
Learning rate :  0.1  Train score :  0.5931123976268431  Cross-Val score :  0.5928680057524804
Learning rate :  0.2  Train score :  0.5933079940582765  Cross-Val score :  0.5930039035737136
Learning rate :  0.5  Train score :  0.5933499440407601  Cross-Val score :  0.5930184244464802
Learning rate :  0.7  Train score :  0.5933839562317715  Cross-Val score :  0.5930388902945116
Learning rate :  1  Train score :  0.5933884793201439  Cross-Val score :  0.5930478536679729
Learning rate :  1.3  Train score :  0.5933931142546474  Cross-Val score :  0.5930413781761753


result : cross validation score is maximum at lambda=1
* ie. around 60%

In [None]:
model = xgb.XGBRegressor(learning_rate = 1, n_estimators=100, verbosity = 0)
                                                                               
model.fit(x_train,y_train) #train the model

print("Learning rate : ", lr, " Train score : ", model.score(x_train,y_train), " Cross-Val score : ", np.mean(cross_val_score(model, x_train, y_train, cv=10)))

Learning rate :  1.3  Train score :  0.5933884793201439  Cross-Val score :  0.5930478536679729


In [None]:
model.score(x_test, y_test) # scoring the model - r2 squared

0.5921180960516392

### predicting:

In [None]:
ypred = model.predict(x_test)

df = pd.DataFrame(data={"Predicted Orders": ypred.flatten(),"actual":y_test})
df.head()

Unnamed: 0,Predicted Orders,actual
0,47.360344,54
1,97.331123,111
2,66.578537,59
3,85.021072,67
4,54.452305,60


INTERPRETATION OF THE MODEL:
  * This model (XGBoost) is an average model in forecasting the demand for products.
  * We can try to get better results by cleaning the data even more/ trying different models.