This notebook does some forecasting analysis on product demand using a certain level of time series analysis. 

In [65]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../data/raw/product_demand.csv')
df.head(5)
df.tail()

We have to make sure all the data has a date, and want to drop the data without a date since that's probably the most important feature we're going to be working with. 

In [66]:
df.drop(df[df.Date.isnull()].index, inplace=True)

In [67]:
df['Date'] = pd.to_datetime(df['Date'])
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand,day,month,year
0,982,2,27,2012-07-27,100,27,7,2012
1,968,2,27,2012-01-19,500,19,1,2012
2,968,2,27,2012-02-03,500,3,2,2012
3,968,2,27,2012-02-09,500,9,2,2012
4,968,2,27,2012-03-02,500,2,3,2012


That was easy enough. Looks like we should be able to access the day, month and year from the converted datetime objects now.

In [68]:
df.dtypes

Product_Code                 int64
Warehouse                    int64
Product_Category             int64
Date                datetime64[ns]
Order_Demand                 int64
day                          int64
month                        int64
year                         int64
dtype: object

I'll now use a label encoder to give numerical values to the rest of the features making them much easier to work with. 

In [69]:
from sklearn.preprocessing import LabelEncoder
#encoding all other features for classification purposes
le = LabelEncoder()
le.fit(df['Product_Code'])
df['Product_Code'] = le.transform(df['Product_Code'])
le.fit(df['Warehouse'])
df['Warehouse'] = le.transform(df['Warehouse'])
le.fit(df['Product_Category'])
df['Product_Category'] = le.transform(df['Product_Category'])

In [70]:
#df["seasons"] = ""
#del df['seasons']
#df.head(5)

In [71]:
#The individual day, month, and year could be useful in a time series analysis. Another feature
#that I can see being added is what season the product was ordered; this could show a lot of insight.
df['day'] = df['Date'].dt.day
df['month'] = df['Date'].dt.month
df['year'] = df['Date'].dt.year
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand,day,month,year
0,982,2,27,2012-07-27,100,27,7,2012
1,968,2,27,2012-01-19,500,19,1,2012
2,968,2,27,2012-02-03,500,3,2,2012
3,968,2,27,2012-02-09,500,9,2,2012
4,968,2,27,2012-03-02,500,2,3,2012


In [72]:
#splitting the data between train and test on the 80/20 split like discussed in class. 
from sklearn.model_selection import train_test_split
labels = df['Order_Demand']
#initially going to test against a few features, not all of them.
features = df[['year' , 'month', 'Product_Code', 'Product_Category', 'Warehouse']]
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    labels, 
                                                    test_size=0.20, 
                                                    random_state=42)

In [73]:
#Had to do a little conversion here becuase Order_Demand was not numeric, it was type object. 
df['Order_Demand'] = pd.to_numeric(df['Order_Demand'], errors='coerce')
df.dtypes
df.isnull().values.any()
#when this was run, it returned true, meaning that there were some missing values in the dataframe
#using dropna to remove any missing values
df = df.dropna()
#conversion to type int so it can be used in the regression models
df['Order_Demand'] = df['Order_Demand'].astype(int)

The first model I will try to use is a Gradient Boosting Regression model, imported from Scikit Learn. 

In [83]:
from sklearn.ensemble import GradientBoostingRegressor
GBR = GradientBoostingRegressor()
model = GBR.fit(X_train, y_train.values.ravel())
#Predicting the label of the new data set
prediction = model.predict(X_test)
print(GBR.score(X_test, y_test))

0.10986028296246353


Looks like the Gradient Boosting Regression model wasn't the best for this. I'll try a different model to see if it results in better accuracy. 

In [85]:
from sklearn.ensemble import RandomForestRegressor
RFR = RandomForestRegressor()
model = RFR.fit(X_train, y_train.values.ravel())
#Predicting the label of the new data set
prediction = model.predict(X_test)
print(RFR.score(X_test, y_test))

0.10286005892698447


Hm, didn't seem to like that one either. One thing I can think of is that a lot more features were added in prior to creating the models (breaking down the datetime to day, month, and year). This could potentially be a reason for the low accuracy scores. Another reason, as we discussed, could be potentially be overfitting. However, we seem to have plenty of training data so now sure if this is really what's causing this.