# Future price prediction
Got this data from Kaggle, this is not meant for time-series predictions, but want to experiment to see if we could get a relative good prediction out of time-series models.

## Conclusion
After several attempts, using both ARIMA and Prophet, we concluded this data is not ideal for time-serie predictions. We attempted removing outliers to make sure the data represents a general trend of the sales behavior, but still could not make the models to make accurate prediction. 

We can conclude that there are not enough trends presented in the data, this could be due to the company was constantly making random changes to their maketing strategies, making the sales data preseting random behaviors. Or the company replies purely on organic grow and there isn't enought data presented in the provided dataset for accurate time-series predictions.

In conclusion, time-series is not a good approach for this data, should use other means, regression or CNN.

In [1]:
# This block is from https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

import seaborn as sns #collection of functions for data visualization
print("seaborn version: {}". format(sns.__version__))

from sklearn.preprocessing import OneHotEncoder #OneHot Encoder
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
%matplotlib inline

#misc libraries
import random
import time
from pandas import datetime


#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)



# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "input"]).decode("utf8"))

Python version: 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
pandas version: 1.0.5
matplotlib version: 3.2.2
NumPy version: 1.18.5
SciPy version: 1.5.0
IPython version: 7.16.1
scikit-learn version: 0.23.1
seaborn version: 0.10.1


  from pandas import datetime


-------------------------


FileNotFoundError: [WinError 2] The system cannot find the file specified

In [None]:
item_categories = pd.read_csv('input/item_categories.csv')
items = pd.read_csv('input/items.csv')
sales_train_raw = pd.read_csv('input/sales_train.csv')
sample_submission = pd.read_csv('input/sample_submission.csv')
shops_raw = pd.read_csv('input/shops.csv')
test_raw = pd.read_csv('input/test.csv')

## There are a lot we can do with the data, but in this kernel we will focus on forcasting

In [None]:
sales_train_raw.info()

## From the info() above, we can see that date is interpreted as objects, let's do a proper read in

In [None]:
def parser(x):
    return datetime.strptime(x,'%d.%m.%Y')

sales_train_di = pd.read_csv('input/sales_train.csv', index_col= 0, parse_dates=[0] ,date_parser=parser)

In [None]:
sales_train_raw.head()

In [None]:
sales_train_di.head()

## Check for outliers

In [None]:
fig = plt.figure(figsize=(5,5))
plt.subplot(1,2,1)
sns.boxplot(y='item_price', data=sales_train_di)
plt.subplot(1,2,2)
sns.boxplot(y='item_cnt_day', data=sales_train_di)
fig.tight_layout(pad=1.0)

## Remove outliers by observation obtained from the boxplots above

In [None]:
sales_train_di.item_price = sales_train_di.item_price.apply(lambda x: 6000 if x > 10000 else x)
sales_train_di.item_cnt_day = sales_train_di.item_cnt_day.apply(lambda x: 700 if x > 700 else x)

In [None]:
fig = plt.figure(figsize=(5,5))
plt.subplot(1,2,1)
sns.boxplot(y='item_price', data=sales_train_di)
plt.subplot(1,2,2)
sns.boxplot(y='item_cnt_day', data=sales_train_di)
fig.tight_layout(pad=1.0)

## Next we work with this data to shape it into formats that we can work with before modeling

In [None]:
#since we want to predict the sales through time, we are creating the feature we want to target
sales_train_di['sales'] = sales_train_di['item_price']*sales_train_di['item_cnt_day']

In [None]:
sales_train_di.index.value_counts()

### Observation

Looks like the data is seperated by date and item_id, meaning that each observation is a sale for an item on a specific date, thus date data are not unique, we could group the data by month

In [None]:
sales = sales_train_di.drop(['date_block_num', 'shop_id', 'item_id', 'item_price', 'item_cnt_day'], axis=1)

In [None]:
sales.head()

## Let's see if we have outliers in the newly created sales feature

In [None]:
sns.boxplot(sales)

In [None]:
sales.sales = sales.sales.apply(lambda x: 800000 if x > 800000 else x)

In [None]:
sns.boxplot(sales)

In [None]:
#grouping the data by month
sales_gm = sales.resample("m").sum()

In [None]:
sales_gm

In [None]:
sales_gm.size

In [None]:
plt.figure(figsize=(14,6))
sns.lineplot(data=sales_gm.sales)

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(sales_gm.sales)

In [None]:
#split the data for trainning and validation purposes
train_index_m = int(np.rint(sales_gm.size*0.8))

In [None]:
sales_gm.size

In [None]:
train_index_m

In [None]:
X_train_m = sales_gm[:train_index_m]
X_train_m.size


In [None]:
X_train_m

In [None]:
X_test_m = sales_gm[train_index_m:]
X_test_m.size

In [None]:
X_test_m

In [None]:
#hyperparameter tuning for the ARIMA model
import itertools
from statsmodels.tsa.arima_model import ARIMA

def hyper_p (train):
    best_aic = np.inf 
    best_param = None
    best_model = None
    
    p=d=q=range(0,12)
    pdq = list(itertools.product(p,d,q))

    for param in pdq:
        try:
            arima = ARIMA(train,order=param)
            arima_fit = arima.fit()
            if arima_fit.aic < best_aic:
                best_aic = arima_fit.aic
                best_param = param
                best_model = arima_fit
        except:
            continue

    print('aic: {:6.5f} | pdq set: {}'.format(best_aic, best_param))
    return best_model

In [None]:
best_arima = hyper_p (X_train_m)

In [None]:
predictions_m= best_arima.forecast(steps=X_test_m.size)[0]
predictions_m

In [None]:
from sklearn.metrics import mean_squared_error
score = mean_squared_error(X_test_m, predictions_m)

In [None]:
score

In [None]:
p_df = pd.DataFrame({'sale': predictions_m}, index = X_test_m.index)

In [None]:
p_df

In [None]:
plt.plot(X_test_m)
plt.plot(p_df,color='red')

### Observation

Not a good prediction, stationality should not be the reason, and the d parameter in the ARIMA should be able to remove it. We can conclude that the data is not exhibiting enough pattern for the model to pick up. 

## Next, we want to see if we could get a better result from the Facebook Prophet model

In [None]:
train_rindex = X_train_m.copy()

In [None]:
train_rindex.reset_index(level=0, inplace=True)

In [None]:
train_rindex.columns = ['ds', 'y']

In [None]:
train_rindex

In [None]:
#borrowed from https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts
from fbprophet import Prophet
#prophet reqiures a pandas df at the below config 
# ( date column named as DS and the value column as Y)
model = Prophet( yearly_seasonality=True) #instantiate Prophet with only yearly seasonality as our data is monthly 
model.fit(train_rindex) #fit the model with your dataframe

In [None]:
future = model.make_future_dataframe(periods = 7, freq = 'MS')  
# now lets make the forecasts
forecast = model.predict(future)


In [None]:
forecast

In [None]:
model.plot(forecast)

In [None]:
sales_gm.plot()

In [None]:
y_pred = forecast[['ds', 'yhat_lower']].tail(7)

In [None]:
y_pred.columns = ['date', 'sales']

In [None]:
y_pred=y_pred.set_index('date')

In [None]:
plt.plot(X_test_m)
plt.plot(y_pred,color='red')

## Conclusion
After several attempts, using both ARIMA and Prophet, we concluded this data is not ideal for time-serie predictions. We attempted removing outliers to make sure the data represents a general trend of the sales behavior, but still could not make the models to make accurate prediction. 

We can conclude that there are not enough trends presented in the data, this could be due to the company was constantly making random changes to their maketing strategies, making the sales data preseting random behaviors. Or the company replies purely on organic grow and there isn't enought data presented in the provided dataset for accurate time-series predictions.

In conclusion, time-series is not a good approach for this data, should use other means, regression or CNN.