## Problem
> There are many seasons that sales are significantly higher or lower than averages. If the company does not know about these seasons, it can lose too much money. Predicting future sales is one of the most crucial plans for a company. Sales forecasting gives an idea to the company for arranging stocks, calculating revenue, and deciding to make a new investment. Another advantage of knowing future sales is that achieving predetermined targets from the beginning of the seasons can have a positive effect on stock prices and investors' perceptions. Also, not reaching the projected target could significantly damage stock prices, conversely. And, it will be a big problem especially for Walmart as a big company.

## Aim
> My aim in this project is to build a model which predicts sales of the stores. With this model, Walmart authorities can decide their future plans which is very important for arranging stocks, calculating revenue and deciding to make new investment or not.

## Step
- 
- 

In [2]:
import logging
import warnings
import pandas as pd
import numpy as np
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import shap
from sklearn.metrics import r2_score
#import check_numeric

sns.set_style('darkgrid')
sns.set_palette("bright")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [95]:
class SalesPredict:

    """
    store 별 판매량 예측
     > trainset: 2010-02-05 ~ 2012-10-26
     > testset: 2012-11-02 ~ 2013-07-26
    """
    
    def __init__(self):
        self.file_path = "data"        
        self.datasets = ['features', 'stores', 'train', 'test']
        self.loaded_datasets = {}
        self.label_encoder = LabelEncoder()
        self.target_encoder = ce.TargetEncoder(cols=['store'])        
        self.base_model = LGBMRegressor()
        self.use_vote_regressor = False
        self.opt_param = False # True 경우 파라메터 최적화 진행
        self.param_space = {
                            'learning_rate': np.logspace(-3, 0, num=100),
                            'n_estimators': np.arange(50, 300, step=10),
                            'max_depth': np.arange(3, 11),
                            'num_leaves': np.arange(10, 101),
                            'min_child_samples': np.arange(5, 31)
                            } 
        
        
    def load_data(self):
        """
        데이터셋 불러오기
        분리된 데이터셋을 병합
        """
        for data in self.datasets:
            df = pd.read_csv(f"{self.file_path}/{data}.csv") 
            df.columns = df.columns.str.lower()
            self.loaded_datasets[data] = df
        
        # 각각 df 불러오기
        features = self.loaded_datasets['features']
        stores = self.loaded_datasets['stores']
        train = self.loaded_datasets['train']
        test = self.loaded_datasets['test']
                
        # 병합
        df = pd.concat([train, test])        
        df = df.merge(features, on=['store','date','isholiday'], how='left')\
                     .merge(stores, on=['store'], how='left')
        
        # columns type change
        df['date'] = pd.to_datetime(df['date'])
        df['isholiday'] = df['isholiday'].astype(int)
        #df['is_test'] = np.where(df['weekly_sales'].isnull() == True, 1, 0)        
        
        # shape check
        assert len(df) == len(train)+len(test)
        print(df.shape)
        return df
    
    
    def add_date_features(self, df):
        
        """
        날짜 관련 피처엔지니어링
        """
        df['month'] = df['date'].dt.month
        #df['day'] = df['date'].dt.dayofweek
        #df['dayofweek'] = df['date'].dt.day_name()
        #df["weekend_yn"] = np.where(df["dayofweek"].isin(["Saturday", "Sunday"]), 1, 0)  # 주말여부
        df["dayofyear"] = df['date'].dt.dayofyear  # 연간 일                
        df["week_no"] = df['date'].dt.isocalendar()['week'].astype('int')
        df['week_num'] = np.ceil((df['date'].dt.to_period('M').dt.to_timestamp().dt.weekday + df['date'].dt.day) / 7.0).astype(int)  # 월별주차
        #df = pd.concat([df, pd.get_dummies(data=df['dayofweek'], prefix='dayofweek')], axis=1)  # 요일 one-hot encoding

        # 월별 누적주문수
        # monthly prediction을 위해 YYYY-MM 형태로 변환
        # df['date_month'] = pd.to_datetime(df['target_date'].dt.strftime('%Y-%m'))        
        # df['ord_cnt_cum'] = df.groupby(['date_month', 'warehouse'])['order'].cumsum()        
        return df
    
    
    def add_fe_laggging(self, df, max_lagging_n=30):
        pass
    
    
    
    def fit_predict(self, df):
        
        # index target_date로 지정
        if 'date' in df.columns:
            df = df.set_index("date")             
            
        # [중요] 컬럼제거: 에러 및 leakage 방지
        drop_cols = ['weekly_sales']
        # df = df.dropna(subset=['weekly_sales']).drop(drop_cols, axis=1)

        #pred_res = pd.DataFrame()
        #for date in test_idx:            
        #    print("train and predict of", date)            
        train = df[df.index < '2012-01-01']
        pred = df[df.index >= '2012-01-01']

        train['type'] = self.label_encoder.fit_transform(train['type'])
        train['store_enc'] = self.target_encoder.fit_transform(train['store'], train['weekly_sales'])
        X_train = train.drop(drop_cols, axis=1)
        y_train = train[['weekly_sales']]

        # 성능 체크 용도
        pred['type'] = self.label_encoder.transform(pred['type'])
        pred['store_enc'] = self.target_encoder.transform(pred['store'], pred['weekly_sales'])
        X_pred = pred.drop(drop_cols, axis=1)
        y_pred = pred[['weekly_sales']] 
        
        model = self.base_model                                                        
        model.fit(X_train, y_train)
        y_pred['pred_cnt'] = model.predict(X_pred).astype(int)
        #pred_res = pred_res.append(y_pred)          
        
        pred_res = y_pred.copy().dropna()
        pred_res['ape'] = 1-(abs(pred_res['weekly_sales']-pred_res['pred_cnt'])/pred_res['weekly_sales'])
        print(pred_res.ape.median())
        
        return df, pred_res
        
    
    
    def process(self):
        
        df = self.load_data()
        df = self.add_date_features(df)
        df, pred_res = self.fit_predict(df)
        
        return df, pred_res

In [96]:
self = SalesPredict()

In [97]:
df, pred_res = self.process()

(536634, 16)
0.6768199388093075


In [99]:
pred_res

Unnamed: 0_level_0,weekly_sales,pred_cnt,ape
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-01-06,16567.69,20833,0.742553
2012-01-13,16894.40,20570,0.782437
2012-01-20,18365.10,20851,0.864640
2012-01-27,18378.16,23069,0.744760
2012-02-03,23510.49,22319,0.949321
...,...,...,...
2012-09-28,508.37,2462,-2.842929
2012-10-05,628.10,2740,-2.362363
2012-10-12,1061.02,2740,-0.582421
2012-10-19,760.01,2699,-1.551269
