## Problem
> There are many seasons that sales are significantly higher or lower than averages. If the company does not know about these seasons, it can lose too much money. Predicting future sales is one of the most crucial plans for a company. Sales forecasting gives an idea to the company for arranging stocks, calculating revenue, and deciding to make a new investment. Another advantage of knowing future sales is that achieving predetermined targets from the beginning of the seasons can have a positive effect on stock prices and investors' perceptions. Also, not reaching the projected target could significantly damage stock prices, conversely. And, it will be a big problem especially for Walmart as a big company.

## Aim
> My aim in this project is to build a model which predicts sales of the stores. With this model, Walmart authorities can decide their future plans which is very important for arranging stocks, calculating revenue and deciding to make new investment or not.

## Step

In [2]:
import logging
import warnings
import pandas as pd
import numpy as np
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import shap
from sklearn.metrics import r2_score
#import check_numeric

sns.set_style('darkgrid')
sns.set_palette("bright")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [71]:
class SalesPredict:

    """
    store 별 판매량 예측
     > trainset: 2010-02-05 ~ 2012-10-26
     > testset: 2012-11-02 ~ 2013-07-26
    """
    
    def __init__(self):
        self.file_path = "data"        
        self.datasets = ['features', 'stores', 'train', 'test']
        self.loaded_datasets = {}
        self.label_encoder = LabelEncoder()
        self.target_encoder = ce.TargetEncoder(cols=['warehouse'])        
        self.base_model = LGBMRegressor()
        self.use_vote_regressor = False
        self.opt_param = False # True 경우 파라메터 최적화 진행
        self.param_space = {
                            'learning_rate': np.logspace(-3, 0, num=100),
                            'n_estimators': np.arange(50, 300, step=10),
                            'max_depth': np.arange(3, 11),
                            'num_leaves': np.arange(10, 101),
                            'min_child_samples': np.arange(5, 31)
                            } 
        
        
    def load_data(self):
        """
        데이터셋 불러오기
        분리된 데이터셋을 병합
        """
        for data in self.datasets:
            df = pd.read_csv(f"{self.file_path}/{data}.csv") 
            df.columns = df.columns.str.lower()
            self.loaded_datasets[data] = df
        
        # 각각 df 불러오기
        features = self.loaded_datasets['features']
        stores = self.loaded_datasets['stores']
        train = self.loaded_datasets['train']
        test = self.loaded_datasets['test']
                
        # 병합
        df = pd.concat([train, test])        
        df = df.merge(features, on=['store','date','isholiday'], how='left')\
                     .merge(stores, on=['store'], how='left')
        
        # columns type change
        df['date'] = pd.to_datetime(df['date'])
        df['isholiday'] = df['isholiday'].astype(int)
        #df['is_test'] = np.where(df['weekly_sales'].isnull() == True, 1, 0)        
        
        # shape check
        assert len(df) == len(train)+len(test)
        print(df.shape)
        return df
    
    
    def add_date_features(self, df):
        
        """
        날짜 관련 피처엔지니어링
        """
        df['month'] = df['date'].dt.month
        #df['day'] = df['date'].dt.dayofweek
        #df['dayofweek'] = df['date'].dt.day_name()
        #df["weekend_yn"] = np.where(df["dayofweek"].isin(["Saturday", "Sunday"]), 1, 0)  # 주말여부
        df["dayofyear"] = df['date'].dt.dayofyear  # 연간 일                
        df["week_no"] = df['date'].dt.isocalendar()['week'].astype('int')
        df['week_num'] = np.ceil((df['date'].dt.to_period('M').dt.to_timestamp().dt.weekday + df['date'].dt.day) / 7.0).astype(int)  # 월별주차
        #df = pd.concat([df, pd.get_dummies(data=df['dayofweek'], prefix='dayofweek')], axis=1)  # 요일 one-hot encoding

        # 월별 누적주문수
        # monthly prediction을 위해 YYYY-MM 형태로 변환
        # df['date_month'] = pd.to_datetime(df['target_date'].dt.strftime('%Y-%m'))        
        # df['ord_cnt_cum'] = df.groupby(['date_month', 'warehouse'])['order'].cumsum()        
        return df
    
    
    def add_fe_laggging(self, df, max_lagging_n=30):
        pass
    
    
    
    def fit_predict(self, df):
        
        # index target_date로 지정
        if 'date' in df.columns:
            df = df.set_index("date")             
            
        # [중요] 컬럼제거: 에러 및 leakage 방지
        drop_cols = ['weekly_sales']
        df = df.drop(drop_cols, axis=1).dropna(subset=['weekly_sales'])

        pred_res = pd.DataFrame()
        for date in pd.period_range('2012-01-01', '2012-10-26'):            
            print("train and predict of", date)            
            train = df[df.index < str(date)]
            pred = df[df.index == str(date)]
            
            train['type'] = self.label_encoder.fit_transform(train['type'])
            train['store_enc'] = self.target_encoder.fit_transform(train['store'], train['weekly_sales'])
            train['dept_enc'] = self.target_encoder.fit_transform(train['dept'], train['weekly_sales'])
            X_train = train.drop("weekly_sales", axis=1)
            y_train = train[['weekly_sales']]

            # 성능 체크 용도
            # label_encoder? self?
            pred['type'] = self.label_encoder.transform(pred['type'])
            pred['store_enc'] = self.target_encoder.transform(pred['store'], pred['weekly_sales'])
            pred['dept_enc'] = self.target_encoder.fit_transform(pred['dept'], pred['weekly_sales'])
            X_pred = pred.drop(["weekly_sales"], axis=1)
            y_pred = pred[['weekly_sales']]        
            

        
        
        
        
        
        
    
    
    def process(self):
        
        df = self.load_data()
        df = self.add_date_features(df)
        
        return df

In [72]:
self = SalesPredict()

In [73]:
df = self.process()

(536634, 16)


In [74]:
df

Unnamed: 0,store,dept,date,weekly_sales,isholiday,temperature,fuel_price,markdown1,markdown2,markdown3,markdown4,markdown5,cpi,unemployment,type,size,month,dayofyear,week_no,week_num
0,1,1,2010-02-05,24924.50,0,42.31,2.572,,,,,,211.096358,8.106,A,151315,2,36,5,1
1,1,1,2010-02-12,46039.49,1,38.51,2.548,,,,,,211.242170,8.106,A,151315,2,43,6,2
2,1,1,2010-02-19,41595.55,0,39.93,2.514,,,,,,211.289143,8.106,A,151315,2,50,7,3
3,1,1,2010-02-26,19403.54,0,46.63,2.561,,,,,,211.319643,8.106,A,151315,2,57,8,4
4,1,1,2010-03-05,21827.90,0,46.50,2.625,,,,,,211.350143,8.106,A,151315,3,64,9,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536629,45,98,2013-06-28,,0,76.05,3.639,4842.29,975.03,3.00,2449.97,3169.69,,,B,118221,6,179,26,5
536630,45,98,2013-07-05,,0,77.50,3.614,9090.48,2268.58,582.74,5797.47,1514.93,,,B,118221,7,186,27,1
536631,45,98,2013-07-12,,0,79.37,3.614,3789.94,1827.31,85.72,744.84,2150.36,,,B,118221,7,193,28,2
536632,45,98,2013-07-19,,0,82.84,3.737,2961.49,1047.07,204.19,363.00,1059.46,,,B,118221,7,200,29,3


In [70]:
df.dayofweek.value_counts()

Friday    536634
Name: dayofweek, dtype: int64

In [76]:
train.tail(100)

Unnamed: 0,store,dept,date,weekly_sales,isholiday
421470,45,98,2010-12-03,349.38,False
421471,45,98,2010-12-10,665.96,False
421472,45,98,2010-12-17,222.48,False
421473,45,98,2010-12-24,222.48,False
421474,45,98,2010-12-31,74.55,True
421475,45,98,2011-01-07,122.5,False
421476,45,98,2011-01-14,42.0,False
421477,45,98,2011-01-21,2.0,False
421478,45,98,2011-01-28,77.0,False
421479,45,98,2011-02-04,57.25,False
