## Problem
> There are many seasons that sales are significantly higher or lower than averages. If the company does not know about these seasons, it can lose too much money. Predicting future sales is one of the most crucial plans for a company. Sales forecasting gives an idea to the company for arranging stocks, calculating revenue, and deciding to make a new investment. Another advantage of knowing future sales is that achieving predetermined targets from the beginning of the seasons can have a positive effect on stock prices and investors' perceptions. Also, not reaching the projected target could significantly damage stock prices, conversely. And, it will be a big problem especially for Walmart as a big company.

## Aim
> My aim in this project is to build a model which predicts sales of the stores. With this model, Walmart authorities can decide their future plans which is very important for arranging stocks, calculating revenue and deciding to make new investment or not.

## Step

In [1]:
import logging
import warnings
import pandas as pd
import numpy as np
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import shap
from sklearn.metrics import r2_score
#import check_numeric

sns.set_style('darkgrid')
sns.set_palette("bright")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [58]:
class SalesPredict:

    """
    store 별 판매량 예측
     -train: 2010-02-05 ~ 2012-10-26
     -test: 2012-11-02 ~ 2013-07-26
    """
    
    def __init__(self):
        self.file_path = "data"        
        self.datasets = ['features', 'stores', 'train', 'test']
        self.loaded_datasets = {}
        self.label_encoder = LabelEncoder()
        self.target_encoder = ce.TargetEncoder(cols=['warehouse'])        
        self.base_model = LGBMRegressor()
        self.use_vote_regressor = False
        self.opt_param = False # True 경우 파라메터 최적화 진행
        self.param_space = {
                            'learning_rate': np.logspace(-3, 0, num=100),
                            'n_estimators': np.arange(50, 300, step=10),
                            'max_depth': np.arange(3, 11),
                            'num_leaves': np.arange(10, 101),
                            'min_child_samples': np.arange(5, 31)
                            } 
        
        
    def load_data(self):
        """
        데이터셋 불러오기
        """
        for data in self.datasets:
            df = pd.read_csv(f"{self.file_path}/{data}.csv") 
            df.columns = df.columns.str.lower()
            self.loaded_datasets[data] = df
        
        features = self.loaded_datasets['features']
        stores = self.loaded_datasets['stores']
        train = self.loaded_datasets['train']
        test = self.loaded_datasets['test']
        
        return features, stores, train, test
    
    
    def merge_data(self):
    
        """
        분리된 데이터셋을 병합
        """
        features, stores, train, test = SalesPredict().load_data()
        y_data = pd.concat([train, test])
            
        df = features.merge(stores, on='store', how='left')\
                     .merge(y_data, on=['store', 'date', 'isholiday'], how='left')
        
        # shape check
        assert len(df) == len(train)+len(test)
        print(df.shape)
        return df

In [59]:
sales_predictor = SalesPredict()

In [60]:
df = sales_predictor.merge_data()

(536634, 16)


In [61]:
df

Unnamed: 0,store,date,temperature,fuel_price,markdown1,markdown2,markdown3,markdown4,markdown5,cpi,unemployment,isholiday,type,size,dept,weekly_sales
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False,A,151315,1,24924.50
1,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False,A,151315,2,50605.27
2,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False,A,151315,3,13740.12
3,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False,A,151315,4,39954.04
4,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False,A,151315,5,32229.38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536629,45,2013-07-26,76.06,3.804,212.02,851.73,2.06,10.88,1864.57,,,False,B,118221,93,
536630,45,2013-07-26,76.06,3.804,212.02,851.73,2.06,10.88,1864.57,,,False,B,118221,94,
536631,45,2013-07-26,76.06,3.804,212.02,851.73,2.06,10.88,1864.57,,,False,B,118221,95,
536632,45,2013-07-26,76.06,3.804,212.02,851.73,2.06,10.88,1864.57,,,False,B,118221,97,
