This kernel is going to solve [Predict Future Sales Competition](https://www.kaggle.com/c/competitive-data-science-predict-future-sales) on Kaggle.

**Competition Description:**


This challenge serves as final project for the ["How to win a data science competition"](https://www.coursera.org/learn/competitive-data-science/home/welcome) Coursera course.

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - [1C Company](http://1c.ru/eng/title.htm). 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

# Import Libraries
First, we import necessary libraries, such as:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Import The Data

In [None]:
items = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
item_categories = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')

train = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
test = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')

sample_submission = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv')

# Read The Data

In [None]:
print('item_categories')
display(item_categories.head())

print('items')
display(items.head())

print('shops')
display(shops.head())

print('train')
display(train.head())

print('test')
display(test.head())

print('sample_submission')
display(sample_submission.head())

- Check train info

In [None]:
train.info()

- Check for missing values

In [None]:
print('train')
display(train.isnull().sum())

print('test')
display(test.isnull().sum())

- Quick look using ```describe()``` function

In [None]:
print('train')
display(train.describe(include='all'))

print('test')
display(test.describe(include='all'))

**Quick observations:**
- There are no missing values.
- The train and test datasets did not match in term of features.
- There is minus value(s) in item_price.
- There is minus value(s) in item_cnt_day.

# Exploratory Data Analysis

### Removing Duplicates

In [None]:
#drop duplicates
subset = ['date','date_block_num','shop_id','item_id','item_cnt_day']
print(train.duplicated(subset=subset).value_counts())
train.drop_duplicates(subset=subset, inplace=True)

### Check negative values in item_price

In [None]:
train[train['item_price'] < 0]

Since there is only 1 negative value in item_price, we can just drop that because it won't affect the prediction too much.

In [None]:
#drop negative value in item_price
train = train[train['item_price'] > 0]

In [None]:
train = train[train['item_cnt_day'] > 0]

### Cleaning item_price and item_cnt_day
- Check min and max

In [None]:
sns.boxplot(train['item_price']);

In [None]:
sns.boxplot(train['item_cnt_day']);

- Drop outliers

In [None]:
#define a drop outliers function
def drop_outliers(df, feature, percentile_high = .99):
    '''df (dataframe)           : dataset
       feature (string)         : column
       percentile_high (float)  : upper limit
       .........................................................
    '''
    #train size before dropping values
    shape_init = df.shape[0]
    
    #get percentile value
    max_value = df[feature].quantile(percentile_high)
    
    #drop outliers
    print('dropping outliers...')
    df = df[df[feature] < max_value]
    
    print(str(shape_init - df.shape[0]) + ' ' + feature + ' values over ' + str(max_value) + ' have been removed' )
    
    return df

In [None]:
#drop outliers in item_price feature
train = drop_outliers(train, 'item_price')

In [None]:
#drop outliers in item_cnt_day
train = drop_outliers(train, 'item_cnt_day')

### Price
Make a dataframe with item_price feature group by shop_id and item_id to get price for each item per shop. We can use this dataframe to create item_price feature for test dataset.

In [None]:
prices_shop_df = train[['shop_id','item_id','item_price']]
prices_shop_df = prices_shop_df.groupby(['shop_id','item_id']).apply(lambda df: df['item_price'][-2:].mean())
prices_shop_df = prices_shop_df.to_frame(name = 'item_price')

prices_shop_df

Now we can merge this dataframe with test dataset to create item_price feature in test dataset.

In [None]:
test = pd.merge(test, prices_shop_df, how='left', left_on=['shop_id','item_id'], right_on=['shop_id','item_id'])

test.head()

In [None]:
#check for missing values
test['item_price'].isnull().sum()

There are still missing values in test's item_price. We will fill this later by creating more features from item_categories.

### Transform Data in Train Dataset As Monthly

In [None]:
#split content in date into month and year
train['month'] = [date.split('.')[1] for date in train['date']]
train['year'] = [date.split('.')[2] for date in train['date']]

#drop date and date_block_num features
train.drop(['date','date_block_num'], axis=1, inplace=True)

#create month and year features fot test dataset
test['month'] = '11'
test['year'] = '2015'

In [None]:
#change item_cnt_day into item_cnt_month
train_monthly = train.groupby(['year','month','shop_id','item_id'], as_index=False)[['item_cnt_day']].sum()
train_monthly.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

train_monthly = pd.merge(train_monthly, prices_shop_df, how='left', left_on=['shop_id','item_id'], right_on=['shop_id','item_id'])

train_monthly.head()

In [None]:
train = train_monthly

### Reindex test dataset

In [None]:
test = test.reindex(columns=['ID','year','month','shop_id','item_id','item_price'])

test.head()

### Exploring other datasets
- Exploring Item Categories dataset

In [None]:
#extract main categories
item_categories['main_category'] = [x.split(' - ')[0] for x in item_categories['item_category_name']]

#some items don't have sub-categories. For those, we will use None as a sub-category (consider the main category as a sub)
sub_categories = []
for i in range(len(item_categories)):
    try:
        sub_categories.append(item_categories['item_category_name'][i].split(' - ')[1])
        
    except IndexError as e:
        sub_categories.append('None')
        #sub_categories.append(item_categories['main_category'][i])

item_categories['sub_category'] = sub_categories

#drop item_category_name
item_categories.drop(['item_category_name'], axis=1, inplace=True)

item_categories.head()

- Exploring Items Dataset

In [None]:
#merge with item_categories
items = pd.merge(items, item_categories, how='left')

#drop item_name and item_category_id
items.drop(['item_name','item_category_id'], axis=1, inplace=True)

items.head()

In [None]:
#merge to train and test datasets
train = pd.merge(train, items, how='left')
test = pd.merge(test, items, how='left')

- Exploring Shops Dataset

In [None]:
from string import punctuation

# replace all the punctuation in the shop_name columns
shops["shop_name_cleaned"] = shops["shop_name"].apply(lambda s: "".join([x for x in s if x not in punctuation]))

# extract the city name
shops["shop_city"] = shops["shop_name_cleaned"].apply(lambda s: s.split()[0])

#extract the type
shops["shop_type"] = shops["shop_name_cleaned"].apply(lambda s: s.split()[1])

#extract shop's name
shops["shop_name"] = shops["shop_name_cleaned"].apply(lambda s: " ".join(s.split()[2:]))

shops.drop(['shop_name_cleaned'], axis=1, inplace=True)

shops.head()

In [None]:
#merge to train and test datasets
train = pd.merge(train, shops, how='left')
test = pd.merge(test, shops, how='left')

Display current train and test datasets

In [None]:
print('train')
display(train.head())

print('test')
display(test.head())

### Fill missing values in item_price (by item categories)

In [None]:
#fill missing values with median of each main_category and sub_category
test['item_price'] = test.groupby(['main_category','sub_category'])['item_price'].apply(lambda df: df.fillna(df.median()))

In [None]:
test['item_price'].isnull().sum()

In [None]:
#fill missing values with median of each sub_category
test['item_price'] = test.groupby(['sub_category'])['item_price'].apply(lambda df: df.fillna(df.median()))

In [None]:
test['item_price'].isnull().sum()

Show remaining missing values

In [None]:
test[test['item_price'].isnull()]

All remaining item_price's missing values have same main_category and sub_category. This main and sub categories are not in the test dataset, but in train dataset.

In [None]:
#fill missing values with median of main_category and sub_category from train dataset
filler = train[(train['main_category'] == 'PC') & (train['sub_category'] == 'Гарнитуры/Наушники')]['item_price'].median()

test['item_price'].fillna(filler, inplace=True)

In [None]:
test['item_price'].isnull().sum()

### Exploratory Data Analysis: Epilogue
- From competition's evaluation note, target values are clipped into [0,20] range.

In [None]:
train['item_cnt_month'] = train['item_cnt_month'].clip(0,20)

- Define target_array

In [None]:
target_array = train['item_cnt_month']
train.drop(['item_cnt_month'], axis=1, inplace=True)

test_id = test['ID']
test.drop(['ID'], axis=1, inplace=True)

- Drop shop_id & item_id

In [None]:
train.drop(['shop_id','item_id'], axis=1, inplace=True)
test.drop(['shop_id','item_id'], axis=1, inplace=True)

- Reduce memory usage

In [None]:
def downcast_dtypes(df):
    '''df (dataframe)  : data
       Changes column types in the dataframe
           `float64` type to `float32`
           `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    
    return df

In [None]:
#reduce memory
downcast_dtypes(train)
downcast_dtypes(test)

In [None]:
train.info()

- Check for missing values

In [None]:
#check for any missing data
print('missing data in the train dataset : ', train.isnull().any().sum())
print('missing data in the test dataset : ', test.isnull().any().sum())

- Normality test

In [None]:
#define a normality test function
def normalityTest(data, alpha=0.05):
    """data (array)   : The array containing the sample to be tested.
	   alpha (float)  : Significance level.
	   return True if data is normal distributed"""
    
    from scipy import stats
    
    statistic, p_value = stats.normaltest(data)
    
    #null hypothesis: array comes from a normal distribution
    if p_value < alpha:  
        #The null hypothesis can be rejected
        is_normal_dist = False
    else:
        #The null hypothesis cannot be rejected
        is_normal_dist = True
    
    return is_normal_dist

In [None]:
#check normality of all numericaal features and transform it if not normal distributed
for feature in train.columns:
    if (train[feature].dtype != 'object'):
        if normalityTest(train[feature]) == False:
            train[feature] = np.log1p(train[feature])
            test[feature] = np.log1p(test[feature])

In [None]:
#use numpy.log1p in order to target_array follows a normal distribution
target_array = np.log1p(target_array)

- Encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()

X = enc.fit_transform(train)
y = target_array

X_predict = enc.fit_transform(test)

# Creating a model

We begin by splitting data into two subsets: for training data and for testing data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1, random_state = 0)

We will use XGBRegressor model to predict total sales for every product and store in the next month.

In [None]:
from xgboost import XGBRegressor

#create a model
model = XGBRegressor()

#fitting
model.fit(
    X_train, 
    y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, y_train), (X_test, y_test)], 
    verbose=True, 
    early_stopping_rounds = 20)

In [None]:
#calculate Mean Squared Error
from sklearn.metrics import mean_squared_error

print('MSE : ', mean_squared_error(y_test, model.predict(X_test)))

In [None]:
#make a prediction
y_predict = model.predict(X_predict)

#transform the values back
y_predict = np.expm1(y_predict)

In [None]:
#sava results to a file
results = pd.DataFrame({'ID': test_id, 'item_cnt_month': y_predict})
results.to_csv('my_submission.csv', index=False)