# Lab 03 - Predict future sales

## File descriptions
- **sales_train.csv** - the training set. Daily historical data from January 2013 to October 2015.
- **test.csv** - the test set. You need to forecast the sales for these shops and products for November 2015.
- **sample_submission.csv** - a sample submission file in the correct format.
- **items.csv** - supplemental information about the items/products.
- **item_categories.csv**  - supplemental information about the items categories.
- **shops.csv** - supplemental information about the shops.

## Data fields
- **ID** - an Id that represents a (Shop, Item) tuple within the test set
- **shop_id** - unique identifier of a shop
- **item_id** - unique identifier of a product
- **item_category_id** - unique identifier of item category
- **item_cnt_day** - number of products sold. You are predicting a monthly amount of this measure
- **item_price** - current price of an item
- **date** - date in format dd/mm/yyyy
- **date_block_num** - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- **item_name** - name of item
- **shop_name** - name of shop
- **item_category_name** - name of item category

# Libraries

In [None]:
!pip install findspark
!pip install pyspark

In [None]:
import findspark
findspark.init()

import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession, Window

from pyspark.sql.functions import *
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer

from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor, LinearRegression, RandomForestRegressor

from pyspark.ml.evaluation import RegressionEvaluator

# Create spark session

In [None]:
spark = SparkSession \
    .builder \
    .appName("Predict future sales") \
    .config("spark.ui.showConsoleProgress", "false") \
    .config("spark.driver.memory", "12g") \
    .getOrCreate()
spark.sparkContext.uiWebUrl

# Read data

### Main file

In [None]:
df = spark.read.csv('../input/competitive-data-science-predict-future-sales/sales_train.csv', header=True, inferSchema=True)
df = df.withColumn('date',to_date(df.date,'dd.MM.yyyy'))

df.show(3)
df.printSchema()
print("Count:",df.count())

### Items

In [None]:
items = spark.read.csv('../input/competitive-data-science-predict-future-sales/items.csv',header=True)

items.show(3,truncate=False)
items.printSchema()
print('Count:',items.count())

### Categories

In [None]:
categories = spark.read.csv('../input/competitive-data-science-predict-future-sales/item_categories.csv',header=True)

categories.show(3,truncate=False)
categories.printSchema()
print('Count:',categories.count())

### Shops

In [None]:
shops = spark.read.csv('../input/competitive-data-science-predict-future-sales/shops.csv',header=True)
shops.show(5)
shops.printSchema()

### Join tables

In [None]:
df = df.join(items,on='item_id',how='inner')
df = df.join(categories,on='item_category_id',how='inner')
df = df.join(shops,on='shop_id',how='inner')
df.printSchema()

# Preprocessing

In [None]:
df = df.where(expr(
    'item_price>0 and item_cnt_day>=0'
))

# Feature engineering

In [None]:
def groupCategory(init_category, index):
    if (index in list(range(1,8))):
        return 'Access'
    elif (index in list(range(10,18))):
        return 'Console'
    elif (index in list(range(18,25))):
        return 'Consoles Game'
    elif (index in list(range(26,28))):
        return 'Phone Game'
    elif (index in list(range(28,32))):
        return 'CD Game'
    elif (index in list(range(32,37))):
        return 'Card'
    elif (index in list(range(37,43))):
        return 'Movie'
    elif (index in list(range(43,55))):
        return 'Book'
    elif (index in list(range(55,61))):
        return 'Music'
    elif (index in list(range(61,73))):
        return 'Gift'
    elif (index in list(range(73,79))):
        return 'Soft'
    else:
        return init_category

categories = categories.rdd.map(lambda r: (
    groupCategory(r[0], int(r[1])),
    r[1]
))
categories = categories.toDF([
    'item_category_name',
    'item_category_id'
])

df = df.drop('item_category_name')
df = df.join(categories,on='item_category_id',how='inner')
df.printSchema()

In [None]:
df = df.withColumnRenamed('item_cnt_day','labels')

df = df.groupby([
    'shop_id',
    'shop_name',
    'item_id',
    'item_name',
    'item_category_name',
    'date_block_num'
]).agg(
    sum('labels').alias('labels'),
    mean('item_price').alias('item_price')
)
df.printSchema()

In [None]:
df = df.where(expr('labels<=500'))

In [None]:
df = df.withColumn('month',floor(df['date_block_num']/12)+1)
df = df.withColumn('isWinter',expr(
    'cast((10<=month and month<=12) as string)'
))
df.printSchema()

In [None]:
window = Window.orderBy('date_block_num').partitionBy('shop_name','item_name')
df = df.withColumn(
    'lag_1', 
    lag('labels',offset=1,default=0).over(window)
)
df = df.withColumn(
    'lag_2', 
    lag('labels',offset=2,default=0).over(window)
)

df.printSchema()

# Train-test split

Instead of splitting randomly, the time-series problem must split the train set is the first part of the data, while the test is the later on. Here, we pick records whose block is from `[2,27]` to be the train, and the rest is the test. 

In [None]:
train = df.where(expr('2<=date_block_num and date_block_num<28'))
test = df.where(expr('date_block_num>=28'))

backup_train = train

With the categorical features, we first index them into integer values.

In [None]:
cat_col = [
    'shop_name',
    'item_category_name',
    'isWinter'
]
index_col = [c+"_index" for c in cat_col]

indexer = StringIndexer(
    inputCols = cat_col,
    outputCols = index_col,
    handleInvalid = 'keep'
).fit(train)

train = indexer.transform(train)
train.select(index_col).show(5)

Thereafter, the indexed columns are now encoded using one-hot encoding technique. This can prevent the model treating the categorical values has the magnitude. 

In [None]:
ohe_col = [c+"_ohe" for c in cat_col]

ohe = OneHotEncoder(
    inputCols = index_col,
    outputCols = ohe_col,
    handleInvalid = 'keep'
).fit(train)

train = ohe.transform(train)
train.select(ohe_col).show(5)

Finally, the feature columns are combined into a vector called `features`.

In [None]:
feature_col = [
    'shop_name_ohe',
    'item_category_name_ohe',
    'isWinter_ohe',
    'month',
    'lag_1',
    'lag_2',
    'item_price'
]

assembler = VectorAssembler(
    inputCols=feature_col, 
    outputCol='features',
    handleInvalid='keep'
)
train = assembler.transform(train)

train.select('labels','features').show(5)

# Modeling

### Evaluator

In [None]:
evaluator = RegressionEvaluator(
    labelCol="labels", 
    predictionCol="prediction", 
    metricName="rmse"
)
train = backup_train

### Model 1 - Gradient Boosting Tree

In [None]:
gbt = GBTRegressor(
    featuresCol = 'features', 
    labelCol = 'labels'
)
pipeline_1 = Pipeline(stages=[
    indexer,
    ohe,
    assembler, 
    gbt
])

model_1 = pipeline_1.fit(train)

In [None]:
train_pred = model_1.transform(train)
test_pred = model_1.transform(test)

test_pred.select("prediction", "labels").show(5)

print("RMSE train data = %g" % evaluator.evaluate(train_pred))
print("RMSE test data = %g" % evaluator.evaluate(test_pred))

### Model 2 - Linear regression

In [None]:
lr = LinearRegression(
    regParam = 0.3, 
    elasticNetParam = 0.2, 
    featuresCol = 'features',
    labelCol = 'labels'
)
pipeline_2 = Pipeline(stages=[
    indexer,
    ohe,
    assembler, 
    lr
])

model_2 = pipeline_2.fit(train)

In [None]:
train_pred = model_2.transform(train)
test_pred = model_2.transform(test)

test_pred.select("prediction", "labels").show(5)

print("RMSE train data = %g" % evaluator.evaluate(train_pred))
print("RMSE test data = %g" % evaluator.evaluate(test_pred))

### Model 3 - Random Forest

In [None]:
rf = RandomForestRegressor (
    featuresCol='features', 
    labelCol='labels', 
    maxDepth = 3,
    numTrees = 20
)

pipeline_3 = Pipeline(stages=[
    indexer,
    ohe,
    assembler, 
    lr
])

model_3 = pipeline_3.fit(train)

In [None]:
train_pred = model_3.transform(train)
test_pred = model_3.transform(test)

test_pred.select("prediction", "labels").show(5)

print("RMSE train data = %g" % evaluator.evaluate(train_pred))
print("RMSE test data = %g" % evaluator.evaluate(test_pred))

# Test create for submission

In [None]:
best_pipeline = pipeline_3
best_model = best_pipeline.fit(df)

In [None]:
kaggle_test = spark.read.csv('../input/competitive-data-science-predict-future-sales/test.csv',header=True,inferSchema=True)

print("Count:",kaggle_test.count())
kaggle_test.show(3)
kaggle_test.printSchema()

In [None]:
# Join other files
kaggle_test = kaggle_test.join(items,on='item_id',how='left')
kaggle_test = kaggle_test.join(categories,on='item_category_id',how='left')
kaggle_test = kaggle_test.join(shops,on='shop_id',how='left')

# Join for price
kaggle_test = kaggle_test.join(
    df.select('shop_id','item_id','item_price')\
        .groupby('shop_id','item_id')\
        .agg(
            mean('item_price').alias('item_price')
        ),
    on = ['shop_id','item_id'],
    how = 'left'
)

# Create feature
kaggle_test = kaggle_test.withColumn('month',lit(11))
kaggle_test = kaggle_test.withColumn('isWinter',lit('True'))

# Create lag_1
kaggle_test = kaggle_test.join(
    df.where('date_block_num==33').select('shop_id','item_id','lag_1'),
    on=['shop_id','item_id'],
    how='left'
)

# Create lag_2
kaggle_test = kaggle_test.join(
    df.where('date_block_num==32').select('shop_id','item_id','lag_2'),
    on=['shop_id','item_id'],
    how='left'
)

# Drop useless columns
kaggle_test = kaggle_test.drop(
    'shop_id',
    'item_id',
    'item_category_id',
)

# Fill NaN values by 0
kaggle_test = kaggle_test.fillna(0)

# Verify the number of samples
print("Count:",kaggle_test.count())
kaggle_test.printSchema()
kaggle_test.show(3)

In [None]:
kaggle_test = best_model.transform(kaggle_test)
kaggle_test.printSchema()

In [None]:
kaggle_test.select(
    'ID',
    col('prediction').alias('item_cnt_month')
).repartition(1).write.csv('submission',header=True)

# Conclusion

In general, the problem is quite hard to working as a normal linear regression since we do not know the future. This is actually a time-series problem in which we cannot divide the data randomly. The train set must be the first part of data, while the test is the rest one (on the aspect of time).

Since PySpark still has a lot weakness, we only use some traditional machine learning model provided (e.g. Random Forest, Linear Regression, etc) with feature engineering technique. Overall, the three tested models are achieved the RMSE roughly 4. As a result, the model achieved the score 2.88 on Kaggle contest.

To improve the performance of the model for this problem, we can consider to calculate the lag of times, difference between trends; or moreover, using AMRA, ARIMA, RNN-like models.