## Review criteria 
Before preparing to submit the assignment, pay attention to the following criterions. Try to complete most of them and present results in a form that can be easily assessed.

### Clarity
- The clear step-by-step instruction on how to produce the final submit file is provided
- Code has comments where it is needed and meaningful function names

### Feature preprocessing and generation with respect to models
- Several simple features are generated
- For non-tree-based models preprocessing is used or the absence of it is explained

### Feature extraction from text and images
- Features from text are extracted
- Special preprocessings for text are utilized (TF-IDF, stemming, levenshtening...)

### EDA
- Several interesting observations about data are discovered and explained
- Target distribution is visualized, time trend is assessed

### Validation
- Type of train/test split is identified and used for validation
- Type of public/private split is identified

### Data leakages
- Data is investigated for data leakages and investigation process is described
- Found data leakages are utilized

### Metrics optimization
- Correct metric is optimized

### Advanced Features I: mean encodings
- Mean-encoding is applied
- Mean-encoding is set up correctly, i.e. KFold or expanding scheme are utilized correctly

### Advanced Features II
 - At least one feature from this topic is introduced
 - Hyperparameter tuning
 - Parameters of models are roughly optimal
 
### Ensembles
 - Ensembling is utilized (linear combination counts)
 - Validation with ensembling scheme is set up correctly, i.e. KFold or Holdout is utilized
 - Models from different classes are utilized (at least two from the following: KNN, linear models, RF, GBDT, NN)

In [5]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
from keras.models import Sequential
from keras.layers import LSTM,Dense,Dropout
from keras import optimizers


import pandas as pd
import numpy as np
import gc
import matplotlib.pyplot as plt
%matplotlib inline 

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)

import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from tqdm import tqdm_notebook

from itertools import product

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['items.csv', 'sample_submission.csv', 'test.csv', 'sales_train.csv', 'item_categories.csv', 'shops.csv']


Step 0: load and print out the data

In [6]:
input_dir = '../input/'
pd.set_option('display.max_colwidth', 20)
pd.set_option('display.max_columns', 500)
shops = pd.read_csv(input_dir + 'shops.csv')
print("Shops:\n", shops.head(), '\n', shops.shape, '\n')
item_cat = pd.read_csv(input_dir + 'item_categories.csv')
print("item_categories: \n", item_cat.head(), '\n', item_cat.shape, '\n')
sales_data = pd.read_csv(input_dir + 'sales_train.csv')
print("sales_train: \n", sales_data.head(), '\n', sales_data.shape, '\n')
items = pd.read_csv(input_dir + 'items.csv')
print("items: \n", items.head(), '\n', items.shape, '\n')
sample_submission = pd.read_csv(input_dir + 'sample_submission.csv')
print("sample_submission: \n", sample_submission.head(), '\n', sample_submission.shape, '\n')
test_data = pd.read_csv(input_dir + 'test.csv')
print("test: \n", test_data.head(), '\n', test_data.shape, '\n')

Shops:
              shop_name  shop_id
0  !Якутск Орджоник...        0
1  !Якутск ТЦ "Цент...        1
2     Адыгея ТЦ "Мега"        2
3  Балашиха ТРК "Ок...        3
4  Волжский ТЦ "Вол...        4 
 (60, 2) 

item_categories: 
     item_category_name  item_category_id
0  PC - Гарнитуры/Н...                 0
1     Аксессуары - PS2                 1
2     Аксессуары - PS3                 2
3     Аксессуары - PS4                 3
4     Аксессуары - PSP                 4 
 (84, 2) 

sales_train: 
          date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154      999.00           1.0
1  03.01.2013               0       25     2552      899.00           1.0
2  05.01.2013               0       25     2552      899.00          -1.0
3  06.01.2013               0       25     2554     1709.05           1.0
4  15.01.2013               0       25     2555     1099.00           1.0 
 (2935849, 6) 

items: 
              item_name  it

In [7]:
def simple_eda(df):
    print("----------TOP 5 RECORDS--------")
    print(df.head(5))
    print("----------INFO-----------------")
    print(df.info())
    print("----------Describe-------------")
    print(df.describe())
    print("----------Columns--------------")
    print(df.columns)
    print("----------Data Types-----------")
    print(df.dtypes)
    print("-------Missing Values----------")
    print(df.isnull().sum())
    print("-------NULL values-------------")
    print(df.isna().sum())
    print("-----Shape Of Data-------------")
    print(df.shape)

In [8]:
print("Sales Data -------------------------> \n")
simple_eda(sales_data)
print("Test data -------------------------> \n")
simple_eda(test_data)
print("Item Categories -------------------------> \n")
simple_eda(item_cat)
print("Items -------------------------> \n")
simple_eda(items)
print("Shops -------------------------> \n")
simple_eda(shops)
print("Sample Submission -------------------------> \n")
simple_eda(sample_submission)

Sales Data -------------------------> 

----------TOP 5 RECORDS--------
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154      999.00           1.0
1  03.01.2013               0       25     2552      899.00           1.0
2  05.01.2013               0       25     2552      899.00          -1.0
3  06.01.2013               0       25     2554     1709.05           1.0
4  15.01.2013               0       25     2555     1099.00           1.0
----------INFO-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
None
----------Describe-------------
       date_block_num       shop_id       item_id    item_price  item_cnt_day
count    2.9

ID                0
item_cnt_month    0
dtype: int64
-----Shape Of Data-------------
(214200, 2)


Next Step: Join dataframes and do groupby

In [9]:
sales_data['date'] = pd.to_datetime(sales_data['date'],format = '%d.%m.%Y')
dataset = sales_data.pivot_table(index = ['shop_id','item_id'],values = ['item_cnt_day'],columns = ['date_block_num'],fill_value = 0)

In [10]:
dataset.reset_index(inplace = True)
dataset = pd.merge(test_data,dataset,on = ['item_id','shop_id'],how = 'left')
dataset.fillna(0,inplace = True)
dataset.drop(['shop_id','item_id','ID'],inplace = True, axis = 1)
print(dataset.head())

  obj = obj._drop_axis(labels, axis, level=level, errors=errors)


   (item_cnt_day, 0)  (item_cnt_day, 1)  (item_cnt_day, 2)  (item_cnt_day, 3)  \
0                0.0                0.0                0.0                0.0   
1                0.0                0.0                0.0                0.0   
2                0.0                0.0                0.0                0.0   
3                0.0                0.0                0.0                0.0   
4                0.0                0.0                0.0                0.0   

   (item_cnt_day, 4)  (item_cnt_day, 5)  (item_cnt_day, 6)  (item_cnt_day, 7)  \
0                0.0                0.0                0.0                0.0   
1                0.0                0.0                0.0                0.0   
2                0.0                0.0                0.0                0.0   
3                0.0                0.0                0.0                0.0   
4                0.0                0.0                0.0                0.0   

   (item_cnt_day, 8)  (ite

In [11]:
# X we will keep all columns execpt the last one 
X_train = np.expand_dims(dataset.values[:,:-1],axis = 2)
# the last column is our label
y_train = dataset.values[:,-1:]

# for test we keep all the columns execpt the first one
X_test = np.expand_dims(dataset.values[:,1:],axis = 2)

# lets have a look on the shape 
print(X_train.shape,y_train.shape,X_test.shape)

(214200, 33, 1) (214200, 1) (214200, 33, 1)


In [12]:
# our defining our model 
my_model = Sequential()
my_model.add(LSTM(units = 256,input_shape = (33,1)))
my_model.add(Dropout(0.5))
my_model.add(Dense(1))

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
my_model.compile(loss = 'mse',optimizer = sgd, metrics = ['mean_squared_error'])
my_model.summary()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 256)               264192    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 264,449
Trainable params: 264,449
Non-trainable params: 0
_________________________________________________________________


In [None]:
my_model.fit(X_train,y_train,batch_size = 1000,epochs = 5)


Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

* Step FINAL: submit a solution csv

In [None]:
# creating submission file 
submission_pfs = my_model.predict(X_test)
# we will keep every value between 0 and 20
submission_pfs = submission_pfs.clip(0,20)
# creating dataframe with required columns 
submission = pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
# creating csv file from dataframe
submission.to_csv('sub_pfs.csv',index = False)