Getting Started


I have learned that data science is not only aimed at systems engineers, it is for enthusiastic, creative people, and with a thirst for knowledge, data is currently a gold mine, you just have to learn how to exploit it. There are tools, tutorials, communities, so I'm very excited to start this my first project close to a real problem, so go ahead.

My plan scheduled in this notebook:

* Understand the data delivered by competition, know data type
* Do EDA
* Pre-process given data set and generate new features from existing ones, new features that give courage to achieve the objective as required my challenge:
    Numeric, categorical, datetime and coordinate features.
* Make decisions about handling missing data or cleaning data
* Find insights
* See differents between test and train
* Build models over dataset, make training and test
* Use differents librarys for data plots, visualize the data
* Create validations for models
* Forecast the total amount of products sold in every shop for the test set
* Create baseline and summit

First, It is necesary import librarys, a few over of these:
* *Numpy (linear algebra)*: It's to work with dimensional arrays, has useful routines and random number capabilities for linear algebra.
* *Pandas*: It allows we to process data like to SQL. Your use is easy intuitive, flexible. 
* *Os*: It is a python's module, it provides functions for interacting with the operating system, and its file system.
* *matplotlib*: It allows we to create a production-quality graphic, differents plots
* *seaborn*: It allows we to create a differents plots. Built on top of matplotlib and designed for advanced statistical graphics.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import scipy.sparse
import matplotlib.pyplot as plt
import seaborn as sns
import base64
from pandas.plotting import scatter_matrix
from keras.models import Sequential
from keras.layers import LSTM,Dense,Dropout
from IPython.display import FileLink, FileLinks
        
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Loading data provide of competition:**

In [None]:
#DEscomentar si es para subir
ROOT_FOLDER         = '/kaggle/input/competitive-data-science-predict-future-sales/'
df_items             =  pd.read_csv(os.path.join(ROOT_FOLDER, 'items.csv'))
df_item_categories   =  pd.read_csv(os.path.join(ROOT_FOLDER, 'item_categories.csv'))
df_sales_train       =  pd.read_csv(os.path.join(ROOT_FOLDER, 'sales_train.csv'))
df_shops             =  pd.read_csv(os.path.join(ROOT_FOLDER, 'shops.csv'))
df_test              =  pd.read_csv(os.path.join(ROOT_FOLDER, 'test.csv'))

Let's knowledge of dataset:

In [None]:
print(' Dataset Items ')
df_items.head(1)

In [None]:
print(' Dataset item_categories ')
df_item_categories.head(1)

In [None]:
print(' Dataset sales_train ')
df_sales_train.head(1)

In [None]:
print(' Dataset test ')
df_test.head(1)

In [None]:
print(' Dataset shops ')
df_shops.head(1)

Let's analize data ***sales_train***:

In [None]:
print ('Train min/max date: %s / %s' % (df_sales_train.date_block_num.min(), df_sales_train.date_block_num.max()))
print('Sales Train shape: %d rows' % df_sales_train.shape[0])
print('Test: %d rows ' % df_test.shape[0])

I can to see, that train has more than 14 times rows than test

Let's go to see missing values, NaN, null:

In [None]:
# Number of NaNs for columns
df_sales_train.isnull().sum(axis=0).head(15)

In [None]:
# Number of NaNs for row
df_sales_train.isnull().sum(axis=1).head(15)

I try to group by shop_id,item_id,month:

In [None]:
index_cols = ['shop_id', 'item_id','item_price','date_block_num']

#get aggregated values for (shop_id, item_id,  date_block_num)
gb_train = df_sales_train.groupby(index_cols,as_index=False).agg({'item_cnt_day':'sum'}, dtype='int32')
gb_train = gb_train.rename(columns={'item_cnt_day': 'item_cnt_month'})
gb_train = gb_train[['shop_id', 'item_id','item_price','date_block_num','item_cnt_month']]
gb_train.fillna(0, inplace=True)
gb_train.head(2)

In [None]:
#Shops without sales per month 2826
gb_train[(gb_train.item_cnt_month == 0)]

In [None]:
gb_train[((gb_train.shop_id == 2) & (gb_train.item_id == 835))]

In [None]:
gb_train.count()

Plot Item Prices versus Sales per day as markers, searching outliers:

In [None]:
figure, axe = plt.subplots(figsize = (12,12))
axe.set_title(" EDA Item Price VS  Sales Day", weight="bold")

plot = plt.scatter(gb_train.item_price, gb_train.item_cnt_month, marker="o", c="yellow", edgecolor ="black", s=30, cmap='viridis', linewidth=0.5)
plt.xlabel('Item Price')
plt.ylabel('Sales Day')

I can see outliers when the prices more than 300000, and when  sales per day over 2000, therefor, create a data set without this data

In [None]:
PRICE_OUT = 300000
SALES_OUT = 2000
gb_train = gb_train[(gb_train.item_price < PRICE_OUT) & (gb_train.item_cnt_month < SALES_OUT)]

In [None]:
gb_train = gb_train.pivot_table(index = ['shop_id','item_id'],values = ['item_cnt_month'],columns = ['date_block_num'],fill_value = 0)

In [None]:
gb_train.head(1)

In [None]:
# Doing merge between train_data and test_df as to be suitable for prediction
df_all_data = pd.merge(df_test, gb_train,on = ['item_id','shop_id'],how = 'left')
#Fill NAN's with 0
df_all_data.fillna(0, inplace=True)

In [None]:
df_all_data.drop(['ID','shop_id','item_id'],inplace = True, axis = 1)

In [None]:
df_all_data.head(1)

In [None]:
# X we will keep all columns execpt the last one 
X_train = np.expand_dims(df_all_data.values[:,:-1],axis = 2)
# the last column is our label
y_train = df_all_data.values[:,-1:]

# for test we keep all the columns execpt the first one
X_test = np.expand_dims(df_all_data.values[:,1:],axis = 2)

# lets have a look on the shape 
print(X_train.shape,y_train.shape,X_test.shape)

In [None]:
model_one = Sequential()
model_one.add(LSTM(units = 64,input_shape = (33,1)))
model_one.add(Dropout(0.4))
model_one.add(Dense(1))

model_one.compile(loss = 'mse',optimizer = 'adam', metrics = ['mean_squared_error'])
model_one.summary()

In [None]:
model_one.fit(X_train,y_train,batch_size = 4096,epochs = 10)

In [None]:
# creating submission file 
submission_file = model_one.predict(X_test)
# we will keep every value between 0 and 20
submission_file = submission_file.clip(0,20)
# creating dataframe with required columns 
submission = pd.DataFrame({'ID':df_test['ID'],'item_cnt_month':submission_file.ravel()})
submission.to_csv("submission_xy.csv",index=False)

In [None]:
submission.head(1)

Finally generated my csv with data:

In [None]:
def download_csv( df, title = "Download CSV file", filename = "submission_xy.csv"):
    csv = df.to_csv(index=False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return FileLink(html)

df = pd.DataFrame(data = submission, columns=['ID', 'item_cnt_month'])
download_csv(df)