<div style="padding:20px;color:white;margin:0;font-size:175%;text-align:center;display:fill;border-radius:5px;background-color:#016CC9;overflow:hidden;font-weight:500">The power of Pandas</div>

In this Notebook, we will show how far we can go with Pandas. In this time series prediction competition, we won't use ARIMA or LSTM. Only Pandas and some basic arithmetics. 

# <b><span style='color:#4B4B4B'>1 |</span><span style='color:#016CC9'>  Looking at the data, understanding the competition.</span></b>

First, we load train and test DataFrames

In [None]:
import pandas as pd
train =pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv') 
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

We will have a look at train first:

In [None]:
print('Number of rows and columns:',train.shape) 
print('')
print ('Columns :')
print(train.dtypes)
print('')
print ('First rows:')
print(train.head(5))

There are only 6 columns, but a lot of rows, almost 3 million. Each row represents a triplet : date, shop_id, item_id. Each row indicates for a certain day, a certain shop and a certain item, how many items were sold and what was the price.

In [None]:
print(train['shop_id'].nunique())
print(train['item_id'].nunique())

There are 60 shops and 21807 items in the training set.

In [None]:
print(train['date'].nunique())
print(train['date_block_num'].nunique())

There are 1034 days in the training set. They are grouped into months indexed by date_block_num. There are 34 months (0 to 33)

The target of the competition is to predict for a list of pairs (shop, item) the number of items sold during the 35th month (November 2015)

In [None]:
print(test.shape) 
print(test.dtypes)

The test file indicates all the pairs (shop, item) for which we need to predict the number of items sold during the 35th month. First let's compare the pairs (shop, item) in the test file and in the training file.

In [None]:
print(test['shop_id'].nunique())
print(test['item_id'].nunique())

We can see that there are less shops and less items in the test file than in the train file. Let's try to understand why.

# <b><span style='color:#4B4B4B'>2 |</span><span style='color:#016CC9'>  Evolution of items and shops</span></b>

As the target of the competiton is a prediction of the monthly sales, we regroup the sales by date_block_num and sum other all the days within a month.

In [None]:
train_monthly=train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day':'sum'})
train_monthly.columns=['item_cnt_month']
train_monthly=train_monthly.reset_index()
print(train_monthly.shape)
print(train_monthly.head(3))

We could think that the number of rows would be 30 times less after regrouping in months. But it only dropped from 3 million to 1.6 million. This probably indicates that there are a lot of pairs (shop, item) that appear only once a month.

We will first analyse the evolution of the number of shops and items during the 34 months in the training file.

In [None]:
import matplotlib.pyplot as plt
shop_item_history=train_monthly.groupby(['date_block_num']).agg({'shop_id':'nunique','item_id':'nunique'})
plt.rcParams["figure.figsize"] = (20,5)
fig, axes = plt.subplots(nrows=1, ncols=2)
shop_item_history.plot(y='shop_id',label='number of shops',ax=axes[0])
shop_item_history.plot(y='item_id',label='number of items',ax=axes[1])
plt.show()

We can see that the number of shops and the number of items have been dropping during the 34 months of the training data. 

We also notice that there are a total of 21807 items in the training set, but the graph shows that there are never more than 8500 active at any given moment. This indicates a lot of rotations in the items sold, maybe due to new versions coming all the time as the company sells Software.

In [None]:
print(shop_item_history.tail())
print('There are {} shops and {} items in the test data'.format(test['shop_id'].nunique(),test['item_id'].nunique()) )

the train data ends with 44 shops and 5413 items, while the test data has 42 shops and 5100 items.
We can suspect that the business is not doing well, shops are closing and items are being removed.

We need to check if new items have been introduced or new shops openend in the test month. This would be a problem because we would have nothing in the train data to help us predict that.

In [None]:
shops_train = list(train_monthly['shop_id'].unique())
shops_test = list(test['shop_id'].unique())
items_train = list(train_monthly['item_id'].unique())
items_test = list(test['item_id'].unique())

print('All shops in the test file are in the train file : {}'.format(set(shops_test).issubset(set(shops_train))))
print('All items in the test file are in the train file : {}'.format(set(items_test).issubset(set(items_train))))

Some items have been newly introduced in the test file. We need to deal with them. Let's see how many new items do we have.

In [None]:
new_items = set(items_test).difference(items_train)
print ('There are {} new items out of {} in the test file. This is {:.1f}%'.format(len(new_items),len(items_test),len(new_items)/len(items_test)*100.0))

# <b><span style='color:#4B4B4B'>3 |</span><span style='color:#016CC9'>(Item, shop) pairs</span></b>

Let's now check the pairs (shop_id, item_id). In the test file, there are 42 shops and 5100 items. This makes 42x5100 = 214200 possible pairs. And there are 214200 lines in the test file. So nothing is missing. The target of the competition is to predict the sales of all the items for all the shops in the test file.
But an important question is, **how many of these pairs are present in the train file?** We have already determined that 100% of the shops in the test are in the train and that 92.9% of the items in the test are in the train. But what about the pairs ?

In [None]:
train_mean = train_monthly.groupby(['shop_id','item_id']).agg({'item_cnt_month':'mean'}).reset_index()
print ('There are {} pairs (shop_id,item_id) in the train file'.format(len(train_mean)))

In [None]:
shop_item_train = train_mean[['shop_id','item_id']].apply(tuple, axis=1).tolist()
shop_item_test = test[['shop_id','item_id']].apply(tuple, axis=1).tolist()
new_shop_item = set(shop_item_test).difference(shop_item_train)

print ('There are {} new pairs (shop,item) out of {} in the test file. This is {:.1f}%'.format(len(new_shop_item),len(shop_item_test),len(new_shop_item)/len(shop_item_test)*100.0))

These are bad news. **48% of the pairs we have to predict have never occured in the train file.** When a pair (shop, item) is in the test file but not in the train file, does it mean that the item has never been sold in the shop during the training period? Theoretically, yes. Can we then safely predict to zero a pair we have never seen? The submission files will then have a lot of zeroes (at least 48%). This is probably not the best strategy, but we will try it first and see how it goes. We can refine it later.

Now enough of these bad news and let's come back to something easier. How did the total number of items sold in all the shops evolved during the training period ?

# <b><span style='color:#4B4B4B'>4 |</span><span style='color:#016CC9'> Business evolution</span></b>

In [None]:
train_total_monthly=train_monthly.groupby(['date_block_num']).agg({'item_cnt_month':'sum'}).reset_index()
train_total_monthly.plot(y='item_cnt_month')
plt.show()

This curve is screaming **non-stationarity**. There is seasonality (12 months cycle) and downtrend. Because the training time series is so short (only 34 time steps), we will try to deal manually with the trend and the seasonality, without using ARIMA or LSTM. 

# <b><span style='color:#4B4B4B'>5 |</span><span style='color:#016CC9'>  Prediction</span></b>

We will look for the fitting line to the curve.

In [None]:
from scipy.optimize import curve_fit

def f(x, A, B): #  'straight line' y=f(x)
    return A*x + B

popt, pcov = curve_fit(f, train_total_monthly['date_block_num'],train_total_monthly['item_cnt_month']) #  data x, y to fit
print ('The fitting line equation is: item_cnt_month = {:.0f} x date_block_num + {:.0f}'.format(popt[0], popt[1]))

In [None]:
plt.plot(train_total_monthly['date_block_num'], train_total_monthly['item_cnt_month'],label='item_cnt_month')

# Plot another line on the same chart/graph
plt.plot(train_total_monthly['date_block_num'], f(train_total_monthly['date_block_num'],popt[0],popt[1]),label='fitting line')
plt.legend(loc="upper right")
plt.show()

In [None]:
print('Based on extrapolation, we expect a total sales count of {:.0f} for the test month'.format(f(34,popt[0],popt[1])))

Now, what about the seasonality ? The month to predict is November. Let's look at the ratio November/October that we have in our training data. date_block_num=0 for Jan13. So Oct13 is 9 and Nov13 is 10. Oct14 is 21 and Nov14 is 22. Oct15 is 33 and Nov15 is 34. 34 is the month we have to predict.

In [None]:
print ('Nov13 / Oct13 is {:.4f}'.format(train_total_monthly.iloc[10]['item_cnt_month']/train_total_monthly.iloc[9]['item_cnt_month']))
print ('Nov14 / Oct14 is {:.4f}'.format(train_total_monthly.iloc[22]['item_cnt_month']/train_total_monthly.iloc[21]['item_cnt_month']))

In November, the sales are higher than in October. In average : November/October = 1.0582. This ratio represents the trend and seasonality altogether. We see that the seasonality is stronger than the trend here, as November is higher than October, while the general trend is down. December is even much stronger, but here we just have to focus on November.

In [None]:
print('Based on extrapolation and seasonality, we expect a total sales count of {:.0f} for the test month'.format(1.0582*train_total_monthly.iloc[33]['item_cnt_month']))

So now what should we do with all of this ? we can try a very simple idea. 
We generated above the Dataframe **train_mean**. This is, for each pair (shop, item) in the train set, the average value of monthly item sold during the training period.

In [None]:
print(train_mean.head(5))
print ('')
print('train_mean number of rows and columns:', train_mean.shape)

We want to add the column item_cnt_month from train_mean to the test DataFrame, when we find the pairs. so we will merge test and train_mean based on the pair (shop, item) but retain all the rows in test, as this is what we have to submit. 

In [None]:
test=pd.merge(test,train_mean,on=['shop_id','item_id'],how='left')
print (test.head(10))

We can see a lot of NaN. 48% exactly as we have already calculated. Let's check this.

In [None]:
print('there are {} NaN. That is {:.1f}% of the rows.'.format(test['item_cnt_month'].isnull().sum(),test['item_cnt_month'].isnull().sum()/len(test)*100.0))

Yahoo ! Always good to check that there is no mistakes. We are going to fill the NaN with zeroes and clip(0,20) as instructed.

In [None]:
test.fillna(0,inplace=True)
test['item_cnt_month'] = test['item_cnt_month'].clip(0, 20)

Now we want to see what is the total number of items sold in the test DataFrame.

In [None]:
print('There are {:.0f} item sold in total in the test Dataframe.'.format(test['item_cnt_month'].sum()))

Using this method, we get 202k items sold during Nov15, while the average monthly items sold during the training period is roughly 110k.
There is probably a selection process going on. The items which were not selling have been removed and we end up selecting the pairs (shop, item) to get the maximum sales.

202k is way too high as we are expecting a total of 75k based on trend and seasonality.
So let's adjust these values proportionally.

In [None]:
test['item_cnt_month']= test['item_cnt_month'] * 75191.0 / 202192.0 

#  <b><span style='color:#4B4B4B'>6 |</span><span style='color:#016CC9'> Submission</span></b>

Let's try to submit these numbers!

In [None]:
submission=test[['ID','item_cnt_month']]

In [None]:
submission.to_csv('submission.csv', index=False)

RMSE = 1.1413. Not bad, with only a few pandas manipulations and arithmetics!

to further improve the score we would need to look at all the new pairs (item, shop) that appeared in the test file and were not in the train file. We predicted zero for them but we need somehow to connect them to the train file using the categories.
If you liked this Notebook, please upvote.