# **Predicting Future Sales**
We are provided with daily historical sales data day wise.

### The task is to forecast the total amount of products sold in every shop for month November 2015 for the test set.

We have got 6 files which includes one sample dataset for submission purpose, lets look at the data fields

## File descriptions
* sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
* test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
* sample_submission.csv - a sample submission file in the correct format.
* items.csv - supplemental information about the items/products.
* item_categories.csv  - supplemental information about the items categories.
* shops.csv- supplemental information about the shops

## **Data Feilds**
* ID - an Id that represents a (Shop, Item) tuple within the test set
* shop_id - unique identifier of a shop
* item_id - unique identifier of a product
* item_category_id - unique identifier of item category
* item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
* item_price - current price of an item
* date - date in format dd/mm/yyyy
* date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name - name of item
* shop_name - name of shop
* item_category_name - name of item category


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Loading the Files

In [None]:
sales = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
item_cat = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

## Lets explore our sales dataset first 

In [None]:
print ("Rows     : " ,sales.shape[0])
print ("Columns  : " ,sales.shape[1])
print ("\nFeatures : \n" ,sales.columns.tolist())
print ("\nMissing values \n:", sales.isnull().any())
print ("\nUnique values :  \n",sales.nunique())

In [None]:
sales.info()

Our date column Dtype is **object**, so we need it to convert to **datetime** datatype for our timeseries analysis

In [None]:
# formatting date column dtype from object to datetime
sales['date'] = pd.to_datetime(sales['date'], format='%d.%m.%Y')

#### learning materials on how convert any date column to datetime dtype: 

1. https://pandas.pydata.org/docs/user_guide/timeseries.html#providing-a-format-argument
2. https://www.kaggle.com/alexisbcook/parsing-dates


In [None]:
print('The data type of our date columns is converted from object to', sales.date.dtype)
sales.head()


In [None]:
sales.describe().T

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(sales['item_id'])

In [None]:
# lets look at the distribution of shops in our sales data
plt.figure(figsize=(15,6))
sns.countplot(sales['shop_id'])

In [None]:
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
sns.boxplot(sales['item_cnt_day'])
plt.subplot(1,2,2)
sns.boxplot(sales['item_price'])

## Lets Explore our Test Dataset

In [None]:
print ("Rows     : " ,test.shape[0])
print ("Columns  : " ,test.shape[1])
print ("\nFeatures : \n" ,test.columns.tolist())
print ("\nMissing values \n:", test.isnull().any())
print ("\nUnique values :  \n",test.nunique())

In [None]:
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
sns.histplot(test['shop_id'])
plt.subplot(1,2,2)
sns.histplot(test['item_id'])

### We can see that we have different no. of unique items and shops in both the sales dataset and test dataset 
 * In test dataset we have 5100 unique items whereas in sales dataset there are 21807 items
 * Also in test data we have 42 unique shops but in sales data we have got 60.

#### So it might be that some items and shops from both the test dataset and sales data will be missing.

In [None]:

missing_items_in_sales = [item for item in sales.item_id.unique() if item not in test.item_id.unique()]
missing_items_in_test = [item for item in test.item_id.unique() if item not in sales.item_id.unique()]

print('Total number unique of item_id that are not present in sales dataset but is present in test dataset:',len(missing_items_in_sales))
print('Total number unique of item_id that are not present in test dataset but is present in sales dataset:',len(missing_items_in_test))

#### Now lets find out **shop_id** that are not present in **test** dataset but are not present in our **sales** dataset

In [None]:
shop_id_notin_test = [id for id in sales.shop_id.unique() if id not in test.shop_id.unique()]
print('shops that are present in sales data but not in test data\n',shop_id_notin_test)

#### WE see that their are some shop_id missing in our test data it might be for the shops which has less number of monthly data or the shops which last month data is missing ie. 33

In [None]:
shop_id = 0
shop_id_month_miss = []

for months in sales.groupby(['shop_id'])['date_block_num'].unique():
    if 33 not in months:
        shop_id_month_miss.append(shop_id)
        shop_id +=1
    else:
        shop_id +=1

print('shops with missing sales data for the last month\n', shop_id_month_miss)     
        
        
    

### So we have got 
* shops that are present in **sales data** but not in **test data** - shop_id_notin_test 
+ shops with **missing sales data** for the **last month** - shop_id_month_miss

In [None]:
''' 
so lets find out the shops which have last month sales data given but still 
not present in our test data'''

for shop in shop_id_notin_test:
    if shop not in shop_id_month_miss:
        print(f'Shop {shop} have last month sales data given but still not in our test dataset')
        

### BUT why?
* why is that we have some Shops for which sales data is given for last month but are not present in test data
* maybe due to less monthly data given so lets see.


In [None]:
print(sales.groupby(['shop_id'])['date_block_num'].unique()[9])
print(sales.groupby(['shop_id'])['date_block_num'].unique()[20])

#### So due to less data for shop_id 9,20 test test data for this shop are also not given

# Question Answer
### Q: Which are the most popular shop and what is the total sale at each shop?

In [None]:
popular_shops = sales.groupby('shop_id').item_cnt_day.agg([sum])

In [None]:
popular_shops = popular_shops.sort_values(by='sum')

In [None]:

popular_shops.plot.barh(figsize=(16,12))
plt.title('Most popular shop by sales', fontsize=20)
plt.xlabel('Total Sales', fontsize=14)

## Q: Which Shop has the most items available and number of unique items?

In [None]:
sales.groupby('shop_id')['item_id'].nunique().sort_values().plot.barh(figsize=(16,12))
plt.title('Most items available at the shop', fontsize=20)
plt.xlabel('Total number of items at shop', fontsize=14)


## Q: Which is the most sold item at each shop?

In [None]:
df = sales.groupby(['shop_id','item_id']).item_cnt_day.sum()   #.sort_values()  #ascending=False

In [None]:
df.loc[df.groupby(level=0).idxmax()].sort_values()

In [None]:
df.loc[df.groupby(level=0).idxmax()].sort_values().plot.barh(figsize=(16,12))
plt.title('Most sold single item at each shop', fontsize=20)
plt.xlabel('Total Sales', fontsize=14)


### item_id 20949 is the most sold item at each shop

## Q: Which are the Top 25 sold Items?

In [None]:
sales.groupby('item_id')['item_cnt_day'].sum().sort_values(ascending=False)[:25].sort_values().plot.barh(figsize=(16,12))
plt.title('Top 25 sold Items', fontsize=20)
plt.xlabel('Total Sales', fontsize=14)


### item_id 20949 is the most sold item.

In [None]:
sales.groupby('item_id')['item_cnt_day'].sum().sum()

### Q: Whats the percentage contribution of each Top 25 products of total sales?

In [None]:
(sales.groupby('item_id')['item_cnt_day'].sum()/sales.groupby('item_id')['item_cnt_day'].sum().sum()*100).sort_values(ascending=False)[:25].sort_values().plot.barh(figsize=(16,12))
plt.title('Percentage Sale of each Top 25 Product out of Total Sale', fontsize=20)
plt.xlabel('Percentage Sale', fontsize=14)

### Q: Whats the total sales for each month? 

In [None]:
sales.groupby(["date_block_num"])["item_cnt_day"].sum().plot(figsize=(16,8))
plt.title('Total Sales of the company month wise', fontsize=20)
plt.ylabel('Sales', fontsize=14)


In [None]:
sales.groupby(["date_block_num"])["item_cnt_day"].sum()

### Q: Which shop has the highest sale for each month and whats the total sale?

In [None]:
data = sales.groupby(["date_block_num",'shop_id'])["item_cnt_day"].sum()
data.loc[data.groupby(level=0).idxmax()].plot.bar(figsize=(16,10))
plt.title('Shop with highest sale for each month', fontsize=20)
plt.ylabel('Sales', fontsize=14)


### Q: How many total items are sold on each weekday?

In [None]:
sales1 = sales.copy()
sales1['weekday'] = sales1.date.dt.day_name()
sales1


In [None]:
sales1.groupby('weekday')['item_cnt_day'].sum().sort_values().plot.barh(figsize=(10,6))
plt.title('Sales for each day of the week', fontsize=20)
plt.xlabel('Sales', fontsize=14)

### We see that during weekends our sale is more as compared to weekdays with sale on saturday the highest.

sales.head(

## Q: Which months has highest sale for year 2013 and 2014 combined?


In [None]:
sales1['month']=sales1.date.dt.month_name()
sales1[sales1['date'] < '2015-01-01'].groupby('month')['item_cnt_day'].sum().sort_values().plot.barh(figsize=(12,8), color='green')
plt.title('Total sale for each month for the combined year of 2013 and 2014', fontsize=20)
plt.xlabel('Sales', fontsize=14)

## Q: What is the Month wise sale for each Year?

In [None]:
sales1['year'] = sales1.date.dt.year
sales1.groupby(['month','year'])['item_cnt_day'].sum().plot.bar(figsize=(16,10))
plt.title('Month wise sale for each year', fontsize=20)
plt.ylabel('Sales', fontsize=14)

### We see that year on year our sale is decreasing for every month.

In [None]:
sales1.month.unique()

In [None]:
sales2 = sales1[sales1.month.isin(['January', 'February', 'March', 'April', 'May', 'June', 'July','August', 'September', 'October'])]
sales2

In [None]:
yearly_sales = sales2.groupby('year')['item_cnt_day'].sum()
yearly_sales

In [None]:
def percent_of(a,b):
    return round(100-((a-b)/a)*100,0)   

In [None]:
print(percent_of(yearly_sales[2013],yearly_sales[2014]))
print(percent_of(yearly_sales[2014],yearly_sales[2015]))

## Q: Which day of the month on our sale is most?

#### 

In [None]:
# Lets extract day feature first from our date column
sales1['day'] = sales1.date.dt.day
sales1

In [None]:
# finding out total sale on each day of the month as well as how many times each day appear
sales_day = sales1.groupby('day')['item_cnt_day','day'].agg({'item_cnt_day':'sum','day':'count'})

In [None]:
sales_day.tail()

In [None]:
# we are mutiplyiing day 31 item_cnt_day sum with (34/20 * 0.87) because we have 34 months and in our test data 20 have months have 31 days and also
# multiplying it with 0.87 because on average we have 0.87 sale of previous year. 

sales_day.loc[31, 'item_cnt_day'] = round(sales_day.loc[31, 'item_cnt_day'] * (34/20)*0.87,0)
sales_day.tail()

#### we are mutiplyiing day 31 item_cnt_day sum with (34/20 * 0.87) because we have 34 months and in our train data 20 months out of 34 months have 31 days and also multiplying it with 0.87 because on average our sale is decreasing and is 0.87 of sale of previous year. 

In [None]:
sales_day.sort_values(by='item_cnt_day')['item_cnt_day'].plot.barh(figsize=(15,10))
plt.title('Sales on each day of the month', fontsize=20)
plt.ylabel('Day of month', fontsize=14)

### 2nd day of the month have the highest sale and 11th day of the month has the lowest sale.

# SUMMARY
#### 1. Most popular shop is 31 and least popular shop is shop 36
#### 2. Shop 25 has most variety of items where as shop 36 has least variety of items.
#### 3. Shop 31 has the highest sale for most month except the last two month of the data.
#### 4. item_id 20959 is most sold item as well as it is most sold item for maximum no of shop
#### 5. item_id 1590 is the least sold item.
#### 6. Weekends sale is more than weekdays and on saturday our sale is maximum.
#### 7. Sale is maximum in the month of December and minimum in the month of April.
#### 8. Highest sale is recorded on 2nd day of the month and lowest sale is recorded on the 11th day of the month.
#### 9. Each passing year our sale is decreasing.

## Outliers

lets use box plot to check the outlies in our sales dataset 

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.boxplot(sales.item_cnt_day)
plt.subplot(1,2,2)
sns.boxplot(sales.item_price)

lets check item_cent_day value greater than or equal to 1000

In [None]:
sales[sales['item_cnt_day'] >= 1000]

In [None]:
sales[sales['item_id'] == 11373].sort_values(by='item_cnt_day', ascending = False)

In [None]:
sales[sales['item_id'] == 20949].sort_values(by='item_cnt_day', ascending = False)

lets only take item_cnt_day with values < 1001 and item_price < 100000 

In [None]:
sales = sales[sales.item_cnt_day<1001]
sales = sales[sales.item_price<100000]

In [None]:
sales[sales.item_price<=0]

In [None]:
sales.loc[sales.item_price<=0, 'item_price'] = sales[sales.item_id==2973]['item_price'].median()

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.boxplot(sales.item_cnt_day)
plt.subplot(1,2,2)
sns.boxplot(sales.item_price)

In [None]:
from itertools import product

In [None]:
matrix = []
cols = ['date_block_num','shop_id','item_id']
for i in range(34):
    sales_df = sales[sales.date_block_num==i]
    matrix.append(np.array(list(product([i], sales_df.shop_id.unique(), sales_df.item_id.unique())), dtype='int16'))

In [None]:
matrix

In [None]:
cols = ['date_block_num','shop_id','item_id']
matrix = pd.DataFrame(np.vstack(matrix), columns=cols)
matrix

https://numpy.org/doc/stable/reference/generated/numpy.vstack.html

In [None]:
matrix.shop_id.unique()

https://www.kaggle.com/snanilim/sales-preprocessing-and-prediction-by-xgboost
* https://www.kaggle.com/dlarionov/feature-engineering-xgboost
* https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts