The goal of this notebook is to figure out how to format the data so that we can tackle the task at hand. The task was to predict the number of items sold for each store given the month. The data was not set-up to answer this question. We needed to organize/format the data to match the task.

In [None]:
# Imports

import pandas as pd
import numpy as np
import plotly.express as px

## sales_train

In [None]:
# Reading in sales_train

sales_train = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
sales_train.head()

In [None]:
# Viewing the shape of the dataframe

sales_train.shape

In [None]:
# Changing the date into a pandas datatime

sales_train['date'] = pd.to_datetime(sales_train['date'])

In [None]:
# Viewing basic stats on data

sales_train.describe()

In [None]:
# Viewing the info for each column

sales_train.info()

In [None]:
# Viewing the number of unique items and shops in training data

# Viewing the number of items
print("Number of unique items: {}".format(len(sales_train["item_id"].unique()))) # 21807 items in total))

# Viewing the number of shops
print("Number of unique shops: {}".format(len(sales_train["shop_id"].unique()))) # 60 shops in total )

In [None]:
# Plotting the number of items sold per month

## Grouping by month and viewing number of items sold in each month
sales_per_month = sales_train.groupby(['date_block_num'])['item_cnt_day'].sum()

## Plotting
fig = px.bar(sales_per_month, title="Number of Items Sold per Month", labels={"date_block_num":"Month", "value":"Count"})
fig.show()

## items

In [None]:
# Reading in items

items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")
items.head()

In [None]:
# Viewing the number of items and number of categories

# Viewing the number of items
print("Number of unique items: {}".format(len(items["item_id"].unique())))

# Viewing the number of categories
print("Number of unique categories: {}".format(len(items["item_category_id"].unique()))) 

## item_categories

In [None]:
# Reading in item_categories

item_categories = pd.read_csv("../input/competitive-data-science-predict-future-sales/item_categories.csv")
item_categories.head()

In [None]:
# Extracting sub categories from item_category_name

## Spliting the category by '-'
item_categories['categories'] = item_categories['item_category_name'].str.split('-')

## Extracting the first element from split
item_categories['type'] = item_categories['categories'].apply(lambda x: x[0].strip())

## Extracting second element if there is a second element, else return first element 
item_categories['sub_type'] = item_categories['categories'].apply(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())

In [None]:
# Dropping unnecessary columns

item_categories.drop(['item_category_name', 'categories'], inplace=True, axis=1)

In [None]:
# # Creating dummy variables

# ## Creating dummies for type 
# item_categories = pd.concat([item_categories, pd.get_dummies(item_categories['type'], drop_first=True)], axis=1)

# ## Creating dummies for sub_type 
# item_categories = pd.concat([item_categories, pd.get_dummies(item_categories['sub_type'], drop_first=True)], axis=1)

In [None]:
# Plotting 

fig = px.bar(item_categories.groupby('type')['sub_type'].count(), title="Number of Sub_categories in Each Category")
fig.show()

## shops

In [None]:
# Reading in shops

shops = pd.read_csv("../input/competitive-data-science-predict-future-sales/shops.csv")
shops.head()

In [None]:
# Replacing shop ids with other duplicate shop_ids

## Shopnames for 0 and 57 are the same so changing shop_id 0 to 57
shops.loc[shops['shop_id']==0, 'shop_id'] = 57

## Shopnames for 1 and 58 are the same so changing shop_id 1 to 58
shops.loc[shops['shop_id']==1, 'shop_id'] = 58

## Shopnames for 10 and 11 are the same so changing shop_id 00 to 11
shops.loc[shops['shop_id']==10, 'shop_id'] = 11

In [None]:
# Collecting the duplicate names just in case 

duplicate_shop_names = {
    shops.loc[shops['shop_id']==57, 'shop_name'].values[0]:shops.loc[shops['shop_id']==57, 'shop_name'].values[1],
    shops.loc[shops['shop_id']==58, 'shop_name'].values[0]:shops.loc[shops['shop_id']==58, 'shop_name'].values[1],
    shops.loc[shops['shop_id']==11, 'shop_name'].values[0]:shops.loc[shops['shop_id']==11, 'shop_name'].values[1] 
}

In [None]:
shops = shops.drop_duplicates(subset='shop_id')

In [None]:
# Retrieving city from shop_name

shops['city'] = shops['shop_name'].str.split(' ').apply(lambda x: x[0])

In [None]:
# Small corrections

## Removing space
shops.loc[shops.shop_name == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'

## Removing ! from '!Якутск'
shops.loc[shops['city'] == '!Якутск', 'city'] = 'Якутск'

In [None]:
shops.head()

In [None]:
# Plotting the number of shops in each city

fig = px.bar(shops.groupby('city')['shop_id'].count(), title='Number of Stores in a City')
fig.show()

In [None]:
# Viewing the number of unique shops

print("Number of unique shops: {}".format(len(shops['shop_id'].unique())))

In [None]:
# Replacing shop ids in sales_train

## Shopnames for 0 and 57 are the same so changing shop_id 0 to 57
sales_train.loc[sales_train['shop_id']==0, 'shop_id'] = 57

## Shopnames for 1 and 58 are the same so changing shop_id 1 to 58
sales_train.loc[sales_train['shop_id']==1, 'shop_id'] = 58

## Shopnames for 10 and 11 are the same so changing shop_id 00 to 11
sales_train.loc[sales_train['shop_id']==10, 'shop_id'] = 11

## The task is to **predict the sales for each product in a store given month**. Therefore we have to downsample the data to represent *monthly sales per item per store*.

## Combining Data into one Dataframe

In [None]:
# Merging the dataframes

## Merging sales_train with item
df = pd.merge(sales_train, items, how="left", on="item_id")

## Merging df and item_categories
df = pd.merge(df, item_categories, how="left", on="item_category_id")

## Merging df and shops
df = pd.merge(df, shops, how="left", on="shop_id")

In [None]:
df.shape

In [None]:
# Grouping data by month, shop_id, item_id to get total sales

data = df.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_price':'mean', 'item_cnt_day':'sum'}).reset_index()
data.head()

In [None]:
# Merging the dataframes

## Merging sales_train with item
data = pd.merge(data, items, how="left", on="item_id")

## Merging df and item_categories
data = pd.merge(data, item_categories, how="left", on="item_category_id")

## Merging df and shops
data = pd.merge(data, shops, how="left", on="shop_id")

In [None]:
data.shape

In [None]:
data.head()

In [None]:
# Plotting sales by month

data['total_sales'] = data['item_price'] * data['item_cnt_day']

fig = px.line(data.groupby('date_block_num')['total_sales'].sum(), 
              title="Sales by Month", 
              labels={"date_block_num":"Month",
                      "value":"Total Sales"})
fig.show()

The data shows that the sales are seasonal. The spikes represents monthDecember which makes sense why there are a lot of sales (Christmas).

## Data Processing

In [None]:
# Converting

data['month'] = data['date_block_num'].apply(lambda month: (month+1)%12)

In [None]:
# Dummifying the categorical columns

# ## Creating dummies and concatenating
data = pd.concat([data, pd.get_dummies(data['shop_id'], drop_first=True, prefix='shop_')], axis=1)

## Creating dummies and concatenating
data = pd.concat([data, pd.get_dummies(data['type'], drop_first=True, prefix='type')], axis=1)

## Creating dummies and concatenating
data = pd.concat([data, pd.get_dummies(data['sub_type'], drop_first=True, prefix='sub_type')], axis=1)

In [None]:
data.head()

In [None]:
data.shape

In [None]:
# Getting the names of the feature columns

# Collecting shop feature names
shop_columns = [col for col in data.columns if 'shop__' in col]

## Collecting type feature names
type_columns = [col for col in data.columns if 'type_' in col]

## Collecting sub_type feature names
sub_type_columns = [col for col in data.columns if 'sub_type_' in col]

In [None]:
# Setting the feature and target variables

features = ['month', 'shop_id','item_id', 'item_price'] + type_columns + sub_type_columns
target = ['item_cnt_day']

## Baseline Model using Linear Regression

In [None]:
# Preparing data for modeling

## Import for splitting data
from sklearn.model_selection import train_test_split

## Setting feature and target variables
X = data[features].fillna(value=0)
y = data[target].fillna(value=0)

## Splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
# Fitting Linear Regression

## Getting LR function
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

## Fitting on training data
lr.fit(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

# Submission

In [None]:
# Reading in test

test = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv")
test.head()

In [None]:
test.shape

In [None]:
# Adding data_block_num

test['date_block_num'] = 34
test['month'] = 11

In [None]:
item_price = data[['item_id', 'item_price']].groupby('item_id')['item_price'].mean().reset_index()

In [None]:
# Merging the test data with dataframes

## Merging sales_train with item
test = pd.merge(test, item_price, how="left", on="item_id")

## Merging items with test
test = pd.merge(test, items, how="left", on="item_id")

## Merging df and item_categories
test = pd.merge(test, item_categories, how="left", on="item_category_id")

## Merging df and shops
test = pd.merge(test, shops, how="left", on="shop_id")

In [None]:
test.shape

In [None]:
# Dummifying the categorical columns

# ## Creating dummies and concatenating
test = pd.concat([test, pd.get_dummies(test['shop_id'], drop_first=True, prefix='shop_')], axis=1)

## Creating dummies and concatenating
test = pd.concat([test, pd.get_dummies(test['type'], drop_first=True, prefix='type')], axis=1)

## Creating dummies and concatenating
test = pd.concat([test, pd.get_dummies(test['sub_type'], drop_first=True, prefix='sub_type')], axis=1)

In [None]:
test[features]

We are getting this issue because there are columns in training data that is not in the test data. We need to create a set of features that are in both training data and test data.

In [None]:
# Getting the names of the feature columns

# Collecting shop feature names
test_shop_columns = [col for col in test.columns if 'shop__' in col]

## Collecting type feature names
test_type_columns = [col for col in test.columns if 'type_' in col]

## Collecting sub_type feature names
test_sub_type_columns = [col for col in test.columns if 'sub_type_' in col]

In [None]:
# Setting the feature and target variables

test_features = ['month', 'item_id' + 'shop_id', 'item_price'] + test_type_columns + test_sub_type_columns

In [None]:
# Comparing the features in train and test data

print(f"Number of predictors in train data: {len(features)}")
print(f"Number of predictors in test data: {len(test_features)}")

There are predictors in the training data that is not in the test data. Why? It is because there are items that were in the training data that were not in the test data. 

In [None]:
common_features = list(set(features) & set(test_features)) 
print(f"Number of common features: {len(common_features)}")

In [None]:
'item_price' in common_features

There are items in training data that is not in test data and vice versa. We are extracting the common features in both training and test data to fit the model.

## Linear Regression using Common Features

In [None]:
# Preparing data for modeling

## Import for splitting data
from sklearn.model_selection import train_test_split

## Setting feature and target variables
X = data[common_features].fillna(value=0)
y = data[target].fillna(value=0)

## Splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
# Fitting Linear Regression

## Getting LR function
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

## Fitting on training data
lr.fit(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

In [None]:
'item_price' in test.columns

In [None]:
test = test.fillna(0)

In [None]:
lr.predict(test[common_features])

In [None]:
test['preds'] = lr.predict(test[common_features])

In [None]:
test.head()

In [None]:
# Creating the submission dataframe

preds = test[['ID', 'preds']]
preds.columns = ['ID', 'item_cnt_month']
preds

In [None]:
# Saving the submission dataframe
preds.to_csv('my_submission.csv', index=False)

The goal of thie notebook was to format the data. We just applied a Linear Regression to see if the format was correct, which it was. The next steps are to:
1. Apply times-series related models
2. Make better fill-in choices
3. Make better feature engineering choices

# ARIMA

In [None]:
data.head()

In [None]:
px.bar(data, x='month', y='item_cnt_day')