# Tabular Playground Series - Jan 2022: A simple average model
I made the first version of my notebook based on my little experience in data science field, which include part of exploratory data analysis. Then I decied to expand my knowledge and learning from the solutions of other people to build in my self. Therefore I would like to thank all of these people for their valuable experience that added to me.

This notebook builds on the approach in https://www.kaggle.com/mfedeli/tabular-playground-series-jan-2022 - thanks for sharing!
(1) Detailed EDA and Vizualizations [TPS January 2022] https://www.kaggle.com/nishantdhingra/detailed-eda-and-vizualizations-tps-january-2022#1.-Which-countries-buys-most-?
(2) Tabular Playground Series - Jan 2022 https://www.kaggle.com/mfedeli/tabular-playground-series-jan-2022/comments

Acknolegments to :-
(1)A Data Science Framework: To Achieve 99% Accuracy
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
(2)Comprehensive data exploration with Python https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
(3) 



#A Data Science Framework
##(1) Defining the problem
In the Tabular Playground Series - Jan 2022 competition we are tasked with develop an algorithm to predict the sales [Numeric values] for two different stores (KaggleMart, & KaggleRama) that sell three different products (the Kaggle Mug, the Kaggle Hat and the Kaggle Sticker, all highly sought-after products) in three different countries (Finland, Sweden and Norway) for the year 2019. We are provided with training data for four years from 2015 to 2018. The features that are available in the data to use are date, country, store, & product. 

As this problem is one of predictive analysis that predict sales based on existing data, therefore the analytical techniques that can be used continuous (such as liner regression , Decision tree, Forest model, Boosted model) or Time Based (ARIMA or ETS).  

## Step 2: Importing the Data
   The data is already provided by kaggle.https://www.kaggle.com/c/tabular-playground-series-jan-2022
    

In [None]:
# Installing PyCaret which is an open-source, low-code machine learning library & end-to-end model management tool. 
#%%capture #suppresses the displays
# install the full version
!pip install pycaret[full]

In [None]:
# Installing Gradio
!pip install gradio

In [None]:
#Step 3: Prepare Data for Consumption
## 3.1 Importing the necessary libraries

import numpy  as np
import pandas as pd

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
# from pandas.tools.plotting import scatter_matrix
%matplotlib inline

#Common Model Algorithms
import sklearn
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
#from datetime import datetime
#from datetime import timedelta

from pycaret.regression import *
print("Setup complete")

In [None]:
# 3.11 Loading & reading the data for getting a quick and dirty overview of variable datatypes 
train=pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv', index_col='row_id')
test=pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv', index_col='row_id')

In [None]:
print('Training data df shape:',train.shape)
print('Test data df shape:',test.shape)

In [None]:
train.head()

The train data contains 5 columns, four of them are objects, and one is numeric which is the target variable.

In [None]:
test.head()

In [None]:
train.info()

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

Initial thoughts:
- The train and test dataset have no null values.
-Country, store and product features are not currently numeric, categorical variables.

In [None]:
train.describe().T

In [None]:
train.describe(include=['O'])

From the above, there are two stores, that selling three products, which are exist in three countries.

In [None]:
# The frequency distribution of the categorical variables country, store, product
train['country'].value_counts()

In [None]:
train['store'].value_counts()


In [None]:
train['product'].value_counts()

The data is equally distributed.

In [None]:
# timeframe
print('Training data:')
print('Min date', train['date'].min())
print('Max date', train['date'].max())
print('\n')
print('Test data:')
print('Min date', test['date'].min())
print('Max date', test['date'].max())


# EDA

In data analysis step, we will try to answr these questions to understand
- which store sell more?
- which store sell more products in each country
- how much store sales by product type.

In [None]:
plt.figure(figsize  = (10,5))
fig = train.groupby(['date','store']).agg(num_sold=('num_sold','sum'))
sns.lineplot(data=fig, x='date', y='num_sold', hue='store')

plt.title('num_sold by each Store')
  

Kaggle Rama is consistently selling more products than Kaggle Mart.
The number of products sold for both companies oscillates depending on the time of year (season) and fluctuates rapidly (this is probably due to weekday vs weekend sales).
There are big spikes towards the end of each year (likely due to christmas) and also some other smaller seasonal spikes (perhaps easter holidays etc).

In [None]:
count_sell = train.groupby(['country']).num_sold.sum()
print(count_sell.to_string())
sns.barplot(x = count_sell.index, y = count_sell.values)
plt.title('Sales per country')
plt.xlabel('country')
plt.ylabel('No. of Sales')

sns.set(color_codes=True)
pal = sns.color_palette("Blues", 9)
sns.set_palette('dark')

In [None]:
plt.pie(count_sell.values, labels = count_sell.index,  autopct='%0.1d%%')
plt.title('Percentage of Sales for each Country')

From the above visualization, it is clear that Norway has a biggest sales of kaggle swags by 43%, and finland has a smallest percentage of the total sales with 26.3%.

In [None]:
#Store sales by country
fig, axes = plt.subplots(2, 1, figsize=(12, 10))

KR=train[train.store=='KaggleRama']
KM=train[train.store=='KaggleMart']
bb=KR.groupby(['date','country']).agg(num_sold=('num_sold','sum'))
cc=KM.groupby(['date','country']).agg(num_sold=('num_sold','sum'))

# Lineplots
ax1=sns.lineplot(ax=axes[0], data=bb, x='date', y='num_sold', hue='country')
ax2=sns.lineplot(ax=axes[1], data=cc, x='date', y='num_sold', hue='country')

# Aesthetics
ax1.title.set_text('KaggleRama')
ax2.title.set_text('KaggleMart')

In [None]:
#Store sales by product type
fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Groupby
dd=KR.groupby(['date','product']).agg(num_sold=('num_sold','sum'))
ee=KM.groupby(['date','product']).agg(num_sold=('num_sold','sum'))

# Lineplots
ax1=sns.lineplot(ax=axes[0], data=dd, x='date', y='num_sold', hue='product')
ax2=sns.lineplot(ax=axes[1], data=ee, x='date', y='num_sold', hue='product')

# Aesthetics
ax1.title.set_text('KaggleRama')
ax2.title.set_text('KaggleMart')

We see that both stores sell Hats the most, then Mugs and finally Stickers the least.
Sales of stickers is fairly constant throughout the year, whereas hat (especially) and mug sales is more affected by seasonality.

In [None]:
def pre_process(df):
    
    df['date'] = pd.to_datetime(df['date'])
    df['week']= df['date'].dt.week
    df['year'] = 'Y'+df['date'].dt.year.astype(str)
    df['quarter'] = 'Q'+df['date'].dt.quarter.astype(str)
    df['day'] = df['date'].dt.day
    df['dayofyear'] = df['date'].dt.dayofyear
    df.loc[(df.date.dt.is_leap_year) & (df.dayofyear >= 60),'dayofyear'] -= 1
    df['weekend'] = df['date'].dt.weekday >=5
    df['weekday'] = 'WD' + df['date'].dt.weekday.astype(str)
    df.drop(columns=['date'],inplace=True)   
    
pre_process(train)
pre_process(test)

In [None]:
def SMAPE(y_true, y_pred):
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

In [None]:
# Getting started with model building and inference with pycaret iniialize setup.

reg = setup(data = train,
            target = 'num_sold',
            normalize=True,
            normalize_method='robust',
            transform_target = True,
            data_split_shuffle = False, 
            create_clusters = False,
            use_gpu = True,
            silent = True,
            fold=10,
            n_jobs = -1)

In [None]:
# Compare models
add_metric('SMAPE', 'SMAPE', SMAPE, greater_is_better = False)
top = compare_models(sort = 'SMAPE', n_select = 10)
compare_model_results = pull()

In [None]:
blend = blend_models(top)
predict_model(blend);

In [None]:
final_blend = finalize_model(blend)
predict_model(final_blend);

In [None]:
plot_model(blend)

In [None]:
#Make predictions on test data
preds = predict_model(final_blend, data=test)

#New_data is pd dataframe
sub = pd.DataFrame(list(zip(test.index,preds.Label)),columns = ['row_id', 'num_sold'])
sub.to_csv('submission.csv', index = False)
print(sub)#.head(),sub.describe())

