## EDA AND BASIC DATA VISUALIZATION USING MATPLOTLIB & SEABORN

Source : https://www.kaggle.com/c/tabular-playground-series-jan-2022/overview

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
test_= pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv', index_col = 'row_id')
train_ = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv', index_col = 'row_id')

In [None]:
display(train_.head())

Display a brief of the data that we have

In [None]:
print(train_.info())
print('-'*50)
print('number of duplicates : {0}'.format(train_.duplicated().sum()))
print('-'*50)
print('total missing values :')
print(train_.isnull().sum())

First, we need the info of our data and check whether there are missing values and duplicates in our data. It seems there are no missing values and duplicates in our data, so we can proceed our analysis.

In [None]:
for i in ['country','store','product']:
    val = train_[i].value_counts()
    print(val)
    print('-'*50)

At this part we want to know all the unique values in each of categorical variables ('Country','Store','Product'). country variables has 3 unique values that are ('Finland','Norway','Sweden'). Store variables has 2 that are ('KaggleMart','KaggleRama) .Product variables has 3 that are ('Kaggle Mug','Kaggle Hat','Kaggle Sticker')

In [None]:
train_['date'] = pd.to_datetime(train_['date'],format = '%Y-%m-%d', errors = 'raise')
print(train_.info())

Convert date into datetime type using pd.to_datetime()

In [None]:
sorted_ = train_.sort_values(by = 'date')
display(sorted_.head())

In [None]:
stcountry_ = sorted_.groupby('country').resample('Y', on = 'date')['num_sold'].agg(['mean','median','min','max','sum'])
print(stcountry_)

From data above, we can see that Norway dominates product sales by each years over other countries (finland and Sweden)

In [None]:
ststore_ = sorted_.groupby('store').resample('Y', on = 'date')['num_sold'].agg(['mean','median','min','max','sum'])
print(ststore_)

From data above, Kaggle Rama has achieved more product sales than Kaggle Mart

In [None]:
stproduct_ = sorted_.groupby('product').resample('Y', on = 'date')['num_sold'].agg(['mean','median','min','max','sum'])
print(stproduct_)

From data above, Kaggle Hat is the most purchased product than the others over the years.

In [None]:
grouped1 = sorted_.groupby('date')[['num_sold']].mean()
display(grouped1.head())

In [None]:
MAVG = grouped1.rolling(window = 100).mean()
display(MAVG.head())

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (17,8))
ax.plot(grouped1.index, grouped1.num_sold, label = 'Product Selling', color = 'black', alpha = 0.35)
ax.plot(MAVG.index, MAVG.num_sold, label = 'MAVG', color = 'red')
ax.set(xlabel = 'Days', ylabel = 'Units', title = 'Average Product Sales 2015-2018')
ax.legend(loc = 'upper left',fontsize = 'medium')
plt.show()


From the chart , we can see there's a positive trend over years and there's a pattern where the number of product sales tend to increase at the end of the year until few months later (mid year)

In [None]:
finland_ = sorted_[sorted_['country'] == 'Finland'].groupby('date').mean()
sweden_ = sorted_[sorted_['country'] == 'Sweden'].groupby('date').mean()
norway_ = sorted_[sorted_['country'] == 'Norway'].groupby('date').mean()

#Plots
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (15,6))
ax.plot(finland_.index, finland_.num_sold, label = 'Finland', color = 'red', alpha = 0.5)
ax.plot(sweden_.index, sweden_.num_sold, label = 'Sweden', color = 'blue', alpha = 0.5)
ax.plot(norway_.index, norway_.num_sold, label = 'Norway', color = 'green', alpha = 0.5)
ax.set(xlabel = 'Days', ylabel = 'Units', title = 'Average Product Selling by Country 2015-2018')
ax.legend(fontsize = 'medium', loc = 'upper left')
plt.show()

From the chart , we can see there's a positive trend over years and there's a pattern where the number of product sales tend to increase at the end of the year until few months later (mid year). from the chart we can clearly see that norway has the higher product sales than the others.

In [None]:
KaggleMart_ = sorted_[sorted_['store'] == 'KaggleMart'].groupby('date').mean()
KaggleRama_ = sorted_[sorted_['store'] == 'KaggleRama'].groupby('date').mean()

#Plots
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (15,6))
ax.plot(KaggleMart_.index, KaggleMart_.num_sold, label = 'Kaggle Mart', color = 'red', alpha = 0.5)
ax.plot(KaggleRama_.index, KaggleRama_.num_sold, label = 'Kaggle Rama', color = 'blue', alpha = 0.5)
ax.set(xlabel = 'Days', ylabel = 'Units', title = 'Average Product Sales by Store 2015-2018')
ax.legend(fontsize = 'medium', loc = 'upper left')
plt.show()

From the chart , we can see there's a positive trend over years and there's a pattern where the number of product sales tend to increase at the end of the year until few months later (mid year). from the chart we can clearly see that Kaggle rama has the higher product sales than Kaggle Mart.

In [None]:
KaggleHat_ = sorted_[sorted_['product'] == 'Kaggle Hat'].groupby('date').mean()
KaggleMug_ = sorted_[sorted_['product'] == 'Kaggle Mug'].groupby('date').mean()
KaggleSticker_ = sorted_[sorted_['product'] == 'Kaggle Sticker'].groupby('date').mean()
#Plots
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (15,6))
ax.plot(KaggleHat_.index, KaggleHat_.num_sold, label = 'Kaggle Hat', color = 'red', alpha = 0.5)
ax.plot(KaggleMug_.index, KaggleMug_.num_sold, label = 'Kaggle Mug', color = 'blue', alpha = 0.5)
ax.plot(KaggleSticker_.index, KaggleSticker_.num_sold, label = 'Kaggle Sticker', color = 'green', alpha = 0.5)
ax.set(xlabel = 'Days', ylabel = 'Units', title = 'Average Product Sales by Product 2015-2018')
ax.legend(fontsize = 'medium', loc = 'upper left')
plt.show()

From the chart , we can see there's a positive trend over years and there's a pattern where the number of product sales tend to increase at the end of the year until few months later (mid year). from the chart we can clearly see that Kaggel Hatis the most purchased product than the others.

In [None]:
grouped2 = sorted_.groupby(['country','store','product']).resample('Y',on = 'date')['num_sold'].mean()
display(grouped2)

In [None]:
DF_ = pd.DataFrame(grouped2).reset_index()
DF_['year'] = DF_['date'].dt.year
display(DF_.head())

In [None]:
sns.set_style('whitegrid')
cp = sns.catplot(x = 'year', y ='num_sold', col = 'product', row = 'store', hue = 'country', data = DF_, kind = 'point',
           height = 5, aspect = 1.25)
cp.fig.subplots_adjust(top=0.9)
cp.fig.suptitle('Average Product Sales by Years 2015-2018')
plt.show()

From the chart above, we can see the average product sales over the years based on many levels ('Country','Store','Product') more detailed

In [None]:
years_ = ['2015','2016','2017','2018']
fig= plt.figure(figsize = (10,10))
for i in range(4):
    val = sorted_[sorted_['date'].isin([years_[i]])].groupby('country')['num_sold'].sum()
    ax = fig.add_subplot(2,2,i+1)
    ax.pie(val.values, labels = val.index,autopct = '%0.2f%%')
    ax.set_title(f'Propotion Product Selling by Country ({years_[i]})', fontdict = {'fontsize' : 10})
plt.show()

We can see Norway has the most propotion than other countries each years based on Product Sales 

In [None]:
countries_ = ['Finland','Sweden', 'Norway']
fig= plt.figure(figsize = (10,10))
for i in range(3):
    val = sorted_[sorted_['date'].isin([years_[i]])].groupby('product')['num_sold'].sum()
    ax = fig.add_subplot(2,2,i+1)
    ax.pie(val.values, labels = val.index,autopct = '%0.2f%%')
    ax.set_title(f'Propotion Product Sales by Product ({countries_[i]})', fontdict = {'fontsize' : 10})
plt.show()

We can see Kaggle hat has the most propotion than other products each years based on the number purchased product.