<a href="https://colab.research.google.com/github/subramanya4shenoy/MachineLearningNbs/blob/main/LR_Forecasting_Mini_Course_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem statement** ⚔

**Dataset Description**

For this challenge, you will be predicting a full year worth of sales for various fictitious learning modules from different fictitious Kaggle-branded stores in different (real!) countries. This dataset is completely synthetic, but contains many effects you see in real-world data, e.g., weekend and holiday effect, seasonality, etc. You are given the task of predicting sales during for year 2022.

Good luck!

In [3]:
"""
Basic setup for integrating Kaggle
Make sure the kaggle.json file is available and uploaded in session
"""
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'

In [4]:
"""
Path to the dataset from kaggle
"""
!kaggle competitions download -c playground-series-s3e19

Downloading playground-series-s3e19.zip to /content
  0% 0.00/1.18M [00:00<?, ?B/s]
100% 1.18M/1.18M [00:00<00:00, 136MB/s]


In [5]:
"""
unzipping the files and removing the zip
"""
!unzip \*.zip && rm *.zip

Archive:  playground-series-s3e19.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               




---



---





**Task**: To predict the sales during the year 2022

**Steps:**
  1. 👁 Look at the data and learn about the data.
  2. 🎯 Decide on which model to choose.
  3. 🦖 Do Exploratory Data analysis.
  4. 🧹 Prepare the data

In [26]:
"""
Importing packages for reading data
"""
import pandas as pd

salesdf = pd.read_csv('train.csv')
testdf = pd.read_csv('test.csv')
submissiondf = pd.read_csv('sample_submission.csv')

In [49]:
def get_insights(df):
  print("\n\n===========dataframe==========================")
  print(df.head(3))
  print("\n================================================")

  print("\n\n===========dataframe size=====================")
  print(df.shape)
  print("\n================================================")

  print("\n\n===========dataframe column names=============")
  print(df.columns)
  print("\n================================================")

  print("\n\n===========dataframe Unique value=============")
  print(df.nunique())
  print("\n================================================")

  print("\n\n===========dataframe data types===============")
  print(df.info())
  print("\n================================================")

  print("\n\n===========dataframe descriptions=============")
  print(df.describe())
  print("\n================================================")
  print("==================================================")


In [53]:
"""
Understanding the data
"""
get_insights(salesdf)
# get_insights(testdf)
# get_insights(submissiondf)



   id        date    country         store  \
0   0  2017-01-01  Argentina  Kaggle Learn   
1   1  2017-01-01  Argentina  Kaggle Learn   
2   2  2017-01-01  Argentina  Kaggle Learn   

                                          product  num_sold  
0               Using LLMs to Improve Your Coding        63  
1                   Using LLMs to Train More LLMs        66  
2  Using LLMs to Win Friends and Influence People         9  



(136950, 6)



Index(['id', 'date', 'country', 'store', 'product', 'num_sold'], dtype='object')



id          136950
date          1826
country          5
store            3
product          5
num_sold      1028
dtype: int64



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136950 entries, 0 to 136949
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        136950 non-null  int64 
 1   date      136950 non-null  object
 2   country   136950 non-null  object
 3   store     136950 non-nu

### **Insights from data (EDA)**
1. We have training set of 6 columns and  136950 rows.
  *   2 int64 columns id and num_sold.
  *   Our dependant variable is num_sold. We need to predict num_sold
  *   We have date columns which need to be converted to date time
  *   We see countr, store and product columns which are categorical data. we need to convert them into numerical representation.
  *   We would be trying one-hot encoding for product columns and store columns as they have very small uniq values.


2. For test set we have similar data but num_sold column is missing (we will be predicting that)

3. For submission set we have id and num sold column. we will be filling the data, converting to csv and submitting.

   




In [77]:
"""
Part of EDA updating date column,
converting categorical data into numbers
one-hot-encoding

we will be doing the same for both salesdf (training + test set) and testdf
so creating methods to each operations first.
"""

"""
date-column
1. method to converting all date column to datetime
2. given " This dataset is completely synthetic, but contains many effects you see in real-world data,
    e.g., weekend and holiday effect, seasonality, etc"
3. we might need to introduce more columns based on holidays, weekends.
4. We import holiday package to get seasonal holiday of any country
"""

import holidays

# Create a dictionary to store holiday classes for different countries
holiday_classes = {}

def generate_holidays(df):
  for country in df['country'].unique():
    try:
        holiday_classes[country] = getattr(holidays, country)()
    except AttributeError:
        holiday_classes[country] = None


def prepare_date_column(tempdf):
  df = tempdf.copy()

  # Converting to datetime format
  df['date'] = pd.to_datetime(df['date'])

  # Check if date is weekend then True else False. [5 = saturday, 6 = sunday]
  df['Is_Weekend'] = df['date'].dt.dayofweek.isin([5, 6])

  # Converting True, False into 1, 0
  df['Is_Weekend'] = df['Is_Weekend'].astype(int)

  # Checking for country specific seasonal holidays
  df['Is_Seasonal_Holiday'] = 0

  # generating special holidays for countries
  generate_holidays(df)

  # updating the seasonal holiday column
  for index, row in df.iterrows():
    country = row['country']
    if country in holiday_classes and holiday_classes[country] is not None and row['date'] in holiday_classes[country]:
        df.at[index, 'Is_Seasonal_Holiday'] = 1

  return df

In [95]:
"""
  Country has uniq values so we will be converting into numbers via skLearn label encoders
  Store too has the same we will be number coding them via label_encoders
  Product is also limitted and will be number coding them and doing a one-hot encoding to the column via
  get_dummies
"""
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

def prepare_columns(tempdf):
  df = tempdf.copy()
  df['country'] = label_encoder.fit_transform(df['country'])
  df['store'] = label_encoder.fit_transform(df['store'])
  dummies_df = pd.get_dummies(df['product'])
  df = pd.concat([df, dummies_df], axis=1)
  df = df.drop('product', axis = 1)
  return df

def prepare_data(df):
  df = prepare_date_column(df)
  df = prepare_columns(df)
  return df

In [96]:
# preprosing training sales data
salesdf = prepare_data(salesdf)


Unnamed: 0,id,date,country,store,num_sold,Using LLMs to Improve Your Coding,Using LLMs to Train More LLMs,Using LLMs to Win Friends and Influence People,Using LLMs to Win More Kaggle Competitions,Using LLMs to Write Better
0,0,2017-01-01,0,1,63,1,0,0,0,0
1,1,2017-01-01,0,1,66,0,1,0,0,0
2,2,2017-01-01,0,1,9,0,0,1,0,0
3,3,2017-01-01,0,1,59,0,0,0,1,0
4,4,2017-01-01,0,1,49,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
136945,136945,2021-12-31,4,0,700,1,0,0,0,0
136946,136946,2021-12-31,4,0,752,0,1,0,0,0
136947,136947,2021-12-31,4,0,111,0,0,1,0,0
136948,136948,2021-12-31,4,0,641,0,0,0,1,0


In [None]:
"""
Now we are able to complete our EDA and preprocessing data
Lets split the data for training and testing
"""

from sklearn.model_selection import train_test_split
