# REFERENCES

Thank you all for sharing your ideas in notebooks and discussions. I am very new here and trying to improve my knowledge on data science. This will be my first notebook, I am open to any kind of comments, ideas and critics. I used features and models from other notebooks, try to implement on my way, so I want to mention about these notebooks.

1. Thank you for your brief explanation and visualizations, I get the idea of using mean and median values from this notebook. Moreover I get the reduce memory function from it. https://www.kaggle.com/code/javigallego/tps-mar22-top-6-solution-eda-fe-blending#3-%7C-Feature-Engineering

1. I was troubling adding lag features, thanks to this notebook I figured it out. https://www.kaggle.com/code/martynovandrey/tps-mar-22-fe-the-less-the-better#Time-lags 

1. Thanks for generalizing special values. https://www.kaggle.com/code/ambrosm/tpsmar22-generalizing-the-special-values/notebook

# Introduction

For the March edition of the 2022 Tabular Playground Series you're challenged to forecast twelve-hours of traffic flow in a U.S. metropolis. The time series in this dataset are labelled with both location coordinates and a direction of travel -- a combination of features that will test your skill at spatio-temporal forecasting within a highly dynamic traffic network.


<font color = "blue">
Content:
    
1. [Read Data](#1)
1. [Reduce Memory Usage](#2)
1. [Getting familiar with Data](#3)
1. [Simple Exploratary Data Analysis](#4)
1. [Feature Engineering](#5)
    * [Time Features](#6)
    * [Mean and Median Values](#7)
    * [Min and Max Congestion](#8)
    * [Morning Averages](#9)
    * [Lag Features](#10)
    * [Label Encoding](#11)
1. [Outliers Detection](#12)
1. [Modelling with CATBOOST](#13)
1. [Submission](#14)

 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import optuna
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
plt.style.use("seaborn-whitegrid")
from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a><br>

# Read Data

In [None]:
train = pd.read_csv("../input/tabular-playground-series-mar-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-mar-2022/test.csv")

In [None]:
df = pd.concat([train, test], axis = 0)

<a id = "2"></a><br>

# Reduce Memory Usage

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

df = reduce_mem_usage(df)

<a id = "3"></a><br>

# Getting familiar with Data

train.csv - the training set, comprising measurements of traffic congestion across 65 roadways from April through September of 1991.

row_id - a unique identifier for this instance

time - the 20-minute period in which each measurementwas taken

x - the east-west midpoint coordinate of the roadway

y - the north-south midpoint coordinate of the roadway

direction - the direction of travel of the roadway. EB indicates "eastbound" travel, for example, while SW indicates a "southwest" direction of travel.

congestion - congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.

test.csv - the test set; you will make hourly predictions for roadways identified by a coordinate location and a direction of travel on the day of 1991-09-30.

sample_submission.csv - a sample submission file in the correct format

In [None]:
df.isnull().sum()

##### It is shown above, fortunately there aren't any missing data.

##### Let's check the unique values to get more intuiton about what we have.

In [None]:
unique_x = df.x.unique()
unique_y = df.y.unique()
unique_direction = df.direction.unique()

print("Unique values of x: ", unique_x)
print("Unique values of y: ", unique_y)
print("Unique directions: ", unique_direction)

##### There are 3 unique values for x and 4 unique values for y, 8 different directions.
##### It is possible to gather them together and have a new feature later on.

In [None]:
df["congestion"].describe()

<a id = "4"></a><br>
# Simple Exploratary Data Analysis

In [None]:
temp = df.copy()
temp["DateTime"] = pd.to_datetime(temp["time"])
temp = df.set_index(temp["DateTime"])
temp['date'] = temp.index
temp['weekofyear'] = temp['date'].dt.weekofyear
temp['day_of_week'] = temp['date'].dt.dayofweek

In [None]:
list1 = ["x", "y", "direction", "congestion"]
sns.heatmap(temp[list1].corr(), annot = True, fmt = ".2f")
plt.show()

Seems data has no correlation in its very raw own.

In [None]:
plt.figure(figsize=(12, 2))

df.groupby(by = ["x","y","direction"])["congestion"].mean().plot()


In [None]:
df[(df["x"] == 0) & (df["y"] == 0)]["direction"].unique()

In [None]:
g = sns.factorplot(x = "weekofyear", y = "congestion", kind = "bar", data  = temp, size = 6)
g.set_ylabels("Survived Probability")
plt.show()

Week by week, there is not really a specific trend.

In [None]:
g = sns.factorplot(x = "day_of_week", y = "congestion", kind = "bar", data  = temp, size = 6)
g.set_ylabels("Congestion")
plt.show()

For weekdays, congestion has no remarkable trend, there is slightly less congestion at weekends.

In [None]:
temp = temp.groupby((temp['date'].dt.hour)).mean()
temp["hour"] = range(0,24)
g = sns.factorplot(x ="hour" ,y = "congestion", kind = "bar", data  = temp, size = 6)
g.set_ylabels("Congestion")
plt.show()

Now, we have clear trend for a day.

In [None]:
plt.figure(figsize=(10, 6))
plt.bar( range(0,101),train.congestion.value_counts().sort_index(), width=1, color = "#093260")
plt.ylabel('Frequency')
plt.xlabel('Congestion')
plt.show()

In [None]:
decomposed_results = seasonal_decompose(train["congestion"], period=24)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(decomposed_results.trend, alpha=0.6, color='blue', label='Congestion Trend', linewidth = 1.0)
ax.plot(decomposed_results.trend.rolling(4680).mean().shift(-4680), alpha=1, color='black', label='Congestion Trend (Rolling Mean)', linewidth = 2.0)
ax.legend(loc='best')
ax.set_ylabel("Congestion Trend")
ax.set_xlabel("Hours")
plt.show()

It is seen above that congestion has no seasonal trend but it has a trend on daily basis.

In [None]:
figure, axis = plt.subplots(3, 2, figsize = (16,10) )

direc_tr = train["direction"]
direc_tr_value = direc_tr.value_counts()

x_tr = train["x"]
x_tr_value = x_tr.value_counts()

y_tr = train["y"]
y_tr_value = y_tr.value_counts()

direc_ts = test["direction"]
direc_ts_value = direc_ts.value_counts()

x_ts = test["x"]
x_ts_value = x_ts.value_counts()

y_ts = test["y"]
y_ts_value = y_ts.value_counts()





axis[0, 0].bar(x_tr_value.index, x_tr_value, color = "#093260")
axis[0, 0].set_title("Frequency of x directions in train")
axis[0, 1].bar(x_ts_value.index, x_ts_value, color = "#093260")
axis[0, 1].set_title("Frequency of x directions in test")

axis[1, 0].bar(y_tr_value.index, y_tr_value, color= "orange")
axis[1, 0].set_title("Frequency of y directions in train")
axis[1, 1].bar(y_ts_value.index, y_ts_value, color= "orange")
axis[1, 1].set_title("Frequency of y directions in test")

axis[2, 0].bar(direc_tr_value.index, direc_tr_value, color = "red")
axis[2, 0].set_title("Frequency of Directions in train")
axis[2, 1].bar(direc_ts_value.index, direc_ts_value, color = "red")
axis[2, 1].set_title("Frequency of Directions in test")

plt.show()


For a reasonable model, let's check the distrubition of each feature in train and test data. We can say they contains similar distribution about each feature.

**Inferences**

* The data has gaussian distribution
* The raw data does not seem to be useful in this way, time features might be effective. As seen in daily congestion figure, hour feature can be used.
* Mean and median values may be used since there is no real trend in weeks.

<a id = "5"></a><br>

# Feature Engineering

<a id = "6"></a><br>

## Time features

In [None]:
df["DateTime"] = pd.to_datetime(df["time"])
df = df.set_index(df["DateTime"])
df['date'] = df.index
dayofyear = df['date'].dt.dayofyear
df['hour'] = df['date'].dt.hour
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.weekofyear
df['minute'] = df["date"].dt.minute
df['afternoon'] = df['hour'] >= 12
df['moment']  = df['date'].dt.hour * 3 + df['date'].dt.minute // 20   
is_weekend = np.where(df["day_of_week"]>5,1,0)
df["is_weekend"] = is_weekend

<a id = "7"></a><br>

## Mean and Median Values

In [None]:
df["roadway"] = df.x.astype(str) + df.y.astype(str) + df.direction.astype(str)
keys = ["day_of_week","hour", "minute","roadway" ]

temp = df.groupby(by=keys).mean().reset_index().set_index(keys)
temp['mean congestion'] = temp['congestion']
df = df.merge(temp['mean congestion'], how='left', left_on=keys, right_on=keys)

temp = df.groupby(by=keys).median().reset_index().set_index(keys)
temp['median congestion'] = temp['congestion']
df = df.merge(temp['median congestion'], how='left', left_on=keys, right_on=keys)


<a id = "8"></a><br>
## Min and Max Congestion

In [None]:
temp = df.groupby(by=keys).min().reset_index().set_index(keys)
temp['min congestion'] = temp['congestion']
df = df.merge(temp['min congestion'], how='left', left_on=keys, right_on=keys)

temp = df.groupby(by=keys).max().reset_index().set_index(keys)
temp['max congestion'] = temp['congestion']
df = df.merge(temp['max congestion'], how='left', left_on=keys, right_on=keys)



<a id = "9"></a><br>
## Morning Averages

In [None]:
df_mornings = df[(df.hour >= 6) & (df.hour < 12)]
morning_avgs = pd.DataFrame(df_mornings.groupby(['month', 'dayofmonth', 'roadway']).congestion.median().astype(int)).reset_index()
morning_avgs = morning_avgs.rename(columns={'congestion':'morning_avg'})
df = df.merge(morning_avgs, on=['month', 'dayofmonth', 'roadway'], how='left')

<a id = "10"></a><br>
## Lag Features

In [None]:
for delta in range(1,8):
    day = df.copy()
    day['date'] = day['date'] + pd.Timedelta(delta, unit="d")
    name = f'lag_{delta}'
    day = day.rename(columns={'congestion':name})[['date', 'roadway', name]]
    df = df.merge(day, on=['date', 'roadway'], how='left')
df=df.fillna(df["congestion"].median())

<a id = "12"></a><br>

# Outliers Detection

In [None]:
def detect_outliers(df, features):
    outlier_indices = []
    
    for c in features:
        
        # 1st quartile
        q1 = np.percentile(df[c], 25)
        
        # 3rd quartile
        q3 = np.percentile(df[c], 75)
        
        # IQR
        iqr = q3 - q1
        
        # Outlier step
        
        outlier_step = iqr*1.5
        
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < q1 - outlier_step) | (df[c] > q1 + outlier_step)].index
        
        # store indeces
        
        outlier_indices.extend(outlier_list_col)
    
    
    return outlier_indices

In [None]:
outlier_indices = detect_outliers(df, ["congestion"])

In [None]:
print((len(outlier_indices)/len(train))*100)

The ratio of outliers to train data is **7.5%**. This is something that should be considered. Replacing outliers with median may work.

In [None]:
for idx in outlier_indices:
    df["congestion"][idx] = df["median congestion"][idx]

<a id = "11"></a><br>
## Label Encoding

In [None]:
le = LabelEncoder()
df['roadway'] = le.fit_transform(df['roadway'])
df['afternoon'] = le.fit_transform(df['afternoon'])
df['direction'] = le.fit_transform(df['direction'])


<a id = "13"></a><br>

# Modelling with CATBOOST

In [None]:
x_train = df[:len(train)]
y_train = x_train["congestion"]
x_test = df[len(train):]

In [None]:
features = ["time","DateTime","congestion","date","day_of_week","quarter","month","weekofyear","dayofmonth","morning_avg"]
x_train.drop(features, 1, inplace = True)
x_test.drop(features, 1, inplace = True)

In [None]:
model = CatBoostRegressor(
    verbose=1000,
    early_stopping_rounds=10,
    random_state = 2022, learning_rate = 0.01, bagging_temperature = 0.02, max_depth = 16, 
    random_strength = 47, l2_leaf_reg = 7.459775961819184e-06, min_child_samples = 49, max_bin = 320, od_type = 'Iter', 
    task_type = 'GPU', loss_function = 'MAE', eval_metric = 'MAE'
).fit(x_train, y_train)   

In [None]:
prediction = model.predict(x_test)

In [None]:
(pd.Series(model.get_feature_importance(), index=x_train.columns)
   .nlargest(20)
   .plot(kind='barh'))

<a id = "14"></a><br>

# SUBMISSION

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-mar-2022/sample_submission.csv")
submission["congestion"] = prediction
submission

In [None]:
from sklearn.metrics import mean_absolute_error

# Read and prepare the training data
train = pd.read_csv('../input/tabular-playground-series-mar-2022/train.csv', parse_dates=['time'])
train['hour'] = train['time'].dt.hour
train['minute'] = train['time'].dt.minute

submission_in = submission.copy()
# Compute the quantiles of workday afternoons in September except Labor Day
sep = train[(train.time.dt.hour >= 12) & (train.time.dt.weekday < 5) &
            (train.time.dt.dayofyear >= 246)]
lower = sep.groupby(['hour', 'minute', 'x', 'y', 'direction']).congestion.quantile(0.2).values
upper = sep.groupby(['hour', 'minute', 'x', 'y', 'direction']).congestion.quantile(0.7).values

# Clip the submission data to the quantiles
submission_out = submission_in.copy()
submission_out['congestion'] = submission_in.congestion.clip(lower, upper)

# Display some statistics
mae = mean_absolute_error(submission_in.congestion, submission_out.congestion)
print(f'Mean absolute modification: {mae:.4f}')
print(f"Submission was below lower bound: {(submission_in.congestion <= lower - 0.5).sum()}")
print(f"Submission was above upper bound: {(submission_in.congestion > upper + 0.5).sum()}")

# Round the submission
submission_out['congestion'] = submission_out.congestion.round().astype(int)
submission_out.to_csv('submission.csv', index = False)
submission_out