#### What are you trying to do in this notebook?
In this month's TPS Competiton, I will forecast 12 hours of traffic flow in a major US metropolis.

The time series in this dataset are labelled with both location coordinates and a direction of travel – a combination of spatio-temporal features within a highly dynamic traffic network.

#### What you have learned in this notebook?
This notebook aims to provide animations for time-space congestion visualizations. The idea is to animate the congestion change during time for all the 12 locations and 65 roadways. For a detail EDA, please visit the notebook.

Most top solutions to the March TPS competition follow the same three-step pattern:

Predict test congestions using an ensemble of gradient-boosted trees
Replace some predictions by so-called "special values" (EDA introducing the special values)
Round the predictions to the nearest integer (Why rounding improves the score)
In this notebook, we generalize step 2:

Rather than replacing some predictions by special values (which are medians of the training data), we clip all predictions to some quantiles of the training data.

#### Why are you trying it?
The training data consists of six month's of traffic congestion levels in 20-minute intervals across a network of 65 roadways from April through September of 1991. The variables in the dataset include:

- **time**: The 20-minute period in which each measurement was taken.
- **x**: The East-West midpoint coordinate of the roadway.
- **y**: The North-South midpoint coordinate of the roadway.
- **direction**: The direction of travel of the roadway. EB indicates Eastbound travel, for example, while SW indicates a Southwest direction of travel.
- **congestion**: Congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.

The test set contains the roadway's coordinate location and direction of travel on the day of 1991-09-30.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tabular-playground-series-mar-2022/sample_submission.csv
/kaggle/input/tabular-playground-series-mar-2022/train.csv
/kaggle/input/tabular-playground-series-mar-2022/test.csv


In [2]:
%%capture
!pip install pycaret[full]

import pandas as pd
import numpy as np 
from pycaret.regression import *

import warnings
warnings.filterwarnings("ignore")

In [3]:
train_df = pd.read_csv('../input/tabular-playground-series-mar-2022/train.csv')
train_df.head()

Unnamed: 0,row_id,time,x,y,direction,congestion
0,0,1991-04-01 00:00:00,0,0,EB,70
1,1,1991-04-01 00:00:00,0,0,NB,49
2,2,1991-04-01 00:00:00,0,0,SB,24
3,3,1991-04-01 00:00:00,0,1,EB,18
4,4,1991-04-01 00:00:00,0,1,NB,60


In [4]:
test_df = pd.read_csv('../input/tabular-playground-series-mar-2022/test.csv')
test_df.head()

Unnamed: 0,row_id,time,x,y,direction
0,848835,1991-09-30 12:00:00,0,0,EB
1,848836,1991-09-30 12:00:00,0,0,NB
2,848837,1991-09-30 12:00:00,0,0,SB
3,848838,1991-09-30 12:00:00,0,1,EB
4,848839,1991-09-30 12:00:00,0,1,NB


In [5]:
train_df['time'] = pd.to_datetime(train_df.time)                                                
train_df['offical_holiday'] = train_df.time.dt.date.astype(str).str.contains('1991-05-27|1991-07-04|1991-09-02').astype('int')
train_df=train_df[train_df['offical_holiday']==0]
train_df=train_df.drop('offical_holiday',axis=1)
train_df=train_df[(train_df.time.dt.weekday< 4) & (train_df.time.dt.month > 4)]     

In [6]:
def feature_engineering(df):
    df['time'] = pd.to_datetime(df['time'])
    df['month']= df.time.dt.month
    df['day']= df.time.dt.dayofyear
    df['am'] = (df.time.dt.hour < 12) & (df.time.dt.hour >6)
    df['wkday'] = df.time.dt.weekday
    df['time']= (df.time.dt.hour-12)*3+df.time.dt.minute/20
    df['xydirday']= df.x.astype(str)+df.y.astype(str)+df.direction+df.day.astype(str)
    df['xydir'] = df.x.astype(str)+df.y.astype(str)+df.direction
    df['all']= df['xydir']+df.time.astype(str)
    
    return df

In [7]:
train_df = feature_engineering(train_df)
test_df = feature_engineering(test_df)

In [8]:
mapper_avg = train_df[['all','congestion']].groupby(['all']).median().to_dict()['congestion']

In [9]:
train_df['avg']= train_df['all'].map(mapper_avg)
test_df['avg']= test_df['all'].map(mapper_avg)
train_df = train_df[train_df.time >=0]

In [10]:
train_df.head()

Unnamed: 0,row_id,time,x,y,direction,congestion,month,day,am,wkday,xydirday,xydir,all,avg
142220,142220,0.0,0,0,EB,27,5,121,False,2,00EB121,00EB,00EB0.0,50.0
142221,142221,0.0,0,0,NB,24,5,121,False,2,00NB121,00NB,00NB0.0,35.0
142222,142222,0.0,0,0,SB,52,5,121,False,2,00SB121,00SB,00SB0.0,55.0
142223,142223,0.0,0,1,EB,27,5,121,False,2,01EB121,01EB,01EB0.0,26.0
142224,142224,0.0,0,1,NB,72,5,121,False,2,01NB121,01NB,01NB0.0,72.0


In [11]:
test_df.head()

Unnamed: 0,row_id,time,x,y,direction,month,day,am,wkday,xydirday,xydir,all,avg
0,848835,0.0,0,0,EB,9,273,False,0,00EB273,00EB,00EB0.0,50.0
1,848836,0.0,0,0,NB,9,273,False,0,00NB273,00NB,00NB0.0,35.0
2,848837,0.0,0,0,SB,9,273,False,0,00SB273,00SB,00SB0.0,55.0
3,848838,0.0,0,1,EB,9,273,False,0,01EB273,01EB,01EB0.0,26.0
4,848839,0.0,0,1,NB,9,273,False,0,01NB273,01NB,01NB0.0,72.0


In [12]:
reg = setup(data = train_df,
            target = 'congestion',
            session_id=999,
            data_split_shuffle = True, 
            create_clusters = False,
            fold_strategy = 'groupkfold',
            fold_groups = 'wkday',
            use_gpu = False,
            silent = True,
            fold=4,
            ignore_features = ['all','day','xydirday'],
            n_jobs = -1)

Unnamed: 0,Description,Value
0,session_id,999
1,Target,congestion
2,Original Data,"(193700, 14)"
3,Missing Values,False
4,Numeric Features,3
5,Categorical Features,7
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(135589, 93)"


In [13]:
top3 = compare_models(sort = 'MAE', n_select=3, exclude = ['lar',  'rf', 'et', 'gbr', 'xgboost'])

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,5.38,57.4306,7.5783,0.7955,0.1952,0.1271,12.925
lightgbm,Light Gradient Boosting Machine,5.425,58.7735,7.6664,0.7907,0.1971,0.1282,0.69
en,Elastic Net,5.5208,61.8057,7.8616,0.7799,0.201,0.1293,0.095
lasso,Lasso Regression,5.5209,61.8058,7.8617,0.7799,0.201,0.1293,0.0975
omp,Orthogonal Matching Pursuit,5.5239,61.1793,7.8217,0.7821,0.2002,0.13,0.105
br,Bayesian Ridge,5.5269,61.0967,7.8164,0.7824,0.2001,0.1302,0.8825
ridge,Ridge Regression,5.5336,61.0995,7.8166,0.7824,0.2001,0.1304,0.09
lr,Linear Regression,5.5341,61.1381,7.8191,0.7822,0.2001,0.1304,1.37
ada,AdaBoost Regressor,6.6121,80.7727,8.9854,0.7123,0.2475,0.181,4.5025
knn,K Neighbors Regressor,7.3289,92.2237,9.6033,0.6715,0.2445,0.1821,1.4975


In [14]:
blender = blend_models(top3)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,5.3818,58.5082,7.6491,0.7891,0.1942,0.1257
1,5.4105,58.5291,7.6504,0.7919,0.195,0.1285
2,5.3788,58.2465,7.6319,0.7919,0.1919,0.1243
3,5.4106,59.0084,7.6817,0.7927,0.2055,0.1299
Mean,5.3954,58.5731,7.6533,0.7914,0.1966,0.1271
Std,0.0152,0.2749,0.0179,0.0014,0.0052,0.0022


In [15]:
final = finalize_model(blender)

In [16]:
test_df['pred'] = (predict_model(final, data=test_df)['Label']).round()

sep = train_df[(train_df.day >= 246) & (train_df.time >= 0)]
lower = sep.groupby(['time', 'x', 'y', 'direction']).congestion.quantile(0.15).values
upper = sep.groupby(['time', 'x', 'y', 'direction']).congestion.quantile(0.7).values

test_df.pred = test_df.pred.clip(lower, upper)

In [17]:
for xydir in set(test_df.xydir):
    
    xydir_counts = train_df.loc[train_df.xydir ==xydir,'congestion'].value_counts()
    
    l = xydir_counts[(xydir_counts > 200)] 
    if len(l) > 2: 
        l = list(l.index)
        test_df.loc[test_df.xydir ==xydir,'pred']= test_df.loc[test_df.xydir ==xydir,'pred'].map(lambda y: min(l, key=lambda x:abs(x-y)))

In [18]:
test_df

Unnamed: 0,row_id,time,x,y,direction,month,day,am,wkday,xydirday,xydir,all,avg,pred
0,848835,0.0,0,0,EB,9,273,False,0,00EB273,00EB,00EB0.0,50.0,50.0
1,848836,0.0,0,0,NB,9,273,False,0,00NB273,00NB,00NB0.0,35.0,35.0
2,848837,0.0,0,0,SB,9,273,False,0,00SB273,00SB,00SB0.0,55.0,52.0
3,848838,0.0,0,1,EB,9,273,False,0,01EB273,01EB,01EB0.0,26.0,27.0
4,848839,0.0,0,1,NB,9,273,False,0,01NB273,01NB,01NB0.0,72.0,71.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2335,851170,35.0,2,3,NB,9,273,False,0,23NB273,23NB,23NB35.0,67.0,60.0
2336,851171,35.0,2,3,NE,9,273,False,0,23NE273,23NE,23NE35.0,25.0,28.0
2337,851172,35.0,2,3,SB,9,273,False,0,23SB273,23SB,23SB35.0,71.0,69.0
2338,851173,35.0,2,3,SW,9,273,False,0,23SW273,23SW,23SW35.0,11.0,15.0


In [19]:
submission = pd.read_csv('../input/tabular-playground-series-mar-2022/sample_submission.csv')
submission['congestion'] = test_df['pred']
submission.to_csv('submission.csv', index=False)

In [20]:
submission

Unnamed: 0,row_id,congestion
0,848835,50.0
1,848836,35.0
2,848837,52.0
3,848838,27.0
4,848839,71.0
...,...,...
2335,851170,60.0
2336,851171,28.0
2337,851172,69.0
2338,851173,15.0


#### Did it work?
To forecast traffic levels across 65 different roadways, three time series models were developed: a Moving Average, ARIMA, and Exponential Smoothing model. In the Moving Average and ARIMA models, the first difference of the weekly seasonally differenced congestion levels was taken to reduce the trend and seasonality in the data, and in the Exponential Smoothing model, Holt-Winter's additive method was used to account for the trend and seasonal components. Out of these three methods, the ARIMA model provided a more accurate forecast on the test set with a Mean Absolute Error of 6.95. The Gradient Boosting model was able to further improve on the traffic forecasts with the lowest test error overall of 5.2.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
Forecast twelve-hours of traffic flow in a U.S. metropolis. The time series in this dataset are labelled with both location coordinates and a direction of travel -- a combination of features that will test our skill at spatio-temporal forecasting within a highly dynamic traffic network.