---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Pre-Profiling Report**](#Section51)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Data Post-Processing**](#Section7)<br>
**8.** [**Model Development & Evaluation**](#Section8)<br>
**9.** [**Conclusion**](#Section9)<br>

---
<a name=Section1></a>
# **1. Introduction**
---

-  For the **March edition of the 2022 Tabular Playground Series** we're challenged to **forecast twelve-hours of traffic flow** in a U.S. metropolis.

- The **time series** in this dataset are labelled with both **location coordinates** and a **direction of travel**.

---
<a name = Section2></a>
# **2. Installing & Importing Libraries**
---

<a name = Section21></a>
### **2.1 Installing Libraries**

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data
!pip install -q yellowbrick                                         # Toolbox for Measuring Machine Performance
# !pip install -q kaggle                                              # Installing kaggle's API
# !pip install -q kaggle-cli                                          # Supporting kaggle API library that provides cli
!pip install -q catboost
!pip install -q xgboost

<a name = Section22></a>
### **2.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

In [None]:
!pip install -q --upgrade pandas-profiling
!pip install -q --upgrade yellowbrick

<a name = Section23></a>
### **2.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high      
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity      
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.neighbors import KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from yellowbrick.model_selection import FeatureImportances
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once
#-------------------------------------------------------------------------------------------------------------------------------


---
<a name = Section3></a>
# **3. Data Acquisition & Description**
---

- `train.csv` - the training set, comprising measurements of **traffic congestion** across **65 roadways** from **April through September of 1991**.

- `test.csv` - the test set; you will make **hourly predictions for roadways** identified by a coordinate location and a direction of travel on the day of **1991-09-30** (September 30, 1991).

- A **general description of dataset** and **information of its columns** are as follows:

</br>

| File | Records | Features | Dataset Size |
| :--: | :--: | :--: | :--: |
| **Train** | 848835 | 6 | 31.29 MB |
| **Test** | 2340 | 5 | 79.59 kB | 

</br>

|ID|Feature name|Feature description|
|:--|:--|:--|
|1|**row_id**| a unique identifier for this instance |
|2|**time**| the 20-minute period in which each measurement was taken |
|3|**x**| the east-west midpoint coordinate of the roadway |
|4|**y**| the north-south midpoint coordinate of the roadway |
|5|**direction**| the direction of travel of the roadway. EB indicates "eastbound" travel, for example, while SW indicates a "southwest" direction of travel |
|6|**congestion**| congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100 |

In [None]:
data_train = pd.read_csv('../input/tabular-playground-series-mar-2022/train.csv')
print('Train Data Shape:', data_train.shape)
data_train.head()

In [None]:
data_test = pd.read_csv('../input/tabular-playground-series-mar-2022/test.csv')
print('Test Data Shape:', data_test.shape)
data_test.head()

<a name = Section31></a>
### **3.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
data_train.describe()

**Observations:**

- **x** ranges from **0** to **2**, averaging at **1.14**.

- **y** ranges from **0** to **3**, averaging at **1.63**.

- **congestion** ranges from **0** to **100**, averaging at **47.82**.

In [None]:
data_test.describe()

**Observations:**

- **x** ranges from **0** to **2**, averaging at **1.14**.

- **y** ranges from **0** to **3**, averaging at **1.63**.


In [None]:
data_train.info()

In [None]:
data_test.info()

**Observations:**

- Our target feature - **congestion** is of **int64 data type**.

- Among the rest of the features, we have **3 int64 type** features (**row_id**, **x**, **y**) and **2 object type** features (**time**, **direction**)

<a name = Section4></a>

---
# **4. Data Pre-Processing**
---

<a name = Section41></a>
### **4.1 Pre Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
profile_train = ProfileReport(df = data_train)
profile_train.to_file(output_file = 'Pre Profiling Report - Train.html')
print('Accomplished!')
profile_train

In [None]:
profile_test = ProfileReport(df = data_test)
profile_test.to_file(output_file = 'Pre Profiling Report - Test.html')
print('Accomplished!')
profile_test

**Observations:**

- There are **no missing values** in either dataset.

- There are **no duplicated rows** in either dataset.

- We **cannot observe any correlation** (Pearson's) between the features according to the reports.

#### **Validating Report Data**:

- Checking for **null** values:

In [None]:
data_train.isna().sum()

In [None]:
data_test.isna().sum()

- Checking for **duplicate** rows:

In [None]:
data_train.duplicated().sum()

In [None]:
data_test.duplicated().sum()

<a name = Section42></a>
### **4.2 Performing Operations**

- We will first convert the datetime into appropriate format.

- We will extract some important information like **month**, **weekday**, **period of day**, **is_Monday**, **is_Friday**, **is_weekend**, **is_month_start**, **is_month_end** and later check their **significance** w.r.t. **congestion**.

- We will use the following conversion table from **hour of day** to **period of day** where we have basically divided the day into **6 parts**:

<br>

|Start Hour|End Hour|Period|
|:--|:--|:--|
|12 AM|4 AM| **Late Night** |
|4 AM|8 AM| **Early Morning** |
|8 AM|12 PM| **Morning** |
|12 PM|4 PM| **Noon** |
|4 PM|8 PM| **Evening** |
|8 PM|12 AM| **Night** |

<br>

- We are creating a function to perform all the operations:

In [None]:
def time_based_feature_extraction(data=None):
  data['time'] = pd.to_datetime(data['time'])
  data['month'] = data['time'].dt.month
  data['weekday'] = data['time'].dt.day_of_week
  data['hour'] = data['time'].dt.hour
  data['period'] = (data['time'].dt.hour % 24 + 4) // 4
  data['period'].replace({1: 'Late Night', 2: 'Early Morning', 3: 'Morning', 4: 'Noon', 5: 'Evening', 6: 'Night'}, inplace=True)

  data['is_month_start'] = data['time'].dt.is_month_start.astype('int')
  data['is_month_end'] = data['time'].dt.is_month_end.astype('int')
  data['is_weekend'] = (data['time'].dt.dayofweek > 4).astype('int')
  data['is_Friday'] = np.where((data['weekday'] == 4), 1, 0)
  data['is_Monday'] = np.where((data['weekday'] == 0), 1, 0)

In [None]:
time_based_feature_extraction(data=data_train)
time_based_feature_extraction(data=data_test)

In [None]:
data_train.head()

In [None]:
data_test.head()

**Observations:**

- Our data is clean and we have added more information that can be used for **Exploratory Data Analysis**.

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---

In [None]:
data_train.head()

In [None]:
data_test.head()

In [None]:
cat = ['x', 'y', 'direction']
time_data = ['month', 'weekday', 'hour', 'period']
target = ['congestion']

**<h4>Question 1:** What is the spread of the target feature?</h4>

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
sns.histplot(data_train['congestion'], ax=ax[0], color='blue', kde=True)
sns.boxplot(data_train['congestion'], ax=ax[1], color='red')
plt.suptitle('congestion')
plt.grid(b=True)
plt.show()

**Observations:**

- The congestion feature is slightly **skewed towards right**.

- There are **some outliers** present as well.

- We **won't be changing** any properties or values of this feature.

**<h4>Question 2:** What is the interpretation between x and congestion?</h4>

In [None]:
plt.figure(figsize=(15, 7))
df_x = data_train.groupby(['x']).agg({"congestion" : "mean"})
sns.barplot(data=df_x, x=df_x.index, y='congestion', palette='rocket')
plt.title(label='congestion w.r.t. x', fontsize=16)
plt.xlabel(xlabel='x', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observation:**

- Looking at the middle chart we observe that **x = 1** location is the **most** busiest, while **x = 0** is the one with **least** traffic.

**<h4>Question 3:** What is the interpretation between y and congestion?</h4>

In [None]:
df_y = data_train.groupby(['y']).agg({"congestion" : "mean"})
plt.figure(figsize=(15, 7))
sns.barplot(data=df_y, x=df_y.index, y='congestion', palette='rocket')
plt.title(label='congestion w.r.t. y', fontsize=16)
plt.xlabel(xlabel='y', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- Finally, at the right we see that both **y = 0** and **y = 2** location are the **busiest**.

- Difference between these two and the other (in '`y`') is a bit significant.

**<h4>Question 4:** What is the interpretation between direction of the traffic and congestion?</h4>

In [None]:
df_direction = data_train.groupby(['direction']).agg({"congestion" : "mean"})
plt.figure(figsize=(15, 7))
sns.barplot(data=df_direction, x=df_direction.index, y='congestion', palette='Spectral')
plt.title(label='congestion w.r.t. direction', fontsize=16)
plt.xlabel(xlabel='direction', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- Direction with most large congestion rates are in **South Bound** and **North Bound**.

- Mean congestion from **NE**, **NW**, **SE**, **SW** directions is quite **low**.

**<h4>Question 5:** How does the average congestion look over time?</h4>

In [None]:
train_group = data_train.groupby('time', as_index=False).agg({'congestion': 'mean'})
fig, ax = plt.subplots(figsize=(25, 7))
sns.lineplot(data=train_group, x=train_group['time'].dt.dayofyear, y='congestion', ax=ax, label='daily_congestion')
sns.lineplot(x=train_group['time'].dt.dayofyear, y=train_group['congestion'].mean(), ax=ax, label='mean_congestion')
plt.title(label='congestion w.r.t. days', fontsize=16)
plt.xlabel(xlabel='days of the year', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- We observe a **strong seasonality** with respect to congestion rate per week. 

- Moreover, the **trend** remains **almost constant**, increasing insignificantly over time.

**<h4>Question 6:** How does the congestion  look like throughout the week?</h4>

In [None]:
plt.figure(figsize=(15, 7))
sns.barplot(data=data_train, x='weekday', y='congestion', palette='Dark2')
plt.title(label='congestion w.r.t. period of day', fontsize=16)
plt.xlabel(xlabel='Weekday', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(ticks=range(7), labels=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- As we can see, working days of the week have a **similar congestion rate**. 

- Likewise, we can see that the **weekend days** are the ones with the **least traffic**, with **Sunday being the quietest day**.

**<h4>Question 7:** How does the congestion differ on Mondays as compared to the rest of the week?</h4>

In [None]:
plt.figure(figsize=(15, 7))
sns.boxplot(data=data_train, x='is_Monday', y='congestion', palette='BuPu')
plt.title(label='congestion w.r.t. Monday', fontsize=16)
plt.xlabel(xlabel='Monday or not', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- Mondays do have **slightly higher traffic** as compared to the rest of the days of the week.

**<h4>Question 8:** How does the congestion differ on Fridays as compared to the rest of the week?</h4>

In [None]:
plt.figure(figsize=(15, 7))
sns.boxplot(data=data_train, x='is_Friday', y='congestion', palette='BuPu')
plt.title(label='congestion w.r.t. Friday', fontsize=16)
plt.xlabel(xlabel='Friday or not', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- Similar to what we observed for Mondays, Fridays too have **slightly higher** **traffic** as compared to the rest of the days of the week.

**<h4>Question 9:** How does the congestion look like throughout the day?</h4>

In [None]:
plt.figure(figsize=(15, 7))
sns.boxplot(data=data_train, x='period', y='congestion', palette='RdYlBu_r')
plt.title(label='congestion w.r.t. period of day', fontsize=16)
plt.xlabel(xlabel='period', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- We can observe a sudden increase in traffic **after early morning** where it **dips down to night**.

- We can see a **sharp drop** between **night** congestion and **late night** (after midnight) congestion as well.

**<h4>Question 10:** How does the congestion look like on an hourly basis?</h4>

In [None]:
fig, ax = plt.subplots(figsize=(25, 7))
sns.lineplot(data=train_group, x=train_group['time'].dt.hour, y='congestion', ax=ax, label='mean_congestion per hour of the day', linestyle='--', palette='pastel')
plt.title(label='mean congestion w.r.t. hours', fontsize=16)
plt.xlabel(xlabel='hours of day', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- We observe that there is an **increase** in traffic at the **beginning of the day**.

- Traffic **peaks at around 8 AM** before dipping down till noon.

- The **busiest hours** are between **13h - 17h (1 PM - 5 PM)**, and after congestion rate decrease as the night falls.

**<h4>Question 11:** How does the congestion look like on an hourly basis throughout the week?</h4>

In [None]:
fig, ax = plt.subplots(figsize=(25, 7))
sns.lineplot(data=train_group, x=train_group['time'].dt.hour, y='congestion', ax=ax, hue=train_group['time'].dt.weekday, palette='Spectral')
plt.title(label='mean congestion w.r.t. hours', fontsize=16)
plt.xlabel(xlabel='hours of day', fontsize=14)
plt.ylabel(ylabel='congestion', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(b=True)
plt.show()

**Observations:**

- We can see that in working days congestion rate is **quite similar for every hour**.

- However, this **changes** when we get into the **weekend**. 

- We can appreciate that, due to the fact that people **do not have to work** on weekends causing the congestion rates to go **down**.

- Moreover, weekend congestion trend **does not have as ups and downs** as working days have.

**<h4>Question 12:** How does the congestion look like via various directions with respect to midpoint coordinates?</h4>

In [None]:
# Inspired from https://www.kaggle.com/hasanbasriakcay/tps-mar22-eda-fe-baseline/notebook 

f, ax = plt.subplots(figsize=(20, 10))
dir_dict = {'EB': (1, 0), 'NB': (0, 1), 'SB': (0, -1), 'WB': (-1, 0), 'NE': (1, 1), 'SE': (-1, 1), 'NW': (1, -1), 'SW': (-1, -1)}
for _, x, y, d in data_train[['x', 'y', 'direction']].drop_duplicates().itertuples():
    mean_congestion = data_train.loc[data_train['direction'] == d, 'congestion'].values[0]
    linewidth = mean_congestion/10
    dx, dy = dir_dict[d]
    dx, dy = dx/4, dy/4
    plt.plot([x, x+dx], [y, y+dy], linewidth=linewidth)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('congestion w.r.t. directions and midpoints', size=16)
plt.show()

**Observations:**

- Congestion is always **heaviest** at **East Bound** and sometimes can be spotted through South West direction.

<a name = Section7></a>

---
# **7. Data Post-Processing**
---

In [None]:
X = data_train.drop(columns=['row_id', 'time', 'congestion'])
y = data_train['congestion']

In [None]:
label_encoder = LabelEncoder()
X['direction'] = label_encoder.fit_transform(X['direction'])
X['period'] = label_encoder.fit_transform(X['period'])

In [None]:
model = RandomForestRegressor(random_state=42)
viz = FeatureImportances(model)
viz.fit(X, y)
fig = plt.figure(figsize=(15, 7))
viz.show()

In [None]:
train = data_train.copy(deep=True)
test = data_test.copy(deep=True)

In [None]:
train.head()

In [None]:
train.columns

In [None]:
X = train.drop(columns=['row_id', 'time', 'congestion'])
X.head()

In [None]:
X_test = test.drop(columns=['row_id', 'time'])
X_test.head()

In [None]:
y = train['congestion']
y.head()

**Note:**

- In this section, you may need to perform encoding, scaling, feature generation, data preparation.

- Always include sections according to the application needs.

<a name = Section71></a>
### **7.1 Feature Encoding**

- In this section, we will perform **transformation** over categorical features to get numeric form.

In [None]:
def label_encoding_features(columns=None):
  label_encoder = LabelEncoder()
  for col in columns:
    X[col] = label_encoder.fit_transform(X[col])
    X_test[col] = label_encoder.transform(X_test[col])

In [None]:
label_encoding_features(['direction', 'period'])

In [None]:
X.head()

In [None]:
X_test.head()

<a name = Section72></a>
### **7.2 Feature Scaling**

- In this section, we will perform data scaling over the features that may impact the outcome of models.

<a name = Section73></a>
### **7.3 Data Preparation**

- Now we will **split** our **data** in **training** and **testing** part for further development.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)

print('Training Data Shape:', X_train.shape, y_train.shape)
print('Testing Data Shape:', X_val.shape, y_val.shape)

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will develop variety of models such as -----------

In [None]:
scores = []
clfs = [LinearRegression(),
        Ridge(random_state=42), Lasso(random_state=42),
        ElasticNet(alpha=0.5, random_state=42),
        DecisionTreeRegressor(random_state=42), 
        RandomForestRegressor(random_state=42), 
        KNeighborsRegressor(), GaussianNB(),
        GradientBoostingRegressor(random_state=42),
        CatBoostRegressor(random_state=42),
        XGBRegressor(random_state=42),
        BaggingRegressor(random_state=42),
        LGBMRegressor(random_state=42)
        ]

mae_list = []

In [None]:
for clf in clfs:
  # Extracting model name
  model_name = type(clf).__name__

  # Fit the model on train data
  clf.fit(X_train, y_train)

  # Make predictions using validation data
  y_val_pred = clf.predict(X_val)

  # Make predictions using train data
  y_train_pred = clf.predict(X_train)

  # Calculate train Accuracy of the model
  clf_train_r2 = r2_score(y_train, y_train_pred)

  # Calculate validation Accuracy of the model
  clf_val_r2 = r2_score(y_val, y_val_pred)

  # Calculate train Accuracy of the model
  clf_train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))

  # Calculate validation Accuracy of the model
  clf_val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

  # Calculate train Accuracy of the model
  clf_train_mae = mean_absolute_error(y_train, y_train_pred)

  # Calculate validation Accuracy of the model
  clf_val_mae = mean_absolute_error(y_val, y_val_pred)

  # Display the accuracy of the model
  print('Performance Metrics for', model_name, ':')

  print()
  
  print('[R2-Score Train]:', clf_train_r2)
  print('[R2-Score Validation]:', clf_val_r2)

  print('[RMSE Train]:', clf_train_rmse)
  print('[RMSE Validation]:', clf_val_rmse)

  print('[MAE Train]:', clf_train_mae)
  print('[MAE Validation]:', clf_val_mae)

  scores.append((model_name, clf_train_r2, clf_val_r2, clf_train_rmse, clf_val_rmse, clf_train_mae, clf_val_mae))

  print('--------------------\n')

In [None]:
models = pd.DataFrame(data=scores, columns=['Model', 'Train R2-Score', 'Val R2-Score', 'Train RMSE', 'Val RMSE', 'Train MAE', 'Val MAE'])
models.sort_values(by='Val MAE')

In [None]:
fig = plt.figure(figsize=(15,7))
sns.barplot((models['Val MAE']), models['Model'], palette='rocket')
plt.grid(b=True)

OLD - BEST YET

In [None]:
models = pd.DataFrame(data=scores, columns=['Model', 'Train R2-Score', 'Val R2-Score', 'Train RMSE', 'Val RMSE', 'Train MAE', 'Val MAE'])
models.sort_values(by='Val MAE')

In [None]:
# Creating a parameter grid for CatBoost Regressor
param_grid_cat  = {'iterations': [100, 150, 200],
                   'learning_rate': [0.03, 0.1],
                   'depth': [2, 4, 6, 8],
                   'l2_leaf_reg': [0.2, 0.5, 1, 3]}

cbr = CatBoostRegressor(random_state=42)
cbrcv = GridSearchCV(estimator = cbr, param_grid = param_grid_cat, scoring ='neg_mean_absolute_error', cv = 5)
cbrcv.fit(X_train, y_train)

# Printing metrics
print("[Hyperparameters]:", cbrcv.best_params_)

print("Best Score:", cbrcv.best_score_)

print("[Train R2 Score]:", r2_score(y_train, cbrcv.predict(X_train)))
print("[Validation R2 Score]:", r2_score(y_val, cbrcv.predict(X_val)))

print("[Train RMSE]:", np.sqrt(mean_squared_error(y_train, cbrcv.predict(X_train))))
print("[Validation RMSE]:", np.sqrt(mean_squared_error(y_val, cbrcv.predict(X_val))))

print('[Train MAE]:', mean_squared_error(y_train, cbrcv.predict(X_train)))
print('[Validation MAE]:', mean_squared_error(y_val, cbrcv.predict(X_val)))

In [None]:
clf = cbrcv.best_estimator_
clf.fit(X_train, y_train)

In [None]:
y_pred_train = clf.predict(X_train)
y_pred_val = clf.predict(X_val)

In [None]:
print('[MAE Train]:', mean_absolute_error(y_train, y_pred_train))
print('[MAE Validation]:', mean_absolute_error(y_val, y_val_pred))

In [None]:
fig = plt.figure(figsize=(15, 7))
sns.regplot(y_val, y_pred_val, color='green')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.grid(b=True)
plt.show()

In [None]:
y_pred = clf.predict(X_test)

In [None]:
clf.save_model('best_model')

In [None]:
my_submission = pd.DataFrame({'row_id': data_test['row_id'], 'congestion': y_pred.ravel()})
my_submission.to_csv('submission.csv', index=False)

<a name = Section9></a>

---
# **9. Conclusion**
---

- There I conclude my notebook.

- Based on this notebook, my test metric achieved is **5.347**


