# Model fit on Time series data
## Subash Chandra Biswal (U77884251)

### Data Set Information:

Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.


Attribute Information:

- **holiday :** Categorical US National holidays plus regional holiday, Minnesota State Fair
- **temp :** Numeric Average temp in kelvin
- **rain_1h :** Numeric Amount in mm of rain that occurred in the hour
- **snow_1h :** Numeric Amount in mm of snow that occurred in the hour
- **clouds_all :** Numeric Percentage of cloud cover
- **weather_main :** Categorical Short textual description of the current weather
- **weather_description :** Categorical Longer textual description of the current weather
- **date_time :** DateTime Hour of the data collected in local CST time
- **traffic_volume :** Numeric Hourly I-94 ATR 301 reported westbound traffic volume

**Source:** UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume)


In [79]:
import pandas as pd
import numpy as np

## Load dataset

In [80]:
traffic_data = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
traffic_data

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918
...,...,...,...,...,...,...,...,...,...
48199,,283.45,0.0,0.0,75,Clouds,broken clouds,2018-09-30 19:00:00,3543
48200,,282.76,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 20:00:00,2781
48201,,282.73,0.0,0.0,90,Thunderstorm,proximity thunderstorm,2018-09-30 21:00:00,2159
48202,,282.09,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 22:00:00,1450


In [81]:
traffic_data.isna().sum()

holiday                0
temp                   0
rain_1h                0
snow_1h                0
clouds_all             0
weather_main           0
weather_description    0
date_time              0
traffic_volume         0
dtype: int64

Considering the features of the dataset, we can say that temp, rain_1h, snow_1h, and clouds_all are the primary contributers of the target variable i.e. traffic_volume. Also, we can say looking at the last 6 hours data we can predict next traffic volume. So the step is 6.

In [82]:
# Convert the holidays to 1 and non-holidays to 0
traffic_data['holiday'] = traffic_data['holiday'].apply(lambda x: 0 if x == 'None' else 1)

In [83]:
# Extract the hour from the date-time field and create a new field named "hour"
df = traffic_data['date_time'].astype('datetime64[ns]').dt.hour
traffic_data.insert(loc = 5, column = 'hour', value = df)

In [84]:
traffic_data=traffic_data.drop(columns=['weather_main','weather_description','date_time','hour','holiday'])

In [63]:
traffic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   holiday         48204 non-null  int64  
 1   temp            48204 non-null  float64
 2   rain_1h         48204 non-null  float64
 3   snow_1h         48204 non-null  float64
 4   clouds_all      48204 non-null  int64  
 5   hour            48204 non-null  int64  
 6   traffic_volume  48204 non-null  int64  
dtypes: float64(3), int64(4)
memory usage: 2.6 MB


In [85]:
traffic_data.head()

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,traffic_volume
0,288.28,0.0,0.0,40,5545
1,289.36,0.0,0.0,75,4516
2,289.58,0.0,0.0,90,4767
3,290.13,0.0,0.0,90,5026
4,291.14,0.0,0.0,75,4918


## Save the processed Data

In [86]:
traffic_data.to_csv('./traffic_data.csv', index=False)