# Task1 : Prediction of tickets for different price categories
__The solution is spread over 4 different files with Task1_*.ipynb__
- `Task1_data_cleaning` does the data cleaning and manipulation tasks for fitting the data.
- `Task1_training` does the training with neural networks using Hyperparameter grid search.
- `Task1_validation` does the validation on all the generated check points to find the best model.
- `Task1_predictions` does the predictions in required format using the best model.

__We start with reading the training data with pandas and simultaneously parsing the datetime feature.__

It seems there are some NAN's and 999's.

In [1]:
import pandas as pd
import holidays
df = pd.read_csv('training_data.csv',index_col=0,parse_dates=[1])
df.isnull().sum()

ride_departure     0
capacity           0
tickets_9_eur      0
tickets_12_eur     0
tickets_15_eur    10
tickets_19_eur    10
direction          0
dtype: int64

Since the amount of  NAN's are negligible compared the total size of data, they are dropped.

In [2]:
df = df[df.tickets_12_eur!=999]
df.dropna(inplace=True)
df.isnull().sum()

ride_departure    0
capacity          0
tickets_9_eur     0
tickets_12_eur    0
tickets_15_eur    0
tickets_19_eur    0
direction         0
dtype: int64

__Adding a holidays column for Bayern(Germany) for the dataframe.__

In [3]:
de_by_holidays = holidays.CountryHoliday('DE',prov='by')
df['holiday'] = df['ride_departure'].dt.date.apply(lambda x : 1 if x in de_by_holidays else 0)
df.head()

Unnamed: 0_level_0,ride_departure,capacity,tickets_9_eur,tickets_12_eur,tickets_15_eur,tickets_19_eur,direction,holiday
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2015-01-01 08:15:00,82.0,21.0,0.0,0.0,0.0,B->A,1
1,2015-01-01 09:15:00,82.0,12.0,0.0,0.0,0.0,A->B,1
2,2015-01-01 10:15:00,82.0,33.0,0.0,0.0,0.0,B->A,1
3,2015-01-01 11:45:00,82.0,25.0,0.0,0.0,0.0,A->B,1
4,2015-01-01 12:45:00,82.0,32.0,0.0,0.0,0.0,B->A,1


__`ride_departure` can be split to better features.__

In [4]:
df['month'] = df['ride_departure'].dt.month
df['day_of_year'] = df['ride_departure'].dt.dayofyear
df['hour'] = df['ride_departure'].dt.hour
df['minute'] = df['ride_departure'].dt.minute
df['day_of_week'] = df['ride_departure'].dt.dayofweek
df.head()

Unnamed: 0_level_0,ride_departure,capacity,tickets_9_eur,tickets_12_eur,tickets_15_eur,tickets_19_eur,direction,holiday,month,day_of_year,hour,minute,day_of_week
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,2015-01-01 08:15:00,82.0,21.0,0.0,0.0,0.0,B->A,1,1,1,8,15,3
1,2015-01-01 09:15:00,82.0,12.0,0.0,0.0,0.0,A->B,1,1,1,9,15,3
2,2015-01-01 10:15:00,82.0,33.0,0.0,0.0,0.0,B->A,1,1,1,10,15,3
3,2015-01-01 11:45:00,82.0,25.0,0.0,0.0,0.0,A->B,1,1,1,11,45,3
4,2015-01-01 12:45:00,82.0,32.0,0.0,0.0,0.0,B->A,1,1,1,12,45,3


__The bus direction is a categorical variable, which can better used by applying one hot encoding__

In [5]:
df = pd.get_dummies(df,prefix='route',columns=['direction']) 
df.head()

Unnamed: 0_level_0,ride_departure,capacity,tickets_9_eur,tickets_12_eur,tickets_15_eur,tickets_19_eur,holiday,month,day_of_year,hour,minute,day_of_week,route_A->B,route_B->A
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,2015-01-01 08:15:00,82.0,21.0,0.0,0.0,0.0,1,1,1,8,15,3,0,1
1,2015-01-01 09:15:00,82.0,12.0,0.0,0.0,0.0,1,1,1,9,15,3,1,0
2,2015-01-01 10:15:00,82.0,33.0,0.0,0.0,0.0,1,1,1,10,15,3,0,1
3,2015-01-01 11:45:00,82.0,25.0,0.0,0.0,0.0,1,1,1,11,45,3,1,0
4,2015-01-01 12:45:00,82.0,32.0,0.0,0.0,0.0,1,1,1,12,45,3,0,1


__Rearranging the features for sake simplicity to feed into neural networks and saving to `cleaned_data.csv`.__

In [6]:
df = df[['month','day_of_year','hour','minute','day_of_week','holiday','route_A->B','route_B->A','capacity','tickets_9_eur','tickets_12_eur','tickets_15_eur','tickets_19_eur','ride_departure']]
df.head()

Unnamed: 0_level_0,month,day_of_year,hour,minute,day_of_week,holiday,route_A->B,route_B->A,capacity,tickets_9_eur,tickets_12_eur,tickets_15_eur,tickets_19_eur,ride_departure
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,1,1,8,15,3,1,0,1,82.0,21.0,0.0,0.0,0.0,2015-01-01 08:15:00
1,1,1,9,15,3,1,1,0,82.0,12.0,0.0,0.0,0.0,2015-01-01 09:15:00
2,1,1,10,15,3,1,0,1,82.0,33.0,0.0,0.0,0.0,2015-01-01 10:15:00
3,1,1,11,45,3,1,1,0,82.0,25.0,0.0,0.0,0.0,2015-01-01 11:45:00
4,1,1,12,45,3,1,0,1,82.0,32.0,0.0,0.0,0.0,2015-01-01 12:45:00


In [7]:
df.to_csv('cleaned_data.csv',index=False)