# Illustrative Notebook

## Prediction Functions

In this notebook we call and run R functions in a Python environment. This is an important step since the random ferns and RANGER algorithms were originally written in R, and most of processes are written in Python.

The following python packages are required to run this notebook:

In [6]:
import numpy as np    # Basica numeric and array operations.
import pandas as pd   # Basica numeric, array operations, and data wrangling.
import math           # Basic numeric operations.
import os             # Interaction with the operative system.
import glob
import rpy2           # Connection to R from a Python environment

We propose classifying the queries based on the need to run new predictions using a new dataset (the latest update is 2021-08-31) or retrieving information from the current dataset. If there is no need to make predictions on new datasets, we upload and return a chart with the predictions. But, if we need to make predictions for date-time values later than 2021-08-31, we need to upload the best model to make a new prediction on this new dataset.

Let's explore these two types of predictions.

## Prediction on date-time values before 2021-08-31

The function `predict_boarding_py()` returns a dataframe that contains the pre-specified values or conditions and their respective predictions. It requires the following inputs:

* `rt`: `route_id` (`as.type(str)`)
* `di`: `direction_id` (`as.type(str)`)
* `st`: `stop_id`
* `part`: `'pre'` or `'post'`
* `serv`: `'weeday'` or `'weekend'`
* `mn`: Month of the year. For example, March is `'3'`.
* `hr`: Hour of the day. For example, 1 pm is `'15'`.

Notice that the inputs do not required any environmental variables. This is a quick query in which the user gets prediction based on transit (*e. g.*, route and direction) and time variables.

In [18]:
def predict_boarding_py(rt, di, st, part, serv, mn, hr): 
    
    # Define folder path:
    
    path = '/'.join(['Training','Trained_Models','Board_Counts',
                     '_'.join(['route',rt]),''.join(['direction',di]),
                     '_'.join(['bus_stop',st])])
    
    # Select data partition:
    
    if part == 'pre':
        
        # Load Performance data to select 'Best Model':
        
        Pre_GLM_RMSEs_file_path = '/'.join([path,'Pre_All_RMSEs.csv'])
        Pre_GLM_RMSEs = pd.read_csv(Pre_GLM_RMSEs_file_path)
        Pre_RMSEs = pd.DataFrame(Pre_GLM_RMSEs.loc[1]).T.reset_index(drop = True)
                
        Pre_RF_Vanilla_RMSE_file_path = '/'.join([path, 'pre_RF_Vanilla_Performance.csv'])
        Pre_RF_Vanilla_RMSE = pd.read_csv(Pre_RF_Vanilla_RMSE_file_path)
        Pre_RF_Vanilla_RMSE = pd.DataFrame(Pre_RF_Vanilla_RMSE['Test'])
        Pre_RF_Vanilla_RMSE.columns = ['RF_Vanilla']
        
        Pre_RMSEs['RF_Vanilla'] = Pre_RF_Vanilla_RMSE['RF_Vanilla']
        
        Pre_RF_Chart_file_path = '/'.join([path, 'pre_RF_Chart.csv'])
        Pre_RF_Chart = pd.read_csv(Pre_RF_Chart_file_path)
        Pre_ZI_RF_RMSE = math.sqrt(np.mean((Pre_RF_Chart['board_count'] - Pre_RF_Chart['RF_Pred'])**2))
        
        Pre_RMSEs['ZI_RF'] = Pre_ZI_RF_RMSE
        Pre_RMSEs = Pre_RMSEs.iloc[: , 1:]
        
        Pre_All_RMSEs = Pre_RMSEs.T
        Pre_All_RMSEs.columns = ['Test']
        
        Best_Model = np.argmin(Pre_All_RMSEs['Test'])
        
        #return(Best_Model)
        
        if Best_Model == 6:
            
            data_File_path = '/'.join([path, 'pre_RF_Chart.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)]
            out_data = out_data.drop_duplicates()
                        
            return(out_data)
                      
        elif Best_Model == 5:
            
            data_File_path = '/'.join([path, 'pre_RF_Vanilla_Chart.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)]
            out_data = out_data.drop_duplicates()
                
            return(out_data)
        
        elif Best_Model == 4:
            
            data_File_path = '/'.join([path, 'Pre_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'Hurdle_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 3:
            
            data_File_path = '/'.join([path, 'Pre_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'ZIP_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 2:
            
            data_File_path = '/'.join([path, 'Pre_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'ZINB_Predictions']]
            
            out_data = out_data.drop_duplicates() 
            
            return(out_data)
        
        elif Best_Model == 1:
            
            data_File_path = '/'.join([path, 'Pre_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'NB_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 0:
            
            data_File_path = '/'.join([path, 'Pre_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'Poiss_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
    
    elif part == 'post':
        
        # Load Performance data to select 'Best Model':
        
        Post_GLM_RMSEs_file_path = '/'.join([path,'Post_All_RMSEs.csv'])
        Post_GLM_RMSEs = pd.read_csv(Post_GLM_RMSEs_file_path)
        Post_RMSEs = pd.DataFrame(Post_GLM_RMSEs.loc[1]).T.reset_index(drop = True)
                
        Post_RF_Vanilla_RMSE_file_path = '/'.join([path, 'post_RF_Vanilla_Performance.csv'])
        Post_RF_Vanilla_RMSE = pd.read_csv(Post_RF_Vanilla_RMSE_file_path)
        Post_RF_Vanilla_RMSE = pd.DataFrame(Post_RF_Vanilla_RMSE['Test'])
        Post_RF_Vanilla_RMSE.columns = ['RF_Vanilla']
        
        Post_RMSEs['RF_Vanilla'] = Post_RF_Vanilla_RMSE['RF_Vanilla']
        
        Post_RF_Chart_file_path = '/'.join([path, 'post_RF_Chart.csv'])
        Post_RF_Chart = pd.read_csv(Post_RF_Chart_file_path)
        Post_ZI_RF_RMSE = math.sqrt(np.mean((Post_RF_Chart['board_count'] - Post_RF_Chart['RF_Pred'])**2))
        
        Post_RMSEs['ZI_RF'] = Post_ZI_RF_RMSE
        Post_RMSEs = Post_RMSEs.iloc[: , 1:]
        
        Post_All_RMSEs = Post_RMSEs.T
        Post_All_RMSEs.columns = ['Test']
        
        Best_Model = np.argmin(Post_All_RMSEs['Test'])
        
        #return(Best_Model)
        
        if Best_Model == 6:
            
            data_File_path = '/'.join([path, 'post_RF_Chart.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)]
            
            out_data = out_data.drop_duplicates()
                
            return(out_data)
        
        elif Best_Model == 5:
            
            data_File_path = '/'.join([path, 'post_RF_Vanilla_Chart.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 4:
            
            data_File_path = '/'.join([path, 'Post_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'Hurdle_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 3:
            
            data_File_path = '/'.join([path, 'Post_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'ZIP_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 2:
            
            data_File_path = '/'.join([path, 'Post_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'ZINB_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 1:
            
            data_File_path = '/'.join([path, 'Post_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'NB_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)
        
        elif Best_Model == 0:
            
            data_File_path = '/'.join([path, 'Post_Chart_Predictions.csv'])
            data = pd.read_csv(data_File_path)
            data = data.iloc[: , 1:]
            
            data.month = data['month'].astype(str)
            data.hour = data['hour'].astype(str)
            data.service_kind = data['service_kind'].astype(str)

            out_data = data[(data['service_kind'] == serv) & (data['month'] == mn) & (data['hour'] == hr)][['month', 'service_kind',
                                                                                                            'hour', 'board_count',
                                                                                                            'mean_temp', 'mean_precip',
                                                                                                            'month_average_board_count',
                                                                                                            'surrounding_board_count',
                                                                                                            'Poiss_Predictions']]
            
            out_data = out_data.drop_duplicates()
            
            return(out_data)

Let's get the boad count predictions of route 1, direction 0, bus stop 12 using the post-lockdown data (after 2020-03-05) for any given weekday during the March at 1 pm:

In [19]:
predict_boarding_py('1', '0', '12', 'post', 'weekday', '3', '15')

Unnamed: 0,month,service_kind,hour,board_count,mean_temp,mean_precip,month_average_board_count,surrounding_board_count,RF_Pred,RF_Lower_Bound,RF_Upper_Bound
205,3,weekday,15,0,0.18583,0.304762,0.479523,0.950382,0.0,0,0.0
211,3,weekday,15,1,0.18583,0.304762,0.479523,0.950382,2.31347,1,5.05
297,3,weekday,15,2,0.240495,0.247845,0.592736,0.523872,2.603125,1,6.0
382,3,weekday,15,0,0.240495,0.247845,0.592736,0.523872,0.0,0,0.0


Notice that besides the predictions -in this case `RF_Pred`- we get a confidence interval for the prediction. Also, notice that there is more than one row that represent the information of multiple events that happened at that bus stop for the given conditions. For example, the first and second row the same values but different board counts, which suggests that there were two different boarding events within one hour. Similarly for the third and fourth rows.