<h1><center>Modelo de predicción de propinas</center></h1>

Una vez analizados los datos pasamos a desarrollar los modelos de predicción.

## Lectura Datos Anterior

In [1]:
#conexión a spark
#install.packages("sparklyr")
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
spark_web(sc)



Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Registered S3 method overwritten by 'openssl':
  method      from
  print.bytes Rcpp


In [2]:
#Defino el schema de los datos 
schema_final <- c( VendorID   = "integer", tpep_pickup_datetime  = "timestamp", tpep_dropoff_datetime="timestamp", passenger_count="integer",
                     trip_distance="numeric", RatecodeID="factor",store_and_fwd_flag="character",PULocationID="character",DOLocationID="character",
                     payment_type="integer",fare_amount="numeric",extra="numeric", mta_tax="numeric", tip_amount="numeric",tolls_amount="integer",
                     improvement_surcharge="numeric",total_amount="numeric") 
#Cargo los datos
DataTaxiTarjetaFiltrado<-spark_read_csv(sc, "C:/Users/Sara/Desktop/DatosTaxi/", columns = schema_final, infer_schema=FALSE) %>%
            filter(payment_type==1 && tip_amount>=0 && tip_amount<=50) %>%
            filter(extra==0 || extra==0.5 || extra==1) %>%
            filter(mta_tax==0.5) %>% 
            filter(Improvement_surcharge==0.3 ) %>% 
            mutate(WeekDay=dayofweek(tpep_dropoff_datetime)) %>%
            mutate(TripDurationMin=(bigint(to_timestamp(tpep_dropoff_datetime)) - bigint(to_timestamp(tpep_pickup_datetime)))/60 ) %>%
            filter(TripDurationMin>0 && trip_distance>0 ) %>% 
            filter(trip_distance<=15.60)%>%
            filter(fare_amount<=52)%>%
            filter(tip_amount<=10.15)%>%
            mutate(TypeTripHour=ifelse(extra==0,'Normal', ifelse(extra==0.5,'Rush Hour', 'OverNight'))) %>%
            mutate(ZoneChange=ifelse(PULocationID ==DOLocationID ,'N', 'Y')) %>%
            mutate(AvgSpeed=trip_distance/(TripDurationMin/60)) %>%
            select(passenger_count,trip_distance,RatecodeID,fare_amount, tip_amount,tolls_amount,WeekDay,AvgSpeed,TripDurationMin,TypeTripHour, ZoneChange)

       

In [3]:
count(DataTaxiTarjetaFiltrado)

# Source: spark<?> [?? x 1]
         n
     <dbl>
1 18606021

## Partición de datos

Para la generación de los modelos vamos a pasar a particionar la muestra de los datos. Tomaremos un 80% de los datos para el entrenamiento de los modelos y un 20% de los datos los reservaremos para el test.

In [4]:
#Genero una partición de los datps 80% entrenamiento, 20% test 
particion_datos <- sdf_partition(DataTaxiTarjetaFiltrado,training = 0.8, test = 0.2, seed = 1234)

# Contamos las particiones:
count(DataTaxiTarjetaFiltrado)
count(particion_datos$test)
count(particion_datos$training)




# Source: spark<?> [?? x 1]
         n
     <dbl>
1 18606021

# Source: spark<?> [?? x 1]
        n
    <dbl>
1 3720586

# Source: spark<?> [?? x 1]
         n
     <dbl>
1 14885435

## Fórmula

La fórmula para los modelos la definimos de la siguiente manera. 

In [5]:
formula<-tip_amount~passenger_count+fare_amount+WeekDay+AvgSpeed+TripDurationMin+TypeTripHour+ZoneChange

# Modelización

Para este caso vamos a entrenar 3 modelos diferentes para más tarde elegir el que mejor poder predictivo presente y a la vez sea más eficiente para utilizar en la app. 
Los modelos que se van a entrenar son una regresión lineal, random forest y gradient boosting machine. Aunque lo ideal sería realizar para cada uno de ellos varias configuraciones para dar con la que tenga mayor poder predictivo, las limitaciones del sistema hacen que únicamente vaya a poder lanzar una única configuración de parámetros para cada modelo. 
En el caso de disponer de máquinas más eficientes, mediante la libreria caret es posible lanzar distintas configuraciones para los modelos y testear cual es la más óptima. 

## Regresión lineal

In [6]:
modelo_lm<-particion_datos$training %>%
                ml_linear_regression(formula)


In [8]:
summary(modelo_lm)


Deviance Residuals (approximate):
    Min      1Q  Median      3Q     Max 
-8.8501 -0.2669  0.1040  0.3847  9.1094 

Coefficients:
           (Intercept)        passenger_count            fare_amount 
          5.416805e-01          -1.834180e-03           1.631235e-01 
               WeekDay               AvgSpeed        TripDurationMin 
         -1.642960e-03           5.883605e-05          -4.543527e-04 
   TypeTripHour_Normal TypeTripHour_Rush Hour           ZoneChange_Y 
         -1.704779e-01          -9.693720e-02           9.766404e-03 

R-Squared: 0.5527
Root Mean Squared Error: 0.9658


In [11]:
ml_save(modelo_lm, "ModeloLineal.R")

Model successfully saved.


In [43]:
# prediccion
training_ml<-ml_predict(modelo_lm, particion_datos$training)
mse_train_ml<-ml_regression_evaluator( training_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mse"),
  uid = random_string("regression_evaluator_"))
rmse_train_ml<-ml_regression_evaluator( training_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("rmse"),
  uid = random_string("regression_evaluator_"))
r2_train_ml<-ml_regression_evaluator( training_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("r2"),
  uid = random_string("regression_evaluator_"))
mae_train_ml<-ml_regression_evaluator( training_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mae"),
  uid = random_string("regression_evaluator_"))
train_lm<-c(mse_train_ml,rmse_train_ml,r2_train_ml,mae_train_ml)


In [46]:
# prediccion
test_ml<-ml_predict(modelo_lm, particion_datos$test) 

mse_test_ml<-ml_regression_evaluator( test_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mse"),
  uid = random_string("regression_evaluator_"))
rmse_test_ml<-ml_regression_evaluator( test_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("rmse"),
  uid = random_string("regression_evaluator_"))
r2_test_ml<-ml_regression_evaluator( test_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("r2"),
  uid = random_string("regression_evaluator_"))
mae_test_ml<-ml_regression_evaluator( test_ml,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mae"),
  uid = random_string("regression_evaluator_"))

test_lm<-c(mse_test_ml,rmse_test_ml,r2_test_ml,mae_test_ml)

## Random Forest

In [None]:
modelo_rf <- particion_datos$training %>%
              ml_random_forest(formula,
              type ="regression",
              prediction_col = "prediction",
  probability_col = "probability",
  raw_prediction_col = "rawPrediction",
  feature_subset_strategy = "auto",
  impurity = "auto",
  checkpoint_interval = 10,
  max_bins = 5,
  max_depth = 10,
  num_trees = 20,
  min_info_gain = 0,
  min_instances_per_node = 1,
  subsampling_rate = 1,
  seed = 1234,
  thresholds = NULL,
  cache_node_ids = FALSE,
  max_memory_in_mb = 256,
  uid = random_string("random_forest_")
)

In [None]:
summary(modelo_rf)
ml_save(modelo_rf, "ModeloRF")

In [29]:
modelo_rf<-ml_load(sc,"ModeloRF")

In [44]:
training_rf<-ml_predict(modelo_rf, particion_datos$training)
mse_train_rf<-ml_regression_evaluator( training_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mse"),
  uid = random_string("regression_evaluator_"))
rmse_train_rf<-ml_regression_evaluator( training_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("rmse"),
  uid = random_string("regression_evaluator_"))
r2_train_rf<-ml_regression_evaluator( training_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("r2"),
  uid = random_string("regression_evaluator_"))
mae_train_rf<-ml_regression_evaluator( training_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mae"),
  uid = random_string("regression_evaluator_"))
train_rf<-c(mse_train_rf,rmse_train_rf,r2_train_rf,mae_train_rf)



In [47]:
test_rf<-ml_predict(modelo_rf, particion_datos$test) 

mse_test_rf<-ml_regression_evaluator( test_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mse"),
  uid = random_string("regression_evaluator_"))
rmse_test_rf<-ml_regression_evaluator( test_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("rmse"),
  uid = random_string("regression_evaluator_"))
r2_test_rf<-ml_regression_evaluator( test_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("r2"),
  uid = random_string("regression_evaluator_"))
mae_test_rf<-ml_regression_evaluator( test_rf,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mae"),
  uid = random_string("regression_evaluator_"))
test_rf<-c(mse_test_rf,rmse_test_rf,r2_test_rf,mae_test_rf)

# Gboost


In [12]:
modelo_gb <- particion_datos$training %>%
              ml_gradient_boosted_trees(formula,
              type ="regression",max.depth=10)


"Some components of ... were not used: max.depth"

In [13]:
ml_save(modelo_gb, "ModeloGB")

Model successfully saved.


In [15]:
summary(modelo_gb)

               Length Class                   Mode     
pipeline_model  5     ml_pipeline_model       list     
formula         1     -none-                  character
dataset         2     tbl_spark               list     
pipeline        5     ml_pipeline             list     
model          11     ml_gbt_regression_model list     
label_col       1     -none-                  character
features_col    1     -none-                  character
feature_names   8     -none-                  character
response        1     -none-                  character

In [48]:
training_gb<-ml_predict(modelo_gb, particion_datos$training)

mse_train_gb<-ml_regression_evaluator( training_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mse"),
  uid = random_string("regression_evaluator_"))
rmse_train_gb<-ml_regression_evaluator( training_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("rmse"),
  uid = random_string("regression_evaluator_"))
r2_train_gb<-ml_regression_evaluator( training_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("r2"),
  uid = random_string("regression_evaluator_"))
mae_train_gb<-ml_regression_evaluator( training_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mae"),
  uid = random_string("regression_evaluator_"))
train_gb<-c(mse_train_gb,rmse_train_gb,r2_train_gb,mae_train_gb)

In [49]:
test_gb<-ml_predict(modelo_gb, particion_datos$test) 

mse_test_gb<-ml_regression_evaluator( test_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mse"),
  uid = random_string("regression_evaluator_"))
rmse_test_gb<-ml_regression_evaluator( test_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("rmse"),
  uid = random_string("regression_evaluator_"))
r2_test_gb<-ml_regression_evaluator( test_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("r2"),
  uid = random_string("regression_evaluator_"))
mae_test_gb<-ml_regression_evaluator( test_gb,
   label_col="tip_amount",
   prediction_col = "prediction",
  metric_name = c("mae"),
  uid = random_string("regression_evaluator_"))
test_gb<-c(mse_test_gb,rmse_test_gb,r2_test_gb,mae_test_gb)


# Comparativa de modelos

Una vez hemos calculado todos los modelos vamos a compararlos entre sí. 

In [67]:
comparativa_modelos<-as.data.frame(rbind(train_lm,test_lm,train_rf,test_rf,train_gb,test_gb))

In [68]:
names(comparativa_modelos)<-c("mse","rmse","r2","mae")

In [69]:
comparativa_modelos

Unnamed: 0,mse,rmse,r2,mae
train_lm,0.9327011,0.9657645,0.5526715,0.5979458
test_lm,0.9305075,0.9646282,0.554596,0.5977304
train_rf,1.0714756,1.0351211,0.4861144,0.6366345
test_rf,1.0716768,1.0352182,0.4870228,0.6368386
train_gb,0.9290692,0.9638824,0.5544133,0.5964876
test_gb,0.9275232,0.9630801,0.5560244,0.5963103


Los modelos de regresión lineal y gradient boosting machine son muy similares en cuanto a poder predictivo. 
Como se comentaba anteriormente lo habitual sería lanzar los modelos cambiando las parametrizaciones e intentar ajustar mejor los modelos. 
Con esta configuración el modelo de gradient boosting machine es ligeramente mejor que el de regresión lineal.