# Renfe trips scrapping exploratory data analysis (I)

Te goal if this analysis in answering the following:

* **Is our project feasible?** There must be strong variations in ticket price between its release to market (about 2 months before departure) and departure date. The goal of the project is to take advantage of those variations to send automatic reminders to users.

## python imports

In [None]:
from IPython.display import Image
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import (download_plotlyjs, 
                            init_notebook_mode, 
                            plot, 
                            iplot)
from plotly import io as pio
import random
import seaborn as sns

## config

In [None]:
# plot styling

mpl.rcParams['figure.figsize'] = (19.2, 10.8)
mpl.rcParams['figure.dpi'] = 100
mpl.rcParams['font.size'] = 12

In [None]:
init_notebook_mode(connected=True)

## data loading

In [None]:
# from update_dump import update_dump
# update_dump()
renfe = pd.read_csv('../input/renfe.csv', parse_dates=['insert_date', 'start_date', 'end_date'])

In [None]:
renfe.head()

In [None]:
renfe.shape

In [None]:
renfe.info(memory_usage='deep')

In [None]:
renfe.describe(include='all')

## data wrangling

First, let's create a unique index for every trip. A trip id must be defined as a 'primary key' resulting from combination of columns, Python hash stardard library function can be used to perform this task efficiently:
* `origin`
* `destination`
* `start_date`
* `end_date`
* `train_type`

In [None]:
renfe.loc[:,'trip_id'] = renfe_index_hash = renfe[['origin', 
                                             'destination', 
                                             'start_date', 
                                             'end_date',
                                             'train_type']] \
                                           .apply(lambda x: hash(tuple(x.tolist())), 
                                                  axis=1)

There are some trains that are scheduled periodically, for example, a trains departure from Seville to Madrid every day at 7h. Let's create and train_id to indentify those periodically scheduled trips. This id can be made combining columns:

* `origin`
* `destination`
* `start_date` -> weekday & departure time
* `train_type`


In [None]:
renfe.loc[:, 'start_date_weekday'] = renfe['start_date'].dt.weekday
renfe.loc[:, 'start_date_time'] = renfe['start_date'].dt.strftime("%H:%M")

In [None]:
renfe.loc[:, 'train_id'] = renfe_index_hash = renfe[['origin', 
                                                     'destination', 
                                                     'start_date_weekday',
                                                     'start_date_time',
                                                     'train_type']] \
                           .apply(lambda x: hash(tuple(x.tolist())), axis=1)

Second, let's check trip duration:

In [None]:
renfe.loc[:, 'trip_duration'] = renfe['end_date'] - renfe['start_date']
renfe.loc[:, 'trip_duration_hours'] = renfe['trip_duration'].dt.components.hours + \
                                      renfe['trip_duration'].dt.components.minutes / 60

Let's plot `trip duration` (in hours) distribution:

In [None]:
sns.distplot(renfe['trip_duration_hours']);

Most trips that are 'high' speed should take far less than 4 hours. Otherwise it means that the train is not really high speed or there is a transfer involved. We are not interested in those trip, so, can set a threshold and filter any trip over that value:

In [None]:
high_speed_max_duration = 4
renfe = renfe.loc[renfe['trip_duration_hours'] < high_speed_max_duration, :]

We will also need a variable that represents how much time lasts to train departure from the very moment of data scrapping. This impacts heavily on price as it increases as departure is closer in time (intuition says that is important getting tickets with enough time).

In [None]:
renfe.loc[:,'time_to_departure'] = renfe['start_date'] - renfe['insert_date']
renfe.loc[:,'time_to_departure_days'] = renfe['time_to_departure'].dt.components.days \
                                      + renfe['time_to_departure'].dt.components.hours / 24 + \
                                        renfe['time_to_departure'].dt.components.minutes / 60 / 24

Negative values for `time_to_departure` must be filtered as they are probably due to errors in scrapping or Renfe webpage maintenance.

In [None]:
renfe = renfe.loc[renfe['time_to_departure_days'] > 0, :]

Let's create an indicator to identify price changes, this can probably be made in a more efficient fashion:

In [None]:
renfe['price_change'] = renfe.groupby('trip_id')['price'].transform(lambda x: x - x.shift(1))
renfe['price_change_direction'] = renfe['price_change'].clip(-1, 1)

## price vs time to departure

Let's make a function that plots price vs time to departure given an origin, destination, and a minimum of scrapped points to plot. It is interesting to make it interactive, as it is expected that price changes are due to fare and train class differences, and tooltips with that information will be useful to have.

In [None]:
def plot_price_vs_time_to_departure(origin, destination, min_obs=256, n_trips=8, dynamic=True):
    trips_obs = renfe['trip_id'].value_counts()
    min_obs_filter = renfe['trip_id'].isin(trips_obs[trips_obs > min_obs].index.tolist())
    filter_origin_destination = (renfe['origin'] == origin) & (renfe['destination'] == destination)

    traces = []

    for trip_id in random.sample(list(renfe[filter_origin_destination \
                                            & min_obs_filter].trip_id.unique()), n_trips):

        trip = renfe[renfe['trip_id'] == trip_id] \
        .drop_duplicates(subset='insert_date', keep='first').sort_values('insert_date', ascending=False)

#         filter_departure_filter = trip['time_to_departure_days'] >= 0.0

#         trip = trip.loc[filter_departure_filter, :]

        traces.append(go.Scatter(
                      x=trip['time_to_departure_days'],
                      y=trip['price'],
                      name = f"{origin}-{destination}-{trip['start_date'].iloc[0].strftime('%A-%H:%M')}",
                      text = trip['fare'] + \
                             '_' + trip['train_class'] + \
                             '_' + trip['train_type'],
                      hoverinfo = 'text+y+x',
                      opacity = 0.6))

    layout = dict(
        title='price vs time to departure (days)',
        xaxis=dict(title='time to departure (days)', 
                   rangeslider=dict(visible = True)),
        yaxis=dict(title='price (€)'),
        legend=dict(font=dict(size=10)),
    )

    fig = dict(data=traces, layout=layout)
    
    if dynamic:
        
        iplot(fig, filename = "price_vs_time_to_departure")
    else:
        img_bytes = pio.to_image(fig, format='png', width=1200, height=700, scale=1)
        Image(img_bytes)
        display(Image(img_bytes))
        
    plot(fig, filename = f"price_vs_time_to_departure_{origin}_{destination}.html", auto_open=False)

---

Let's plot trips with destination Madrid and origin Barcelona:

In [None]:
plot_price_vs_time_to_departure('BARCELONA', 'MADRID', min_obs=400, n_trips=8)

As seen in the plot, there is a variation in price, going up and down depending on different situations. The general trend is that, the closer the departure time is, the higher the price. Price changes because of two reasons:

* Cheaper tickets are sold out: it means that only more expensive tickets are available (for example fare Promo -> Flexible, or train class Turista -> Preferente).
* Due to ticket cancellations or increases in train capacity (longer trains) cheaper tickets are released and price drops.

Gaps in the series means that all tickets are sold our or problems with our scrapping system (more likely the first option).

Let's plot other destinations:

---

In [None]:
plot_price_vs_time_to_departure('MADRID', 'SEVILLA')

A similar behavior is shown here, maybe with more stable/predictable prices and less price drops.

---

In [None]:
plot_price_vs_time_to_departure('MADRID', 'VALENCIA')

Same feeling here, stable/predictable prices with only a few price drops.

---

In [None]:
plot_price_vs_time_to_departure('MADRID', 'PONFERRADA')

I this case there is no direct High Speed Train to Ponferrada and a transfer is needed in León.

## conclusions

Up to this point, data looks promising. Copying directly from our project Trello board, our idea is described as:

1. Show tickets from Renfe and 2nd hand stores for user selected day and time period.
2. Give the user the option to set an alarm and be notified in case ticket (second hand or Renfe) experiment a price drop (only drops, not rises).
3. In case there is no options available, set up an alarm when the new ticket is released.