# Spanish_High_Speed_Rail_tickets_pricing-Renfe


In [None]:
# import modules 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

In [None]:
data = pd.read_csv('../input/renfe.csv')

In [None]:
data.head()

In [None]:
data.info()

First step is to clean the dataset 

In [None]:
# remove the first "Unnamed: 0" column
data = data.iloc[:,1:]

In [None]:
# set "insert_date" ,"start_date" and "end_date" in datetime format
for col in ["insert_date" ,"start_date" , "end_date"]:
    data[col] = pd.to_datetime(data[col])

In [None]:
# set in datetime format and set it as index 
data.set_index('insert_date',inplace=True)

In [None]:
data.info()

In [None]:
data.head()

Next step is to check if there is missing value

In [None]:
data.isnull().sum()

Since there are too many missing values, instead of removing from the dataset, replacing missing values with mean values is more appropriate. 
<br>Since origin, destination and train_type have no missing values,and train_class and fare have fewer missing values than price, by using these information can help match with more accurate mean price.
<br>Missing values for train_class and fare are determined by mode of records with the same origin and destination

First step is to find the average price for each combinaiton
<br>This can be done by using groupby function

In [None]:
avg_price_by_type = data[~np.isnan(data['price'])].groupby(['origin','destination','train_type','train_class','fare']).agg({'price':np.mean})


In [None]:
avg_price_by_type.head()

Second step is to fill the mean value back to the data table 

In [None]:
data.head()

In [None]:
data.reset_index(drop=False,inplace=True)
data.set_index(['origin','destination','train_type','train_class','fare'], inplace=True)

In [None]:
data.head()

In [None]:
data['price'].fillna(avg_price_by_type['price'],inplace=True)

In [None]:
data.isnull().sum()

Successfully reduce missing price records from over 310k to only 53k, only 2% of overall records left.
<br>To simplify, these records are deleted.

In [None]:
data.dropna(inplace=True)

In [None]:
data.reset_index( inplace=True)

In [None]:
data.isnull().sum()

So by eliminating records with missing fare, those missing records from train_class and fare are also eliminated.
<br>Now we can explore the dataset

In [None]:
data.set_index('insert_date',inplace=True)

In [None]:
data.head()

In [None]:
data.pivot_table(index='origin',columns='destination',values='price',aggfunc='count')

So Madrid is a terminal to connect all other four cities. 
<br>Also, the most crowded route is between madrid and barcelona, while between ponferrada and madrid is the least crowded 

Since madrid-barcelona is the most crowded route, below will study it more

In [None]:
data_b_m = data[(data['origin']=='BARCELONA')&(data['destination']=='MADRID')]
data_m_b = data[(data['origin']=='MADRID')&(data['destination']=='BARCELONA')]

In [None]:
len(set(data_b_m['start_date']))

In [None]:
len(set(data_m_b['start_date']))

So in total there are 2243 trains from Barcelona to Madrid and 2230 trains from Madrid to Barcelona 

## From Barcelona to Madrid

This part will study the relation between train_type, ticket fare and price 

In [None]:
data_b_m.groupby('train_type')['start_date'].count()

In [None]:
data_b_m.groupby(['train_type'])['start_date'].nunique()

In [None]:
data_b_m.groupby('train_type')['start_date'].count()/data_b_m.groupby(['train_type'])['start_date'].nunique()

AVE is the most common train types among three. However, there are more average passengers in AVE-TGV and R. EXPRES than AVE. 
<br>Let's see if there are any difference between three

In [None]:
data_b_m['duration'] = data_b_m['end_date']-data_b_m['start_date']

In [None]:
for train in ['AVE','AVE-TGV','R. EXPRES']:
    print('train type: ',train)
    print(data_b_m[data_b_m['train_type']==train].describe())
    print()

The possible reason that R. EXPRES is popular is because of its cheap ticket price. It's only half other two train types.
<br>However, the duration requires 9 hours while other two only need around 3 hours.
<br>One thing is that std=0 for the price from R. EXPRES. Therefore we can assume that there is only one fare for R. RXPREs. 

In [None]:
set(data_b_m[data_b_m['train_type']=='R. EXPRES']['fare'])

As expected. 

Then we can focus more about the remaining two train types. First is to see the price vs fare for each train type

In [None]:
data_b_m.groupby(['train_type','fare']).agg({'price':['count',np.mean,np.std]})

Promo is the cheapest type and Flexible is the most expensive one.
<br>This is expected as normally flexible ticket has less restricted condition on the usage of ticket or cancellation.
<br><br>Also around 80% of tickets are in Promo type in both AVE and AVE-TGV train types

Now let's move to the return trip

## From Madrid to Barcelona

This part will study about the number of journey and discover any pattern

In [None]:
print(data_m_b['start_date'].min())
print(data_m_b['end_date'].max())

In [None]:
data_m_b['start_day'] =data_m_b['start_date'].dt.day
data_m_b['start_mth'] =data_m_b['start_date'].dt.month
data_m_b['start_hour'] =data_m_b['start_date'].dt.hour
data_m_b['start_weekday'] =data_m_b['start_date'].dt.weekday 

data_m_b['end_day'] =data_m_b['end_date'].dt.day
data_m_b['end_mth'] =data_m_b['end_date'].dt.month
data_m_b['end_hour'] =data_m_b['end_date'].dt.hour
data_m_b['end_weekday'] =data_m_b['end_date'].dt.weekday 

In [None]:
data_m_b.head()

First is to study the relation between ticket price and weekday of travelling 

In [None]:
for train in set(data_m_b['train_type']):
    print(data_m_b[data_m_b['train_type']==train].groupby(['train_type','train_class','fare','start_weekday']).agg({'price':['count',np.mean]}))

From the above results, there are some conclusions:
1. There are some combinations with fixed ticket prices, eg. train_type = R. EXPRES , (AVE-TGV,Preferente,Flexible), ...
2. Normally, tickets on Sunday are the most expensive within the week
3. R. EXPRES seems not in service on Sunday (only 3 records)
4. Ranking of ticket fare based on ticket price: Promo, Promo+, Flexible, Mesa

Next is to study the pattern of the travelling time 

In [None]:
data_m_b.groupby(['train_type','start_hour']).agg({'price':['count',np.mean]})

Both AVE-TGV and R. EXPRES only have one timeslot from Madrid to Bacelona.
<br>There are many available options for AVE, and ticket prices are normally around 90 (except the first two and the lass sessions)
<br>Below will further study about records from AVE train

In [None]:
for train_class in set(data_m_b['train_class']):
    for fare in set(data_m_b['fare']):
        
        try:
            data_m_b[(data_m_b['train_type']=='AVE') & (data_m_b['train_class']==train_class) & (data_m_b['fare']==fare)]\
        .pivot_table(index='start_hour',columns='start_weekday',values='price',aggfunc=np.mean).plot()
            plt.title("train_class: {}, fare: {}".format(train_class,fare))
            plt.legend(loc=0)
            plt.show()
        except TypeError:
            continue
        print()

Obviously the prices are related to the leaving time.
<br>There are some times with higher ticket prices , eg: 7, 19 and 20 hours 