# Spanish High Speed Rail tickets pricing data analysis 
![](https://www.seat61.com/images/Spain-pato-train-barcelona2.jpg)

# Introduction:
### Rail transport in Spain operates on four rail gauges and services are operated by a variety of private and public operators. The total route length in 2012 was 16,026 km (10,182 km electrified)

#### Most railways are operated by Renfe Operadora; metre and narrow-gauge lines are operated by FEVE and other carriers in individual autonomous communities. It is proposed and planned to build or convert more lines to standard gauge,including some dual gauging of broad-gauge lines, especially where these lines link to France, including platforms to be heightened.

### Spain is a member of the International Union of Railways (UIC).

# Importing Libraries

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import datetime
import matplotlib.pyplot as plt
# from chart_studio.plotly.offline import init_notebook_mode,iplot
# import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import math

# Dataset preparation

In [None]:
data = pd.read_csv("../input/spanish-high-speed-rail-system-ticket-pricing/thegurus-opendata-renfe-trips.csv",nrows=3579770)

Dropping irrelevant columns

In [None]:
data = data.drop(columns=['id','seats','meta','company','duration'])
cols = list(data.columns)
cols = [cols[-1]] + cols[:-1]
data = data[cols]

column_dict = {
    "departure":"start_date",
    "arrival":"end_date",
    "vehicle_type":"train_type",
    "vehicle_class":"train_class"
}
data = data.rename(columns = column_dict)

data.to_csv("./renfe.csv",index=False)

# Loading the dataset

In [None]:
data = pd.read_csv('./renfe.csv')
data.head()

In [None]:
data.info()

# DATA PREPARATION

1. Involves checking for null values
2. Filling the null values with mean or mode whichever is suitable according to data type
3. Removing unneccessary data columns not deciding the price of ticket


## Checking for Null values in the dataset

In [None]:
data.isnull().sum()

## Filling the Null values in the price column by taking the mean price of the ticket

In [None]:
data['price'].fillna(data['price'].mean(),inplace=True)

## Dropping the rows containing Null values in the attributes train_class and fare

In [None]:
data.dropna(inplace=True)

## Dropping irrelevant attributes

In [None]:
data.drop('insert_date',axis=1,inplace=True)

In [None]:
data.isnull().sum()

# Extrapolatory Data Analysis

## Univariate Analysis

### Number of people boarding from different stations

In [None]:
fig,ax = plt.subplots(figsize=(15,5))
ax = sns.countplot(data['origin'])
plt.show()

From the above graph we can visualize that maximum number of the people have **"Madrid"** as the station of origin.

### Number of people having the following stations as destination

In [None]:
fig,ax = plt.subplots(figsize=(15,5))
ax = sns.countplot(data['destination'])
plt.show()

* From the above graph we can visualize that also maximum number of people are coming to **"Madrid"** as the most of the people have their destination station as **Madrid**.

### Different types of train that runs in Spain 

In [None]:
fig,ax = plt.subplots(figsize=(15,6))
ax = sns.countplot(data['train_type'])
plt.show()

We can see that **"AVE"** are runs maximum in number as compared to other train types.

### Number of train of different class

In [None]:
fig,ax = plt.subplots(figsize=(15,6))
ax = sns.countplot(data['train_class'])
plt.show()

**"Turista"** is the train_class in which people travel in general. 

### Number of tickets bought from each category 

In [None]:
fig,ax = plt.subplots(figsize=(15,6))
ax = sns.countplot(data['fare'])
plt.show()

### Distribution of the ticket prices

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.distplot(data['price'],rug=True)
plt.show()

# Bivariate Analysis

### Train_class vs Price

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.boxplot(x='train_class',y='price',data=data)
plt.show()

**"Cama G. Clase"** is the train class with the highest ticket price, and the tickets of this class are bought by least number of people.

### Train_type vs Price

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.boxplot(x='train_type',y='price',data=data)
plt.show()

The average price of the tickets of train_type **AVE and AVE-TGV** are comparatilvely higher as compared to other train types.

# Feature Engineering

Involves extracting the relevant features from the data columns deciding the cost of ticket

### Finding the travel time between the place of origin and destination

In [None]:
data = data.reset_index()

In [None]:
datetimeFormat = '%Y-%m-%d %H:%M:%S'
def fun(a,b):
    diff = datetime.datetime.strptime(b, datetimeFormat)- datetime.datetime.strptime(a, datetimeFormat)
    return(diff.seconds/3600.0)
    

In [None]:
data['travel_time_in_hrs'] = data.apply(lambda x:fun(x['start_date'],x['end_date']),axis=1) 

### Removing redundant features

In [None]:
data.drop(['start_date','end_date'],axis=1,inplace=True)
data.head()

###  Travelling from MADRID to SEVILLA

In [None]:
df1 = data[(data['origin']=="MADRID") & (data['destination']=="SEVILLA")]
df1.head()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.barplot(x="train_type",y="travel_time_in_hrs",data=df1)
plt.show()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.boxplot(x="train_type",y="price",data=df1)
plt.show()

- The fastest train between Madrid and SEVILLA is AVE and even the costliest one and it takes approximately 2-2.4 hrs.
- The cheapest train is MD-LD , even the slowest one and it takes above 7 hours to reach the destination.

### Travelling from MADRID to BARCELONA

In [None]:
df1 = data[(data['origin']=="MADRID") & (data['destination']=="BARCELONA")]
df1.head()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.barplot(x="train_type",y="travel_time_in_hrs",data=df1)
plt.show()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.boxplot(x="train_type",y="price",data=df1)
plt.show()

- The fastest trains on this route are AVE and AVE-TGV as they around 2-3 hours to reach the destination.
- R.Express takes maximum time i.e. more than 8 hours and it is one of the cheapest one.

### Travelling from MADRID to VALENCIA

In [None]:
df1 = data[(data['origin']=="MADRID") & (data['destination']=="VALENCIA")]
df1.head()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.barplot(x="train_type",y="travel_time_in_hrs",data=df1)
plt.show()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.boxplot(x="train_type",y="price",data=df1)
plt.show()

- AVE and ALVIA are two train_types which takes least amount of time to reach the destination, while MD-LD  takes maximum amount of time.
- REGIONAL train have minimum ticket fare.
- AVE is the best train to travel as it is the fastest train with less ticket pricing.

### Travelling from MADRID to PONFERRADA

In [None]:
df1 = data[(data['origin']=="MADRID") & (data['destination']=="PONFERRADA")]
df1.head()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.barplot(x="train_type",y="travel_time_in_hrs",data=df1)
plt.show()

In [None]:
f,ax = plt.subplots(figsize=(15,6))
ax = sns.boxplot(x="train_type",y="price",data=df1)
plt.show()

- It takes minimum 4 hours to travel from MADRID to PONFERRADA
- AVE-MD , AVE-LD , ALVIA they take almost same time which is the minimum time and also their ticket prices are same
- There is no point travelling via MD-LD as it takes maximum time and have the most expensive tickets.

# Route chart 

1. Using Pie chart to visualize the different routes and class of trains available for the same

In [None]:
data['route'] = data['origin']+' to '+data['destination']

print('There are {} number of routes in dataframe'.format(data['route'].nunique()))

In [None]:
cnt_ = data['route'].value_counts()

fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Routes",
      "hoverinfo":"label+percent+name",
      "hole": .5,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart of routes",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

# Analysis of train type column 

## Overview of different train types in spain: 

**AVE:** In Spain With 3,100km of track the Spanish high-speed AVE trains operate on the longest high-speed network in Europe. Running at speeds of up to 310 km/h this extensive network allows for fast connections between cities in Spain. Travel from Madrid to Barcelona in less than 3 hours! This modern train system connects many cities across Spain from Madrid and Barcelona, to Córdoba, Seville, Málaga and Valencia.

**ALVIA:** The Spanish Alvia trains combine both a long distance and a high-speed service to connect major cities across Spain. The Alvia offers many routes such as connections from Madrid to Gijón, Alicante and Castellón and from Barcelona to Bilbao, A Coruña and Vigo. With air conditioned carriages and check-in control before boarding the Alvia is comfortable and relaxed way to traverse one of Europe's biggest countries.

**REGIONAL:** Regional and intercity trains in Spain. FEVE trains operate in the north of Spain, connecting cities like Bilbao, Gijón, León and Santander. Cercanías (suburban trains) is a network of trains that operates in and around the larger Spanish cities including Barcelona and Valencia.

**INTERCITY:** Traditional intercity trains travelling between 160 do 250 km/h allow you to reach nearly every corner of Spain. You can choose to travel in 2nd class (Turista) or 1st class (Preferente). The comfort of the carriages is close to that of the high-speed AVE trains. All trains are air-conditioned.

**AV City:** The ave city trains are high speed train to complement the AVE to offer lower prices and marketed in economy class (p) and economy plus (p+)

**Less distance(LD) - Medium distance(MD):** The LD-AVE and MD-AVE on the list of trains is for an indirect service that uses a comination of the regular trains (either the LD - Larga Distancia/Long Distance or the MD - 
Media Distancia/Medium Distance). Those trains requires a change (usually in Zaragoza or Valencia) so, due to the change and the lower speed trains on part of the way. The journey is longer but the tickets are cheaper.

The **LD,AVE-MD,AVE-LD,LD-MD,MD-AVE,MD,LD-AVE** slightly falls under above category.


In [None]:
cnt_ = data['train_type'].value_counts()

fig = {
  "data": [
    {
      "values": cnt_.values,
      "labels": cnt_.index,
      "domain": {"x": [0, .5]},
      "name": "Train types",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart Train types",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
             "text": "Pie Chart",
                "x": 0.50,
                "y": 1
            },
        ]
    }
}
iplot(fig)
cnt_

## Observations

* Most of the train types in our dataset was AVE as these are high speed trains.It occupies 69.4%

* People rarely choosing the MD-AVE,MD,LD-AVE trains because these train journeys are longer in time. 

* We can say that spain people like to travel in high speed trains.

In [None]:
data.head()

# Label Encoding

* Encoding the all columns having categorical values in the form of strings to equivalent numerical code


In [None]:
lab_en = LabelEncoder()
data.loc[:,'origin'] = lab_en.fit_transform(data.loc[:,'origin'])
data.loc[:,'destination'] = lab_en.fit_transform(data.loc[:,'destination'])
data.loc[:,'train_type'] = lab_en.fit_transform(data.loc[:,'train_type'])
data.loc[:,'train_class'] = lab_en.fit_transform(data.loc[:,'train_class'])
data.loc[:,'fare'] = lab_en.fit_transform(data.loc[:,'fare'])

In [None]:
data.head()

In [None]:
data = data.loc[:,data.columns!='route']
data = data.loc[:,data.columns!='index']

In [None]:
data.to_csv("data_prepared_renfe.csv",index=False)

# Coorelation Matrix

## Observations

1. Price and origin follows weak negative coorelation
2. Price and destination follows weak negative coorelation
3. Travel time is positively coorelated to train type

In [None]:
def coorelation_matrix_plot(data, title = "Train Ticket Prediction", height = 9, width = 12):
    cor_mat = round(data.corr(method ="spearman"), 2)
    plt.figure(figsize = (width, height))
    ax = sns.heatmap(cor_mat, annot=True, annot_kws={"size": 15}, cmap = sns.color_palette("PuOr_r", 50), 
                     vmin = -1, vmax = 1)
    ax.axes.set_title(title, fontsize = 30)
    ax.title.set_position([.5, 1.03])
    plt.show()
    
coorelation_matrix_plot(data.iloc[:,1:], title = "Train Ticket Prediction")

# Splitting the data into training and test set

* Splitting the dataset into 80:20 ratio for training and testing the results

In [None]:
# X = data.iloc[:,[1,2,3,4,6,7]].values
# Y = data.iloc[:,5].values

X = data.loc[:,data.columns!='price']
Y = data.loc[:,data.columns=='price']

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=5)

# Applying Linear Regression

In [None]:
lr = LinearRegression()
lr.fit(X_train,Y_train)

In [None]:
Y_predicted = lr.predict(X_test)

print("R Squared value ",r2_score(Y_test, Y_predicted))
print("Mean Absolute Error",mean_absolute_error(Y_test, Y_predicted))
print("Mean Squared Error ",math.sqrt(mean_squared_error(Y_test, Y_predicted)))


# Applying Light Gradient Boosting

In [None]:
lg = LGBMRegressor(n_estimators=1000)
lg.fit(X_train,Y_train)

In [None]:
Y_predicted = lg.predict(X_test)

print("R Squared value ",r2_score(Y_test, Y_predicted))
print("Mean Absolute Error",mean_absolute_error(Y_test, Y_predicted))
print("Mean Squared Error ",math.sqrt(mean_squared_error(Y_test, Y_predicted)))


# Applying XGBoost

In [None]:
import xgboost as xgb

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

xg_reg.fit(X_train,Y_train)

Y_predicted = xg_reg.predict(X_test)



In [None]:

print("R Squared value ",r2_score(Y_test, Y_predicted))
print("Mean Absolute Error",mean_absolute_error(Y_test, Y_predicted))
print("Mean Squared Error ",math.sqrt(mean_squared_error(Y_test, Y_predicted)))


# Applying Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()
dtr.fit(X_train,Y_train)

Y_predicted = dtr.predict(X_test)

In [None]:

print("R Squared value ",r2_score(Y_test, Y_predicted))
print("Mean Absolute Error",mean_absolute_error(Y_test, Y_predicted))
print("Mean Squared Error ",math.sqrt(mean_squared_error(Y_test, Y_predicted)))


# Applying SVM

In [None]:
from sklearn.svm import SVR

svr = SVR()


In [None]:
val = int(X_train.shape[0]/1000)
print(val)
X_train = X_train.iloc[:val,:]
Y_train = Y_train.iloc[:val,:]

In [None]:
svr.fit(X_train,Y_train)


In [None]:

X_test = X_test.iloc[:val,:]
Y_test = Y_test.iloc[:val,:]
Y_predicted = svr.predict(X_test)

In [None]:

print("R Squared value ",r2_score(Y_test, Y_predicted))
print("Mean Absolute Error",mean_absolute_error(Y_test, Y_predicted))
print("Mean Squared Error ",math.sqrt(mean_squared_error(Y_test, Y_predicted)))


# Applying Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train,Y_train)

Y_predicted = rf.predict(X_test)

In [None]:

print("R Squared value ",r2_score(Y_test, Y_predicted))
print("Mean Absolute Error",mean_absolute_error(Y_test, Y_predicted))
print("Mean Squared Error ",math.sqrt(mean_squared_error(Y_test, Y_predicted)))
