# Flights Price Analysis 

The goal of this notebook is to run some analysis on a dataset that contains one-way flights found on Expedia between 2022-04-16 and 2022-10-05 (you can find it at this [link](https://www.kaggle.com/datasets/dilwong/flightprices).

In [1]:
val bucketname = "unibo-bd2223-paolopenazzi"

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,application_1677054126516_0002,spark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

bucketname: String = unibo-bd2223-paolopenazzi


## DataPreparation

The following columns will be kept and created:
- flightID: Identifier for the flight.
- searchDate: Date the record was obtained from Expedia.
- searchMonth: Month the record was obtained from Expedia.
- searchDay: Day the record was obtained from Expedia.
- flightDate: The date of the flight.
- flightMonth: The month of the flight.
- flightDay: The day of the flight.
- startingAirport: 3-letter code for the starting airport.
- destinationAirport: 3-letter code for the destination airport.
- duration: Travel duration in minutes.
- isEconomy: Is basic economy?
- isRefundable: Is the ticket refundable?
- isNonStop: Is the flight non-stop?
- baseFare: Price of the ticket (not including taxes).
- totalFare: Price of the ticket, including taxes and fees.
- seatsRemaining: Number of remaining seats.
- travelDistance: The total travel distance in miles.

In [2]:
case class FlightData(
    flightID:String,
    searchDate:String,
    searchMonth:String,
    searchDay:String,
    flightDate:String,
    flightMonth:String,
    flightDay:String,
    startingAirport:String,
    destinationAirport:String,
    duration:Int,
    isEconomy:Boolean,
    isRefundable:Boolean,
    isNonStop:Boolean,
    baseFare:Double,
    totalFare:Double,
    seatsRemaining:Int,
    travelDistance:Int
)

object FlightData {

    def parse(line:String) = {
        val input = line.split(",")
        val flightID = input(0)
        val searchDate = input(1)
        val searchMonth = searchDate.substring(5,7)
        val searchDay = searchDate.substring(8,10)
        val flightDate = input(2)
        val flightMonth = flightDate.substring(5,7)
        val flightDay = flightDate.substring(8,10)
        val startingAirport = input(3)
        val destinationAirport = input(4)
        val dur = input(6).replace("P","").replace("T","").split("D|H|M").map(x => x.toInt)
        val duration = dur.length match {
            case 3 => dur(0) * 1440 + dur(1) * 60 + dur(2)
            case 2 => dur(0) * 60 + dur(1)
            case 1 => dur(0)
        }
        val isEconomy = input(8).toBoolean
        val isRefundable = input(9).toBoolean
        val isNonStop = input(10).toBoolean
        val baseFare = input(11).toDouble
        val totalFare = input(12).toDouble
        val seatsRemaining = input(13).toInt
        val travelDistance = input(14) match {
            case "" => 0
            case _ => input(14).toInt
        }
        
        new FlightData(flightID,searchDate,searchMonth,searchDay,flightDate,flightMonth,flightDay,startingAirport,
                       destinationAirport,duration,isEconomy,isRefundable,isNonStop,baseFare,totalFare,
                       seatsRemaining,travelDistance)
    }
}

/** Create RRD */
val rdd = sc.textFile("s3a://"+bucketname+"/datasets/itineraries.csv")
/** Extract header from RDD and parse every row*/
val header = rdd.first(); 
val rddFlights = rdd.filter(row => row != header).map(FlightData.parse)
/** Count elements on RDD */
//rddFlights.count() //82138753

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

defined class FlightData
defined object FlightData
Companions must be defined together; you may wish to use :paste mode for this.
rdd: org.apache.spark.rdd.RDD[String] = s3a://unibo-bd2223-paolopenazzi/datasets/itineraries.csv MapPartitionsRDD[1] at textFile at <console>:29
header: String = legId,searchDate,flightDate,startingAirport,destinationAirport,fareBasisCode,travelDuration,elapsedDays,isBasicEconomy,isRefundable,isNonStop,baseFare,totalFare,seatsRemaining,totalTravelDistance,segmentsDepartureTimeEpochSeconds,segmentsDepartureTimeRaw,segmentsArrivalTimeEpochSeconds,segmentsArrivalTimeRaw,segmentsArrivalAirportCode,segmentsDepartureAirportCode,segmentsAirlineName,segmentsAirlineCode,segmentsEquipmentDescription,segmentsDurationInSeconds,segmentsDistance,segmentsCabinCode
rddFlights: org.apache.spark.rdd.RDD[FlightData] = MapPartitionsRDD[3] at map at <console>:29


#### Visualize data as a DataFrame

In [None]:
import spark.implicits._

val columns = Seq("flightID",
                  "searchDate",
                  "searchMonth",
                  "searchDay",
                  "flightDate",
                  "flightMonth",
                  "flightDay",
                  "startingAirport",
                  "destinationAirport",
                  "duration",
                  "isEconomy",
                  "isRefundable",
                  "isNonStop",
                  "baseFare",
                  "totalFare",
                  "seatsRemaining",
                  "travelDistance")

val flightDataframe = rddFlights.toDF(columns:_*)
flightDataframe.show(60,truncate=40,vertical=true)

#### Ignore flights for which there are not at least 7 days of difference between the searchDate and the flightDate

In [None]:
/** import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
import java.time.Duration */
import java.time.temporal.ChronoUnit.DAYS
import java.time.LocalDate

def enoughData(x: FlightData): Long = {
        val searchDate = LocalDate.parse(x.searchDate);
        val flightDate = LocalDate.parse(x.flightDate);
        val daysBetween = DAYS.between(searchDate, flightDate);
        daysBetween
}

val rddFlightsFiltered = rddFlights.filter(x => enoughData(x) >= 7)
// rddFlightsFiltered.count() // 82138753 in RDD, 71823816 in RDD filtered 

In [None]:
rddFlightsFiltered.take(10)

## Data Exploration

In [None]:
val rddFlightsFilteredCached = rddFlightsFiltered.cache()

In [None]:
//"Number of searches performed: " + rddFlightsFilteredCached.count() // 71823816
//"Number of distinct flights: " + rddFlightsFilteredCached.map(x => x.flightID).distinct().count() // 5475345
//"Number of distinct startingAirport: " + rddFlightsFilteredCached.map(x => (x.startingAirport)).distinct().count() //16
//"Number of flights by starting airport: " + rddFlightsFilteredCached.map(x => (x.flightID, x.startingAirport)).distinct().map(x => (x._2, 1)).reduceByKey(_+_)
// Array((LAX,539111), (CLT,358966), (JFK,264186), (BOS,403703), (OAK,284170), (LGA,392553), (ATL,324273), (MIA,310015), (DTW,300891), (PHL,323567), (SFO,467443), (EWR,257195), (ORD,326516), (DEN,300488), (IAD,277081), (DFW,345298))
//"Number of flights by month: " + rddFlightsFilteredCached.map(x => (x.flightID, x.flightMonth)).distinct().map(x => (x._2, 1)).reduceByKey(_+_)
// Array((04,87897), (05,610532), (06,895364), (07,930405), (08,967938), (09,891707), (10,729489), (11,362013))
// "Number of routes: " + rddFlightsFilteredCached.map(x => (x.startingAirport, x.destinationAirport)).distinct().count()
// numero di voli economi e non economici
//"Number of direct flights: " + rddFlightsFilteredCached.map(x => (x.flightID, x.isNonStop)).distinct().filter(x => x._2 == true).count() // 625655

## Jobs

### 1 - ??

The change that occurs in the ticket price in the 1-7 and 8-14 days prior to flight departure is calculated, dividing the result between 'economy' and non-economy tickets.

The correlation between that change in price and the number of seats remaining on that flight is then displayed.

In [5]:
val rdd14DaysBefore = rddFlights.
    filter(x => daysBetween(x) <= 14).
    cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rdd14DaysBefore: org.apache.spark.rdd.RDD[FlightData] = MapPartitionsRDD[4] at filter at <console>:30


In [6]:
val rddEconomyFlights = rdd14DaysBefore.filter(_.isEconomy)
val rddNonEconomyFlights = rdd14DaysBefore.filter(!_.isEconomy)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddEconomyFlights: org.apache.spark.rdd.RDD[FlightData] = MapPartitionsRDD[5] at filter at <console>:27
rddNonEconomyFlights: org.apache.spark.rdd.RDD[FlightData] = MapPartitionsRDD[6] at filter at <console>:27


In [7]:
val rddAveragePriceEconomy = rddEconomyFlights.
    map(x => (daysBetween(x), x.baseFare)).
    aggregateByKey((0.0,0.0))((x,y)=>(x._1+y, x._2 + 1) , (x,y)=>(x._1 + y._1, x._2 + y._2)).
    map({case(k,v) => (k, v._1/v._2)})

val rddAveragePriceNonEconomy = rddNonEconomyFlights.
    map(x => (daysBetween(x), x.baseFare)).
    aggregateByKey((0.0,0.0))((x,y)=>(x._1+y, x._2 + 1) , (x,y)=>(x._1 + y._1, x._2 + y._2)).
    map({case(k,v) => (k, v._1/v._2)})

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddAveragePriceEconomy: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[9] at map at <console>:32
rddAveragePriceNonEconomy: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[12] at map at <console>:33


In [8]:
val rddAverageSeatsRemainingEconomy = rddEconomyFlights.
    map(x => (daysBetween(x), x.seatsRemaining)).
    aggregateByKey((0.0,0.0))((x,y)=>(x._1+y, x._2 + 1) , (x,y)=>(x._1 + y._1, x._2 + y._2)).
    map({case(k,v) => (k, v._1/v._2)})

val rddAverageSeatsRemainingNonEconomy = rddNonEconomyFlights.
    map(x => (daysBetween(x), x.seatsRemaining)).
    aggregateByKey((0.0,0.0))((x,y)=>(x._1+y, x._2 + 1) , (x,y)=>(x._1 + y._1, x._2 + y._2)).
    map({case(k,v) => (k, v._1/v._2)})

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddAverageSeatsRemainingEconomy: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[15] at map at <console>:32
rddAverageSeatsRemainingNonEconomy: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[18] at map at <console>:33


In [16]:
import org.apache.spark.sql.SaveMode

val path_output = "s3a://"+bucketname+"/spark/bdexam"

val rddEconomyResult = rddAverageSeatsRemainingEconomy.join(rddAveragePriceEconomy).
    map({case(k,v) => (k, v._1, v._2)}).
    coalesce(1).toDF().write.format("csv").mode(SaveMode.Overwrite).save(path_output)
val rddNonEconomyResult = rddAverageSeatsRemainingNonEconomy.join(rddAveragePriceNonEconomy).
    map({case(k,v) => (k, v._1, v._2)}).
    coalesce(1).toDF().write.format("csv").mode(SaveMode.Overwrite).save(path_output)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import org.apache.spark.sql.SaveMode
path_output: String = s3a://unibo-bd2223-paolopenazzi/spark/bdexam
rddEconomyResult: Unit = ()
rddNonEconomyResult: Unit = ()


In [11]:
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg._

val economyData = rddEconomyResult.map({case(days,tuple) => Vectors.dense(tuple._1, tuple._2)}).cache()
val correlMatrix: Matrix = Statistics.corr(economyData)

val nonEconomyData = rddNonEconomyResult.map({case(days,tuple) => Vectors.dense(tuple._1, tuple._2)}).cache()
val correlMatrix: Matrix = Statistics.corr(nonEconomyData)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg._
economyData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[34] at map at <console>:36
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0                  -0.8847960167193927
-0.8847960167193927  1.0
nonEconomyData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[43] at map at <console>:36
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0                  -0.5823660225408362
-0.5823660225408362  1.0


### 2 - ??

We want to identify the cheapest direct flights departing in a given week, grouped according to the departure airport, calculating the following values:
- average price recorded in the 7 days before the departure of direct flights only
- previous result compared to the distance traveled by the flight, so we find the flights that take us as far as possible for less money

In [4]:
import java.time.temporal.ChronoUnit.DAYS
import java.time.LocalDate

def daysBetween(x: FlightData): Long = {
    val searchDate = LocalDate.parse(x.searchDate);
    val flightDate = LocalDate.parse(x.flightDate);
    DAYS.between(searchDate, flightDate);
}

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import java.time.temporal.ChronoUnit.DAYS
import java.time.LocalDate
daysBetween: (x: FlightData)Long


In [35]:
val rddData1WeekBeforeDirectFlight = rddFlights.
    filter(x => daysBetween(x) <= 14).
    filter(_.isNonStop).
    filter(_.travelDistance != 0).
    cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddData1WeekBeforeDirectFlight: org.apache.spark.rdd.RDD[FlightData] = MapPartitionsRDD[148] at filter at <console>:45


In [36]:
val rddAveragePricePerFlight = rddData1WeekBeforeDirectFlight.
    map(x => (x.flightID, x.baseFare)).
    aggregateByKey((0.0,0.0))((x,y)=>(x._1+y, x._2 + 1) , (x,y)=>(x._1 + y._1, x._2 + y._2)).
    map({case(k,v) => (k, v._1/v._2)})

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddAveragePricePerFlight: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[151] at map at <console>:43


In [37]:
val rddAveragePricePerFlightWithDistance = rddData1WeekBeforeDirectFlight.
    map(x => (x.flightID, x.travelDistance)).
    distinct().
    join(rddAveragePricePerFlight)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddAveragePricePerFlightWithDistance: org.apache.spark.rdd.RDD[(String, (Int, Double))] = MapPartitionsRDD[158] at join at <console>:45


In [38]:
val rddAveragePricePerDistance = rddAveragePricePerFlightWithDistance.
    map({case(x,y) => (x, y._2/y._1)})

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddAveragePricePerDistance: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[159] at map at <console>:41


In [39]:
val rddAveragePricePerAirport = rddData1WeekBeforeDirectFlight.
    map(x => (x.flightID, x.startingAirport)).
    distinct().
    join(rddAveragePricePerDistance)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddAveragePricePerAirport: org.apache.spark.rdd.RDD[(String, (String, Double))] = MapPartitionsRDD[166] at join at <console>:45


In [45]:
val rddFinal = rddAveragePricePerAirport.
    map({case(x,y) => (y._1, y._2)}).
    aggregateByKey((0.0,0.0))((x,y)=>(x._1+y, x._2 + 1) , (a,b)=>(a._1 + b._1, a._2 + b._2)).
    map({case(k,v) => (v._1/v._2, k)}).
    sortByKey()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

rddFinal: org.apache.spark.rdd.RDD[(Double, String)] = ShuffledRDD[181] at sortByKey at <console>:44


In [47]:
val df = rddFinal.toDF("Airport", "AvgPricePerDistance")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

df: org.apache.spark.sql.DataFrame = [Airport: double, AvgPricePerDistance: string]


In [69]:
import $ivy.`org.vegas-viz::vegas_2.11:0.3.11`


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
<console>:38: error: not found: value $ivy
       import $ivy.`org.vegas-viz::vegas_2.11:0.3.11`
              ^



In [65]:
import `org.vegas-viz:vegas_2.11:0.3.11`
import `org.vegas-viz:vegas-spark_2.11:0.3.11`

import vegas._
import vegas.render.WindowRenderer._
import vegas.sparkExt._

Vegas(“Average price/distance per airport”)
 .withDataFrame(df)
 .mark(Bar) // Change to .mark(Area)
 .encodeX(“Airport”, Nom)
 .encodeY(“AvgPricePerDistance”, Quant)
 .show

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
<console>:2: error: '.' expected but ';' found.
import `org.vegas-viz:vegas-spark_2.11:0.3.11`
^

