##**Airline Data Analysis** 

### Aim
    Show a sample of 5 records from dataset.
    Read the data with data types.
    Make a new column MonthStr, Which has months in form of 01, 02, 03, ..., 12.
    Find the # of flights each airline made.
    Find the mean departure delay per origination airport.
    What is the average departure delay from each airport?

In [5]:
pip install pyspark # installing pyspark enviornment 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 43 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 48.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=17cc16a485f3f59293e654f20a5f1c9ba6d4f22c6f6ba6cf2d9f1c52cc687708
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [6]:
#importing required libraries
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType

In [7]:
#Spark Config
conf = SparkConf().setAppName('airline_conf')
sc = SparkContext(conf=conf)
spark=SparkSession.builder.appName('spark_airline').getOrCreate()
sqlcontext=SQLContext(spark)



In [8]:
#loading data set
df=spark.read.csv('/content/Airline_data.csv',header=True)

#### Showing sample 5 records from dataset

In [None]:
df.show(5) # showing top 5 rows 

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|1989|    1|        23|        1|   1419|      1230|   1742|      1552|           UA|      183

#### Shape of data

In [9]:
row=df.count()
col=len(df.columns)
print('shape of dataset: (',row,',',col,')')

shape of dataset: ( 426 , 29 )


#### Read the data with data types.

In [10]:
df.printSchema()  #  printing dataset schema

root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: string (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: string (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: string (nullable = true)
 |-- CarrierDelay:

#### Make a new column MonthStr, Which has months in form of 01, 02, 03, ..., 12.

In [12]:
modified_df=df.withColumn("MonthStr",f.date_format(f.to_date(f.concat_ws('-',df.Year,df.Month,df.DayofMonth)),"MM"))
# first concating year,month, day and formating them into date
# formating new concated date into monthnumber
# storing this value into new column monthStr


modified_df.select('MonthStr').show() # showing top 20 rows

+--------+
|MonthStr|
+--------+
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
|      01|
+--------+
only showing top 20 rows



#### Find the no. of flights each airline made.

In [13]:
df.groupBy('UniqueCarrier').count().orderBy(f.desc('count')).show()
# grouping the data set by UniqueCarrier
# counting the no of rows for each carrier
#sorting the result in descending order and showing top 20 rows

+-------------+-----+
|UniqueCarrier|count|
+-------------+-----+
|           UA|  426|
+-------------+-----+



#### Feature Update

In [14]:
# changing data type of column DepDelay and ArrDelay, converting this column into integer
modified_df=df.withColumn("DepDelay", df["DepDelay"].cast(IntegerType()))  
modified_df=modified_df.withColumn("ArrDelay", modified_df["ArrDelay"].cast(IntegerType())) 

In [15]:
modified_df.printSchema()
#checking datatype after converting it

root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: string (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: string (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: string (nullable = true)
 |-- CarrierDela

#### Find the mean arrival delay per origination airport?

In [16]:
modified_df.groupBy('Origin').mean('ArrDelay').withColumnRenamed('avg(ArrDelay)','avg_ArrDelay').show()
# grouping data by Origin and calculating mean for ArrDelay
# renaming mean column and showing top 20 rows

+------+-------------------+
|Origin|       avg_ArrDelay|
+------+-------------------+
|   LIH|0.16666666666666666|
|   HNL|  14.21774193548387|
|   EWR|               9.25|
|   DEN| 20.166666666666668|
|   IAD| 12.966666666666667|
|   SFO| 11.215384615384615|
|   PHL|  6.827586206896552|
|   OGG|  16.24137931034483|
+------+-------------------+



#### What is the average departure delay from each airport?

In [17]:
modified_df.groupBy('Origin').avg('DepDelay').withColumnRenamed('avg(DepDelay)','avg_DepDelay').show()
# grouping data by Origin and calculating avg  for DepDelay
# renaming avg column and showing top 20 rows

+------+-------------------+
|Origin|       avg_DepDelay|
+------+-------------------+
|   LIH|-3.7666666666666666|
|   HNL|  3.217741935483871|
|   EWR|  4.958333333333333|
|   DEN|               27.6|
|   IAD|                8.9|
|   SFO| 19.646153846153847|
|   PHL| 16.137931034482758|
|   OGG|                6.0|
+------+-------------------+

