# **Practice: Exploratory Data Analysis (EDA) with Pandas**



In [1]:
# import the libraries
import pandas as pd
import numpy as np

In [2]:
# If you would like to see all the rows and columns in a subset of the dataset you filter, have option to set maximum columns and rows `None`
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', None)

# 1. **Introduction**

### **Exploratory Data Analysis (EDA) on the Flights Dataset**

In this notebook, we will explore a dataset that contains information about flights.  Through various examples, you will learn how to use Python and the Pandas library to understand, clean, and analyze data.

Let's start by loading the dataset we've uploaded into the same folder in Google Drive and mount the drive.


In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

## **Pandas Series**

**Definition:** A Pandas Series object is a one-dimensional labeled array capable of holding data of any type (e.g., integer, string, float, python objects, etc.). It is essentially a column in an Excel spreadsheet


**Important details about `np.nan`**
*  `np.nan` is a special floating-point value that represents "*Not a Number*" in the NumPy library. It is the NumPy equivalent of the IEEE floating-point representation for NaN (Not a Number) and is used to represent undefined or unrepresentable values, especially in datasets.
* Mathematical operations with `np.nan` will always result in `np.nan`. For instance, 5 + `np.nan` results in `np.nan`.
* In the Pandas library, `np.nan` is used to represent missing data. Methods like `isna()` or `isnull()` can be used to detect such values.
* In data analysis, you will often want to replace `np.nan` values with some other value, and Pandas provides methods like `fillna()` for this purpose.
* In Pandas, you can also drop rows or columns with `np.nan` values using the `dropna()` method.


* Create a `pandas` `Series` object that has length 5 and the first element is `nan`

In [4]:
s = pd.Series([np.nan,12.03, 42.0, 3.0, 9.05])

* **Return the data in the Series as a [NumPy array](https://numpy.org/doc/stable/user/absolute_beginners.html)**

In [5]:
s.values

array([  nan, 12.03, 42.  ,  3.  ,  9.05])

* **Get a summary of statistics for numerical columns**

In [6]:
s.describe()

count     4.000000
mean     16.520000
std      17.397143
min       3.000000
25%       7.537500
50%      10.540000
75%      19.522500
max      42.000000
dtype: float64

## **Basic Data Exploration**

* **Create a Pandas Dataframe from `flights.csv`**





In [7]:
df = pd.read_csv('flights.csv', index_col=False)

* **Let's look what is in the dataset**

In [8]:
# Display the first few rows of the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,1/1/2013 5:00
1,1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,1/1/2013 5:00
2,2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,1/1/2013 5:00
3,3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,1/1/2013 5:00
4,4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,1/1/2013 6:00


In [9]:
# Display the first 10 rows of the dataframe
df.head(10)

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,1/1/2013 5:00
1,1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,1/1/2013 5:00
2,2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,1/1/2013 5:00
3,3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,1/1/2013 5:00
4,4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,1/1/2013 6:00
5,5,2013,1,1,554.0,558,-4.0,740.0,728,12.0,UA,1696,N39463,EWR,ORD,150.0,719,5,58,1/1/2013 5:00
6,6,2013,1,1,555.0,600,-5.0,913.0,854,19.0,B6,507,N516JB,EWR,FLL,158.0,1065,6,0,1/1/2013 6:00
7,7,2013,1,1,557.0,600,-3.0,709.0,723,-14.0,EV,5708,N829AS,LGA,IAD,53.0,229,6,0,1/1/2013 6:00
8,8,2013,1,1,557.0,600,-3.0,838.0,846,-8.0,B6,79,N593JB,JFK,MCO,140.0,944,6,0,1/1/2013 6:00
9,9,2013,1,1,558.0,600,-2.0,753.0,745,8.0,AA,301,N3ALAA,LGA,ORD,138.0,733,6,0,1/1/2013 6:00


In [10]:
# Display the last few rows of the dataframe
df.tail()

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
995,995,2013,1,2,806.0,810,-4.0,1300.0,1315,-15.0,AA,655,N5FTAA,JFK,STT,193.0,1623,8,10,2/1/2013 8:00
996,996,2013,1,2,807.0,810,-3.0,1133.0,1129,4.0,DL,1271,N322US,JFK,FLL,170.0,1069,8,10,2/1/2013 8:00
997,997,2013,1,2,808.0,810,-2.0,1049.0,1045,4.0,DL,269,N971DL,JFK,ATL,124.0,760,8,10,2/1/2013 8:00
998,998,2013,1,2,808.0,815,-7.0,1020.0,1016,4.0,US,675,N656AW,EWR,CLT,107.0,529,8,15,2/1/2013 8:00
999,999,2013,1,2,809.0,810,-1.0,950.0,948,2.0,B6,1051,N304JB,JFK,PIT,71.0,340,8,10,2/1/2013 8:00


In [11]:
# Check the shape of the dataframe
df.shape

(1000, 20)

In [12]:
# Check the data types of each column
df.dtypes

Unnamed: 0          int64
year                int64
month               int64
day                 int64
dep_time          float64
sched_dep_time      int64
dep_delay         float64
arr_time          float64
sched_arr_time      int64
arr_delay         float64
carrier            object
flight              int64
tailnum            object
origin             object
dest               object
air_time          float64
distance            int64
hour                int64
minute              int64
time_hour          object
dtype: object

In [13]:
# Check for missing values in the dataframe
df.isnull().sum()

Unnamed: 0         0
year               0
month              0
day                0
dep_time           4
sched_dep_time     0
dep_delay          4
arr_time           5
sched_arr_time     0
arr_delay         11
carrier            0
flight             0
tailnum            0
origin             0
dest               0
air_time          11
distance           0
hour               0
minute             0
time_hour          0
dtype: int64

In [14]:
#Drop the first column, as it is not useful
df.drop( 'Unnamed: 0', axis=1, inplace=True)

### **Slicing**

Slicing in a Pandas DataFrame refers to the practice of selecting specific rows and/or columns from a DataFrame. This operation allows you to extract portions of your dataset, which can be useful for inspection, analysis, or further manipulation.



In [15]:
# Select row 300 in the dataframe. The result is a Series with the indices represented by column names.
df.iloc[300]

year                        2013
month                          1
day                            1
dep_time                  1157.0
sched_dep_time              1205
dep_delay                   -8.0
arr_time                  1342.0
sched_arr_time              1345
arr_delay                   -3.0
carrier                       MQ
flight                      4431
tailnum                   N723MQ
origin                       LGA
dest                         RDU
air_time                    80.0
distance                     431
hour                          12
minute                         5
time_hour         1/1/2013 12:00
Name: 300, dtype: object

In [16]:
s_300 = df.iloc[300]

In [17]:
# you can index the series using the classic dictionary-like syntax
s_300['carrier']

'MQ'

In [18]:
# Select all the rows where the dep_delay is 0.0
df_no_dep_delays = df[ df['dep_delay'] == 0.0 ]

In [19]:
#let's see what it looks like by displaying first few rows
df_no_dep_delays.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
15,2013,1,1,559.0,559,0.0,702.0,706,-4.0,B6,1806,N708JB,JFK,BOS,44.0,187,5,59,1/1/2013 5:00
17,2013,1,1,600.0,600,0.0,851.0,858,-7.0,B6,371,N595JB,LGA,FLL,152.0,1076,6,0,1/1/2013 6:00
18,2013,1,1,600.0,600,0.0,837.0,825,12.0,MQ,4650,N542MQ,LGA,ATL,134.0,762,6,0,1/1/2013 6:00
24,2013,1,1,607.0,607,0.0,858.0,915,-17.0,UA,1077,N53442,EWR,MIA,157.0,1085,6,7,1/1/2013 6:00
28,2013,1,1,615.0,615,0.0,1039.0,1100,-21.0,B6,709,N794JB,JFK,SJU,182.0,1598,6,15,1/1/2013 6:00


In [20]:
# Select all the rows where the dep_delay = 0.0 AND dep_time = 559.0
df_dep_delay_0_dep_time_559 = df[ (df['dep_delay'] == 0.0) & (df['dep_time'] == 559.0) ]

In [21]:
#let's see what it looks like by displaying first few rows
df_dep_delay_0_dep_time_559.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
15,2013,1,1,559.0,559,0.0,702.0,706,-4.0,B6,1806,N708JB,JFK,BOS,44.0,187,5,59,1/1/2013 5:00


In [22]:
# # Select all flights that did not have any delays andwhose origins are either JFK or LAX
df_jfk_lax_no_delay = df_no_dep_delays[df_no_dep_delays["origin"].isin(['JFK', 'LAX'])]

In [23]:
df_jfk_lax_no_delay.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
15,2013,1,1,559.0,559,0.0,702.0,706,-4.0,B6,1806,N708JB,JFK,BOS,44.0,187,5,59,1/1/2013 5:00
28,2013,1,1,615.0,615,0.0,1039.0,1100,-21.0,B6,709,N794JB,JFK,SJU,182.0,1598,6,15,1/1/2013 6:00
54,2013,1,1,655.0,655,0.0,1021.0,1030,-9.0,DL,1415,N3763D,JFK,SLC,294.0,1990,6,55,1/1/2013 6:00
94,2013,1,1,745.0,745,0.0,1135.0,1125,10.0,AA,59,N336AA,JFK,SFO,378.0,2586,7,45,1/1/2013 7:00
111,2013,1,1,805.0,805,0.0,1015.0,1005,10.0,B6,219,N273JB,JFK,CLT,98.0,541,8,5,1/1/2013 8:00


## **Groupby and Pivot Tables**


* The `groupby` method in Pandas is used to group rows of a DataFrame based on some columns, and then apply a function (like sum, mean, or count) to each group. It is especially useful for aggregating data in different ways. `groupby` combines split-apply-combine operation. See the [O'Reilly Python for Data Science](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#GroupBy:-Split,-Apply,-Combine) book for a detailed explanation.

* The `pivot_table` method in Pandas reshapes data between rows and columns, allowing for multi-level indexing (grouping) and complex aggregations. It's a way to summarize and explore data, where you can define which columns are used as row or column indices and which columns to aggregate. Pivot tables are multidimensional groupbys. Read more [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html)

In [24]:
# Using `groupby`, display the average departure delay per carrier
avg_dep_delay_per_carrier = df.groupby('carrier')['dep_delay'].mean()
avg_dep_delay_per_carrier

carrier
9E    15.580645
AA     7.133929
AS    -3.666667
B6     9.808290
DL    -0.544118
EV    30.596899
F9    -8.000000
FL    -3.666667
HA    -3.000000
MQ    20.848837
UA     6.920398
US    -1.418605
VX    -1.142857
WN     4.181818
Name: dep_delay, dtype: float64

In [25]:
# Using groupby, display the total number of flights per carrier
total_flights_per_carrier = df.groupby('carrier').size()
total_flights_per_carrier

carrier
9E     31
AA    114
AS      3
B6    194
DL    136
EV    130
F9      2
FL     12
HA      1
MQ     86
UA    201
US     43
VX     14
WN     33
dtype: int64

In [26]:
# Using pivot_table, display average departure delay per carrier
avg_dep_delay_pivot = df.pivot_table( values='dep_delay', index='carrier', aggfunc='mean')
avg_dep_delay_pivot

Unnamed: 0_level_0,dep_delay
carrier,Unnamed: 1_level_1
9E,15.580645
AA,7.133929
AS,-3.666667
B6,9.80829
DL,-0.544118
EV,30.596899
F9,-8.0
FL,-3.666667
HA,-3.0
MQ,20.848837


In [27]:
# Using pivot_table, display max and min departure delays per carrier
max_min_delays = df.pivot_table(values='dep_delay', index='carrier', aggfunc=['max', 'min'])
max_min_delays

Unnamed: 0_level_0,max,min
Unnamed: 0_level_1,dep_delay,dep_delay
carrier,Unnamed: 1_level_2,Unnamed: 2_level_2
9E,255.0,-10.0
AA,285.0,-15.0
AS,-1.0,-7.0
B6,156.0,-12.0
DL,105.0,-10.0
EV,379.0,-13.0
F9,-2.0,-14.0
FL,9.0,-11.0
HA,-3.0,-3.0
MQ,853.0,-15.0
