In [45]:
!pip install plotly
# Import required libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np



In [3]:
airline_data =  pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/airline_data.csv', 
                            encoding = "ISO-8859-1",
                            dtype={'Div1Airport': str, 'Div1TailNum': str, 
                                   'Div2Airport': str, 'Div2TailNum': str})

In [46]:
airline_data.head()

Unnamed: 0.1,Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1295781,1998,2,4,2,4,1998-04-02,AS,19930,AS,...,,,,,,,,,,
1,1125375,2013,2,5,13,1,2013-05-13,EV,20366,EV,...,,,,,,,,,,
2,118824,1993,3,9,25,6,1993-09-25,UA,19977,UA,...,,,,,,,,,,
3,634825,1994,4,11,12,6,1994-11-12,HP,19991,HP,...,,,,,,,,,,
4,1888125,2017,3,8,17,4,2017-08-17,UA,19977,UA,...,,,,,,,,,,


In [5]:
# Shape of the data
airline_data.shape

(27000, 110)

In [22]:
# Randomly sample 500 data points. Setting the random state to be 42 so that we get same result.
data = airline_data.sample(n=500, random_state=42)

In [23]:
data.shape
#data.columns

(500, 110)

## 1. Scatter Plot


A scatter plot shows the relationship between 2 variables on the x and y-axis. 
The data points here appear scattered when plotted on a two-dimensional plane. Using scatter plots, we can create exciting visualizations to express various relationships, such as:

Height vs weight of persons

Engine size vs automobile price

Exercise time vs Body Fat

Let us use a scatter plot to represent departure time changes with respect to airport distance

This plot should contain the following

* Title as **Distance vs Departure Time**.
* x-axis label should be **Distance**
* y-axis label should be **DeptTime**
* **Distance** column data from the flight delay dataset should be considered in x-axis
* **DepTime** column data from the flight delay dataset should be considered in y-axis
* Scatter plot markers should be of red color

In [48]:
#First we will create an empty figure ising go.Figure()
fig=go.Figure()
#Next we will create a scatter plot by using the add_trace function and use the go.scatter() function within it
# In go.Scatter we define the x-axis data,y-axis data and define the mode as markers with color of the marker as red
fig.add_trace(go.Scatter(x=data['Distance'], y=data['DepTime'], mode='markers', marker=dict(color='red')))
fig.update_layout(title='Distance vs Departure Time', xaxis_title='Distance', yaxis_title='DepTime')
# Display the figure
fig.show()

Can we rewrite the above code using px.scatter? if yes how? 

In [25]:
fig=px.scatter(x=data['Distance'], y=data['DepTime'], title='Distance vs Departure Time', labels=dict(x='Distance', y='DeptTime'))
fig.show()

## 2. Line Plot

A line plot shows information that changes continuously with time. Here the data points are connected by straight lines. Line plots are also plotted on a two dimensional plane like scatter plots. Using line plots, we can create exciting visualizations to illustrate:

  * Annual revenue growth
  * Stock Market analysis over time
  * Product Sales over time


Let us now use a line plot to extract average monthly arrival delay time and see how it changes over the year.

  This plot should contain the following

* Title as **Month vs Average Flight Delay Time**.
* x-axis label should be **Month**
* y-axis label should be **ArrDelay**
* A new dataframe **line_data** should be created which consists of 2 columns average **arrival delay time per month** and **month** from the dataset
* **Month** column data from the line_data dataframe should be considered in x-axis
* **ArrDelay** column data from the ine_data dataframeshould be considered in y-axis
* Plotted line in the line plot should be of green color

In [26]:
line_data = data.groupby('Month')['ArrDelay'].mean().reset_index()

In [16]:
line_data

Unnamed: 0,Month,ArrDelay
0,1,2.232558
1,2,2.6875
2,3,10.868421
3,4,6.229167
4,5,-0.27907
5,6,17.310345
6,7,5.088889
7,8,3.121951
8,9,9.081081
9,10,1.2


In [38]:
fig=go.Figure()
fig.add_trace(go.Scatter(x=line_data['Month'], y=line_data['ArrDelay'], mode='lines+markers', marker=dict(color='red')))
fig.update_layout(title='Month vs Average Flight Delay Time', xaxis_title='Month', yaxis_title='ArrDelay')
fig.show()

## 3.Bar Plot: 
A bar plot represents categorical data in rectangular bars. Each category is defined on one axis, and the value counts for this category are represented on another axis. Bar charts are generally used to compare values.We can use bar plots in visualizing:

 * Pizza delivery time in peak and non peak hours
 * Population comparison by gender
 * Number of views by movie name

Let us use a bar chart to extract number of flights that goes to a destination

This plot should contain the following

* Title as **Total number of flights to the destination state split by reporting air**.
* x-axis label should be **DestState**
* y-axis label should be **Flights**
* Create a new dataframe called **bar_data**  which contains 2 columns **DestState** and **Flights**.Here **flights** indicate total number of flights in each combination.

In [28]:
bar_data = data.groupby(['DestState'])['Flights'].mean().reset_index()
bar_data

Unnamed: 0,DestState,Flights
0,AK,1.0
1,AL,1.0
2,AZ,1.0
3,CA,1.0
4,CO,1.0
5,CT,1.0
6,FL,1.0
7,GA,1.0
8,HI,1.0
9,IA,1.0


In [50]:
bar_data = data.groupby(['DestState'])['Flights'].sum().reset_index()
fig=px.bar(x=bar_data['DestState'], y=bar_data['Flights'], title='Total number of flights to the destination state split by reporting air', labels=dict(x='DestState', y='Flights'))
fig.show()

## 4.Histogram: 
 A histogram is used to represent continuous data in the form of bar. Each bar has discrete values in bar graphs, whereas in histograms, we have bars representing a range of values. Histograms show frequency distributions. We can use histograms to visualize:
 
 * Students marks distribution
 * Frequency of waiting time of customers in a Bank


Let us represent the distribution of arrival delay using a histogram

This plot should contain the following

* Title as **Total number of flights to the destination state split by reporting air**.
* x-axis label should be **ArrayDelay**
* y-axis will show the count of arrival delay

In [51]:
data['ArrDelay'] = data['ArrDelay'].fillna(0)
fig=px.histogram(data,x='ArrDelay', title='Total number of flights to the destination state split by reporting air')
fig.show()

**But what if I need to get the distribution of flights that arrived delayed? (ie, Just the positive values)**

In [15]:
fig = px.histogram(data['ArrDelay'] > 0, x='ArrDelay',
                   nbins=50,
                   title='Distribution of Positive Arrival Delays (minutes)')
fig.show()


## 5. Bubble Plot: 
A bubble plot is used to show the relationship between 3 or more variables. It is an extension of a scatter plot. Bubble plots are ideal for visualizing:

  * Global Economic position of Industries
  * Impact of viruses on Diseases

Let  use a bubble plot to represent number of flights as per reporting airline

This plot should contain the following

* Title as **Reporting Airline vs Number of Flights**.
* x-axis label should be **Reporting_Airline**
* y-axis label should be **Flights**
* size of the bubble should be **Flights** indicating number of flights
* Name of the hover tooltip to `reporting_airline` using `hover_name` parameter.

In [30]:
bub_data = data.groupby('Reporting_Airline')['Flights'].sum().reset_index()
fig=px.scatter(bub_data, x='Reporting_Airline', y='Flights', hover_name='Reporting_Airline', title='Reporting Airline vs Number of Flights', size_max=60)
fig.show()

## 6.Pie Plot: 
 A pie plot is a circle chart mainly used to represent proportion of part of given data with respect to the whole data. Each slice represents a proportion and on total of the proportion becomes a whole. We can use bar plots in visualizing:
 
 * Sales turnover percentatge with respect to different products
 * Monthly expenditure of a Family


Let us represent the proportion of distance group by month (month indicated by numbers)

This plot should contain the following

* Title as **Distance group proportion by month**.
* values should be **Month**
* names should be **DistanceGroup**

In [17]:
fig = px.pie(data, values='Month', names='DistanceGroup', title='Distance group proportion by month')
fig.show()

DistanceGroup is a categorical grouping of flight distances into fixed ranges for simplified analysis and reporting.

Flights are grouped by their Great Circle distance (in miles) into predefined buckets.

Typically:

1: 1–250 miles

2: 251–500 miles

3: 501–750 miles

4: 751–1000 miles

5: 1001–1250 miles (Could be different in exact data)

In [31]:
print(data['DistanceGroup'].unique())

[ 1  3  8  2  7  9  4 10  5 11  6]


Now think how to get a pie chart that shows **Number of flights recorded in each month**

In [39]:
import plotly.express as px
import pandas as pd
import calendar

# Step 1: Count number of flights per month
month_counts = data['Month'].value_counts().reset_index()
month_counts.columns = ['Month', 'FlightCount']

# Step 2: Convert numeric months to names (optional for better visuals)
month_counts['Month'] = month_counts['Month'].apply(lambda x: calendar.month_name[int(x)])

# Step 3: Create pie chart
fig = px.pie(
    month_counts,
    names='Month',
    values='FlightCount',
    title='Number of Flights Recorded in Each Month'
)

fig.show()






## 7.Sunburst Charts: 
Sunburst charts represent hierarchial data in the form of concentric circles. Here the innermost circle is the root node which defines the parent, and then the outer rings move down the hierarchy from the centre. They are also called radial charts.We can use them to plot

* Worldwide mobile Sales where we can drill down as follows:   
    * innermost circle represents total sales  
    * first outer circle represents continentwise sales
    * second outer circle represents countrywise sales within each continent
    
    
* Disease outbreak hierarchy


* Real Estate Industrial chain

In [52]:
ex_data = dict(
    character=["Eve", "Cain", "Seth", "Enos", "Noam", "Abel", "Awan", "Enoch", "Azura"],
    parent=["", "Eve", "Eve", "Seth", "Seth", "Eve", "Eve", "Awan", "Eve" ],
    value=[10, 14, 12, 10, 2, 6, 6, 4, 4])

fig = px.sunburst(
    ex_data,
    names='character',
    parents='parent',
    values='value',
    title="Family chart"
)
fig.show()

Let us represent the hierarchical view in othe order of month and destination state holding value of number of flights

This plot should contain the following

Define hierarchy of sectors from root to leaves in path parameter. Here, we go from Month to DestStateName feature.
Set sector values in values parameter. Here, we can pass in Flights feature.
Show the figure.
Title as Flight Distribution Hierarchy

In [53]:
fig = px.sunburst(data, names='DestStateName',parents='Month' ,values='Flights',title='Flight Distribution Hierarchy')
fig.show()

If your parents column (Month) contains numeric month values (1-12), but your names column only contains state names (e.g. California, Texas), then:

There are no matching names = parents pairs in the data for Plotly to build a hierarchy.

In [42]:
data['DestStateName']

5312     Wisconsin
18357      Georgia
6428      Nebraska
15414     Illinois
10610      Indiana
           ...    
18946     Missouri
16291       Nevada
21818     Missouri
24116      Florida
16705      Florida
Name: DestStateName, Length: 500, dtype: object