# Exploratory Data & Business Analysis
The **Data Science Lifecycle** includes steps to perform Exploratory Data Analysis as well as evaluate the outputs of machine learning models. Since most data science projects in industry are set to solve specific business objectives, it's important to evaluate the business impact of the models we build. 

### Exploratory Data Analysis (EDA)
[Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) is the process of analyzing data to derive insights, often done using visualizations. While this process can be done manually, there are a variety of open source tools which can come in handy. We will explore the usage of [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling) below. 

**Note:** All required Python packages required for this course are installed via `conda`. Please refer to [environment.yml](../environment.yml) for a detailed list of packages & versions used in this course.

In [1]:
import pandas as pd
import pandas_profiling
import numpy as np

In [2]:
airlines_df = pd.read_csv("../data/external/allyears2k.csv", encoding="ISO-8859-1", low_memory=False)
target = "IsDepDelayed"
print("Dataset has {} entries and {} features".format(*airlines_df.shape))

# preview data
airlines_df.head()

Dataset has 43978 entries and 31 features


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed,IsDepDelayed
0,1987,10,14,3,741.0,730,912.0,849,PS,1451,...,0,,0,,,,,,YES,YES
1,1987,10,15,4,729.0,730,903.0,849,PS,1451,...,0,,0,,,,,,YES,NO
2,1987,10,17,6,741.0,730,918.0,849,PS,1451,...,0,,0,,,,,,YES,YES
3,1987,10,18,7,729.0,730,847.0,849,PS,1451,...,0,,0,,,,,,NO,NO
4,1987,10,19,1,749.0,730,922.0,849,PS,1451,...,0,,0,,,,,,YES,YES


`pandas-profiling` can automatically generate a report with the following features:

* Essentials: type, unique values, missing values
* Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
* Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* Most frequent values
* Histograms
* Correlations highlighting of highly correlated variables, Spearman and Pearson matrixes

Here's how...

In [3]:
pandas_profiling.ProfileReport(airlines_df)

0,1
Number of variables,31
Number of observations,43978
Total Missing (%),20.0%
Total size in memory,10.4 MiB
Average record size in memory,248.0 B

0,1
Numeric,17
Categorical,7
Boolean,3
Date,0
Text (Unique),0
Rejected,4
Unsupported,0

0,1
Distinct count,417
Unique (%),0.9%
Missing (%),2.7%
Missing (n),1195
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,124.81
Minimum,16
Maximum,475
Zeros (%),0.0%

0,1
Minimum,16
5-th percentile,50
Q1,71
Median,101
Q3,151
95-th percentile,284
Maximum,475
Range,459
Interquartile range,80

0,1
Standard deviation,73.974
Coef of variation,0.59267
Kurtosis,0.97901
Mean,124.81
MAD,57.718
Skewness,1.2929
Sum,5339900
Variance,5472.2
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
70.0,571,1.3%,
75.0,564,1.3%,
65.0,529,1.2%,
73.0,511,1.2%,
77.0,509,1.2%,
68.0,500,1.1%,
67.0,499,1.1%,
69.0,496,1.1%,
72.0,491,1.1%,
76.0,474,1.1%,

Value,Count,Frequency (%),Unnamed: 3
16.0,4,0.0%,
17.0,3,0.0%,
18.0,4,0.0%,
19.0,4,0.0%,
20.0,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
440.0,1,0.0%,
442.0,1,0.0%,
443.0,1,0.0%,
451.0,1,0.0%,
475.0,1,0.0%,

0,1
Correlation,0.98769

0,1
Distinct count,332
Unique (%),0.8%
Missing (%),2.7%
Missing (n),1195
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,9.3171
Minimum,-63
Maximum,475
Zeros (%),3.4%

0,1
Minimum,-63
5-th percentile,-17
Q1,-6
Median,2
Q3,14
95-th percentile,61
Maximum,475
Range,538
Interquartile range,20

0,1
Standard deviation,29.84
Coef of variation,3.2027
Kurtosis,26.857
Mean,9.3171
MAD,18.087
Skewness,3.9132
Sum,398610
Variance,890.44
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1514,3.4%,
-5.0,1407,3.2%,
-2.0,1321,3.0%,
-1.0,1317,3.0%,
-3.0,1308,3.0%,
-4.0,1299,3.0%,
1.0,1214,2.8%,
3.0,1202,2.7%,
-6.0,1189,2.7%,
-7.0,1170,2.7%,

Value,Count,Frequency (%),Unnamed: 3
-63.0,1,0.0%,
-56.0,1,0.0%,
-55.0,1,0.0%,
-51.0,3,0.0%,
-48.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
399.0,1,0.0%,
410.0,1,0.0%,
460.0,1,0.0%,
470.0,1,0.0%,
475.0,1,0.0%,

0,1
Distinct count,1276
Unique (%),2.9%
Missing (%),2.7%
Missing (n),1195
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1504.6
Minimum,1
Maximum,2400
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,757
Q1,1118
Median,1527
Q3,1917
95-th percentile,2230
Maximum,2400
Range,2399
Interquartile range,799

0,1
Standard deviation,484.35
Coef of variation,0.3219
Kurtosis,-0.56653
Mean,1504.6
MAD,410.32
Skewness,-0.25185
Sum,64373000
Variance,234590
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1114.0,76,0.2%,
1900.0,76,0.2%,
1850.0,75,0.2%,
1834.0,75,0.2%,
1540.0,67,0.2%,
912.0,67,0.2%,
1446.0,66,0.2%,
1838.0,66,0.2%,
1950.0,66,0.2%,
1124.0,65,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1.0,10,0.0%,
2.0,10,0.0%,
3.0,10,0.0%,
4.0,11,0.0%,
5.0,13,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2356.0,11,0.0%,
2357.0,13,0.0%,
2358.0,10,0.0%,
2359.0,10,0.0%,
2400.0,6,0.0%,

0,1
Distinct count,1041
Unique (%),2.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1485.3
Minimum,0
Maximum,2359
Zeros (%),1.3%

0,1
Minimum,0
5-th percentile,752
Q1,1109
Median,1516
Q3,1903
95-th percentile,2223
Maximum,2359
Range,2359
Interquartile range,794

0,1
Standard deviation,492.75
Coef of variation,0.33175
Kurtosis,-0.29157
Mean,1485.3
MAD,413.73
Skewness,-0.34735
Sum,65320047
Variance,242800
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0,569,1.3%,
1630,217,0.5%,
1725,190,0.4%,
1810,189,0.4%,
920,183,0.4%,
1555,175,0.4%,
1627,165,0.4%,
1505,162,0.4%,
1215,161,0.4%,
1900,160,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0,569,1.3%,
1,2,0.0%,
5,18,0.0%,
10,12,0.0%,
12,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2352,1,0.0%,
2353,21,0.0%,
2355,34,0.1%,
2357,1,0.0%,
2359,31,0.1%,

0,1
Correlation,0.91498

0,1
Correlation,0.98409

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),99.5%
Missing (n),43757

0,1
B,93
A,81
C,47
(Missing),43757

Value,Count,Frequency (%),Unnamed: 3
B,93,0.2%,
A,81,0.2%,
C,47,0.1%,
(Missing),43757,99.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.024694

0,1
0,42892
1,1086

Value,Count,Frequency (%),Unnamed: 3
0,42892,97.5%,
1,1086,2.5%,

0,1
Distinct count,128
Unique (%),0.3%
Missing (%),79.7%
Missing (n),35045
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.0478
Minimum,0
Maximum,369
Zeros (%),16.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,22
Maximum,369
Range,369
Interquartile range,0

0,1
Standard deviation,16.206
Coef of variation,4.0036
Kurtosis,137.45
Mean,4.0478
MAD,6.7339
Skewness,9.6258
Sum,36159
Variance,262.63
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,7344,16.7%,
8.0,72,0.2%,
9.0,71,0.2%,
3.0,66,0.2%,
6.0,65,0.1%,
11.0,64,0.1%,
5.0,62,0.1%,
7.0,59,0.1%,
13.0,57,0.1%,
4.0,56,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,7344,16.7%,
1.0,55,0.1%,
2.0,54,0.1%,
3.0,66,0.2%,
4.0,56,0.1%,

Value,Count,Frequency (%),Unnamed: 3
277.0,1,0.0%,
282.0,1,0.0%,
285.0,1,0.0%,
354.0,1,0.0%,
369.0,1,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8206
Minimum,1
Maximum,7
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,5
95-th percentile,7
Maximum,7
Range,6
Interquartile range,3

0,1
Standard deviation,1.905
Coef of variation,0.49861
Kurtosis,-1.0826
Mean,3.8206
MAD,1.6113
Skewness,0.16787
Sum,168023
Variance,3.6291
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
4,8109,18.4%,
2,7303,16.6%,
3,7031,16.0%,
1,5802,13.2%,
5,5589,12.7%,
7,5277,12.0%,
6,4867,11.1%,

Value,Count,Frequency (%),Unnamed: 3
1,5802,13.2%,
2,7303,16.6%,
3,7031,16.0%,
4,8109,18.4%,
5,5589,12.7%,

Value,Count,Frequency (%),Unnamed: 3
3,7031,16.0%,
4,8109,18.4%,
5,5589,12.7%,
6,4867,11.1%,
7,5277,12.0%,

0,1
Distinct count,31
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,14.601
Minimum,1
Maximum,31
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,6
Median,14
Q3,23
95-th percentile,30
Maximum,31
Range,30
Interquartile range,17

0,1
Standard deviation,9.1758
Coef of variation,0.62843
Kurtosis,-1.2152
Mean,14.601
MAD,7.9695
Skewness,0.17157
Sum,642126
Variance,84.195
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
3,3206,7.3%,
2,2960,6.7%,
11,2680,6.1%,
12,1800,4.1%,
1,1368,3.1%,
14,1262,2.9%,
19,1255,2.9%,
13,1253,2.8%,
21,1249,2.8%,
18,1247,2.8%,

Value,Count,Frequency (%),Unnamed: 3
1,1368,3.1%,
2,2960,6.7%,
3,3206,7.3%,
4,1215,2.8%,
5,1211,2.8%,

Value,Count,Frequency (%),Unnamed: 3
27,1244,2.8%,
28,1247,2.8%,
29,1219,2.8%,
30,1208,2.7%,
31,1210,2.8%,

0,1
Distinct count,293
Unique (%),0.7%
Missing (%),2.5%
Missing (n),1086
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10.007
Minimum,-16
Maximum,473
Zeros (%),14.5%

0,1
Minimum,-16
5-th percentile,-6
Q1,-2
Median,1
Q3,10
95-th percentile,56
Maximum,473
Range,489
Interquartile range,12

0,1
Standard deviation,26.439
Coef of variation,2.6419
Kurtosis,36.094
Mean,10.007
MAD,14.993
Skewness,4.812
Sum,429240
Variance,699.01
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,6393,14.5%,
-1.0,3749,8.5%,
-2.0,3083,7.0%,
1.0,2425,5.5%,
-3.0,2228,5.1%,
-4.0,2014,4.6%,
2.0,1719,3.9%,
3.0,1355,3.1%,
-5.0,1139,2.6%,
4.0,1124,2.6%,

Value,Count,Frequency (%),Unnamed: 3
-16.0,3,0.0%,
-15.0,4,0.0%,
-14.0,7,0.0%,
-13.0,14,0.0%,
-12.0,45,0.1%,

Value,Count,Frequency (%),Unnamed: 3
381.0,1,0.0%,
384.0,1,0.0%,
396.0,1,0.0%,
458.0,1,0.0%,
473.0,1,0.0%,

0,1
Distinct count,1144
Unique (%),2.6%
Missing (%),2.5%
Missing (n),1086
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1345.8
Minimum,1
Maximum,2400
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,649
Q1,929
Median,1330
Q3,1735
95-th percentile,2103
Maximum,2400
Range,2399
Interquartile range,806

0,1
Standard deviation,465.34
Coef of variation,0.34576
Kurtosis,-1.1072
Mean,1345.8
MAD,400.83
Skewness,0.080993
Sum,57726000
Variance,216540
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
700.0,235,0.5%,
800.0,155,0.4%,
630.0,137,0.3%,
659.0,135,0.3%,
715.0,126,0.3%,
1250.0,119,0.3%,
1740.0,113,0.3%,
1300.0,112,0.3%,
755.0,110,0.3%,
1420.0,110,0.3%,

Value,Count,Frequency (%),Unnamed: 3
1.0,4,0.0%,
4.0,1,0.0%,
5.0,1,0.0%,
8.0,1,0.0%,
10.0,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2356.0,10,0.0%,
2357.0,5,0.0%,
2358.0,1,0.0%,
2359.0,3,0.0%,
2400.0,4,0.0%,

0,1
Distinct count,134
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0

0,1
PHX,9317
PHL,4482
PIT,3020
Other values (131),27159

Value,Count,Frequency (%),Unnamed: 3
PHX,9317,21.2%,
PHL,4482,10.2%,
PIT,3020,6.9%,
ORD,2103,4.8%,
CLT,1542,3.5%,
DEN,1470,3.3%,
LAX,1440,3.3%,
SFO,1331,3.0%,
BOS,831,1.9%,
OAK,808,1.8%,

0,1
Correlation,0.97816

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0024785

0,1
0,43869
1,109

Value,Count,Frequency (%),Unnamed: 3
0,43869,99.8%,
1,109,0.2%,

0,1
Distinct count,2439
Unique (%),5.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,818.84
Minimum,1
Maximum,3949
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,55
Q1,204
Median,557
Q3,1242
95-th percentile,2602
Maximum,3949
Range,3948
Interquartile range,1038

0,1
Standard deviation,777.4
Coef of variation,0.94939
Kurtosis,0.89335
Mean,818.84
MAD,619.6
Skewness,1.2336
Sum,36011077
Variance,604360
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
55,245,0.6%,
93,210,0.5%,
58,206,0.5%,
76,197,0.4%,
80,155,0.4%,
98,155,0.4%,
39,154,0.4%,
86,154,0.4%,
34,152,0.3%,
159,152,0.3%,

Value,Count,Frequency (%),Unnamed: 3
1,3,0.0%,
2,4,0.0%,
3,5,0.0%,
4,4,0.0%,
5,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3943,1,0.0%,
3946,1,0.0%,
3947,2,0.0%,
3948,3,0.0%,
3949,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
YES,24441
NO,19537

Value,Count,Frequency (%),Unnamed: 3
YES,24441,55.6%,
NO,19537,44.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
YES,23091
NO,20887

Value,Count,Frequency (%),Unnamed: 3
YES,23091,52.5%,
NO,20887,47.5%,

0,1
Distinct count,182
Unique (%),0.4%
Missing (%),79.7%
Missing (n),35045
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.6201
Minimum,0
Maximum,373
Zeros (%),16.2%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th percentile,47.4
Maximum,373.0
Range,373.0
Interquartile range,0.0

0,1
Standard deviation,23.488
Coef of variation,3.0823
Kurtosis,42.69
Mean,7.6201
MAD,12.346
Skewness,5.4563
Sum,68070
Variance,551.67
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,7140,16.2%,
15.0,55,0.1%,
16.0,51,0.1%,
17.0,43,0.1%,
9.0,43,0.1%,
12.0,41,0.1%,
25.0,40,0.1%,
19.0,40,0.1%,
20.0,39,0.1%,
3.0,38,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,7140,16.2%,
1.0,26,0.1%,
2.0,21,0.0%,
3.0,38,0.1%,
4.0,33,0.1%,

Value,Count,Frequency (%),Unnamed: 3
263.0,1,0.0%,
292.0,1,0.0%,
309.0,1,0.0%,
343.0,1,0.0%,
373.0,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,1.4091

0,1
1,41979
10,1999

Value,Count,Frequency (%),Unnamed: 3
1,41979,95.5%,
10,1999,4.5%,

0,1
Distinct count,157
Unique (%),0.4%
Missing (%),79.7%
Missing (n),35045
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.855
Minimum,0
Maximum,323
Zeros (%),16.8%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,28
Maximum,323
Range,323
Interquartile range,0

0,1
Standard deviation,18.62
Coef of variation,3.8352
Kurtosis,71.086
Mean,4.855
MAD,8.1771
Skewness,7.1731
Sum,43370
Variance,346.7
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,7388,16.8%,
2.0,75,0.2%,
1.0,73,0.2%,
4.0,68,0.2%,
6.0,58,0.1%,
15.0,58,0.1%,
8.0,57,0.1%,
3.0,54,0.1%,
5.0,47,0.1%,
16.0,44,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,7388,16.8%,
1.0,73,0.2%,
2.0,75,0.2%,
3.0,54,0.1%,
4.0,68,0.2%,

Value,Count,Frequency (%),Unnamed: 3
269.0,1,0.0%,
285.0,1,0.0%,
293.0,1,0.0%,
307.0,1,0.0%,
323.0,1,0.0%,

0,1
Distinct count,132
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0

0,1
DEN,3558
PIT,3241
ORD,2246
Other values (129),34933

Value,Count,Frequency (%),Unnamed: 3
DEN,3558,8.1%,
PIT,3241,7.4%,
ORD,2246,5.1%,
BUR,2021,4.6%,
CLT,1781,4.0%,
PHL,1632,3.7%,
SFO,1544,3.5%,
LAX,1268,2.9%,
BWI,996,2.3%,
BOS,976,2.2%,

0,1
Distinct count,13
Unique (%),0.0%
Missing (%),79.7%
Missing (n),35045
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.017016
Minimum,0
Maximum,14
Zeros (%),20.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,14
Range,14
Interquartile range,0

0,1
Standard deviation,0.40394
Coef of variation,23.739
Kurtosis,721.18
Mean,0.017016
MAD,0.033959
Skewness,26.096
Sum,152
Variance,0.16317
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,8914,20.3%,
6.0,3,0.0%,
10.0,3,0.0%,
5.0,2,0.0%,
3.0,2,0.0%,
12.0,2,0.0%,
8.0,2,0.0%,
14.0,1,0.0%,
13.0,1,0.0%,
11.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,8914,20.3%,
1.0,1,0.0%,
3.0,2,0.0%,
5.0,2,0.0%,
6.0,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
10.0,3,0.0%,
11.0,1,0.0%,
12.0,2,0.0%,
13.0,1,0.0%,
14.0,1,0.0%,

0,1
Distinct count,3501
Unique (%),8.0%
Missing (%),36.4%
Missing (n),16024

0,1
UNKNOW,179
000000,124
äNKNOæ,114
Other values (3497),27537
(Missing),16024

Value,Count,Frequency (%),Unnamed: 3
UNKNOW,179,0.4%,
000000,124,0.3%,
äNKNOæ,114,0.3%,
0,66,0.2%,
N912UA,59,0.1%,
N316AW,56,0.1%,
N509DC,55,0.1%,
N922UA,54,0.1%,
N913UA,53,0.1%,
N302AW,51,0.1%,

0,1
Distinct count,68
Unique (%),0.2%
Missing (%),36.4%
Missing (n),16026
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.3814
Minimum,0
Maximum,128
Zeros (%),1.4%

0,1
Minimum,0
5-th percentile,2
Q1,3
Median,5
Q3,6
95-th percentile,11
Maximum,128
Range,128
Interquartile range,3

0,1
Standard deviation,4.202
Coef of variation,0.78084
Kurtosis,87.754
Mean,5.3814
MAD,2.3734
Skewness,6.196
Sum,150420
Variance,17.657
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,5888,13.4%,
3.0,5297,12.0%,
5.0,4832,11.0%,
6.0,3126,7.1%,
2.0,2021,4.6%,
7.0,1981,4.5%,
8.0,1161,2.6%,
9.0,717,1.6%,
0.0,623,1.4%,
10.0,486,1.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,623,1.4%,
1.0,90,0.2%,
2.0,2021,4.6%,
3.0,5297,12.0%,
4.0,5888,13.4%,

Value,Count,Frequency (%),Unnamed: 3
79.0,1,0.0%,
87.0,1,0.0%,
88.0,1,0.0%,
116.0,1,0.0%,
128.0,1,0.0%,

0,1
Distinct count,127
Unique (%),0.3%
Missing (%),36.4%
Missing (n),16024
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,14.169
Minimum,0
Maximum,254
Zeros (%),1.3%

0,1
Minimum,0
5-th percentile,6
Q1,9
Median,12
Q3,16
95-th percentile,30
Maximum,254
Range,254
Interquartile range,7

0,1
Standard deviation,9.9051
Coef of variation,0.69909
Kurtosis,46.6
Mean,14.169
MAD,5.9823
Skewness,4.664
Sum,396070
Variance,98.111
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
10.0,2733,6.2%,
9.0,2337,5.3%,
11.0,2335,5.3%,
12.0,2145,4.9%,
8.0,1986,4.5%,
13.0,1889,4.3%,
14.0,1576,3.6%,
7.0,1500,3.4%,
15.0,1331,3.0%,
16.0,1096,2.5%,

Value,Count,Frequency (%),Unnamed: 3
0.0,557,1.3%,
1.0,1,0.0%,
2.0,6,0.0%,
3.0,48,0.1%,
4.0,146,0.3%,

Value,Count,Frequency (%),Unnamed: 3
159.0,1,0.0%,
165.0,1,0.0%,
170.0,1,0.0%,
177.0,1,0.0%,
254.0,1,0.0%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
US,18729
UA,9434
WN,6170
Other values (7),9645

Value,Count,Frequency (%),Unnamed: 3
US,18729,42.6%,
UA,9434,21.5%,
WN,6170,14.0%,
HP,3451,7.8%,
PS,3212,7.3%,
DL,935,2.1%,
PI,786,1.8%,
AA,724,1.6%,
TW,329,0.7%,
CO,208,0.5%,

0,1
Distinct count,47
Unique (%),0.1%
Missing (%),79.7%
Missing (n),35045
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.28938
Minimum,0
Maximum,201
Zeros (%),20.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,201
Range,201
Interquartile range,0

0,1
Standard deviation,4.4168
Coef of variation,15.263
Kurtosis,864.92
Mean,0.28938
MAD,0.57273
Skewness,26.234
Sum,2585
Variance,19.508
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,8840,20.1%,
13.0,5,0.0%,
12.0,5,0.0%,
6.0,5,0.0%,
7.0,4,0.0%,
22.0,4,0.0%,
25.0,4,0.0%,
14.0,4,0.0%,
17.0,4,0.0%,
11.0,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,8840,20.1%,
1.0,2,0.0%,
3.0,1,0.0%,
4.0,1,0.0%,
5.0,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
121.0,1,0.0%,
122.0,1,0.0%,
128.0,1,0.0%,
135.0,1,0.0%,
201.0,1,0.0%,

0,1
Distinct count,22
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1997.5
Minimum,1987
Maximum,2008
Zeros (%),0.0%

0,1
Minimum,1987.0
5-th percentile,1988.0
Q1,1992.0
Median,1997.5
Q3,2003.0
95-th percentile,2007.0
Maximum,2008.0
Range,21.0
Interquartile range,11.0

0,1
Standard deviation,6.3444
Coef of variation,0.0031762
Kurtosis,-1.205
Mean,1997.5
MAD,5.5
Skewness,0
Sum,87846055
Variance,40.251
Memory size,343.7 KiB

Value,Count,Frequency (%),Unnamed: 3
2008,1999,4.5%,
2007,1999,4.5%,
1988,1999,4.5%,
1989,1999,4.5%,
1990,1999,4.5%,
1991,1999,4.5%,
1992,1999,4.5%,
1993,1999,4.5%,
1994,1999,4.5%,
1995,1999,4.5%,

Value,Count,Frequency (%),Unnamed: 3
1987,1999,4.5%,
1988,1999,4.5%,
1989,1999,4.5%,
1990,1999,4.5%,
1991,1999,4.5%,

Value,Count,Frequency (%),Unnamed: 3
2004,1999,4.5%,
2005,1999,4.5%,
2006,1999,4.5%,
2007,1999,4.5%,
2008,1999,4.5%,

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed,IsDepDelayed
0,1987,10,14,3,741.0,730,912.0,849,PS,1451,,91.0,79.0,,23.0,11.0,SAN,SFO,447.0,,,0,,0,,,,,,YES,YES
1,1987,10,15,4,729.0,730,903.0,849,PS,1451,,94.0,79.0,,14.0,-1.0,SAN,SFO,447.0,,,0,,0,,,,,,YES,NO
2,1987,10,17,6,741.0,730,918.0,849,PS,1451,,97.0,79.0,,29.0,11.0,SAN,SFO,447.0,,,0,,0,,,,,,YES,YES
3,1987,10,18,7,729.0,730,847.0,849,PS,1451,,78.0,79.0,,-2.0,-1.0,SAN,SFO,447.0,,,0,,0,,,,,,NO,NO
4,1987,10,19,1,749.0,730,922.0,849,PS,1451,,93.0,79.0,,33.0,19.0,SAN,SFO,447.0,,,0,,0,,,,,,YES,YES


+ Gather insights using exploratory methods, descriptive & inferential statistics
    + Find median, mode, std dev, min, max, average for each column. Do these make sense in the context of the column?
    + Do values have reasonable lower & upper bounds?
    + Univariate feature distributions (to observe stability & other patterns of a given feature like skew)
    + Feature & target correlations
    + Target analysis (plot of feature vs target)
    + Are there any outliers?
    + Do the column values seem to follow a normal distribution? Uniform? Exponential (i.e. long tail)?
    + Ensure correct data types are imputed
    + Look at the values. Ensure they make sense in the context of each column
    + Look for missing/empty values
    + Determine if there is a target imbalance? One class is represented less 15% of the outcomes. 
    + For categorical fields, what are the unique values in the field?
    + For numeric fields, are all values numbers?

If you want to generate a HTML report file, save the ProfileReport to an object and use the `to_file()` function. You can see a HTML version of our data located [here](../data/processed/data_analysis.html).

In [None]:
profile = pandas_profiling.ProfileReport(airlines_df)
profile.to_file(outputfile="../data/processed/data_analysis.html")

When we are building machine learning models, several learners are sensitive to features which are highly correlated. In other words, including features with high correlations reduces the predictive power of the model. `pandas-profiling` can automatically retrieve the list of variables which are rejected due to high correlations... 

In [None]:
profile = pandas_profiling.ProfileReport(airlines_df)
rejected_variables = profile.get_rejected_variables(threshold=0.9)
rejected_variables

It's of no surprise that **Scheduled Elapsed Time** is positively correlated to **Distance** as further distances take longer time to travel; thus resulting in longer time spent in the air.

### Business Analysis
Before a model is released into production, we must understand how the outputs of the machine learning model **impact the business objectives**. One way to assess the impact to the business objective is to define the cost of errors (bad predictions). In the case of a binary classifier, the two types of errors which can result are **Type 1 (False Positive)** and **Type 2 (False Negative)**.

Consider the following confusion matrix for the airline dataset we've been using in this course...
  
![alt text](../assets/images/Flight_Confusion_Matrix.png)

Let's refer to the test set confusion matrix which resulted from the model development process covered earlier.

```
Confusion Matrix...
[[3768 3178]
 [2009 5558]]
```

 ```
 True Positive (TP, correct prediction): 3,768. 
 Assume flight on-time results in $100 of revenue per customer.
 
 True Negative (TN, correct prediction): 5,558
 Assume flight which has been delayed results in $100 of revenue per customer.

 False Positive (FP, incorrect prediction): 3,178
 Assume the inconvenience of a delayed flight when the customer was notified it will be on-time is $500 of cost per customer.
 
 False Negative (FN, incorrect prediction): 2,009
 Assume the inconvenience of a flight on-time when the customer was notified it will be delayed is $100 of cost per customer.

```

One approach to evaluating the business result of the predictions from this model is using the following formula:


`Business Impact = Revenue * TP + Revenue * TN - Cost * FP - Cost * FN`

In [19]:
business_impact = 700 * 3768 + 700 * 5558 - 3178 * 500 - 2009 * 100
business_impact

4738300

This model will result in a net positive value of $4,738,300 if we assume a single passenger was to travel on every flight in our test set.