# Data Summarization

## Basic Operations of Data Manipulations

You have learned several key operations that allow you to solve the vast majority of your data manipulation challenges:

* Pick observations by their values.
* Reorder the rows.
* Pick variables by their names.
* Create new variables with functions of existing variables.
* **Collapse many values down to a single summary.**

These can all be used in conjunction with groupby() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. 

In [2]:
import pandas as pd
import numpy as np 

In [3]:
from nycflights13 import flights
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


## Data frame with columns

- year,month,day
        Date of departure    
- dep_time,arr_time
        Actual departure and arrival times (format HHMM or HMM), local tz.
- sched_dep_time,sched_arr_time
        Scheduled departure and arrival times (format HHMM or HMM), local tz.    
- dep_delay,arr_delay
        Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- hour,minute
        Time of scheduled departure broken into hour and minutes.
- carrier
        Two letter carrier abbreviation. See airlines() to get name
- tailnum
        Plane tail number
- flight
        Flight number
- origin,dest
        Origin and destination. See airports() for additional metadata.
- air_time
        Amount of time spent in the air, in minutes
- distance
        Distance between airports, in miles
- time_hour
        Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.

In [None]:
# Basic descriptive statistics for each column 
flights.describe()

In [None]:
flights.describe(include='all')

In [4]:
# Dimensions of the dataframe
flights.shape

(336776, 19)

In [5]:
# Number of rows in the dataframe
len(flights)

336776

In [6]:
# Number of distinct values in a column.
flights['carrier'].nunique()

16

In [7]:
# Count number of rows with each unique value of variable
flights['carrier'].value_counts()

UA    58665
B6    54635
EV    54173
DL    48110
AA    32729
MQ    26397
US    20536
9E    18460
WN    12275
VX     5162
FL     3260
AS      714
F9      685
YV      601
HA      342
OO       32
Name: carrier, dtype: int64

## Summary functions

Pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, and produce single values for each of the groups. When applied to a DataFrame, the result is returned as a pandas Series for each column. Examples:
- ``sum()`` Sum values of each object.
- ``count()`` Count non-NA/null values of each object.
- ``median()`` Median value of each object.
- ``quantile([0.25,0.75])`` Quantiles of each object.
- ``min()`` Minimum value in each object.
- ``max()`` Maximum value in each object.
- ``mean()`` Mean value of each object.
- ``var()`` Variance of each object.
- ``std()`` Standard deviation of each object.
- ``apply(function)`` Apply function to each object.

These summary functions can be applied to all the rows in the dataframe. 

In [8]:
# Count the number of flights (rows)
flights['flight'].count()

336776

In [9]:
# Sum up the total distance of all flights. 
flights['distance'].sum()

350217607

In [10]:
# Average/mean of arrival delay
flights['arr_delay'].mean()

6.89537675731489

In [None]:
# Apply a function to multiple columns
flights[['distance','air_time']].max()

## Group by

These summary functions are not terribly useful unless we pair them with groupby(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use a summary function on a grouped data frame they’ll be automatically applied “by group”. 

In [11]:
flights.groupby('carrier').size()

carrier
9E    18460
AA    32729
AS      714
B6    54635
DL    48110
EV    54173
F9      685
FL     3260
HA      342
MQ    26397
OO       32
UA    58665
US    20536
VX     5162
WN    12275
YV      601
dtype: int64

In [12]:
flights.groupby(['carrier','flight']).size()

carrier  flight
9E       2900      59
         2901      55
         2902      55
         2903      56
         2904      57
                   ..
YV       3778       3
         3788      23
         3790       9
         3791      15
         3799       1
Length: 5725, dtype: int64

In [13]:
flights.groupby(['year','month','day'])['arr_delay'].mean()

year  month  day
2013  1      1      12.651023
             2      12.692888
             3       5.733333
             4      -1.932819
             5      -1.525802
                      ...    
      12     27     -0.148803
             28     -3.259533
             29     18.763825
             30     10.057712
             31      6.212121
Name: arr_delay, Length: 365, dtype: float64

In [14]:
flights.groupby(['year','month'])['arr_delay'].agg(['mean','std','min','max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,min,max
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013,1,6.129972,40.423898,-70.0,1272.0
2013,2,5.613019,39.528619,-70.0,834.0
2013,3,5.807577,44.119192,-68.0,915.0
2013,4,11.176063,47.491151,-68.0,931.0
2013,5,3.521509,44.237613,-86.0,875.0
2013,6,16.48133,56.130866,-64.0,1127.0
2013,7,16.711307,57.117088,-66.0,989.0
2013,8,6.040652,42.595142,-68.0,490.0
2013,9,-4.018364,39.710309,-68.0,1007.0
2013,10,-0.167063,32.649858,-61.0,688.0


What if we want to apply different summary functions to different columns? 

In [None]:
flights.groupby('carrier').agg({'flight': 'size',
                                'distance': 'sum', 
                                'arr_delay': ['mean','std'],
                                'hour': lambda x: x.max()-x.min()
                               })

In [None]:
# You can also create a function to include multiple aggregate functions on different columns. 
# In this way, you can give a name for each new column in the resulting dataframe. 
def f(x):
    d = {}
    d['flight_count'] = x['flight'].count()
    d['total_distance'] = x['distance'].sum()
    d['arr_delay_mean'] = x['arr_delay'].mean()
    d['arr_delay_std'] = x['arr_delay'].std()
    d['hour_range'] = x['hour'].max() - x['hour'].min()
    return pd.Series(d)

flights.groupby('carrier').apply(f)

## Combining multiple operations

Now let's put multiple operators we've learned together. 

### Q1. How many flights left before 5am on each day? (These usually indicate delayed flights from the previous day)

In [20]:
# Write the code below: 
flights[flights.dep_time<500].groupby(['month','day']).size()

month  day
1      2      3
       3      4
       4      3
       5      3
       6      2
             ..
12     27     7
       28     2
       29     3
       30     6
       31     4
Length: 348, dtype: int64

### Q2. For each destination from any of the three airports in NYC, explore the relationship between the distance and average delay. 

Follow the three steps:
- Group flights by destination and summarise to compute number of flights, average distance, and average arrival delay.
- Filter to remove noisy points and Honolulu airport (HNL), which is almost twice as far away as the next closest airport.
- Sort all rows by arrival delay.

In [35]:
# Your code step by step: 
data=flights[(flights.origin.isin(['JFK','LGA','NWK']))& (flights.dest != 'HNL')]
data=data.groupby('dest').agg ({
    'flight':'count',
    'distance':'mean',
    'arr_delay':'mean'
})
data.sort_values('arr_delay')

Unnamed: 0_level_0,flight,distance,arr_delay
dest,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LEX,1,604.000000,-22.000000
PSP,19,2378.000000,-12.722222
AVL,10,599.000000,-12.100000
SAV,68,722.000000,-10.514706
STT,333,1623.000000,-6.372727
...,...,...,...
BHM,297,865.996633,16.877323
MHT,142,195.000000,17.527778
EGE,103,1746.572816,18.534653
CAE,12,617.000000,19.666667


In [None]:
# Put everything in one line of code: 


### Q3. Find the planes (identified by the tail number) that have the highest average arrival delays.

In [None]:
# First, find the flights that are not cancelled. Save them in a separate dataframe named "not_cancelled"
# Assumption: flights that are not cancelled should have values for dep_delay and arr_delay. 


In [None]:
# For each plane, find the average arrival delays and the flight count. 
# Save the results in a dataframe named "delays"


In [None]:
# Explore the "delays" dataframe. 
# Try to create a scatter plot to show 'flight' vs. 'arr_delay'
# https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.DataFrame.plot.scatter.html


**What is the range of average arrival delay for these planes?**

**Any outliers?**
 

## Operations within groups

Grouping is most useful in conjunction with aggregate functions. But you can also do other operations within groups:

In [None]:
# Find the worst members of each group:

# Find the top three flights with the longest arr_delay everyday. 
flights['rank_daily_delay'] = flights.groupby(['year', 'month', 'day'])['arr_delay'].rank(method='min',ascending=False)
flights[flights.rank_daily_delay<=3]

In [None]:
# Find all groups bigger than a threshold:

# Find all flights that fly to the popular destinations that appear in over 1000 times(flights). 
flights.groupby('dest').filter(lambda x: x['dest'].count()>1000)

In [None]:
# Standardise to compute per group metrics:

# For all flights that arrived later than scheduled, 
# calculate the proportion of arrival delay among delayed flights to the each destination
# display year, month, day, destination, flight, arr_delay, and proportion of arr_delay
flights['prop_delay'] = flights[flights.arr_delay>0].groupby('dest')['arr_delay'].transform(lambda x: x / x.sum())
flights[flights.arr_delay>0][['year', 'month', 'day', 'dest', 'flight', 'arr_delay', 'prop_delay']]    