## Top 10 arrival airports in the world in 2013 (using the bookings file)

Arrival airport is the column arr_port. It is the IATA code for the airport

To get the total number of passengers for an airport, you can sum the "pax" column, grouping by arr_port.

Note that there is negative pax. That corresponds to cancelations. So to get the total number of passengers that have actually booked, you should sum including the negatives (that will remove the canceled bookings).

Print the top 10 arrival airports in the standard output, including the number of passengers.

Bonus point: Get the name of the city or airport corresponding to that airport (programatically, we suggest to have a look at GeoBases in Github)

Bonus point: Solve this problem using pandas (instead of any other approach)


Suggestion: follow the below plan of action:

* Get familiar with the data
* Select columns of interest
* Decide what to do with NaNs

* Make processing plan
* Develop code that works with a sample

* Adjust the code to work with Big data
* Test big data approach on a sample

* Run program with big data


## 1) Get familiar with data

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

### What if we dont want to read the whole file?

Options:

* prepare the sample

* read_csv with nrows option

In [3]:
!bzcat ../../../data/Challenge/bookings.csv.bz2 | head -n 100000 >  bookings.sample.csv


bzcat: I/O or other error, bailing out.  Possible reason follows.
bzcat: Broken pipe
	Input file = ../../../data/Challenge/bookings.csv.bz2, output file = (stdout)


In [5]:
!ls -la bookings.sample.csv

-rw-rw-r-- 1 dani dani 42445466 Jan 26 21:03 bookings.sample.csv


In [11]:
sample_df = pd.read_csv('../../../data/Challenge/bookings.csv.bz2', nrows=100000, sep='^')

In [12]:
sample_df.head()

Unnamed: 0,act_date,source,pos_ctry,pos_iata,pos_oid,rloc,cre_date,duration,distance,dep_port,...,route,carrier,bkg_class,cab_class,brd_time,off_time,pax,year,month,oid
0,2013-03-05 00:00:00,1A,DE,a68dd7ae953c8acfb187a1af2dcbe123,1a11ae49fcbf545fd2afc1a24d88d2b7,ea65900e72d71f4626378e2ebd298267,2013-02-22 00:00:00,1708,0,ZRH,...,LHRZRH,VI,T,Y,2013-03-07 08:50:00,2013-03-07 11:33:37,-1,2013,3,
1,2013-03-26 00:00:00,1A,US,e612b9eeeee6f17f42d9b0d3b79e75ca,7437560d8f276d6d05eeb806d9e7edee,737295a86982c941f1c2da9a46a14043,2013-03-26 00:00:00,135270,0,SAL,...,SALATLCLT,NV,L,Y,2013-04-12 13:04:00,2013-04-12 22:05:40,1,2013,3,
2,2013-03-26 00:00:00,1A,US,e612b9eeeee6f17f42d9b0d3b79e75ca,7437560d8f276d6d05eeb806d9e7edee,737295a86982c941f1c2da9a46a14043,2013-03-26 00:00:00,135270,0,SAL,...,CLTATLSAL,NV,U,Y,2013-07-15 07:00:00,2013-07-15 11:34:51,1,2013,3,
3,2013-03-26 00:00:00,1A,AU,0f984b3bb6bd06661c95529bbd6193bc,36472c6dbaf7afec9136ac40364e2794,5ecf00fdcbcec761c43dc7285253d0c1,2013-03-26 00:00:00,30885,0,AKL,...,AKLHKGSVO,XK,G,Y,2013-04-24 23:59:00,2013-04-25 16:06:31,1,2013,3,SYDA82546
4,2013-03-26 00:00:00,1A,AU,0f984b3bb6bd06661c95529bbd6193bc,36472c6dbaf7afec9136ac40364e2794,5ecf00fdcbcec761c43dc7285253d0c1,2013-03-26 00:00:00,30885,0,AKL,...,SVOHKGAKL,XK,G,Y,2013-05-14 20:15:00,2013-05-16 10:44:50,1,2013,3,SYDA82546


Clean the column names

In [17]:
sample_df.columns = sample_df.columns.str.strip()
sample_df.columns

Index(['act_date', 'source', 'pos_ctry', 'pos_iata', 'pos_oid', 'rloc',
       'cre_date', 'duration', 'distance', 'dep_port', 'dep_city', 'dep_ctry',
       'arr_port', 'arr_city', 'arr_ctry', 'lst_port', 'lst_city', 'lst_ctry',
       'brd_port', 'brd_city', 'brd_ctry', 'off_port', 'off_city', 'off_ctry',
       'mkt_port', 'mkt_city', 'mkt_ctry', 'intl', 'route', 'carrier',
       'bkg_class', 'cab_class', 'brd_time', 'off_time', 'pax', 'year',
       'month', 'oid'],
      dtype='object')

## 2) Select the columns of interest 

In [26]:
sample_df['arr_port']
sample_df['off_time'].iloc[0][:4]

'2013'

In [28]:
sample_cols = sample_df[['arr_port', 'off_time', 'pax']]
sample_cols.head()

Unnamed: 0,arr_port,off_time,pax
0,LHR,2013-03-07 11:33:37,-1
1,CLT,2013-04-12 22:05:40,1
2,CLT,2013-07-15 11:34:51,1
3,SVO,2013-04-25 16:06:31,1
4,SVO,2013-05-16 10:44:50,1


## 3) What to do with NaN?



In [29]:
sample_cols.isnull().sum()

arr_port    0
off_time    0
pax         0
dtype: int64

In [35]:
nulls_per_airport = sample_cols.groupby('arr_port').apply(lambda df: df.isnull().sum())
nulls_per_airport.head()

Unnamed: 0_level_0,arr_port,off_time,pax
arr_port,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAE,0,0,0
AAL,0,0,0
AAQ,0,0,0
AAR,0,0,0
ABE,0,0,0


In [44]:
out = []

for key, group in sample_cols.groupby('arr_port'):
    group.name = key
    out.append(group.isnull().sum())

pd.concat(out, axis=1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505
arr_port,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
off_time,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
pax,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In the sample everything might be ok, but we should prepare for NaN case

## 4) Make processing plan
1) get only the bookings from 2013

2) group by arr_port, sum

3) sort 

4) get top 10

#### 4.1) Get only the booking from 2013

In [50]:
arrivals_2013 = sample_cols['off_time'].str[:4] == '2013'
sample_cols = sample_cols[arrivals_2013]

#### 4.2) group by arr_port, sum

In [53]:
pax_per_aiport = sample_cols.groupby('arr_port')['pax'].sum()

#### 4.3,4) sort, get top 10

In [54]:
pax_per_aiport.sort_values(ascending=False).head(10)

arr_port
LHR         1006
MCO          838
JFK          792
LAX          758
BKK          740
LAS          732
SFO          698
ORD          686
CDG          673
DXB          588
Name: pax, dtype: int64

## 5) Adjust the code to work with Big data


Hint: check out https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

In [55]:
chunks = pd.read_csv('../../../data/Challenge/bookings.csv.bz2', chunksize=100000, sep='^')
chunks

<pandas.io.parsers.TextFileReader at 0x7f0fa1dc02e8>

### We have to read the whole file, but with nrows we are reading always the first N rows


#### Now we need to put together the results from all the chunks

Options:

* df.append()

* pd.concat()
    
    
df.append() is not recommended for the case when we need to concatenate multiple dfs, since it can be a lot slower.

When adapting our code to use chunks, we'll process each chunk as we did the sample. After that, we'll need a way to merge the results from all chunks. In this case, it is to concatenate them and then `groupby` and `sum` again.

In [73]:
chunks = pd.read_csv('../../../data/Challenge/bookings.csv.bz2', chunksize=1000000, sep='^')


partial_results = []
chunk_num = 0

for chunk in chunks:
    
    chunk.columns = chunk.columns.str.strip()
    chunk = chunk[['arr_port', 'off_time', 'pax']]
    chunk = chunk[chunk['off_time'].str[:4] == '2013']
    
    partial_result = chunk.groupby('arr_port')['pax'].sum()
    
    partial_results.append(partial_result)
    
    print(chunk_num)
    chunk_num += 1


0
1
2
3
4


  interactivity=interactivity, compiler=compiler, result=result)


5
6
7
8
9
10


In [86]:
s = pd.concat(partial_results)
s.shape

(22615,)

In [87]:
s.groupby('arr_port').sum().sort_values(ascending=False).shape

(2261,)

In [88]:
final_result = s.groupby('arr_port').sum().sort_values(ascending=False).head(10)
final_result

arr_port
LHR         81439.0
LAX         64230.0
LAS         63190.0
MCO         62290.0
JFK         60060.0
CDG         58080.0
SFO         53710.0
MIA         53020.0
BKK         52660.0
DXB         52230.0
Name: pax, dtype: float64