<a href="https://colab.research.google.com/github/tfaieta/Collab/blob/master/New_York_City_Flights_2013.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# New York City Flights Analyzation 

This is the beginning of our analyzation of New York City Flights using the [nycflights](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf) data.

In [0]:
# Pandas for data manipulation,
# Numpy for Histogram
import pandas as pd
import numpy as np

## Bokeh Basics

So first lets start off with the basics of using Bokeh, a Python data visualization library. For now we are hard-coding the data in, but let's step through what is actually happening here. 

Whenever you see `from`, you are importing a module into the current Python file. For this entire file, we could have just written, `import bokeh`, but since we are only using parts of the Bokeh module and don't want to import anything we aren't using because it lags performance, we use the `from module import this` syntax so that way we only load in what we want.

We create a figure in Bokeh by using the `figure(width, height, title, x-label, y-label)` function. The way that we have access to this is because we imported it in `from bokeh.plotting import figure`. You can find everything that the Bokeh library can offer us [here](https://bokeh.pydata.org/en/latest/docs/reference.html). This is often the best way to find and use a library.

Then we hardcode some points on the graph with the squares and circles arrays. Notice that in Python, we don't have to declare data types (like strings, arrays, ints, they can all be mixed together and added together in Python) to variables and we can be flexible, this is something that makes Python a language that many people like because of the added flexibility with data types. 

Also notice that with Bokeh, we get some tools out of the box. On the right side of the graph we see that we can download the graph, pan and zoom, refresh, etc. 

## Basic Glyphs

In [0]:
# bokeh basics
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

# Create a blank figure with labels
p = figure(plot_width = 600, plot_height = 600, 
           title = 'Example Glyphs',
           x_axis_label = 'X', y_axis_label = 'Y')

squares_x = [1, 3, 4, 5, 8]
squares_y = [8, 7, 3, 1, 10]

circles_x = [9, 12, 4, 3, 15]
circles_y = [8, 4, 11, 6, 10]

# Squares glyph
p.square(squares_x, squares_y, size = 12, color = 'navy', alpha = 0.6)

# Circle glyph
p.circle(circles_x, circles_y, size = 12, color = 'red')


# Set to output the plot in the notebook
output_notebook()

# Show the plot
show(p)

FileNotFoundError: ignored

# Data Inspection

## Importing Data

So something really cool is that a lot of data is free and open, you can find datasets on Github or on government websites. They usually come in CSV (comma seperated values) and follow a format kind of like excel with columns and rows. 

So to import data we are using the [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) module which we are assigning it a nickname when we imported it `pd`. Pandas is another python library that we can use to help analyze data. Why use Panda? Because I wanted to give an example of how you can use different libraries together and tie in functionality.

Pandas has a function called `read_csv(file_path, index_col)` which can be passed in other options documented [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv). 

This puts our data into a pandas dataframe which gives us access to do a lot of things with it. Think about it as copying and pasting into excel, the raw data doesn't do much, but once we put it into excel we can add formulas and manipulate it. We then test to see that it was imported correctly by calling the `head()` function which we have access to because we put the data into a dataframe. 

In [0]:
flights = pd.read_csv('https://raw.githubusercontent.com/tfaieta/DataScience/master/data/flights.csv?token=ADR5YW65IG7OZJRCPES7ZLS5BKU74', index_col=0)
flights.head()

We are going to focus on a single variable, in this case the arrival delay in minutes. Before we get into plotting, we will want to take a look at the summary statistics for the arrival delay.

In [0]:
flights['arr_delay'].describe()

# Histogram 

The first graph we will make is a simple histogram of the arrival delay. We will consider all airlines on the same plot.

## Data for plotting

In [0]:
# Bins will be five minutes in width
# Limit delays to [-60, +120] minutes using the range
arr_hist, edges = np.histogram(flights['arr_delay'], 
                               bins = int(180/5), 
                               range = [-60, 120])

# Put the information in a dataframe
delays = pd.DataFrame({'flights': arr_hist, 
                       'left': edges[:-1], 
                       'right': edges[1:]})

In [0]:
delays.head()

In [0]:
# Create the blank plot
p = figure(plot_height = 600, plot_width = 600, 
           title = 'Histogram of Arrival Delays',
          x_axis_label = 'Delay (min)]', 
           y_axis_label = 'Number of Flights')

# Add a quad glyph
p.quad(bottom=0, top=delays['flights'], 
       left=delays['left'], right=delays['right'], 
       fill_color='red', line_color='black')

# Show the plot
show(p)

## matplotlib equivalent

In [0]:
import matplotlib.pyplot as plt
plt.hist(flights['arr_delay'], bins = int(180/5), range = (-60, 120));
plt.xlabel('Delay (min)'); plt.ylabel('Number of Flights')
plt.title('Histogram of Arrival Delays')
plt.show();

# Add Basic Styling

In [0]:
# Style function that takes in a plot
def style(p):
    # Title 
    p.title.align = 'center'
    p.title.text_font_size = '20pt'
    p.title.text_font = 'serif'
    
    # Axis titles
    p.xaxis.axis_label_text_font_size = '14pt'
    p.xaxis.axis_label_text_font_style = 'bold'
    p.yaxis.axis_label_text_font_size = '14pt'
    p.yaxis.axis_label_text_font_style = 'bold'
    
    # Tick labels
    p.xaxis.major_label_text_font_size = '12pt'
    p.yaxis.major_label_text_font_size = '12pt'
    
    return p

# Add Aesthetics
styled_p = style(p)

# Show plot
show(styled_p)

# Column Data Source

In [0]:
# Import the ColumnDataSource class
from bokeh.models import ColumnDataSource

In [0]:
# Formatted columns for Hover Tooltips (see next section)
delays['f_flights'] = ['%d flights' % count for count in delays['flights']]
delays['f_interval'] = ['%d to %d minutes' % (left, right) for left, right in zip(delays['left'], delays['right'])]

delays.head()

In [0]:
# Convert to column data source
src = ColumnDataSource(delays)
src.data.keys()

# Add in Tooltips on Hover

In [0]:
# Import the hover tool class
from bokeh.models import HoverTool

Example of hovertool referring to both field in our datasource and attribute of the graph.

```python
h = HoverTool(tooltips = [('Delay Interval Left ', '@left'),
                          ('(x,y)', '($x, $y)')])
```

In [0]:
h = HoverTool(tooltips = [('Delay Interval Left ', '@left'),
                          ('(x,y)', '($x, $y)')])

In [0]:
# Create the blank plot
p = figure(plot_height = 600, plot_width = 600, 
           title = 'Histogram of Arrival Delays',
          x_axis_label = 'Delay (min)', 
           y_axis_label = 'Number of Flights')

# Add a quad glyph with source this time
p.quad(bottom=0, top='flights', left='left', right='right', source=src,
       fill_color='red', line_color='black', fill_alpha = 0.75,
       hover_fill_alpha = 1.0, hover_fill_color = 'navy')

# Add a hover tool referring to the formatted columns
hover = HoverTool(tooltips = [('Delay', '@f_interval'),
                             ('Num of Flights', '@f_flights')])

# Style the plot
p = style(p)

# Add the hover tool to the graph
p.add_tools(hover)

output_notebook()

# Show the plot
show(p)

FileNotFoundError: ignored

## Save the Plot

In [0]:
# from bokeh.io import output_file

# output_file('./visualizations/hist.html')
# show(p)