# Project 7 

In this project, you will implement the the clustering techniques that you've learned this week. 

#### Step 1: Load the python libraries that you will need for this project 

In [7]:
import pandas as pd 
import matplotlib as plt
import numpy as np
import sklearn
import psycopg2 as psy
import sqlalchemy
import patsy
import seaborn as sns
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.base import TransformerMixin
from sklearn.grid_search import GridSearchCV

%matplotlib inline

#### Step 2: Examine your data 

In [8]:
df_raw = pd.read_csv("../assets/airport_cancellations.csv")
df = df_raw.dropna() 
df.head()

Unnamed: 0,Airport,Year,Departure Cancellations,Arrival Cancellations,Departure Diversions,Arrival Diversions
0,ABQ,2004.0,242.0,235.0,71.0,46.0
1,ABQ,2005.0,221.0,190.0,61.0,33.0
2,ABQ,2006.0,392.0,329.0,71.0,124.0
3,ABQ,2007.0,366.0,304.0,107.0,45.0
4,ABQ,2008.0,333.0,300.0,79.0,42.0


In [33]:
list(df.columns)

['airport',
 'year',
 'departure cancellations',
 'arrival cancellations',
 'departure diversions',
 'arrival diversions']

In [9]:
df_raw2 = pd.read_csv("../assets/airports.csv")
df2 = df_raw2.dropna() 
print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3498 entries, 0 to 5163
Data columns (total 13 columns):
Key                        3498 non-null float64
LocID                      3498 non-null object
AP_NAME                    3498 non-null object
ALIAS                      3498 non-null object
Facility Type              3498 non-null object
FAA REGION                 3498 non-null object
COUNTY                     3498 non-null object
CITY                       3498 non-null object
STATE                      3498 non-null object
AP Type                    3498 non-null object
Latitude                   3498 non-null float64
Longitude                  3498 non-null float64
Boundary Data Available    3498 non-null object
dtypes: float64(3), object(10)
memory usage: 382.6+ KB
None


In [10]:
df_raw3 = pd.read_csv("../assets/Airport_operations.csv")
df3 = df_raw3.dropna() 
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 841 entries, 0 to 840
Data columns (total 15 columns):
airport                               841 non-null object
year                                  841 non-null int64
departures for metric computation     841 non-null int64
arrivals for metric computation       841 non-null int64
percent on-time gate departures       841 non-null float64
percent on-time airport departures    841 non-null float64
percent on-time gate arrivals         841 non-null float64
average_gate_departure_delay          841 non-null float64
average_taxi_out_time                 841 non-null float64
average taxi out delay                841 non-null float64
average airport departure delay       841 non-null float64
average airborne delay                841 non-null float64
average taxi in delay                 841 non-null float64
average block delay                   841 non-null float64
average gate arrival delay            841 non-null float64
dtypes: float64(1

### Intro: Write a problem statement / aim for this project

Hired by the FFA to analyze cause of flight delays. The data has no target class. Therefore our analysis will need to use unsupervised techniques. I will use Principle Component Analysis to determine the biggest causes of delays in preparation for clustering.

### Part 1: Create a PostgreSQL database 

#### 1. Let's create a database where we can house our airport data

In [31]:
%load_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [32]:
%%sql 
postgresql://localhost:5432/

u'Connected: None@'

In [36]:
%%sql
DROP TABLE IF EXISTS airport_cancellations;
CREATE TABLE airport_cancellations
(airport,
year,
departure cancellations,
arrival cancellations,
departure diversions,
arrival diversions);

COPY airport_cancellations FROM
'/Users/Tamara/DSI-NYC-1/projects/01-projects-weekly/project-07/assets/airport_cancellations.csv '
DELIMITER ',' CSV HEADER;

Done.
(psycopg2.ProgrammingError) syntax error at or near "cancellations"
LINE 4: departure cancellations,
                  ^
 [SQL: 'CREATE TABLE airport_cancellations\n(airport,\nyear,\ndeparture cancellations,\narrival cancellations,\ndeparture diversions,\narrival diversions);']


Load our csv files into tables

In [23]:
# convert columns to lowercase 
df.columns = map(str.lower, df.columns)
df2.columns = map(str.lower, df2.columns)
df3.columns = map(str.lower, df3.columns)

Join airport_cancellations.csv and airports.csv into one table

In [24]:
df.to_sql("airport_cancellations",con = engine, if_exists = "replace")

In [25]:
df2.to_sql("airports", con = engine, if_exists = "replace")

In [26]:
df3.to_sql("airport_operations", con = engine, if_exists="replace")

In [27]:
# get the data expect to use for Principal Component Analysis
joined = pd.read_sql('SELECT o.*  FROM airports as a JOIN airport_cancellations as ac ON a.locid=ac.airport JOIN airport_operations as o ON a.locid=o.airport;',con = engine)

In [28]:
# get all the data
everything = pd.read_sql('SELECT * FROM airports as a INNER JOIN airport_cancellations as ac ON a.locid=ac.airport INNER JOIN airport_operations as o ON a.locid=o.airport;',con = engine)

In [30]:
# creating csv for all data
everything.to_csv("../assets/everything.csv",encoding="utf-8")

Query the database for our intial data

#### 1.2 What are the risks and assumptions of our data? 

### Part 2: Exploratory Data Analysis

#### 2.1 Plot and Describe the Data

#### Are there any unique values? 

### Part 3: Data Mining

#### 3.1 Create Dummy Variables

In [22]:
df = data
categories = ["airport"]
for category in categories:
    series = df[category]
    dummies = pd.get_dummies(series, prefix=category)
    df = pd.concat([df, dummies], axis=1)
dataWithDummies = df.copy(deep=True)

del dataWithDummies["airport"]
del data["airport"]

#### 3.2 Format and Clean the Data

In [25]:
# copying data to a new variable so I dont have to run the whole thing again
data = joined.copy(deep=True)
data = joined.dropna()
del data["index"]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8789 entries, 0 to 8788
Data columns (total 15 columns):
airport                               8789 non-null object
year                                  8789 non-null int64
departures for metric computation     8789 non-null int64
arrivals for metric computation       8789 non-null int64
percent on-time gate departures       8789 non-null float64
percent on-time airport departures    8789 non-null float64
percent on-time gate arrivals         8789 non-null float64
average_gate_departure_delay          8789 non-null float64
average_taxi_out_time                 8789 non-null float64
average taxi out delay                8789 non-null float64
average airport departure delay       8789 non-null float64
average airborne delay                8789 non-null float64
average taxi in delay                 8789 non-null float64
average block delay                   8789 non-null float64
average gate arrival delay            8789 non-null float64


### Part 4: Define the Data

#### 4.1 Confirm that the dataset has a normal distribution. How can you tell? 

#### 4.2 Find correlations in the data

#### 4.3 What is the value of understanding correlations before PCA? 

Answer: 

#### 4.4 Validate your findings using statistical analysis

#### 4.5 How can you improve your overall analysis? 

Answer: 

### Part 5: Perform a PCA

#### 5.1 Conduct the PCA

### Part 6: Additional Analysis
Include any other models you'd like to run here. These can include regressions, classifications, or clusters. 

### Part 7: Write an analysis plan of your findings 

Create a writeup on the interpretation of findings including an executive summary with conclusions and next steps. Put it on your blog, and include the link here.

Which operational features are most correlated with delays?

What should the airport's next steps be?

### Bonus: Copy your Database to AWS 

Make sure to properly document all of the features of your dataset

### Bonus: Create a 3-Dimensional Plot of your new dataset with PCA applied