# Flights Data Exploration Challenge

In this challenge, you'll explore a real-world dataset containing flights data from the US Department of Transportation.

Let's start by loading and viewing the data.

In [13]:
import pandas as pd

df_flights = pd.read_csv('flights.csv')
df_flights.head()

#print(df_flights.columns)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,Carrier,OriginAirportID,OriginAirportName,OriginCity,OriginState,DestAirportID,DestAirportName,DestCity,DestState,CRSDepTime,DepDelay,DepDel15,CRSArrTime,ArrDelay,ArrDel15,Cancelled
0,2013,9,16,1,DL,15304,Tampa International,Tampa,FL,12478,John F. Kennedy International,New York,NY,1539,4,0.0,1824,13,0,0
1,2013,9,23,1,WN,14122,Pittsburgh International,Pittsburgh,PA,13232,Chicago Midway International,Chicago,IL,710,3,0.0,740,22,1,0
2,2013,9,7,6,AS,14747,Seattle/Tacoma International,Seattle,WA,11278,Ronald Reagan Washington National,Washington,DC,810,-3,0.0,1614,-7,0,0
3,2013,7,22,1,OO,13930,Chicago O'Hare International,Chicago,IL,11042,Cleveland-Hopkins International,Cleveland,OH,804,35,1.0,1027,33,1,0
4,2013,5,16,4,DL,13931,Norfolk International,Norfolk,VA,10397,Hartsfield-Jackson Atlanta International,Atlanta,GA,545,-1,0.0,728,-9,0,0


The dataset contains observations of US domestic flights in 2013, and consists of the following fields:

- **Year**: The year of the flight (all records are from 2013)
- **Month**: The month of the flight
- **DayofMonth**: The day of the month on which the flight departed
- **DayOfWeek**: The day of the week on which the flight departed - from 1 (Monday) to 7 (Sunday)
- **Carrier**: The two-letter abbreviation for the airline.
- **OriginAirportID**: A unique numeric identifier for the departure aiport
- **OriginAirportName**: The full name of the departure airport
- **OriginCity**: The departure airport city
- **OriginState**: The departure airport state
- **DestAirportID**: A unique numeric identifier for the destination aiport
- **DestAirportName**: The full name of the destination airport
- **DestCity**: The destination airport city
- **DestState**: The destination airport state
- **CRSDepTime**: The scheduled departure time
- **DepDelay**: The number of minutes departure was delayed (flight that left ahead of schedule have a negative value)
- **DelDelay15**: A binary indicator that departure was delayed by more than 15 minutes (and therefore considered "late")
- **CRSArrTime**: The scheduled arrival time
- **ArrDelay**: The number of minutes arrival was delayed (flight that arrived ahead of schedule have a negative value)
- **ArrDelay15**: A binary indicator that arrival was delayed by more than 15 minutes (and therefore considered "late")
- **Cancelled**: A binary indicator that the flight was cancelled

Your challenge is to explore the flight data to analyze possible factors that affect delays in departure or arrival of a flight.

_Tu reto es explorar los datos de vuelo para analizar los posibles factores que inciden en los retrasos en la salida o llegada de un vuelo._

1. Start by cleaning the data.
    - Identify any null or missing data, and impute appropriate replacement values.
    - Identify and eliminate any outliers in the **DepDelay** and **ArrDelay** columns.
1. Comience por limpiar los datos.
    - Identificar cualquier dato nulo o faltante e imputar valores de reemplazo apropiados.
    - Identifique y elimine cualquier valor atípico en las columnas en las columnas **DepDelay** y **ArrDelay**.
2. Explore the cleaned data.
    - View summary statistics for the numeric fields in the dataset.
    - Ver estadísticas de resumen para los campos numéricos en el conjunto de datos
    - Determine the distribution of the **DepDelay** and **ArrDelay** columns.
    - Determinar la distribución de las columnas **DepDelay** y **ArrDelay**.
    - Use statistics, aggregate functions, and visualizations to answer the following questions:
    - Use funciones estadísticas, funciones de agregado y visualizaciones para responder las siguientes preguntas:
        - *What are the average (mean) departure and arrival delays?*
        - *¿Cuáles son los retrasos promedio (promedio) de salida y llegada?*
        - *How do the carriers compare in terms of arrival delay performance?*
        - *¿Cómo se comparan los transportistas en términos de rendimiento de retrasos en la llegada?*
        - *Is there a noticable difference in arrival delays for different days of the week?*
        - *¿Hay una diferencia notable en los retrasos de llegada para los diferentes días de la semana?*
        - *Which departure airport has the highest average departure delay?*
        - *¿Qué aeropuerto de salida tiene el promedio de retraso de salida más alto?*
        - *Do **late** departures tend to result in longer arrival delays than on-time departures?*
        - *¿Las salidas **tardías** tienden a provocar retrasos de llegada más prolongados que las salidas puntuales?*
        - *Which route (from origin airport to destination airport) has the most **late** arrivals?*
        - *¿Qué ruta (del aeropuerto de origen al aeropuerto de destino) tiene más llegadas **retrasadas**?*
        - *Which route has the highest average arrival delay?*
        - *¿Qué ruta tiene el promedio de retraso de llegada más alto?*
        
Add markdown and code cells as required to create your solution.

> **Note**: There is no single "correct" solution. A sample solution is provided in [01 - Flights Challenge.ipynb](01%20-%20Flights%20Solution.ipynb).

In [16]:
# Your code to explore the data
df_flights.isnull()
df_flights.isnull().sum()

Year                    0
Month                   0
DayofMonth              0
DayOfWeek               0
Carrier                 0
OriginAirportID         0
OriginAirportName       0
OriginCity              0
OriginState             0
DestAirportID           0
DestAirportName         0
DestCity                0
DestState               0
CRSDepTime              0
DepDelay                0
DepDel15             2761
CRSArrTime              0
ArrDelay                0
ArrDel15                0
Cancelled               0
dtype: int64

In [18]:
#Limpiar datos nullos
df_flights = df_flights.dropna(axis=0, how='any')
df_flights.isnull().sum()

Year                 0
Month                0
DayofMonth           0
DayOfWeek            0
Carrier              0
OriginAirportID      0
OriginAirportName    0
OriginCity           0
OriginState          0
DestAirportID        0
DestAirportName      0
DestCity             0
DestState            0
CRSDepTime           0
DepDelay             0
DepDel15             0
CRSArrTime           0
ArrDelay             0
ArrDel15             0
Cancelled            0
dtype: int64