# The Task at Hand:  Getting Airline Data In Order

(CC) Creative Commons BY-SA Lynd Bacon & Associates, Ltd. DBA Loma Buena Associates

## Your Data

The data you'll use for this assignment are from [OpenFlights.org](http://www.openflights.org).

You are provided with three data files, one for airports, one for routes, and one for airlines.  The data are for up to January 2012.  You'll be using this data for a couple of upcoming tasks, so be sure to keep track of them and to save your work with them.

The data in the file **airports.dat** look like this. Here are the first four (4) records in this file:

1,"Goroka","Goroka","Papua New Guinea","GKA","AYGA",-6.081689,145.391881,5282,10,"U","Pacific/Port_Moresby" <br/>
2,"Madang","Madang","Papua New Guinea","MAG","AYMD",-5.207083,145.7887,20,10,"U","Pacific/Port_Moresby"<br />
3,"Mount Hagen","Mount Hagen","Papua New Guinea","HGU","AYMH",-5.826789,144.295861,5388,10,"U","Pacific/Port_Moresby" <br />
4,"Nadzab","Nadzab","Papua New Guinea","LAE","AYNZ",-6.569828,146.726242,239,10,"U","Pacific/Port_Moresby" <br />

What you have here is a character (comma in this case) separated value file.

Here are the fields in this file, according to OpenFlights.org:

* Airport ID : Unique OpenFlights identifier for this airport. 
* Name : Name of airport. May or may not contain the City name.
* City : Main city served by airport. May be spelled differently from Name.
* Country : Country or territory where airport is located.
* IATA/FAA : 3-letter FAA code, for airports located in Country "United States of America". 3-letter IATA code, for all other airports. Blank if not assigned.
* ICAO : 4-letter ICAO code. Blank if not assigned.
* Latitude : Decimal degrees, usually to six significant digits. Negative is South, positive is North.
* Longitude : Decimal degrees, usually to six significant digits. Negative is West, positive is East.
* Altitude : In feet.
* Timezone : Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.
* DST : Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown).
* Tz : database time zoneTimezone in "tz" (Olson) format, eg. "America/Los_Angeles". 







OpenFlights says:
    
The data is ISO 8859-1 (Latin-1) encoded, with no special characters.

Note: Rules for daylight savings time change from year to year and from country to country. The current data is an approximation for 2009, built on a country level. Most airports in DST-less regions in countries that generally observe DST (eg. AL, HI in the USA, NT, QL in Australia, parts of Canada) are marked incorrectly."

The other two files, **routes.dat** and **airlines.dat**, are similar to **airports.dat**.  The fields in **airlines.dat** are:

* Airline ID : Unique OpenFlights identifier for this airline. 
* Name : Name of the airline. 
* Alias : Alias of the airline. For example, All Nippon Airways is commonly known as "ANA". 
* IATA : 2-letter IATA code, if available.
* ICAO : 3-letter ICAO code, if available.
* Callsign : Airline callsign.
* Country : Country or territory where airline is incorporated.
* Active : "Y" if the airline is or has until recently been operational, "N" if it is defunct. This field is not reliable: in particular, major airlines that stopped flying long ago, but have not had their IATA code reassigned (eg. Ansett/AN), will incorrectly show as "Y".


Additional information about the **airlines.dat** data from OpenFlights:
    
The data is ISO 8859-1 (Latin-1) encoded. The special value \N is used for "NULL" to indicate that no value is available, and is understood automatically by MySQL if imported.
Notes: Airlines with null codes/callsigns/countries generally represent user-added airlines. Since the data is intended primarily for current flights, defunct IATA codes are generally not included. For example, "Sabena" is not listed with a SN IATA code, since "SN" is presently used by its successor Brussels Airlines.

**routes.dat** has the following data fields:

* Airline : 2-letter (IATA) or 3-letter (ICAO) code of the airline. 
* Airline ID : Unique OpenFlights identifier for airline (see Airline). 
* Source airport : 3-letter (IATA) or 4-letter (ICAO) code of the source airport.
* Source airport ID : Unique OpenFlights identifier for source airport (see Airport) 
* Destination airport : 3-letter (IATA) or 4-letter (ICAO) code of the destination airport.
* Destination airport ID : Unique OpenFlights identifier for destination airport (see Airport) 
* Codeshare : "Y" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise.
* Stops : Number of stops on this flight ("0" for direct)
* Equipment : 3-letter codes for plane type(s) generally used on this flight, separated by spaces


Here's some additional information about **routes.dat**:

The data is ISO 8859-1 (Latin-1) encoded. The special value \N is used for "NULL" to indicate that no value is available, and is understood automatically by MySQL if imported.  (Note: the \N is how missings were coded. It's not be taken as the _newline_ character.)

Notes: 
* Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately. 
* Routes where one carrier operates both its own and codeshare flights are listed only once. 


## What You Need to Do

Here's what you need to do for this exercise.  You'll use Python do to it. You can use the Enthought Canopy distribution, or some other version of Python, like the Continuum Anaconda scientific Python distribution.    For each of the following, provide syntactically correct and commented code, followed by the results that your code produced for what is requested.  Your commenting should explain what your code does.  Use Python conventions for including your comments with your code.

Here's a style guide for Python code that you might find useful:

[PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/)

Submit your assignment in a pdf file of three or four pages, _but in no more than six (6) pages_. 

#### 1. Read each data file into a Pandas DataFrame.  Add meaningful names (i.e., names that would make sense to other people, given the data) to the columns of each DataFrame.
    
* Provide your syntactically correct, commented code.
* Print the first three rows of each DataFrame.  Provide your code, and the results it produced. 
    
#### 2. Check each DataFrame for duplicate records.  For each, report the number of duplicates you found.
    
* Provide your commented, syntactically correct code and the results it produced.  
    
#### 3. Describe the data types of the columns in each of the DataFrames.
    
* Provide your commented, syntactically correct code and the results it produced.  
        
#### 4. Determine how many of the airlines are "defunct." 
    
* Provide your definition of what a defunct airline is.
* Provide your commented, syntactically correct code and the results it produced.  
    

#### 5. Determine how many "routes from nowhere" there are in the data.  These are flights that don't originate from an airport.
    
* Provide your commented, syntactically correct code and the results it produced.  
        
#### 6. Save your DataFrames for future use.  You may pickle them, put them in a shelve db, on in a the tables of a SQL db.  Check to make sure that they are saved correctly.
    
 * Provide your commented, syntactically correct code and the results it produced.

## Tips and Hints

Before you can use pandas you may need may to install it, if you haven't already done so.  You should be able to find it in Canopy and in Anaconda.

You might find the documentation at (http://pandas.pydata.org) useful in addition to what has already been made available.

Pickling is a basic and venerable Python "serializing" method. Pickling and unpickling are methods of converting Python objects in RAM to and from character streams so that they can persist.  Pickle files can be text files or binary files.  Here's a nice piece about pickling:

(http://python.about.com/od/pythonstandardlibrary/a/pickle_intro.htm)

You can find a notebook about the __shelve__ package on Canvas. It should be in the _Resources_ section of the course home page.