# Background

Accidents are an unfortunate fact of air travel. Although flying is statistically safer than driving, minor and major flying accidents occur daily. In this project, we'll work with a data set of airplane accident statistics to analyze patterns and look for any common threads.

We'll be working with a data set that contains 77,282 aviation accidents that occurred in the U.S., and the metadata associated with them. The data in our AviationData.txt file comes from the __[National Transportation Safety Board (NTSB)](https://www.ntsb.gov/Pages/default.aspx)__. You can download the file at __[data.gov](https://catalog.data.gov/dataset/aviation-data-and-documentation-from-the-ntsb-accident-database-system-05748/resource/4b1e95fe-91a7-4112-85fa-424d2672a906)__. Here's a preview:

Event Id | Investigation Type | Accident Number | Event Date | Location | Country | Latitude | Longitude | Airport Code | Airport Name | Injury Severity | Aircraft Damage | Aircraft Category | Registration Number | Make | Model | Amateur Built | Number of Engines | Engine Type | FAR Description | Schedule | Purpose of Flight | Air Carrier | Total Fatal Injuries | Total Serious Injuries | Total Minor Injuries | Total Uninjured | Weather Condition | Broad Phase of Flight | Report Status | Publication Date |
20150908X74637 | Accident | CEN15LA402 | 09/08/2015 | Freeport, IL | United States | 42.246111 | -89.581945 | KFEP | albertus Airport | Non-Fatal | Substantial | Unknown | N24TL | CLARKE REGINALD W | DRAGONFLY MK |  |  |  | Part 91: General Aviation |  | Personal |  |  | 1 |  |  | VMC | TAKEOFF | Preliminary | 09/09/2015 |
20150906X32704 | Accident | ERA15LA339 | 09/05/2015 | Laconia, NH | United States | 43.606389 | -71.452778 | LCI | Laconia Municipal Airport | Fatal(1) | Substantial | Weight-Shift | N2264X | EVOLUTION AIRCRAFT INC | REVO | No | 1 | Reciprocating | Part 91: General Aviation |  | Personal |  | 1 |  |  |  | VMC | MANEUVERING | Preliminary | 09/10/2015 |
20150908X00229 | Accident | GAA15CA251 | 09/04/2015 | Hayes, SD | United States |  |  |  |  |  |  |  | N321DA | AIR TRACTOR INC | AT 402A |  |  |  |  |  |  |  |  |  |  |  |  |  | Preliminary |  |

As we can see, the file isn't in CSV format; it separates the fields with a pipe character (|) instead. In the following exercise, you'll need to do some custom parsing to read in AviationData.txt. Each row contains data about a single aviation accident. Here are descriptions for some of the columns:

Event Id - The unique id for the incident

Investigation Type - The type of investigation the NTSB conducted

Event Date - The date of the accident

Location - Where the accident occurred

Country - The country where the accident occurred

Latitude - The latitude where the accident occurred

Longitude - The longitude where the accident occurred

Injury Severity - The severity of any injuries

Aircraft Damage - The extent of the damage to the aircraft

Aircraft Category - The type of aircraft

Make - The make of the aircraft

Model - The model of the aircraft

Number of Engines - The number of engines on the plane

Air Carrier - The carrier operating the aircraft

Total Fatal Injuries - The number of fatal injuries

Total Serious Injuries - The number of serious injuries

Total Minor Injuries - The number of minor injuries

Total Uninjured - The number of people who did not sustain injuries

Broad Phase of Flight - The phase of flight during which the accident occurred


## Step 1 - Instructions

1. Use the head and tail commands to explore AviationData.txt on the command line.
2. Open the empty Python script read.py.
3. In read.py, open AviationData.txt and read each line into a list.
        When you're finished, you should have a list of strings, each of which represents one line from AviationData.txt.
        Assign the result to aviation_data.
4. Loop through each item in aviation_data and split it on the pipe character (|). ## be careful! " | "
        After the loop completes, you should have a list of lists. Each inner list should be a single row.
        Assign the result to aviation_list.
5. Create a list named lax_code.
6. Search through aviation_list for LAX94LA336. This value could be in any column and in any row.
        When you find the value, append the entire row to lax_code.
7. Were there any downsides to the approach you just took to search through AviationData.txt? Write some text explaining your answer.

In [16]:
## read the file
with open('AviationData.txt', 'r') as f:
    text = f.read()
aviation_data = text.split("\n")
aviation_list = []
for row in aviation_data:
    row_split = row.split("|")
    row_split_strip = [item.strip() for item in row_split]
    aviation_list.append(row_split_strip)
    
## Looking at the data
length = len(aviation_list)
head_data = aviation_list[:5]
print(length)
print(head_data)
    

77283
[['Event Id', 'Investigation Type', 'Accident Number', 'Event Date', 'Location', 'Country', 'Latitude', 'Longitude', 'Airport Code', 'Airport Name', 'Injury Severity', 'Aircraft Damage', 'Aircraft Category', 'Registration Number', 'Make', 'Model', 'Amateur Built', 'Number of Engines', 'Engine Type', 'FAR Description', 'Schedule', 'Purpose of Flight', 'Air Carrier', 'Total Fatal Injuries', 'Total Serious Injuries', 'Total Minor Injuries', 'Total Uninjured', 'Weather Condition', 'Broad Phase of Flight', 'Report Status', 'Publication Date', ''], ['20150908X74637', 'Accident', 'CEN15LA402', '09/08/2015', 'Freeport, IL', 'United States', '42.246111', '-89.581945', 'KFEP', 'albertus Airport', 'Non-Fatal', 'Substantial', 'Unknown', 'N24TL', 'CLARKE REGINALD W', 'DRAGONFLY MK', '', '', '', 'Part 91: General Aviation', '', 'Personal', '', '', '1', '', '', 'VMC', 'TAKEOFF', 'Preliminary', '09/09/2015', ''], ['20150906X32704', 'Accident', 'ERA15LA339', '09/05/2015', 'Laconia, NH', 'United S

In [34]:
##Search for "LAX94LA336" - lazay method
import time
start_time = time.time()
lax_code = []
for row in aviation_list:
    for item in row:
        if item == "LAX94LA336":
            lax_code.append(row)
            break
print(time.time()-start_time)
print(len(lax_code))
print(lax_code)



0.14743304252624512
1
[['20001218X45447', 'Accident', 'LAX94LA336', '07/19/1962', 'BRIDGEPORT, CA', 'United States', '', '', '', '', 'Fatal(4)', 'Destroyed', '', 'N5069P', 'PIPER', 'PA24-180', 'No', '1', 'Reciprocating', '', '', 'Personal', '', '4', '0', '0', '0', 'UNK', 'UNKNOWN', 'Probable Cause', '09/19/1996', '']]


In [35]:
%%timeit
lax_code = []
for row in aviation_list:
    for item in row:
        if item == "LAX94LA336":
            lax_code.append(row)
            break



72.1 ms ± 4.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Step 2 - Instructions - Linear and Log Time Algorithm

The algorithm you wrote on the previous screen took exponential time. That's because it had to loop through each row first, and then each column inside that row.

There are ways to make the algorithm take linear and constant time while we still scan across the whole data set, though.

1. Write a linear time algorithm that searches each row in aviation_data for the string LAX94LA336. # search on the 3rd column
2. Try writing a log(n) time algorithm that searches AviationData.txt for the string LAX94LA336. # binary search on sorted list
3. What are the trade-offs between the different approaches? Write some text explaining your answer.

In [37]:
# Linear Search - fixed column
lax_code_linear = []
for row in aviation_list:
    if row[2] == "LAX94LA336":
        lax_code_linear.append(row)
        break

print(lax_code_linear)

[['20001218X45447', 'Accident', 'LAX94LA336', '07/19/1962', 'BRIDGEPORT, CA', 'United States', '', '', '', '', 'Fatal(4)', 'Destroyed', '', 'N5069P', 'PIPER', 'PA24-180', 'No', '1', 'Reciprocating', '', '', 'Personal', '', '4', '0', '0', '0', 'UNK', 'UNKNOWN', 'Probable Cause', '09/19/1996', '']]


In [54]:
%%timeit ## O(n)
lax_code_linear = []
for row in aviation_list:
    if row[2] == "LAX94LA336":
        lax_code_linear.append(row)
        break

14.2 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [46]:
# Log Search - fixed column and binary search on the dataset ordered by the 3rd column
aviation_list_clean = []
for row in aviation_list:
    if len(row)>=3:
        aviation_list_clean.append(row)
        
aviation_list_sorted = sorted(aviation_list_clean, key=lambda row: row[2])

lax_code_log = []
lower_bound = 0
upper_bound = len(aviation_list_sorted)
index = (lower_bound + upper_bound)//2
guess = aviation_list_sorted[index][2]
while guess != "LAX94LA336" and upper_bound>lower_bound:
    if guess > "LAX94LA336":
        upper_bound = index - 1
    elif guess < "LAX94LA336":
        lower_bound = index + 1
    index = (lower_bound + upper_bound)//2
    guess = aviation_list_sorted[index][2]

if guess == "LAX94LA336":
    lax_code_log.append(aviation_list_sorted[index])
else:
    print("Not Found!")

print(lax_code_log)

[['20001218X45447', 'Accident', 'LAX94LA336', '07/19/1962', 'BRIDGEPORT, CA', 'United States', '', '', '', '', 'Fatal(4)', 'Destroyed', '', 'N5069P', 'PIPER', 'PA24-180', 'No', '1', 'Reciprocating', '', '', 'Personal', '', '4', '0', '0', '0', 'UNK', 'UNKNOWN', 'Probable Cause', '09/19/1996', '']]


In [51]:
%%timeit ## O(logn)
lax_code_log = []
lower_bound = 0
upper_bound = len(aviation_list_sorted)
index = (lower_bound + upper_bound)//2
guess = aviation_list_sorted[index][2]
while guess != "LAX94LA336" and upper_bound>lower_bound:
    if guess > "LAX94LA336":
        upper_bound = index - 1
    elif guess < "LAX94LA336":
        lower_bound = index + 1
    index = (lower_bound + upper_bound)//2
    guess = aviation_list_sorted[index][2]

if guess == "LAX94LA336":
    lax_code_log.append(aviation_list_sorted[index])
else:
    print("Not Found!")

4.29 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [52]:
%%timeit ## O(n)
aviation_list_clean = []
for row in aviation_list:
    if len(row)>=3:
        aviation_list_clean.append(row)

18.3 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [53]:
%%timeit ## O(nlogn)
aviation_list_sorted = sorted(aviation_list_clean, key=lambda row: row[2])

51.4 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Step 3 - Instructions - Hash Tables

So far, you've stored the data as a list of strings and a list of lists. You can also store the data as a list of dictionaries.

1. Create an empty list and name it aviation_dict_list.
2. Loop through each item in aviation_data and split it on the pipe character (|).

    Convert the split row to a dictionary. The dictionary should use the columns names as keys, and their values as its own values. Here's an example of a single item:
        {"Event Id": "20150908X74637", "Investigation Type": "Accident", ...}

    Append the result to aviation_dict_list.
    Create an empty list and name it lax_dict.
    Search through aviation_dict_list for LAX94LA336. This value could be in any key in any dictionary.
    When you find the value, append the entire dictionary to lax_dict.
    Was it harder or easier to search through a list of dictionaries than a list of lists? Write your thoughts in a text file.



In [103]:
aviation_dict_list = []
header_split = aviation_data[0].split("|") ## header
header_list = [item.strip() for item in header_split]
for row in aviation_data[1:]: ## Non-header
    row_split = row.split("|")
    row_split_strip = [item.strip() for item in row_split]
    row_dict = {header_list[i]:row_split_strip[i] for i in range(len(row_split_strip))} 
    aviation_dict_list.append(row_dict)

In [111]:
lax_dict = []
for row in aviation_dict_list:
    try: 
        if row['Accident Number'] == "LAX94LA336":
            lax_dict.append(row)
            break
    except:
        pass
    
print(lax_dict)
    

[{'Event Id': '20001218X45447', 'Investigation Type': 'Accident', 'Accident Number': 'LAX94LA336', 'Event Date': '07/19/1962', 'Location': 'BRIDGEPORT, CA', 'Country': 'United States', 'Latitude': '', 'Longitude': '', 'Airport Code': '', 'Airport Name': '', 'Injury Severity': 'Fatal(4)', 'Aircraft Damage': 'Destroyed', 'Aircraft Category': '', 'Registration Number': 'N5069P', 'Make': 'PIPER', 'Model': 'PA24-180', 'Amateur Built': 'No', 'Number of Engines': '1', 'Engine Type': 'Reciprocating', 'FAR Description': '', 'Schedule': '', 'Purpose of Flight': 'Personal', 'Air Carrier': '', 'Total Fatal Injuries': '4', 'Total Serious Injuries': '0', 'Total Minor Injuries': '0', 'Total Uninjured': '0', 'Weather Condition': 'UNK', 'Broad Phase of Flight': 'UNKNOWN', 'Report Status': 'Probable Cause', 'Publication Date': '09/19/1996', '': ''}]


In [112]:
%%timeit ## time calculation
lax_dict = []
for row in aviation_dict_list:
    try: 
        if row['Accident Number'] == "LAX94LA336":
            lax_dict.append(row)
            break
    except:
        pass

26.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### thoughts on step - 3
| Method | Timeit | Complexity |
| --- | --- | --- |
| nested loop | 72ms | O(mxn) |
| Linear Search - list | 14.2ms | O(n) |
| Binary Search - list | 4.29us | O(logn) |
| Sorted - list | 51.4ms | O(nlogn) |
| linear - dict - list | 26.4ms | O(n) |


## Step 4 - Instructions - Accidents by U.S. State

You now have two representations of the data - aviation_dict_list and aviation_list. In the analysis on the next few screens, feel free to choose the representation that makes the analysis the easiest.

On this screen, you'll count how many accidents occurred in each U.S. state, then determine which state had the most accidents overall.

1. Count up how many accidents occurred in each U.S. state, and assign the result to state_accidents.
    You can parse the state by splitting the Location field and extracting the state.
2. Sort state_accidents, and extract the name of the state with the most aviation accidents.

In [119]:
state_list = [] ## extract state from each row
for row in aviation_dict_list:
    try:
        state = row['Location'].split(',')[1].strip()
        state_list.append(state)
    except:
        pass
        

In [120]:
state_accident_counts = {} ## counts for each state with loop
for st in state_list:
    if st in state_accident_counts:
        state_accident_counts[st] += 1
    else:
        state_accident_counts[st] = 1
        
state_accident_counts_sorted = sorted(state_accident_counts.items(), key=lambda x: x[1], reverse=True)  ## sort dict by value    
print(state_accident_counts_sorted[:10]) ## display the first 10 states with most accident

[('CA', 8029), ('FL', 5117), ('TX', 5112), ('AK', 5049), ('AZ', 2502), ('CO', 2458), ('WA', 2354), ('IL', 1874), ('MI', 1863), ('GA', 1746)]


In [123]:
## counts for each state with "Counter" class
from collections import Counter
counter_state = Counter(state_list)
counter_state.most_common(10)

[('CA', 8029),
 ('FL', 5117),
 ('TX', 5112),
 ('AK', 5049),
 ('AZ', 2502),
 ('CO', 2458),
 ('WA', 2354),
 ('IL', 1874),
 ('MI', 1863),
 ('GA', 1746)]

## Step 5 - Instructions - Fatalities and Injuries by Month

You can also count how many fatalities and serious injuries occurred during each month.

1. Count how many fatalities and serious injuries occured during each unique month and year, and assign the result to monthly_injuries.
    You can parse the date by splitting the Event Date column and extracting the month number.
    Total the fatalities and serious injuries by adding the numbers in the Total Fatal Injuries and Total Serious Injuries columns.
    These columns are blank for months with no fatalities or serious injuries, so you'll have to replace those empty slots with 0.
2. Turn monthly_injuries into two lists - one with the month names, and one with the counts.
3. Implement a clever way of displaying these lists so you can understand the number of fatalities and serious injuries per month.


## Step 6 - Instructions - What's Next

1. Map out accidents using the basemap library for matplotlib.
2. Count the number of accidents by air carrier.
3. Count the number of accidents by airplane make and model.
4. Figure out what percentage of accidents occur under adverse weather conditions.