# Collecting and Selecting Data

In this section, we are going to dive into the basic collecting and cleaning the dataset. This will bring us to load a dataset, assess its features, and glean as much context as possible. We will then use this to narrow and clean our data to something of interest. 

A common mistake data science professionals make is diving straight into a dataset and forgetting to ask basic questions, like where did the dataset come from? I will argue that bias in data is inevitable, because reality itself is biased. We will then use this to mindfully think through how this data was collected and base decisions accordingly on it. 

## The Bird Strike Dataset 

Aircraft bird strikes as reported by the Federal Aviation Administration (FAA) in the United States. A bird strike occurs when a bird collides with an aircraft, and the damage can be severe. Each year, there are on average 13,000 birdstrikes in the United States alone and cost the aviation industry $400 million in damages. While most bird strikes are minor, some can be dangerous and fatal. 

<a title="Greg L, CC BY 2.0 &lt;https://creativecommons.org/licenses/by/2.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:US_Airways_Flight_1549_(N106US)_after_crashing_into_the_Hudson_River_(crop_2).jpg"><img width="512" alt="US Airways Flight 1549 (N106US) after crashing into the Hudson River (crop 2)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/US_Airways_Flight_1549_%28N106US%29_after_crashing_into_the_Hudson_River_%28crop_2%29.jpg/512px-US_Airways_Flight_1549_%28N106US%29_after_crashing_into_the_Hudson_River_%28crop_2%29.jpg?20200816213116"></a>

**In 2009, US Airways Flight 1549 suffered a major birdstrike resulting in an emergency landing in the Hudson River.**<br>
*Greg L, CC BY 2.0, via Wikimedia Commons*

To get started use `pip` or `conda` to install `openpyxl`. Pandas will  need this library to load Excel files. If you are using plain Python, make sure to install `pandas` too. 

In [1]:
!pip install openpyxl pandas



In [2]:
!conda install openpyxl pandas --yes

zsh:1: command not found: conda


## FAA Birdstrike Dataset 

We are going to use the FAA bird strike dataset for this data exploration exercise. This dataset was obtained from the [FAA Wildlife Strike Database](https://wildlife.faa.gov/search). To keep the experience raw, I have not cleaned or converted the exported Excel `xlsx` file into a different format like SQLite, CSV, or Pickle. 

Below, let's load the Excel file into a Pandas `DataFrame`. This file is about 129MB so it may take a moment. 

In [3]:
import pandas as pd

df = pd.read_excel('bird_strike_faa.xlsx')

Let's display the DataFrame with all of its columns. 

In [4]:
with pd.option_context('display.max_columns', None):
  display(df)

Unnamed: 0,INDEX_NR,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME,TIME_OF_DAY,AIRPORT_ID,AIRPORT,LATITUDE,LONGITUDE,RUNWAY,STATE,FAAREGION,LOCATION,OPID,OPERATOR,REG,FLT,AIRCRAFT,AMA,AMO,EMA,EMO,AC_CLASS,AC_MASS,TYPE_ENG,NUM_ENGS,ENG_1_POS,ENG_2_POS,ENG_3_POS,ENG_4_POS,PHASE_OF_FLIGHT,HEIGHT,SPEED,DISTANCE,SKY,PRECIPITATION,AOS,COST_REPAIRS,COST_OTHER,COST_REPAIRS_INFL_ADJ,COST_OTHER_INFL_ADJ,INGESTED_OTHER,INDICATED_DAMAGE,DAMAGE_LEVEL,STR_RAD,DAM_RAD,STR_WINDSHLD,DAM_WINDSHLD,STR_NOSE,DAM_NOSE,STR_ENG1,DAM_ENG1,ING_ENG1,STR_ENG2,DAM_ENG2,ING_ENG2,STR_ENG3,DAM_ENG3,ING_ENG3,STR_ENG4,DAM_ENG4,ING_ENG4,STR_PROP,DAM_PROP,STR_WING_ROT,DAM_WING_ROT,STR_FUSE,DAM_FUSE,STR_LG,DAM_LG,STR_TAIL,DAM_TAIL,STR_LGHTS,DAM_LGHTS,STR_OTHER,DAM_OTHER,OTHER_SPECIFY,EFFECT,EFFECT_OTHER,BIRD_BAND_NUMBER,SPECIES_ID,SPECIES,OUT_OF_RANGE_SPECIES,REMARKS,REMAINS_COLLECTED,REMAINS_SENT,WARNED,NUM_SEEN,NUM_STRUCK,SIZE,ENROUTE_STATE,NR_INJURIES,NR_FATALITIES,COMMENTS,REPORTED_NAME,REPORTED_TITLE,SOURCE,PERSON,LUPDATE,TRANSFER
0,608242,1996-06-22,6,1996,,,KSMF,SACRAMENTO INTL,38.69542,-121.59077,,CA,AWP,,UAL,UNITED AIRLINES,,1768,B-737-300,148,24,10.0,1.0,A,4.0,D,2.0,1.0,1.0,,,Take-off Run,0.0,,0.0,,,,,,,,0,0,,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,UNKBM,Unknown bird - medium,0,BLOOD ON L FWD NOSE AREA SEEN BY CREW AFTER LDG.,0,0,Unknown,,1,Medium,,,,/Legacy Record=100001/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2007-12-20,0
1,608243,1996-06-26,6,1996,,,KDEN,DENVER INTL AIRPORT,39.85841,-104.667,,CO,ANM,,UAL,UNITED AIRLINES,,1845,B-737-300,148,24,10.0,1.0,A,4.0,D,2.0,1.0,1.0,,,Take-off Run,0.0,,0.0,,,,,,,,0,0,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,UNKBM,Unknown bird - medium,0,CREW SUSPECTED BIRDSTRIKE ON T/O. LOOKED LIKE ...,0,0,Unknown,,1,Medium,,,,/Legacy Record=100002/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2007-12-20,0
2,608244,1996-07-01,7,1996,,,KOMA,EPPLEY AIRFIELD,41.30252,-95.89417,,NE,ACE,,UAL,UNITED AIRLINES,,306,B-757-200,148,26,34.0,40.0,A,4.0,D,2.0,1.0,1.0,,,Take-off Run,0.0,,0.0,,,,,,,,0,0,N,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,UNKBM,Unknown bird - medium,0,BIRDSTRIKE AT ROTATION. INSPN. NO DMG.,0,0,Unknown,,1,Medium,,,,/Legacy Record=100003/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2007-12-20,0
3,608245,1996-07-01,7,1996,,,KIAD,WASHINGTON DULLES INTL ARPT,38.94453,-77.45581,,DC,AEA,,UAL,UNITED AIRLINES,,510,A-320,04A,3,23.0,1.0,A,4.0,D,2.0,1.0,1.0,,,Approach,1000.0,,,,,,,,,,0,0,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,,,,,UNKBM,Unknown bird - medium,0,"ON FINAL APCH, STRIKE UNDER THE NOSE OF THE CO...",0,0,Unknown,,1,Medium,,,,/Legacy Record=100004/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2007-12-20,0
4,608246,1996-07-01,7,1996,,,KLGA,LA GUARDIA ARPT,40.77724,-73.87261,,NY,AEA,,UAL,UNITED AIRLINES,,677,A-320,04A,3,23.0,1.0,A,4.0,D,2.0,1.0,1.0,,,Climb,5000.0,,,,,,,,,,1,1,M,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,,,,,UNKBM,Unknown bird - medium,0,LOUD NOISE WAS HEARD DURING CLIMBOUT THAT SOUN...,0,0,Unknown,,1,Medium,,,,/Legacy Record=100005/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2007-12-20,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298196,1516465,2024-03-17,3,2024,07:15,,KSEA,SEATTLE-TACOMA INTL,47.44898,-122.30931,16L,WA,ANM,,UNK,UNKNOWN,,,UNKNOWN,,,,,,,,,,,,,,,,0.0,,,,,,,,0,0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,Z6007,American robin,0,Found dead robin on Runway Inspection. DAWN. 1...,1,0,Unknown,,1,Small,,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Carcass Found,2024-05-01,0
298197,1516467,2024-03-17,3,2024,19:15,,KHYI,SAN MARCOS MUNICIPAL ARPT,29.89361,-97.86469,8,TX,ASW,,EJA,NETJETS,N237QS,237,CL-601/604,188,7,22.0,4.0,A,3.0,D,2.0,5.0,5.0,,,Approach,2000.0,160.0,5.0,Some Cloud,,,,,,,0,0,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,,,,,ZT000,Meadowlarks,0,"none, we were not aware we hit a bird until af...",1,1,No,1,1,Small,,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Pilot,2024-05-01,0
298198,1516468,2024-03-17,3,2024,16:39,Day,KSBA,SANTA BARBARA MUNICIPAL,34.42621,-119.84037,,CA,AWP,,ASA,ALASKA AIRLINES,N590AS,1141,B-737-800,148,43,10.0,1.0,A,4.0,D,2.0,1.0,1.0,,,Approach,30.0,,0.0,No Cloud,,0.35,,,,,0,0,N,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,ZT002,Western meadowlark,0,"Aircraft out of service, maintenance enroute t...",1,0,Unknown,2-10,2-10,Small,,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2024-05-01,0
298199,1516469,2024-03-17,3,2024,21:45,Night,ZGGG,BAIYUN AIRPORT,23.392436,113.298786,20L,FN,FGN,,FDX,FEDEX EXPRESS,N103FE,6024,B-767-300,148,97,22.0,7.0,A,4.0,D,2.0,1.0,1.0,,,Approach,3000.0,180.0,10.0,No Cloud,,,,,,,0,0,N,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,UNKB,Unknown bird,0,10 mile final. NOTE: NUMBER STRUCK NOT REPORTE...,0,0,Unknown,,1,,,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Pilot,2024-05-01,0


Yikes. Those are a lot of columns. Just how many rows and columns are we dealing with? We can access this by running `df.shape`. 

In [5]:
df.shape

(298201, 101)

So we got 298,201 rows and 101 columns. We will thoughtfully go through these and narrow our dataset only to certain columns and rows. We will save this work by saving it for later. To get there though, let's try to understand our data first.

We have another file that might be useful: the `read_me.xlsx`. Included are sheets in an Excel workbook that have descriptions for each column as well as engine codes, aircraft types, and engine position. Those last three sheets are helpful to make sense of certain columns that use coding conventions. Let's display all the column description in a `DataFrame` and make sure no truncation happens setting `display.max_rows` and `display.max_colwidth` to `None`. 

In [6]:
schema = pd.read_excel('read_me.xlsx',sheet_name='Column Name')

with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    display(schema)

Unnamed: 0,Column Name,Explanation of Column Name and Codes
0,INDEX NR,Individual record number
1,OPID,Airline operator code
2,OPERATOR,"A three letter International Civil Aviation Organization code for aircraft operators. (BUS = business, PVT = private aircraft other than business, GOV = government aircraft, MIL - military aircraft.)"
3,ATYPE,Aircraft
4,AMA,International Civil Aviation Organization code for Aircraft Make
5,AMO,International Civil Aviation Organization code for Aircraft Model
6,EMA,Engine Make Code (see Engine Codes tab below)
7,EMO,Engine Model Code (see Engine Codes tab below)
8,AC_CLASS,Type of aircraft (see Aircraft Type tab below)
9,AC_MASS,"1 = 2,250 kg or less: 2 = ,2251-5700 kg: 3 = 5,701-27,000 kg: 4 = 27,001-272,000 kg: 5 = above 272,000 kg"


Oh man... talk about drinking from a firehose! But at least we have a descriptor of every column. Let's take a look at every data type of every column next. 

In [7]:
with pd.option_context('display.max_rows', None):
    display(df.dtypes)

INDEX_NR                   int64
INCIDENT_DATE             object
INCIDENT_MONTH             int64
INCIDENT_YEAR              int64
TIME                      object
TIME_OF_DAY               object
AIRPORT_ID                object
AIRPORT                   object
LATITUDE                  object
LONGITUDE                 object
RUNWAY                    object
STATE                     object
FAAREGION                 object
LOCATION                  object
OPID                      object
OPERATOR                  object
REG                       object
FLT                       object
AIRCRAFT                  object
AMA                       object
AMO                       object
EMA                      float64
EMO                      float64
AC_CLASS                  object
AC_MASS                  float64
TYPE_ENG                  object
NUM_ENGS                 float64
ENG_1_POS                float64
ENG_2_POS                float64
ENG_3_POS                float64
ENG_4_POS 

This is not something we just want to gloss over. We really want to take time to understand this data, ask questions, and consult experts if needed. It is important to understand what we are trying to achieve here: find possible causes for bird strikes. 

**PLEASE ASK QUESTIONS! DON'T WORK WITH DATA YOU DON'T FULLY UNDERSTAND. EVER!**

## Some Basic Cleaning

Based on these column descriptors, it is safe to set `INDEX_NR` to be our `DataFrame` index. Let's get that done. This is the unique identifier for each record.  

In [8]:
df.set_index('INDEX_NR', inplace=True)

Let's say we talked to a weather-hardened pilot to give us further context to these descriptions. We also found a friendly retired expert from the FAA who knows this dataset really well. Our curiosity starts to rewards us and we begin to understand the bigger picture of this dataset. We also take the opportunity to ask them their opinions on what causes bird strikes. Perhaps we are biasing our future conclusions already, but it is always a good idea to consult the experts on everything you can once you have them. 

![](resource/ZPGDGucW.svg)

Because of this knowledge, we start to have a good idea of what data we want to include. For starts, we want to only include recent data from 2015 onwards. We have deemed there is little value looking at older data from before that year. We can argue that the nature of bird strikes never change, but the environment certaintly changes. Schedules and airports grow and shrink, weather patterns change, and different airlines come and go.  Even the [FAA themselves](https://wildlife.faa.gov/home) say:  

> Expanding wildlife populations, increases in number of aircraft movements, a trend toward faster and quieter aircraft, and outreach to the aviation community all have contributed to the observed increase in reported wildlife strikes.

So nothing is ever static and things are always changing. For this reason, we make a judgment to limit our data to 2015 onwards. Let's delegate this to Pandas. 

In [9]:
df = df[df['INCIDENT_YEAR'] >= 2015]

For the record, we can use `>=`, `<=`, `<`, `>`, and `==` for equality and inequality filtering. 

After removing everything before 2015, how many rows do we have now? 

In [10]:
df.shape

(141069, 100)

So we went from 298,201 to 141,069 rows. This substantially removed a lot of data which will make working with the data easier. We will have less data to sift through. Just be mindful and document every decision you make along the way. Every decision is a bias, and while bias inherently is not bad, it will steer your conclusions whether they are right or wrong.  

Let's take a look at our dataframe but alphabetically sort the columns to make them easy to pick through

In [11]:
with pd.option_context('display.max_columns', None):
    display(df.sort_index(axis = 1))

Unnamed: 0_level_0,AC_CLASS,AC_MASS,AIRCRAFT,AIRPORT,AIRPORT_ID,AMA,AMO,AOS,BIRD_BAND_NUMBER,COMMENTS,COST_OTHER,COST_OTHER_INFL_ADJ,COST_REPAIRS,COST_REPAIRS_INFL_ADJ,DAMAGE_LEVEL,DAM_ENG1,DAM_ENG2,DAM_ENG3,DAM_ENG4,DAM_FUSE,DAM_LG,DAM_LGHTS,DAM_NOSE,DAM_OTHER,DAM_PROP,DAM_RAD,DAM_TAIL,DAM_WINDSHLD,DAM_WING_ROT,DISTANCE,EFFECT,EFFECT_OTHER,EMA,EMO,ENG_1_POS,ENG_2_POS,ENG_3_POS,ENG_4_POS,ENROUTE_STATE,FAAREGION,FLT,HEIGHT,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,INDICATED_DAMAGE,INGESTED_OTHER,ING_ENG1,ING_ENG2,ING_ENG3,ING_ENG4,LATITUDE,LOCATION,LONGITUDE,LUPDATE,NR_FATALITIES,NR_INJURIES,NUM_ENGS,NUM_SEEN,NUM_STRUCK,OPERATOR,OPID,OTHER_SPECIFY,OUT_OF_RANGE_SPECIES,PERSON,PHASE_OF_FLIGHT,PRECIPITATION,REG,REMAINS_COLLECTED,REMAINS_SENT,REMARKS,REPORTED_NAME,REPORTED_TITLE,RUNWAY,SIZE,SKY,SOURCE,SPECIES,SPECIES_ID,SPEED,STATE,STR_ENG1,STR_ENG2,STR_ENG3,STR_ENG4,STR_FUSE,STR_LG,STR_LGHTS,STR_NOSE,STR_OTHER,STR_PROP,STR_RAD,STR_TAIL,STR_WINDSHLD,STR_WING_ROT,TIME,TIME_OF_DAY,TRANSFER,TYPE_ENG,WARNED
INDEX_NR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1
708307,A,1.0,PA-28,VERO BEACH MUNICIPAL,KVRB,729,23,,,/Legacy Record=1/,,,,,M,0,0,0,0,0,0,0,0,0,0,0,0,0,1,,,,,,7.0,,,,,ASO,,,2015-05-22,5,2015,1,0,0,0,0,0,27.65556,,-80.41794,2019-07-27,,,1.0,,1,BUSINESS,BUS,,0,Tower,Approach,,N9240F,0,0,"N9240F was right base to final on Runway 4, an...",REDACTED,REDACTED,4,,,MOR,Unknown bird,UNKB,,FL,0,0,0,0,0,0,0,0,0,0,0,0,0,1,,,0,A,Unknown
708308,A,3.0,BE-1900,KENAI MUNICIPAL ARPT,PAEN,123,27,,,/Legacy Record=2/,,,,,M,0,0,0,0,1,0,0,0,0,0,0,0,0,0,,,,31.0,4.0,4.0,4.0,,,,AAL,820,,2015-06-18,6,2015,1,0,0,0,0,0,60.572,,-151.24753,2019-07-27,,,2.0,,1,BUSINESS,BUS,,0,Tower,Approach,,,0,0,"ON FINAL, PILOT REPORTED HEARING IMPACT. AFTER...",REDACTED,REDACTED,,,,MOR,Unknown bird,UNKB,,AK,0,0,0,0,1,0,0,0,0,0,0,0,0,0,,,0,C,Unknown
708309,A,1.0,PA-46 MALIBU,DAVID WAYNE HOOKS MEMORIAL ARPT,KDWH,729,45,,,/Legacy Record=4/,,,,,M,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,7.0,,,,,ASW,,,2015-09-20,9,2015,1,0,0,0,0,0,30.06186,,-95.55278,2019-07-27,,,1.0,,1,BUSINESS,BUS,,0,Tower,,,N952G,0,0,"N952G, P46T/G, bird strike, no injuries report...",REDACTED,REDACTED,,,,MOR,Unknown bird,UNKB,,TX,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,,0,A,Unknown
708310,A,4.0,B-717-200,LAMBERT-ST LOUIS INTL,KSTL,148,45,,,/Legacy Record=6/,,,,,M,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2.0,,,9.0,6.0,5.0,5.0,,,,ACE,939,,2015-11-07,11,2015,1,0,0,0,0,0,38.74769,,-90.35999,2019-07-27,,,2.0,,1,DELTA AIR LINES,DAL,,0,Tower,Approach,,,0,0,NO EMERGENCY: AFTER THE AIRCRAFT WAS AT THE GA...,REDACTED,REDACTED,30R,,,MOR,Unknown bird,UNKB,,MO,0,0,0,0,0,0,1,0,0,0,0,0,0,1,,,0,D,Unknown
708311,A,2.0,BE-90 KING,POMPANO BEACH AIRPARK,KPMP,,,,,/Legacy Record=7/,,,,,M,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0.0,,,,,4.0,4.0,,,,ASO,MEDFLY1,0.0,2015-12-17,12,2015,1,0,0,0,0,0,26.24714,,-80.11106,2019-07-27,,,2.0,,1,BUSINESS,BUS,,0,Tower,Landing Roll,,,0,0,"MEDFLY1, A BE9L, VFR, LANDED RY 15. UPON CONTA...",REDACTED,REDACTED,15,,,MOR,Unknown bird,UNKB,,FL,0,0,0,0,1,0,0,0,0,0,0,0,0,0,,,0,C,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1516465,,,UNKNOWN,SEATTLE-TACOMA INTL,KSEA,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,,,,,,,,ANM,,,2024-03-17,3,2024,0,0,0,0,0,0,47.44898,,-122.30931,2024-05-01,,,,,1,UNKNOWN,UNK,,0,Carcass Found,,,,1,0,Found dead robin on Runway Inspection. DAWN. 1...,REDACTED,REDACTED,16L,Small,,FAA Form 5200-7-E,American robin,Z6007,,WA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,07:15,,0,,Unknown
1516467,A,3.0,CL-601/604,SAN MARCOS MUNICIPAL ARPT,KHYI,188,7,,,,,,,,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,,,22.0,4.0,5.0,5.0,,,,ASW,237,2000.0,2024-03-17,3,2024,0,0,0,0,0,0,29.89361,,-97.86469,2024-05-01,,,2.0,1,1,NETJETS,EJA,,0,Pilot,Approach,,N237QS,1,1,"none, we were not aware we hit a bird until af...",REDACTED,REDACTED,8,Small,Some Cloud,FAA Form 5200-7-E,Meadowlarks,ZT000,160.0,TX,0,0,0,0,0,1,0,0,0,0,0,0,0,0,19:15,,0,D,No
1516468,A,4.0,B-737-800,SANTA BARBARA MUNICIPAL,KSBA,148,43,0.35,,,,,,,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,,,10.0,1.0,1.0,1.0,,,,AWP,1141,30.0,2024-03-17,3,2024,0,0,0,0,0,0,34.42621,,-119.84037,2024-05-01,,,2.0,2-10,2-10,ALASKA AIRLINES,ASA,,0,Airport Operations,Approach,,N590AS,1,0,"Aircraft out of service, maintenance enroute t...",REDACTED,REDACTED,,Small,No Cloud,FAA Form 5200-7-E,Western meadowlark,ZT002,,CA,1,0,0,0,0,0,0,1,0,0,0,0,0,0,16:39,Day,0,D,Unknown
1516469,A,4.0,B-767-300,BAIYUN AIRPORT,ZGGG,148,97,,,,,,,,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10.0,,,22.0,7.0,1.0,1.0,,,,FGN,6024,3000.0,2024-03-17,3,2024,0,0,0,0,0,0,23.392436,,113.298786,2024-05-01,,,2.0,,1,FEDEX EXPRESS,FDX,,0,Pilot,Approach,,N103FE,0,0,10 mile final. NOTE: NUMBER STRUCK NOT REPORTE...,REDACTED,REDACTED,20L,,No Cloud,FAA Form 5200-7-E,Unknown bird,UNKB,180.0,FN,0,0,0,0,0,0,0,0,0,0,0,0,1,0,21:45,Night,0,D,Unknown


Now let's weigh if we want to omit any columns. Granted we are going to pick and choose columns to aid us in making decisions and further clean our data in the next section. But we should consider what columns we want to immediately exclude that just add noise. 

In [12]:
chosen_cols = ["OPID",
"OPERATOR",
"AIRCRAFT",
"AC_CLASS",
"AC_MASS",
"NUM_ENGS",
"TYPE_ENG",
"INCIDENT_DATE",
"INCIDENT_YEAR", 
"INCIDENT_MONTH",
"TIME_OF_DAY",
"TIME",
"AIRPORT_ID",
"AIRPORT",
"STATE",
"RUNWAY",
"LOCATION",
"LATITUDE",
"LONGITUDE",
"HEIGHT",
"SPEED",
"DISTANCE",
"PHASE_OF_FLIGHT",
"DAMAGE_LEVEL",
"STR_RAD",
"DAM_RAD",
"STR_WINDSHLD",
"DAM_WINDSHLD",
"STR_NOSE",
"STR_PROP",
"DAM_PROP",
"STR_WING_ROT",
"DAM_WING_ROT",
"STR_FUSE",
"DAM_FUSE",
"STR_LG",
"DAM_LG",
"STR_TAIL",
"DAM_TAIL",
"STR_LGHTS",
"DAM_LGHTS",
"STR_OTHER",
"EFFECT",
"SKY",
"PRECIPITATION",
"SPECIES_ID",
"SPECIES",
"SIZE",
"WARNED",
"COST_REPAIRS",
"COST_OTHER",
"COST_REPAIRS_INFL_ADJ",
"COST_OTHER_INFL_ADJ",
"NR_INJURIES",
"NR_FATALITIES",
"INDICATED_DAMAGE"]

df = df[chosen_cols]

with pd.option_context('display.max_columns', None):
    display(df)

Unnamed: 0_level_0,OPID,OPERATOR,AIRCRAFT,AC_CLASS,AC_MASS,NUM_ENGS,TYPE_ENG,INCIDENT_DATE,INCIDENT_YEAR,INCIDENT_MONTH,TIME_OF_DAY,TIME,AIRPORT_ID,AIRPORT,STATE,RUNWAY,LOCATION,LATITUDE,LONGITUDE,HEIGHT,SPEED,DISTANCE,PHASE_OF_FLIGHT,DAMAGE_LEVEL,STR_RAD,DAM_RAD,STR_WINDSHLD,DAM_WINDSHLD,STR_NOSE,STR_PROP,DAM_PROP,STR_WING_ROT,DAM_WING_ROT,STR_FUSE,DAM_FUSE,STR_LG,DAM_LG,STR_TAIL,DAM_TAIL,STR_LGHTS,DAM_LGHTS,STR_OTHER,EFFECT,SKY,PRECIPITATION,SPECIES_ID,SPECIES,SIZE,WARNED,COST_REPAIRS,COST_OTHER,COST_REPAIRS_INFL_ADJ,COST_OTHER_INFL_ADJ,NR_INJURIES,NR_FATALITIES,INDICATED_DAMAGE
INDEX_NR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
708307,BUS,BUSINESS,PA-28,A,1.0,1.0,A,2015-05-22,2015,5,,,KVRB,VERO BEACH MUNICIPAL,FL,4,,27.65556,-80.41794,,,,Approach,M,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,,,,UNKB,Unknown bird,,Unknown,,,,,,,1
708308,BUS,BUSINESS,BE-1900,A,3.0,2.0,C,2015-06-18,2015,6,,,PAEN,KENAI MUNICIPAL ARPT,AK,,,60.572,-151.24753,,,,Approach,M,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,,,,UNKB,Unknown bird,,Unknown,,,,,,,1
708309,BUS,BUSINESS,PA-46 MALIBU,A,1.0,1.0,A,2015-09-20,2015,9,,,KDWH,DAVID WAYNE HOOKS MEMORIAL ARPT,TX,,,30.06186,-95.55278,,,,,M,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,UNKB,Unknown bird,,Unknown,,,,,,,1
708310,DAL,DELTA AIR LINES,B-717-200,A,4.0,2.0,D,2015-11-07,2015,11,,,KSTL,LAMBERT-ST LOUIS INTL,MO,30R,,38.74769,-90.35999,,,2.0,Approach,M,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,,,,UNKB,Unknown bird,,Unknown,,,,,,,1
708311,BUS,BUSINESS,BE-90 KING,A,2.0,2.0,C,2015-12-17,2015,12,,,KPMP,POMPANO BEACH AIRPARK,FL,15,,26.24714,-80.11106,0.0,,0.0,Landing Roll,M,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,,,,UNKB,Unknown bird,,Unknown,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1516465,UNK,UNKNOWN,UNKNOWN,,,,,2024-03-17,2024,3,,07:15,KSEA,SEATTLE-TACOMA INTL,WA,16L,,47.44898,-122.30931,,,0.0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,Z6007,American robin,Small,Unknown,,,,,,,0
1516467,EJA,NETJETS,CL-601/604,A,3.0,2.0,D,2024-03-17,2024,3,,19:15,KHYI,SAN MARCOS MUNICIPAL ARPT,TX,8,,29.89361,-97.86469,2000.0,160.0,5.0,Approach,N,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,,Some Cloud,,ZT000,Meadowlarks,Small,No,,,,,,,0
1516468,ASA,ALASKA AIRLINES,B-737-800,A,4.0,2.0,D,2024-03-17,2024,3,Day,16:39,KSBA,SANTA BARBARA MUNICIPAL,CA,,,34.42621,-119.84037,30.0,,0.0,Approach,N,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,No Cloud,,ZT002,Western meadowlark,Small,Unknown,,,,,,,0
1516469,FDX,FEDEX EXPRESS,B-767-300,A,4.0,2.0,D,2024-03-17,2024,3,Night,21:45,ZGGG,BAIYUN AIRPORT,FN,20L,,23.392436,113.298786,3000.0,180.0,10.0,Approach,N,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,No Cloud,,UNKB,Unknown bird,,Unknown,,,,,,,0


## Saving Our Data

To save our data, let's just save to a plain CSV using `to_csv()` in Pandas. Make sure to also give the indexed column a name, which we will just . 

In [13]:
df.to_csv('birdstrike_section1.csv', index_label='INDEX_NR')

For good measure, let's make sure we can read the CSV without issue into Pandas. 

In [14]:
pd.read_csv('birdstrike_section1.csv', index_col='INDEX_NR')

  pd.read_csv('birdstrike_section1.csv', index_col='INDEX_NR')


Unnamed: 0_level_0,OPID,OPERATOR,AIRCRAFT,AC_CLASS,AC_MASS,NUM_ENGS,TYPE_ENG,INCIDENT_DATE,INCIDENT_YEAR,INCIDENT_MONTH,...,SPECIES,SIZE,WARNED,COST_REPAIRS,COST_OTHER,COST_REPAIRS_INFL_ADJ,COST_OTHER_INFL_ADJ,NR_INJURIES,NR_FATALITIES,INDICATED_DAMAGE
INDEX_NR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
708307,BUS,BUSINESS,PA-28,A,1.0,1.0,A,2015-05-22,2015,5,...,Unknown bird,,Unknown,,,,,,,1
708308,BUS,BUSINESS,BE-1900,A,3.0,2.0,C,2015-06-18,2015,6,...,Unknown bird,,Unknown,,,,,,,1
708309,BUS,BUSINESS,PA-46 MALIBU,A,1.0,1.0,A,2015-09-20,2015,9,...,Unknown bird,,Unknown,,,,,,,1
708310,DAL,DELTA AIR LINES,B-717-200,A,4.0,2.0,D,2015-11-07,2015,11,...,Unknown bird,,Unknown,,,,,,,1
708311,BUS,BUSINESS,BE-90 KING,A,2.0,2.0,C,2015-12-17,2015,12,...,Unknown bird,,Unknown,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1516465,UNK,UNKNOWN,UNKNOWN,,,,,2024-03-17,2024,3,...,American robin,Small,Unknown,,,,,,,0
1516467,EJA,NETJETS,CL-601/604,A,3.0,2.0,D,2024-03-17,2024,3,...,Meadowlarks,Small,No,,,,,,,0
1516468,ASA,ALASKA AIRLINES,B-737-800,A,4.0,2.0,D,2024-03-17,2024,3,...,Western meadowlark,Small,Unknown,,,,,,,0
1516469,FDX,FEDEX EXPRESS,B-767-300,A,4.0,2.0,D,2024-03-17,2024,3,...,Unknown bird,,Unknown,,,,,,,0


Great! No issues here. We have successfully narrowed our data down to what we think will be of interest. Again remember to have an auditable trail of how we transformed and used this data. Keep the original data handy in case we ever need additional rows or columns again. But this should give us plenty to work with going forward. 

Note we could have also saved the results to a pickle using `to_pickle()` or SQL using `to_sql()`. I personally like using SQLite to store data [and offer an entire Anaconda course on that topic](https://learning.anaconda.cloud/introduction-to-sql). 

## Where Did the Data Come From? 

Before we wrap up this section, let's think carefully and step back. We talked to an experienced pilot and retired FAA official. They were wonderful and shared some great information on this data. But being data experts, we have to truly ask what created this data. 

<a title="See page for author, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:F16_after_bird_strike.jpg"><img width="512" alt="F16 after bird strike" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/F16_after_bird_strike.jpg/512px-F16_after_bird_strike.jpg?20050925122118"></a>

**Bird strike on an F-16 canopy**

Each record is reported by a pilot to the FAA. Now stop and consider that. It is one thing if the pilot works for a big airline like Delta or Southwest Airlines, and they are well-trained in procedure. They have little reason to miss filing a bird strike report other than the time it takes. But an independent pilot who owns his own small aircraft may be slightly more of a wildcard. If the damage is minor or nonexistant, perhaps he just shrugs it off and does not report it. If he hits a protected species like the bald eagle, he may be even less inclined as he is unsure of the consequences. 

Self-reporting always carries a bias with it, because not everybody is going to self-report. This could skew the results in ways that do not reflect reality. For example, if the data shows large airlines are far more prone to bird strikes than general aviation aircraft, that could be due to airlines being better at reporting, not because birds collide more with airliners. 

In summary, do not get caught up in just what the data says. Also ask where it came from. What could possible bias it? Framing these questions will help frame more intelligent conclusions. 

## EXERCISE

Your boss wants to see a geographic map of all birdstrikes in the year 2023. Filter down this dataframe to pull in the `LATITUDE`, `LONGITUDE`, and the records for that `INCIDENT_YEAR` 2023. 

In [15]:
# put you code here 



### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

First filter for the `INCIDENT_YEAR` being 2023, and then grab the `LATITUDE` and `LONGITUDE` columns on that resulting `DataFrame`. 

In [16]:
df[df["INCIDENT_YEAR"] == 2023][["LATITUDE","LONGITUDE"]]

Unnamed: 0_level_0,LATITUDE,LONGITUDE
INDEX_NR,Unnamed: 1_level_1,Unnamed: 2_level_1
1353124,41.61627,-87.41279
1364684,61.17432,-149.99619
1365179,55.530193,13.371639
1368877,38.17439,-85.736
1369140,38.17439,-85.736
...,...,...
1515811,42.15991,-76.89144
1515812,42.15991,-76.89144
1516015,,
1516016,,
