# UFO Sightings

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

**Example**: The cell below imports `pandas` as a dependency because `pandas` functions will be used throughout the program, such as the Pandas `DataFrame` as well as the `read_csv` function.

In [1]:
import pandas as pd

_[This cell defines a a variable equal to a csv file in our directory and we then use the read_csv function in pandas to import the data from the csv file as a data frame. Character encoding helps read in the CSV file if there are non-standard characters or values in the data set that would not be properly imported otherwise.]_

In [2]:
csv_path = "Resources/ufoSightings.csv"

ufo_df = pd.read_csv(csv_path)

ufo_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


_[Replace this with your clear explanation of what happens in the cell below. Explain why doing a count of values could be helpful.]_

Doing a count of values is helpful in this case to see if there are missing values in the data set and just to see more generally how many rows are in the dataset.

In [3]:
ufo_df.count()

datetime                80332
city                    80332
state                   74535
country                 70662
shape                   78400
duration (seconds)      80332
duration (hours/min)    80332
comments                80317
date posted             80332
latitude                80332
longitude               80332
dtype: int64

_[Replace this with your clear explanation of what happens in the cell below. What are some pros and cons of using `any` versus `all` as the parameter for `how` in the `dropna()` function?]_

This function will drop any data with missing values. That means that any row with a missing value will be dropped from the data set. If we did all, the whole row would have to have missing values in order for it to be dropped. The pros of any is that it will make the data set clean and contain no missing values anywhere. However, there might still be valuable observations dropped if only one value is missing from that row. Therefore, using all would be better if we are just trying to get rid of rows that we know are completly useless in the data set. 

In [4]:
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()

datetime                66516
city                    66516
state                   66516
country                 66516
shape                   66516
duration (seconds)      66516
duration (hours/min)    66516
comments                66516
date posted             66516
latitude                66516
longitude               66516
dtype: int64

_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what defining a list of columns and using that as the second parameter in the `loc` function does. Also, which filter was applied and how as well as the expected outcome of applying the filter.]_

Defining columns just makes it easier to set up the loc function below. That loc function specifies which columns we want in our new data frame from the previous data frame. The filter == "us" sets it up so that we only see rows from UFO sightings occuring in the United States. 

In [5]:
columns = [
    "datetime",
    "city",
    "state",
    "country",
    "shape",
    "duration (seconds)",
    "duration (hours/min)",
    "comments",
    "date posted"
]

usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", columns]
usa_ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999


_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what `value_counts` does as well as why this can be practical. Also, describe what will this return.]_
The value counts function gives us a count of the amount of times each state occurs in the data set here. This is practical if we want to know how many times a specific value occurs in the data set to get a better understanding of the data. This returns the the amount of UFO sightings occuring in each state in this data set. 

In [6]:
state_counts = usa_ufo_df["state"].value_counts()
%whos


Variable       Type         Data/Info
-------------------------------------
clean_ufo_df   DataFrame                   datetime  <...>[66516 rows x 11 columns]
columns        list         n=9
csv_path       str          Resources/ufoSightings.csv
pd             module       <module 'pandas' from '//<...>ages/pandas/__init__.py'>
state_counts   Series       ca    8683\nfl    3754\nw<...>Name: state, dtype: int64
ufo_df         DataFrame                   datetime  <...>[80332 rows x 11 columns]
usa_ufo_df     DataFrame                   datetime  <...>n[63553 rows x 9 columns]


_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what is the data type of `state_counts` and why. Explain why is this step necessary for continuing your analysis.]_


This creates a data frame from our state counts variable. State counts is a series with a data type of an integer. The pandas data frame function puts the sum of UFO sightings into each state into a datframe that is easier to work with in the next steps in our code. 

In [7]:
state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df.head()

Unnamed: 0,state
ca,8683
fl,3754
wa,3707
tx,3398
ny,2915


_[Replace this with your clear explanation of what happens in the cell below. Explain what is being manipulated here, and why would this be more user-friendly to do.]_
This changes the name of the column in the data frame to give us a better definition of what the integer value represents in this data set. It is more user friendly to have the state as the index and to know what those numbers are measuring. 

In [8]:
state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

Unnamed: 0,Sum of Sightings
ca,8683
fl,3754
wa,3707
tx,3398
ny,2915


_[Replace this with your clear explanation of what happens in the cell below. Explain what is happening by calling looking at the `dtypes` property and why this can be helpful.]_

This shows us the data type of every column in our data frame. This helps us better understand if we need to change the data type of one of the columns in order to perform future calculations. 

In [9]:
usa_ufo_df.dtypes

datetime                object
city                    object
state                   object
country                 object
shape                   object
duration (seconds)      object
duration (hours/min)    object
comments                object
date posted             object
dtype: object

_[Replace this with your clear explanation of what happens in the cell below. Be sure to explain why this step is necessary, and what will you now be able to do as a result of performing this step.]_

We cast seconds as a float so that we can perform numeric functions on that column without encountering error. This is useful when we take the sum of this column in the next line of code. 

In [11]:
usa_ufo_df.loc[:, "duration (seconds)"] = usa_ufo_df["duration (seconds)"].astype("float")
usa_ufo_df.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)      float64
duration (hours/min)     object
comments                 object
date posted              object
dtype: object

_[Replace this with your clear explanation of what happens in the cell below. What is the output and how were we able to accomplish this?]_

This takes the sum of the seconds column from every row of the data. We did this by referencing the column in the data frame and using the sum function to get this value. 

In [13]:
# Now it is possible to find the sum of seconds
usa_ufo_df["duration (seconds)"].sum()

351281285.38

_[Replace this with your clear explanation of what happens in the cell below. How did we group by two columns, and what are we now able to do as a result? Lastly, explain what does this output tell you.]_

We use the groupby function here to look at the specific places in the US that recorded UFO sightings. By doing this, we can look at the specific occurences of UFO sighting in different cities. The output gives us the amount of UFO sightings occuring in each city in the US. 

In [14]:
grouped_data = usa_ufo_df.groupby(['state', 'city'])

# Hint: If you are counting records, you can use any column and get the same result. Try it.
grouped_data['datetime'].count()

state  city                                                     
ak     adak                                                          1
       anchor point                                                  1
       anchorage                                                    82
       angoon                                                        1
       auke bay                                                      2
       bethel                                                        8
       big lake                                                      1
       butte                                                         1
       chugiak                                                       2
       clam gulch                                                    1
       cold bay                                                      1
       cordova                                                       2
       council                                                       1
       craig