# UFO Sightings

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

**Example**: The cell below imports `pandas` as a dependency because `pandas` functions will be used throughout the program, such as the Pandas `DataFrame` as well as the `read_csv` function.

In [1]:
import pandas as pd

The cell below loads the csv file "ufoSightings" and then utilizes pandas to read the csv.  After the csv file is read, pandas creates a dataframe named "ufo_df."  The first 5 rows of the dataframe are viewed through the ".head()" function

In [2]:
csv_path = "Resources/ufoSightings.csv"

ufo_df = pd.read_csv(csv_path)

ufo_df.head()

FileNotFoundError: [Errno 2] File b'Resources/ufoSightings.csv' does not exist: b'Resources/ufoSightings.csv'

The cell below is counting any cell with values for each column.  This is helpful because the user can determine that each of these are single instances and will help provide a numeric value for how many ufo sightings occurred.

In [3]:
ufo_df.count()

NameError: name 'ufo_df' is not defined

The cell below is dropping any rows that contain a NaN (NULL) value within the row.  The pro of using the 'any' parameter versus the 'all' parameter is that the 'any' parameter will provide a consistent row count among all columns, a floor value for the amount of sightings, and the most accurate info available.  However, the drawback of using the 'any' parameter is that the user is eliminating rows of data that could have provided additional information.  The 'all' parameter would allow you to run through the rows with missing values and determine if the information is useful or not.

In [None]:
clean_ufo_df = ufo_df.dropna(how="all")
clean_ufo_df.count()

_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what defining a list of columns and using that as the second parameter in the `loc` function does. Also, which filter was applied and how as well as the expected outcome of applying the filter.]_

The columns variable is being used to define a list of columns.  

The usa_ufo_df dataframe is pulling all of the date, city, shape, time, comments, and date posted information for all sightings within the US.  

Breakdown:

.loc parameter is searching and pulling any values that are the "us" in the country column.

== "us" is the filter that pulls only the sightings from the "us"

, columns] portion of the code is used to only pull the columns defined in the "columns" variable

In [None]:
columns = [
    "datetime",
    "city",
    "state",
    "country",
    "shape",
    "duration (seconds)",
    "duration (hours/min)",
    "comments",
    "date posted"
]

usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", columns]
usa_ufo_df.head()

_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what `value_counts` does as well as why this can be practical. Also, describe what will this return.]_

The cell below is counting the amount of sightings that occurred in each state.  The value_counts() function counts only the unique values within the state column.

In [None]:
state_counts = usa_ufo_df["state"].value_counts()
state_counts

_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what is the data type of `state_counts` and why. Explain why is this step necessary for continuing your analysis.]_

1. pd.DataFrame(state_counts) creates a dataframe that counts all of the sightings by state.  State_counts is a series containing counts of unique values.  This is important for our analysis because the user knows that these ufo sightings are unique to each state
2. state_ufo_counts_df.head() displays the first 5 instances of the dataframe

In [None]:
state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df.head()

_[Replace this with your clear explanation of what happens in the cell below. Explain what is being manipulated here, and why would this be more user-friendly to do.]_

- state_ufo_counts_df is now creating a new column header called "Sum of Sightings", that counts all of the sightings by state.  This is accomplished by utilizing the ".renamecolumns" function.  This is more user friendly because it allows the user to know the values within the dataframe.

In [None]:
state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

_[Replace this with your clear explanation of what happens in the cell below. Explain what is happening by calling looking at the `dtypes` property and why this can be helpful.]_

The dtypes attribute is used to find out the data type of each column in the given dataframe.  Knowing the data types is very useful because it allows the user to trouble shoot possible errors or unexpected results.

In [None]:
usa_ufo_df.dtypes

_[Replace this with your clear explanation of what happens in the cell below. Be sure to explain why this step is necessary, and what will you now be able to do as a result of performing this step.]_

- The usa_ufo_df.loc[:, "duration (seconds)"] variable locates all of the row information in the "duration (seconds) column
- The usa_ufo_df["duration (seconds)"].astype("float") function changes the data types in the "duration (seconds) column to floating data types
- The benefit of this is that the values within the "duration (seconds)" column are now numeric and easily manipulated

In [None]:
usa_ufo_df.loc[:, "duration (seconds)"] = usa_ufo_df["duration (seconds)"].astype("float")
usa_ufo_df.dtypes

_[Replace this with your clear explanation of what happens in the cell below. What is the output and how were we able to accomplish this?]_

- The usa_ufo_df["duration (seconds)"].sum() sums the total amount of seconds from the "duration (seconds)" column

In [None]:
# Now it is possible to find the sum of seconds
usa_ufo_df["duration (seconds)"].sum()

_[Replace this with your clear explanation of what happens in the cell below. How did we group by two columns, and what are we now able to do as a result? Lastly, explain what does this output tell you.]_

- The grouped_data variable groups the usa_ufo_df by state and city  
- We are able to group the two columns together by using the ".groupby" function
- The grouped_data['datetime'].count() function counts the number of sightings and displays it by state and city
- Now the user can view amount of sightings by state and city

In [None]:
grouped_data = usa_ufo_df.groupby(['state', 'city'])

# Hint: If you are counting records, you can use any column and get the same result. Try it.
grouped_data['datetime'].count()