# UFO Sightings

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

**Example**: The cell below imports `pandas` as a dependency because `pandas` functions will be used throughout the program, such as the Pandas `DataFrame` as well as the `read_csv` function.

In [1]:
import pandas as pd

The csv_path is assigned the location of the file we want the program to read.

The ufo_df is assigned the read file using the pd.read_csv statement

The .head prints out the first five rows of information from the ufo_df so that we can view the information

In [3]:
csv_path = "Resources/ufoSightings.csv"

ufo_df = pd.read_csv(csv_path)

ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


The .count is counting all of the rows under each column that contain information, this is why some values have a lower count because they have missing information.

In [5]:
ufo_df.count()

datetime                80332
city                    80332
state                   74535
country                 70662
shape                   78400
duration (seconds)      80332
duration (hours/min)    80332
comments                80317
date posted             80332
latitude                80332
longitude               80332
dtype: int64

The clean_ufo_df is being assigned a cleaned up version of the data that drops missing value rows from the dataframe using .dropna. Using "any" for the how function will drop rows that are missing any info as opposed to "all" will drop if all info is missing. "any" can be an issue if you dont want to lose relevant information present in the other columns or keys for that row

In [6]:
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()

datetime                66516
city                    66516
state                   66516
country                 66516
shape                   66516
duration (seconds)      66516
duration (hours/min)    66516
comments                66516
date posted             66516
latitude                66516
longitude               66516
dtype: int64

_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what defining a list of columns and using that as the second parameter in the `loc` function does. Also, which filter was applied and how as well as the expected outcome of applying the filter.]_
The usa_ufo_df is filtered to include only the "US" information by passing the "country" column to the index [] and making it compare == only to those values of "US". The second parameter, columns, assigns the new dataframe only those columns in the list for the US filtered dataframe.

In [7]:
columns = [
    "datetime",
    "city",
    "state",
    "country",
    "shape",
    "duration (seconds)",
    "duration (hours/min)",
    "comments",
    "date posted"
]

usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", columns]
usa_ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999


_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what `value_counts` does as well as why this can be practical. Also, describe what will this return.]_
The state_counts is being assigned the value counts of each state listed in the usa_ufo_df. The value.counts returns the numbers of times each state is counted in the rows under the "state" key/column. It is useful for identifying the number of times values appear under a column or in a dataset.

In [8]:
state_counts = usa_ufo_df["state"].value_counts()
state_counts

ca    8683
fl    3754
wa    3707
tx    3398
ny    2915
il    2447
az    2362
pa    2319
oh    2251
mi    1781
nc    1722
or    1667
mo    1431
co    1385
in    1268
va    1248
ma    1238
nj    1236
ga    1235
wi    1205
tn    1091
mn     996
sc     986
ct     865
ky     843
md     818
nv     778
ok     714
nm     693
ia     669
al     629
ut     611
ks     599
ar     578
la     547
me     544
id     508
nh     482
mt     460
wv     438
ne     373
ms     368
ak     311
hi     257
vt     254
ri     224
sd     177
wy     169
de     165
nd     123
pr      24
dc       7
Name: state, dtype: int64

_[Replace this with your clear explanation of what happens in the cell below. Be sure to describe what is the data type of `state_counts` and why. Explain why is this step necessary for continuing your analysis.]_
This new variable is assigned a newly created dataframe that uses the information stored in the state_counts variable to create it. The first five states are printed to confirm the result.

In [9]:
state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df.head()

Unnamed: 0,state
ca,8683
fl,3754
wa,3707
tx,3398
ny,2915


_[Replace this with your clear explanation of what happens in the cell below. Explain what is being manipulated here, and why would this be more user-friendly to do.]_
This command used below, .rename, is used to rename indexes or keys/columns in a dataset. In this case, the original column header "state" is renamed "Sum of Sightings" and stored as a new dataframe. The .head is used to confirm the result visibily.

In [10]:
state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

Unnamed: 0,Sum of Sightings
ca,8683
fl,3754
wa,3707
tx,3398
ny,2915


_[Replace this with your clear explanation of what happens in the cell below. Explain what is happening by calling looking at the `dtypes` property and why this can be helpful.]_
The dtypes statement after the dataframe is used to return the data type of the values in the dataframe under the columns/keys. In this case each data type is an object. This can be very useful because the type of data has an impact on how it is treated in syntax and functions. For instance, a boolean value could be confused as a string and would not perform correctly in syntax if something thinks, False, is a string when it is a reserved boolean data type.

In [12]:
usa_ufo_df.dtypes

datetime                object
city                    object
state                   object
country                 object
shape                   object
duration (seconds)      object
duration (hours/min)    object
comments                object
date posted             object
dtype: object

_[Replace this with your clear explanation of what happens in the cell below. Be sure to explain why this step is necessary, and what will you now be able to do as a result of performing this step.]_
This cell below changes the data type of the duration in seconds to a floating point integer using .astype('float'). This is important because as an object type the values cannot be used for numerical analysis in the program that is desired in the next step.

In [13]:
usa_ufo_df.loc[:, "duration (seconds)"] = usa_ufo_df["duration (seconds)"].astype("float")
usa_ufo_df.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)      float64
duration (hours/min)     object
comments                 object
date posted              object
dtype: object

_[Replace this with your clear explanation of what happens in the cell below. What is the output and how were we able to accomplish this?]_
The .sum() command in the cell below returns the total sum of all the values in the duration seconds column. This was only possible because the datatype of the column was changed to a float data type. If it was still and object data type this would return an error for numerical functions.

In [14]:
# Now it is possible to find the sum of seconds
usa_ufo_df["duration (seconds)"].sum()

351281285.38

_[Replace this with your clear explanation of what happens in the cell below. How did we group by two columns, and what are we now able to do as a result? Lastly, explain what does this output tell you.]_
The groupby command using the columns passed into the command and groups all of the data in the dataframe based on those columns. This can be done with either one or multiple columns. When using multiple columns they must be contained in brackets as opposed to simply ('state', 'city'). The brackets tell the command to use multiple columns, additionally, the grouping is done from left to right using the columns in brackets. So it is first grouped by state and then by city.
Finally, the grouped columns are filtered by a .count of the 'datetime' column only that provides a count for each city in the states.

In [15]:
grouped_data = usa_ufo_df.groupby(['state', 'city'])

# Hint: If you are counting records, you can use any column and get the same result. Try it.
grouped_data['datetime'].count()

state  city                                                     
ak     adak                                                          1
       anchor point                                                  1
       anchorage                                                    82
       angoon                                                        1
       auke bay                                                      2
       bethel                                                        8
       big lake                                                      1
       butte                                                         1
       chugiak                                                       2
       clam gulch                                                    1
       cold bay                                                      1
       cordova                                                       2
       council                                                       1
       craig