<h1> UFO Sightings 🛸 </h1>
<h3> Exploratory Data Analysis (EDA) </h3>
<p>Data obtained from Kaggle: </p>
https://www.kaggle.com/datasets/ogunkoya/ufo-1149

Before starting, importing all the libraries needed

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

<h4>Step 1: Exploring Data Available</h4>

In [2]:
# first, import the data into a pandas dataframe
df = pd.read_csv('UFO-1149.csv')

In [3]:
# looking at the first few records in the dataframe
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8831,-97.9411
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.3842,-98.5811
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.9167
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783,-96.6458
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4181,-157.8036


In [4]:
# looking at the last few records in the dataframe
df.tail()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
80327,9/9/2013 21:15,nashville,tn,us,light,600.0,10 minutes,Round from the distance/slowly changing colors...,9/30/2013,36.1658,-86.7844
80328,9/9/2013 22:00,boise,id,us,circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,9/30/2013,43.6136,-116.2025
80329,9/9/2013 22:00,napa,ca,us,other,1200.0,hour,Napa UFO&#44,9/30/2013,38.2972,-122.2844
80330,9/9/2013 22:20,vienna,va,us,circle,5.0,5 seconds,Saw a five gold lit cicular craft moving fastl...,9/30/2013,38.9011,-77.2656
80331,9/9/2013 23:00,edmond,ok,us,cigar,1020.0,17 minutes,2 witnesses 2 miles apart&#44 Red &amp; White...,9/30/2013,35.6528,-97.4778


In [5]:
# number of rows and columns
df.shape

(80332, 11)

In [6]:
# reviewing all columns, data types, and amount of values in each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  float64
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80318 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  float64
 10  longitude             80332 non-null  float64
dtypes: float64(3), object(8)
memory usage: 6.7+ MB


In [7]:
# reviewing the amount of missing values in each column
df.isnull().sum()

datetime                   0
city                       0
state                   5797
country                 9670
shape                   1932
duration (seconds)         0
duration (hours/min)       0
comments                  14
date posted                0
latitude                   0
longitude                  0
dtype: int64

In [8]:
# reviewing the number of unique items in each column
df.nunique()

datetime                69586
city                    19900
state                      67
country                     5
shape                      29
duration (seconds)        533
duration (hours/min)     8304
comments                79998
date posted               317
latitude                17834
longitude               19192
dtype: int64

In [9]:
# looking at all unique countries recorded
df['country'].value_counts()

us    65114
ca     3000
gb     1905
au      538
de      105
Name: country, dtype: int64

In [10]:
# looking at all the unique values under state column
# few more states than the standard 50 states
df['state'].unique()

array(['tx', nan, 'hi', 'tn', 'ct', 'al', 'fl', 'ca', 'nc', 'ny', 'ky',
       'mi', 'ma', 'ks', 'sc', 'wa', 'ab', 'co', 'nh', 'wi', 'me', 'ga',
       'pa', 'il', 'ar', 'on', 'mo', 'oh', 'in', 'az', 'mn', 'nv', 'nf',
       'ne', 'or', 'bc', 'ia', 'va', 'id', 'nm', 'nj', 'mb', 'wv', 'ok',
       'ri', 'nb', 'vt', 'la', 'pr', 'ak', 'ms', 'ut', 'md', 'mt', 'sk',
       'wy', 'sd', 'pq', 'ns', 'qc', 'de', 'nd', 'dc', 'nt', 'sa', 'yt',
       'yk', 'pe'], dtype=object)

<hr>

<b>Quick Observations: </b>
- Need to reformat 'datetime' column to separate the time and date <br>
- Don't need 'date posted' or 'comments' columns <br>
- Slightly over 80,000 entries with some missing data in state, country, and shape <br>
- Need to clean up the US states and some foreign entries seemed to be lumped into the 'city' column <br>

<br>

<b>Questions I want to answer with this data: </b>
1. What time periods are captured in this dataset? <br>
2. Are there multiple reports from the same day? <br>
3. Do most sightings occur during the day or at night? <br>
4. What states have the most sightings? <br>
5. How long do sightings last on average? <br>

<hr>

<h4>Section 2: Data Cleaning and Transformation</h4>

In [11]:
# splitting datetime column into a two 'date' and 'time'
df['date'] = df['datetime'].apply(lambda x: x.split(' ')[0])

In [12]:
df['time'] = df['datetime'].apply(lambda x: x.split(' ')[1])

In [13]:
# NEXT: drop unnecessary columns
df.drop(columns=['datetime', 'comments', 'date posted', 'duration (hours/min)'], axis = 1, inplace=True)

In [14]:
# setting list of all us states
states = [ 'ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga',
           'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md', 'me',
           'mi', 'mn', 'mo', 'ms', 'mt', 'nc', 'nd', 'ne', 'nh', 'nj', 'nm',
           'nv', 'ny', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx',
           'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy']

In [15]:
# adding country on items that are in the list of states above and the country is NULL
c1 = df['country'].isna()
c2 = df['state'].isin(states)

df['country'] = df['country'].mask(c1 & c2, 'us')

In [17]:
# removing remaining missing values
df.dropna(subset=['country', 'shape', 'state'], inplace=True)

In [21]:
# confirming updates
df.head()

Unnamed: 0,city,state,country,shape,duration (seconds),latitude,longitude,date,time
0,san marcos,tx,us,cylinder,2700.0,29.8831,-97.9411,10/10/1949,20:30
1,lackland afb,tx,us,light,7200.0,29.3842,-98.5811,10/10/1949,21:00
3,edna,tx,us,circle,20.0,28.9783,-96.6458,10/10/1956,21:00
4,kaneohe,hi,us,light,900.0,21.4181,-157.8036,10/10/1960,20:00
5,bristol,tn,us,sphere,300.0,36.595,-82.1889,10/10/1961,19:00


In [22]:
df.isnull().sum()

city                  0
state                 0
country               0
shape                 0
duration (seconds)    0
latitude              0
longitude             0
date                  0
time                  0
dtype: int64

In [19]:
# NEXT: redo duration (seconds) to more meaningful time column, drop duration(hours/min)

In [20]:
# NEXT: create a dataset for USA items only

<h4>Section 3: Analysis & Visualization</h4>