<h1>06. Final project </h1>
<br>

In this notebook, we will complete an end-to-end data visualization project and learn how we can use visualization to solve real-world problems. 

<p class="lead"> 
Table of Contents: 

- <a href="#Come-up-with-questions">Come up with questions</a>
- <a href="#Find-data">Find data</a>    
- <a href="#Refine-questions">Refine your questions</a>
- <a href="#Data-cleaning">Data cleaning</a>
- <a href="#Data-exploration-and-visualization">Data exploration and visualization</a>

    
</p>





<div>
<h2 class="breadcrumb">Come up with questions</h2><p>
</div>

- What excites you? What kind of problems would you like to explore and solve with data visualization?

![](assets/UFO1.gif)

![UFO](assets/UFO01.gif)

<div>
<h2 class="breadcrumb">Find data</h2><p>
</div>

- Google
- Find public datasets (e.g., https://www.kaggle.com/datasets/) 
- Use API (e.g., Twitter API)
- Web scraping 
- Survey


UFO data: https://www.kaggle.com/datasets/NUFORC/ufo-sightings

<div>
<h2 class="breadcrumb">Refine questions</h2><p>
</div>

Given the data we see, we can refine our questions:

- What is the yearly/monthly/daily trend of UFO sightings? Which year/month of the year/day of the month has the highest number of UFO sightings?
- Which country has the highest number of UFO sightings?
- Are there overall trend differences by country?

<div>
<h2 class="breadcrumb">Data cleaning</h2><p>
</div>

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
df = pd.read_csv('scrubbed.csv', low_memory=False)

In [5]:
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


In [6]:
df.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude '],
      dtype='object')

In [7]:
# rename column "longitude " to "longitude". There is an extra space. 
df = df.rename(columns={"longitude ": "longitude"})

In [8]:
# check data types
df.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)       object
duration (hours/min)     object
comments                 object
date posted              object
latitude                 object
longitude               float64
dtype: object

In [12]:
# convert duration to float 
# df['duration (seconds)'] = df['duration (seconds)'].astype(float)

In [11]:
# found an issue with data
df[df['duration (seconds)']=='2`']

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
27822,2/2/2000 19:33,bouse,az,us,,2`,each a few seconds,Driving through Plomosa Pass towards Bouse Loo...,2/16/2000,33.9325,-114.005


←←←←←←←←←←←←←←←←←←←← stopped here

In [13]:
# clean up duration 
df['duration (seconds)'] = df['duration (seconds)'].str.strip('`')

In [14]:
# convert duration to float 
df['duration (seconds)'] = df['duration (seconds)'].astype(float)

In [16]:
# convert latitude to float 
# df['latitude'] = df['latitude'].astype(float)

In [17]:
# found an issue with data
df[df.latitude=='33q.200088']

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
43782,5/22/1974 05:30,mescalero indian reservation,nm,,rectangle,180.0,two hours,Huge rectangular object emmitting intense whit...,4/18/2012,33q.200088,-105.624152


In [18]:
# check data
df.iloc[43780: 43786]

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
43780,5/2/2014 22:40,iron gate,va,us,light,180.0,3 minutes,3 fiery orange lights in formation maneuvering...,5/8/2014,37.7975000,-79.791389
43781,5/2/2014 22:45,parma,oh,us,fireball,40.0,30-40 seconds,Round orange red fireball over Parma.,5/8/2014,41.4047222,-81.723056
43782,5/22/1974 05:30,mescalero indian reservation,nm,,rectangle,180.0,two hours,Huge rectangular object emmitting intense whit...,4/18/2012,33q.200088,-105.624152
43783,5/22/1977 20:00,marana,az,us,sphere,3600.0,60 min ??,That&#39s NOT the moon&#33,9/29/2004,32.4366667,-111.224722
43784,5/22/1980 02:00,little rock,ar,us,unknown,15.0,15 seconds,Bright intense light awoke us at 2:00 am in Li...,4/16/2005,34.7463889,-92.289444
43785,5/22/1990 23:00,perl island (private),ak,,sphere,2100.0,two @ 35minea,The ship hovered over our runway twice for 35 ...,3/2/2004,59.119149,-151.68817


In [19]:
# check rows with latitude containing string q 
df[df['latitude'].str.contains('q')]

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
43782,5/22/1974 05:30,mescalero indian reservation,nm,,rectangle,180.0,two hours,Huge rectangular object emmitting intense whit...,4/18/2012,33q.200088,-105.624152


In [20]:
# clean up latitude
df['latitude'] = df['latitude'].str.replace('q','')

In [21]:
# convert latitude to float
df['latitude'] = df['latitude'].astype(float)

In [23]:
# convert datetime to pandas datetime object 
# df['datetime'] = pd.to_datetime(df['datetime'])

In [24]:
# found an issue 
df[df['datetime']=='10/11/2006 24:00']

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
388,10/11/2006 24:00,rome,ny,us,oval,120.0,a min or two,I was walking from the garage to the house&#44...,2/1/2007,43.212778,-75.456111


In [25]:
# clean up datetime
df['datetime'] = df['datetime'].str.replace('24:00','23:59')

In [26]:
# convert datetime to pandas datetime object
df['datetime'] = pd.to_datetime(df['datetime'])

In [27]:
# create variables year, month, day 
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['day'] = df['datetime'].dt.day

In [28]:
# check data types again
df.dtypes

datetime                datetime64[ns]
city                            object
state                           object
country                         object
shape                           object
duration (seconds)             float64
duration (hours/min)            object
comments                        object
date posted                     object
latitude                       float64
longitude                      float64
year                             int64
month                            int64
day                              int64
dtype: object

<div>
<h2 class="breadcrumb">Data exploration and visualization</h2><p>
</div>

### What is the overall trend of UFO sightings over time?

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.histplot(data=df, x='datetime', kde=True);

### Which year has the highest number of UFO sightings?

In [None]:
df.year.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(15,5), constrained_layout=True)
sns.countplot(data=df[df.year>1980], x='year');
ax.set_xlabel("Year");
plt.xticks(rotation=90);
ax.grid()


In [None]:
df.head()

### Which day of the month has the highest number of UFO sightings?

In [None]:
df.day.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.countplot(data=df, x='day');
ax.set_xlabel("Day of month");
for p in ax.patches:
    ax.annotate(p.get_height(),  (p.get_x(), p.get_height()));

### Which month of the year has the highest number of UFO sightings?

In [None]:
df.month.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.countplot(data=df, x='month');
ax.set_xlabel("Month of year");
for p in ax.patches:
    ax.annotate(p.get_height(),  (p.get_x()+0.2, p.get_height()));

### Which country has the highest number of UFO sightings?

In [None]:
df.plot('longitude', 'latitude', kind='scatter', alpha=0.1);
# there are other tools more approriate for geographical plotting that we will not cover here. 

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.countplot(data=df, x='country');
ax.set_xlabel("Country");
for p in ax.patches:
    ax.annotate(p.get_height(),  (p.get_x()+0.3, p.get_height()));

### Are there overall trend differences by country?

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.kdeplot(data=df, x='datetime', hue='country');

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
for c in df['country'].unique():
    sns.kdeplot(data=df[df.country==c], x='datetime', label=c, ax=ax);
ax.legend();


It's your turn!

What other insights can you find with this dataset or another dataset you chose? 