<h1>Final project </h1>
<br>

In this notebook, we will complete an end-to-end data visualization project and learn how we can use visualization to solve real-world problems. 

<p class="lead"> 
Table of Contents: 

- <a href="#Come-up-with-questions">Come up with questions</a>
- <a href="#Find-data">Find data</a>    
- <a href="#Refine-questions">Refine your questions</a>
- <a href="#Data-cleaning">Data cleaning</a>
- <a href="#Data-exploration-and-visualization">Data exploration and visualization</a>

    
</p>





<div>
<h2 class="breadcrumb">Come up with questions</h2><p>
</div>

- What excites you? What kind of problems would you like to explore and solve with data visualization?

![](assets/UFO1.gif)

<div>
<h2 class="breadcrumb">Find data</h2><p>
</div>

- Google
- Find public datasets (e.g., https://www.kaggle.com/datasets/) 
- Use API (e.g., Twitter API)
- Web scraping 
- Survey


UFO data: https://www.kaggle.com/datasets/NUFORC/ufo-sightings

<div>
<h2 class="breadcrumb">Refine questions</h2><p>
</div>

Given the data we see, we can refine our questions:

- What is the yearly/monthly/daily trend of UFO sightings? Which year/month of the year/day of the month has the highest number of UFO sightings?
- Which country has the highest number of UFO sightings?
- Are there overall trend differences by country?

<div>
<h2 class="breadcrumb">Data cleaning</h2><p>
</div>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('archive/scrubbed.csv', low_memory=False)

In [None]:
df.head()

In [None]:
df.columns

In [None]:
# rename column "longitude " to "longitude". There is an extra space. 
df = df.rename(columns={"longitude ": "longitude"})

In [None]:
# check data types
df.dtypes

In [None]:
# convert duration to float 
# df['duration (seconds)'] = df['duration (seconds)'].astype(float)

In [None]:
# found an issue with data
df[df['duration (seconds)']=='2`']

In [None]:
# clean up duration 
df['duration (seconds)'] = df['duration (seconds)'].str.strip('`')

In [None]:
# convert duration to float 
df['duration (seconds)'] = df['duration (seconds)'].astype(float)

In [None]:
# convert latitude to float 
# df['latitude'] = df['latitude'].astype(float)

In [None]:
# found an issue with data
df[df.latitude=='33q.200088']

In [None]:
# check data
df.iloc[43780: 43786]

In [None]:
# check rows with latitude containing string q 
df[df['latitude'].str.contains('q')]

In [None]:
# clean up latitude
df['latitude'] = df['latitude'].str.replace('q','')

In [None]:
# convert latitude to float
df['latitude'] = df['latitude'].astype(float)

In [None]:
# convert datetime to pandas datetime object 
# df['datetime'] = pd.to_datetime(df['datetime'])

In [None]:
# found an issue 
df[df['datetime']=='10/11/2006 24:00']

In [None]:
# clean up datetime
df['datetime'] = df['datetime'].str.replace('24:00','23:59')

In [None]:
# convert datetime to pandas datetime object
df['datetime'] = pd.to_datetime(df['datetime'])

In [None]:
# create variables year, month, day 
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['day'] = df['datetime'].dt.day

In [None]:
# check data types again
df.dtypes

<div>
<h2 class="breadcrumb">Data exploration and visualization</h2><p>
</div>

### What is the overall trend of UFO sightings over time?

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.histplot(data=df, x='datetime', kde=True);

### Which year has the highest number of UFO sightings?

In [None]:
df.year.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(15,5), constrained_layout=True)
sns.countplot(data=df[df.year>1980], x='year');
ax.set_xlabel("Year");
plt.xticks(rotation=90);
ax.grid()


In [None]:
df.head()

### Which day of the month has the highest number of UFO sightings?

In [None]:
df.day.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.countplot(data=df, x='day');
ax.set_xlabel("Day of month");
for p in ax.patches:
    ax.annotate(p.get_height(),  (p.get_x(), p.get_height()));

### Which month of the year has the highest number of UFO sightings?

In [None]:
df.month.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.countplot(data=df, x='month');
ax.set_xlabel("Month of year");
for p in ax.patches:
    ax.annotate(p.get_height(),  (p.get_x()+0.2, p.get_height()));

### Which country has the highest number of UFO sightings?

In [None]:
df.plot('longitude', 'latitude', kind='scatter', alpha=0.1);
# there are other tools more approriate for geographical plotting that we will not cover here. 

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.countplot(data=df, x='country');
ax.set_xlabel("Country");
for p in ax.patches:
    ax.annotate(p.get_height(),  (p.get_x()+0.3, p.get_height()));

### Are there overall trend differences by country?

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
sns.kdeplot(data=df, x='datetime', hue='country');

In [None]:
fig, ax = plt.subplots(figsize=(14,5), constrained_layout=True)
for c in df['country'].unique():
    sns.kdeplot(data=df[df.country==c], x='datetime', label=c, ax=ax);
ax.legend();


It's your turn!

What other insights can you find with this dataset or another dataset you chose? 