# The visualisation of UFO movements over the years in the USA
### A story about UFO sightings over the years in the USA

![](https://media-cldnry.s-nbcnews.com/image/upload/t_nbcnews-fp-1024-512,f_auto,q_auto:best/newscms/2021_20/3476829/210521-ufo-new-mexico-ew-453p.jpg)


## Loading the necessary libraries

In [None]:
import pandas as pd
import seaborn as sns
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

## Data loading and cleaning

### Data loading

In [None]:
df = pd.read_csv('/kaggle/input/ufo-sightings-approx-100000/nuforc_reports.csv')

### Data inspection

In [None]:
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
print(df.isna().sum())
print(df.isna().sum().sum())

We're missing a total of 45.858 values. This is quite a significant number. </br>
Most of these values are in the city lat and long values, we can simply drop these values or generate these based on the city names.</br>
For the missing date_time values we'll check if we can extract these from the stats column.</br>
As the summary and text column hold similar info we'll drop the text column as it has more missing values.</br>


### Data cleaning

#### Filling missing data with regular expressions

In [None]:
# Extract the date information from the stats column, we'll save this in a new column called date, we'll also create a drop column with access info that we will drop later
df[['drop', 'date']] = df['stats'].str.extract(r'^(?P<drop>Occurred : )(?P<date>[0-9]{1,2}[/][0-9]{1,2}[/][0-9]{4})')

In [None]:
# Check the output to see if we managed to extract the date
df.head()

#### dropping records

In [None]:
#Drop all access columns that do not hold any usefull information for further analysis
df.drop(columns=['date_time', 'stats', 'report_link', 'text', 'posted', 'drop'], inplace=True)

#### Setting the correct data type

In [None]:
# Set the correct data type for the columns
df['date'] = pd.to_datetime(df['date'], errors = 'coerce')
df['summary'] = df['summary'].astype(str)

## Exploratory Data Analysis

### Date exploration

In [None]:
sns.set_theme(style='darkgrid')
plt.figure(figsize=(15, 7))

fig = sns.kdeplot(
    data=df,
    x='date',
    fill=True
)
plt.title(
    'UFO sightings over the years',
    fontdict={
        'fontsize': 16
    }
)

In [None]:
df.sort_values(by=['date'], ascending=True).head(10)

It seems that the earliest recorded UFO sightings have been from the beginning of the 18th century. **We even have a record of Thomas Jefferson**.

Most recorded UFO sightings have been recorded after the year 2000

### State exploration

In [None]:
plt.figure(figsize=(15, 15))
sns.countplot(
    data=df,
    y='state',
    order=df['state'].value_counts().index
)

plt.title(
    'UFO sightings by state',
    fontdict={
        'fontsize': 16
    }
)

It seems that by far the most UFO sightings happen in california, it would be interesting for us to find out if this has always been the case over the years or if this is a recent development

In [None]:
state_list = df.groupby('state')['state'].count().reset_index(name='count').nlargest(10, columns=['count'])


plt.figure(figsize=(15, 7))

sns.kdeplot(
    data=df[df['state'].isin(state_list['state'])],
    x='date',
    hue='state'
)

plt.title(
    'UFO sightings over the years by state',
    fontdict={
        'fontsize': 16
    }
)

It seems that most of california's recorded UFO sightings have been from after the year 2000, this is similar to the other states. However we also see a fair amount of records from before this period.

### Shape exploration

In [None]:
plt.figure(figsize=(15, 8))

sns.countplot(
    data=df,
    y='shape',
    order=df['shape'].value_counts().index
)

plt.title(
    'UFO sightings by shape',
    fontdict={
        'fontsize': 16
    }
)

In [None]:
shape_list = df.groupby('shape')['shape'].count().reset_index(name='count').nlargest(10, columns=['count'])

plt.figure(figsize=(15, 7))

sns.kdeplot(
    data=df[df['shape'].isin(shape_list['shape'])],
    x='date',
    hue='shape'
)

plt.title(
    'UFO sightings over the years by shape',
    fontdict={
        'fontsize': 16
    }
)

It seems that most of the recorded UFO sightings consisted of a light and a circle shaped object.

### Interactive maps

In [None]:
# Create a temporary dataframe with no missing values
df_temp = df.copy()
df_temp = df_temp.dropna(subset=['date'])
df_temp['year'] = pd.DatetimeIndex(df_temp['date']).year

In [None]:
fig = px.choropleth(
    df_temp.groupby(['state', 'year'])['year'].count().reset_index(name='Sightings').sort_values(by=['year'], ascending=True),
    locations="state",
    color='Sightings',
    color_continuous_scale='aggrnyl',
    locationmode='USA-states',
    scope="usa",
    animation_frame="year",
    animation_group='state',
    height=700
)

fig.update_layout(
    title_text='UFO sightings over the years',

)

fig.show()

The above visualisation shows us the UFO sightings per state over the years since 1721 to 2019. 

In [None]:
fig = px.choropleth(
    df_temp.groupby(['state'])['state'].count().reset_index(name='Sightings'),
    locations='state',
    color='Sightings',
    color_continuous_scale='aggrnyl',
    locationmode = 'USA-states',
    height=700
)

fig.update_layout(
    title_text = 'Total UFO sightings by state',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

The Above graph shows us the total amount of UFO sightings by state, from the year 1721 to 2019. This clearly shows that most UFO recordings take place in California

In [None]:
df_temp = df_temp.dropna(subset=['city_latitude', 'city_longitude'])
fig = px.scatter_geo(
    df_temp.sort_values(by=['year'],ascending=True),
    lat='city_latitude',
    lon='city_longitude',
    locationmode='USA-states',
    scope="usa",
    animation_frame="year",
    animation_group='state',
    height=700
)

fig.update_layout(
    title_text='Recorded UFO location over the years',
)

fig.show()

The above graph shows us the exact location of the recorded UFO sightings per year. The graph paints a picture of the massive increase in sightings since the year 2000

### Wordcloud

In [None]:
text = " ".join(df['summary'])
wordcloud = WordCloud(
    background_color="white",
    width=1600, 
    height=800
).generate(text)

plt.figure(figsize=(15, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

The wordcloud paints a picture of the summary of all the recorded UFO sightings.