# UFO Sightings

## Introduction
UFO sightings have been reported throughout history.  Many sightings can be explained scientifically but some sightings elude explanation.  Over the years the United States Government have studied UFOs; in 2021 the United States showed renewed interest in UFOs in the interest of national security.

## Problem Statement
What can we learn from the UFO Sighting Reports?
* Location
    - What are the most common locations that UFOs are sighted?
* What are the most common UFO shapes?
* What times of the day are UFOs seen the most?
* Descriptions
    - What are the topics discussed in UFO sightings reports?
    - What is the sentiment of UFO sightings reports?
    
## Output

1. Sighting Location
2. Sighting Duration
3. Sighting Day and Time
4. UFO Shape
5. Comments Corpus
6. Comments Document Term Matrix (DTM)


## Data Source
__[NATIONAL UFO REPORTING CENTER (NUFORC)](https://www.kaggle.com/datasets/NUFORC/ufo-sightings)__


MIT License

Copyright (c) 2022 UFO Software, LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import spacy
from spacy import displacy
from spacy.language import Language
from spacy.util import minibatch
from textblob import TextBlob
import re
import string
import os
from os.path import exists
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm.notebook import tqdm
import pickle
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# declare the directory structure
parent_dir = '/Volumes/data_sets/ufo_sightings'
data_dir = parent_dir+'/data'
temp_dir = parent_dir+'/temp'
if not os.path.isdir(temp_dir):
    os.mkdir(temp_dir)

In [3]:
# read in the data

col_dtypes = {'city': 'string',
              'state': 'string',
              'country': 'string',
              'shape': 'string',
              'comments': 'string',
              'latitude': 'string',
              'longitude ': 'float32'
             }

date_cols = ['datetime',
             'duration (seconds)',
             'duration (hours/min)',
             'date posted'
            ]

cols = list(col_dtypes.keys()) + date_cols


df = pd.read_csv(data_dir+'/scrubbed.csv', low_memory = False, usecols = cols, dtype = col_dtypes, parse_dates = date_cols, skipinitialspace = True)

df.rename(columns = {'longitude ': 'longitude', 'duration (seconds)': 'seconds', 'duration (hours/min)': 'hours_min', 'date posted': 'date_posted'}, inplace = True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941109
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581085
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645836
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803604
...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.1658333,-86.784447
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.6136111,-116.202499
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.2972222,-122.284447
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.9011111,-77.265556


In [4]:
df.dtypes

datetime               object
city                   string
state                  string
country                string
shape                  string
seconds                object
hours_min              object
comments               string
date_posted    datetime64[ns]
latitude               string
longitude             float32
dtype: object

## Location Data

In [5]:
df.latitude[df.latitude.str.contains('q')]

43782    33q.200088
Name: latitude, dtype: string

In [6]:
df.iloc[43782, 9] = '33.200088'
df.latitude = df.latitude.astype(float)

## Location Data from GPS Coordinates
Fill in missing values and fix improperly recorded locations using the GPS coordinates

In [7]:
locator = Nominatim(user_agent='ufo_sightings', timeout=20)
rgeocode = RateLimiter(locator.reverse, min_delay_seconds=0.75, max_retries = 10, error_wait_seconds =  300.0)
def get_city_state_country(x):
    location = rgeocode(str(x.latitude)+','+str(x.longitude) , language="en")
    if location is not None:
        address = location.raw['address']
        city = address.get('city', '')
        state = address.get('state', '')
        country_code = address.get('country_code', '')

        return [city,state,country_code]
    else:
        # if the location is not found from the GPS coordinates return the orginal data
        return [x.city, x.state, x.country]

## Warning Long Execution Time
Takes over 15 hours to run

In [8]:
location_file = temp_dir+'/city_state_country.parquet'
if not os.path.isfile(location_file):
    tqdm.pandas()
    df['geo_city'], df['geo_state'], df['geo_country'] = zip(*df.progress_apply(get_city_state_country, axis =1))
    df = df.to_parquet(location_file)
else:
    df = pd.read_parquet(location_file)
    
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,,Texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,San Antonio,Texas,us
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,Chester,England,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,,Texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,Kaneohe,Hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,Nashville-Davidson,Tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,Boise,Idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,Napa,California,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,,Virginia,us


## Remove Text Between Parentheses 

In [9]:
df.geo_city = df.geo_city.apply(lambda x: re.sub("\(.*?\)","",x))
df.city = df.city.apply(lambda x: re.sub("\(.*?\)","",x))
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,,Texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,San Antonio,Texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,Chester,England,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,,Texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,Kaneohe,Hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,Nashville-Davidson,Tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,Boise,Idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,Napa,California,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,,Virginia,us


## Fill in missing city values

In [10]:
df[df.geo_city.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


In [11]:
df[df.geo_city == '']

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,,Texas,us
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,,Texas,us
6,10/10/1965 21:00,penarth,,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.180000,,Wales,gb
8,10/10/1966 20:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.586111,-86.286110,,Alabama,us
9,10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,2005-05-11,30.294722,-82.984169,,Florida,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80322,9/9/2013 21:00,aleksandrow,,,light,15,15 seconds,Two points of light following one another in a...,2013-09-30,50.465843,22.891813,,Lublin Voivodeship,pl
80324,9/9/2013 21:00,hamstead,nc,,light,120,2 minutes,8 to ten lights bright orange in color large t...,2013-09-30,34.367594,-77.710548,,North Carolina,us
80325,9/9/2013 21:00,milton,on,ca,fireball,180,3 minutes,Massive Bright Orange Fireball in Sky,2013-09-30,46.300000,-63.216667,,Prince Edward Island,ca
80326,9/9/2013 21:00,woodstock,ga,us,sphere,20,20 seconds,Driving 575 at 21:00 hrs saw a white and green...,2013-09-30,34.101389,-84.519447,,Georgia,us


## If the city found by geolocation is empty replace it with the city from the original data

In [12]:
df.geo_city = np.where(df.geo_city == '', df.city.str.lower(), df.geo_city.str.lower())
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,Texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,Texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,England,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,Texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,Hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,Tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,Idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,California,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,Virginia,us


## Fill in missing state values

In [13]:
df[df.state.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,England,gb
6,10/10/1965 21:00,penarth,,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.180000,penarth,Wales,gb
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,2002-01-11,32.364167,-64.678612,bermuda nas,,bm
20,10/10/1974 21:30,cardiff,,gb,disk,1200,20 minutes,back in 1974 I was 19 at the time and lived i...,2007-02-01,51.500000,-3.200000,cardiff,Wales,gb
24,10/10/1976 22:00,stoke mandeville,,gb,cigar,3,3 seconds,White object over Buckinghamshire UK.,2009-12-12,51.783333,-0.783333,stoke mandeville,England,gb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80217,9/9/2007 19:01,melbourne,,au,circle,600,10 min,Hostile,2007-10-08,-37.813938,144.963425,melbourne,Victoria,au
80234,9/9/2009 03:14,aberdeen,,gb,light,6,6 seconds,Bright light seen over Aberdeen&#44 Scotland&#...,2009-12-12,57.166667,-2.666667,aberdeen,Scotland,gb
80254,9/9/2009 21:15,nottinghamshire,,gb,fireball,600,10 mins,resembled orange flame imagine a transparent h...,2009-12-12,53.166667,-1.000000,newark and sherwood,England,gb
80255,9/9/2009 21:38,kaiserlautern,,de,light,40,about 40 seconds,2 white lights over Kaiserslautern&#44 ramstei...,2009-12-12,49.450000,7.750000,kaiserslautern,Rhineland-Palatinate,de


In [14]:
df[df.geo_state.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
515,10/1/1970 23:00,indian ocean,,,light,240,3-4 minutes,Bright object seemingly appeared out of nowher...,2004-07-08,-33.137551,81.826172,indian ocean,,
1740,10/15/1968 21:30,pacific ocean,,,circle,30,30 sec.,Bright&#44 white soundless orb with no trajeco...,2003-09-12,-8.783195,-124.508522,pacific ocean,,
3282,10/20/2008 02:00,indian ocean,,,unknown,300,5 minuts,at night in the middle of the ocean ( a light ...,2009-08-27,-33.137551,81.826172,indian ocean,,
4212,10/24/1995 02:00,tyrrhenian sea,,,sphere,30,30sec,blue colour sphere was obsereved from containe...,2006-07-16,40.076986,11.343106,tyrrhenian sea,,
5363,10/29/2010 21:00,indian ocean,,,fireball,5400,1.5 hrs,During the routine bridge watch at sea&#44 on ...,2010-11-21,-33.137551,81.826172,indian ocean,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74542,9/15/1966 01:30,pacific ocean,,,unknown,300,5 mis.,object in water 2 feet from boat made a straig...,2007-10-08,-8.783195,-124.508522,pacific ocean,,
75541,9/18/2005 14:00,atlantic ocean,,,disk,60,1 minute,lenticular cloud to disc,2005-10-11,-14.599413,-28.673147,atlantic ocean,,
76026,9/20/1988 13:00,atlantic ocean,,,unknown,20,20 seconds,The craft was visible at different positions f...,2005-05-11,-14.599413,-28.673147,atlantic ocean,,
76282,9/21/1988 03:00,atlantic ocean,,,fireball,15,15 seconds,The light clearly lit up the bow of the vessel...,2005-05-11,-14.599413,-28.673147,atlantic ocean,,


## Fill in state and country when the UFO was sighted over water

In [15]:
df.geo_state = np.where((df.geo_state.isna()) & ((df.geo_city.str.contains('ocean') | df.geo_city.str.contains('sea') | df.geo_city.str.contains('gulf') | df.geo_city.str.contains('antarctica'))),'over_water', df.geo_state.str.lower())
df.geo_country = np.where((df.geo_country.isna()) & ((df.geo_city.str.contains('ocean') | df.geo_city.str.contains('sea') | df.geo_city.str.contains('gulf') | df.geo_city.str.contains('antarctica'))),'over_water', df.geo_country)
df  

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [16]:
df[df.geo_state == '']

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,2002-01-11,32.364167,-64.678612,bermuda nas,,bm
184,10/10/2007 23:20,stord,,,light,600,10 min,Thise could be an ETV case&#44 but it could al...,2008-01-21,59.900209,5.282347,stord,,no
285,10/11/1986 20:30,alice springs,,au,,20,20 seconds,Being of light reported&#44Jesus or another m...,2005-01-19,-23.697479,133.883621,alice springs,,au
296,10/11/1997 22:00,hafnarfjordur,,,sphere,300,5 min,playing with a jet,2008-06-12,64.066667,-21.950001,hafnarfjordur,,is
480,10/1/1952 03:30,fukuoka,,,disk,1200,about 20 mins,UFO seen by multiple U. S. military personnel;...,2006-12-07,33.590355,130.401718,fukuoka,,jp
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78898,9/3/2004 14:54,busan,,,chevron,2,seconds,It has dark brown color&#44 an empennage-shape...,2004-09-09,35.179554,129.075638,busan,,kr
79526,9/6/2002 00:00,mbour,,,light,60,1 minute,In Mbour&#44 Senegal&#44 ( 14deg.&#4425 min. N...,2005-05-11,14.416667,-16.966667,m'bour,,sn
79538,9/6/2002 22:00,kunsan city&#44 south korea,,,triangle,60,1 minute,Triangular &#44Cloud like shape,2002-09-13,35.967677,126.736626,gunsan-si,,kr
79745,9/7/2003 12:03,pecs,,,egg,1500,25min,((NUFORC Note: Hoax. PD)) Small object lands,2005-10-11,46.072735,18.232265,pécs,,hu


## Fill in state when the state is blank

In [17]:
df.geo_state = np.where(df.geo_state == '', 'unknown', df.geo_state)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [18]:
df[(df.geo_country.isna()) | (df.geo_country == '')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
46045,5/7/2003 17:00,europe,,,unknown,5,5 sec,RADAR WARNING,2003-05-09,54.525961,15.255119,europe,,


In [19]:
df.iloc[46045, 12] = 'unknown'
df.iloc[46045, 13] = 'unknown'

In [20]:
df.to_parquet(temp_dir+'/clean_locations.parquet')

## Duration
The seconds column represents the hours minutes column in seconds.  It is cleaner and easier to work with.

In [21]:
# remove seconds symbol so that the time in seconds can be represented as a float
df.seconds = df.seconds.str.replace(r'`', '')
df.seconds = df.seconds.astype(float)

In [22]:
df.dtypes

datetime               object
city                   object
state                  object
country                object
shape                  object
seconds               float64
hours_min              object
comments               object
date_posted    datetime64[ns]
latitude              float64
longitude             float32
geo_city               object
geo_state              object
geo_country            object
dtype: object

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     80332 non-null  object        
 1   city         80332 non-null  object        
 2   state        74535 non-null  object        
 3   country      70662 non-null  object        
 4   shape        78400 non-null  object        
 5   seconds      80332 non-null  float64       
 6   hours_min    80332 non-null  object        
 7   comments     80317 non-null  object        
 8   date_posted  80332 non-null  datetime64[ns]
 9   latitude     80332 non-null  float64       
 10  longitude    80332 non-null  float32       
 11  geo_city     80332 non-null  object        
 12  geo_state    80332 non-null  object        
 13  geo_country  80332 non-null  object        
dtypes: datetime64[ns](1), float32(1), float64(2), object(10)
memory usage: 8.3+ MB


In [24]:
# there are no missing values for seconds so the hours minutes column is not needed
df.drop(columns = ['hours_min'], inplace = True)

In [25]:
df.to_parquet(temp_dir+'/clean_duration.parquet')

## DateTime
Change 24:00 to 00:00 for midnight

In [26]:
df['datetime'] = df['datetime'].apply(lambda x: re.sub('24:00', '00:00', x))
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,10/10/1955 17:00,chester,,gb,circle,20.0,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,10/10/1956 21:00,edna,tx,us,circle,20.0,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600.0,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200.0,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200.0,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,9/9/2013 22:20,vienna,va,us,circle,5.0,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [27]:
df[df['datetime'].isnull()]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


In [28]:
df['datetime'] = pd.to_datetime(df['datetime'])
df.dtypes

datetime       datetime64[ns]
city                   object
state                  object
country                object
shape                  object
seconds               float64
comments               object
date_posted    datetime64[ns]
latitude              float64
longitude             float32
geo_city               object
geo_state              object
geo_country            object
dtype: object

In [29]:
df.to_parquet(temp_dir+'/clean_datetime.parquet')

## UFO Shape

In [30]:
df['shape'].unique()

array(['cylinder', 'light', 'circle', 'sphere', 'disk', 'fireball',
       'unknown', 'oval', 'other', 'cigar', 'rectangle', 'chevron',
       'triangle', 'formation', None, 'delta', 'changing', 'egg',
       'diamond', 'flash', 'teardrop', 'cone', 'cross', 'pyramid',
       'round', 'crescent', 'flare', 'hexagon', 'dome', 'changed'],
      dtype=object)

In [31]:
df[df['shape'].isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
62,1995-10-10 19:45:00,milwaukee,wi,us,,120.0,Man on Hwy 43 SW of Milwaukee sees large&#44 ...,1999-11-02,43.038889,-87.906387,milwaukee,wisconsin,us
63,1995-10-10 22:40:00,oakland,ca,us,,60.0,Woman repts. bright light in NW sky&#44 sudde...,1999-11-02,37.804444,-122.269722,oakland,california,us
239,2011-10-10 19:30:00,murfeesboro/smyrna,tn,,,2700.0,Multi color oblect over Smyrna/Murfreesboro 10...,2011-10-19,35.947474,-86.488365,murfeesboro/smyrna,tennessee,us
285,1986-10-11 20:30:00,alice springs,,au,,20.0,Being of light reported&#44Jesus or another m...,2005-01-19,-23.697479,133.883621,alice springs,unknown,au
293,1995-10-11 18:30:00,new york city,ny,us,,720.0,Young man&#44 mother witness watch strange red...,1999-11-02,40.714167,-74.006386,new york,new york,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80128,1999-09-09 22:00:00,mount shasta,ca,us,,18000.0,multiple anomalious lights&#44white flashes&#4...,1999-10-02,41.310000,-122.309441,mount shasta,california,us
80155,2002-09-09 19:02:00,moriches bay,ny,,,30.0,Two men report witnessing a peculiar object de...,2002-09-13,40.789394,-72.715630,moriches bay,new york,us
80156,2002-09-09 19:02:00,moriches bay,ny,,,60.0,U. S. Coast Guard (Boston) forwards report of ...,2002-09-13,40.789394,-72.715630,moriches bay,new york,us
80179,2003-09-09 22:00:00,prescott,az,us,,2700.0,Bright &quot;stars&quot; flying in sky in Pres...,2007-08-07,34.540000,-112.467781,prescott,arizona,us


In [32]:
df['shape'].value_counts()

light        16565
triangle      7865
circle        7608
fireball      6208
other         5649
unknown       5584
sphere        5387
disk          5213
oval          3733
formation     2457
cigar         2057
changing      1962
flash         1328
rectangle     1297
cylinder      1283
diamond       1178
chevron        952
egg            759
teardrop       750
cone           316
cross          233
delta            7
round            2
crescent         2
pyramid          1
flare            1
hexagon          1
dome             1
changed          1
Name: shape, dtype: int64

## Fold the shapes that occur less often into similar shapes

In [33]:
df.loc[df['shape'] == 'changed', 'shape'] = 'changing'
df.loc[df['shape'] == 'delta', 'shape'] = 'triangle'
df.loc[df['shape'] == 'cigar', 'shape'] = 'cylinder'
df.loc[df['shape'] == 'flare', 'shape'] = 'fireball'
df.loc[df['shape'] == 'round', 'shape'] = 'circle'
df.loc[df['shape'] == 'dome', 'shape'] = 'disk'
df.loc[df['shape'] == 'crescent', 'shape'] = 'teardrop'
df.loc[df['shape'] == 'pyramid', 'shape'] = 'other'
df.loc[df['shape'] == 'hexagon', 'shape'] = 'other'
df.loc[df['shape'].isna(), 'shape'] = 'unknown'
df['shape'].value_counts()

light        16565
triangle      7872
circle        7610
unknown       7516
fireball      6209
other         5651
sphere        5387
disk          5214
oval          3733
cylinder      3340
formation     2457
changing      1963
flash         1328
rectangle     1297
diamond       1178
chevron        952
egg            759
teardrop       752
cone           316
cross          233
Name: shape, dtype: int64

In [34]:
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,2013-09-09 21:15:00,nashville,tn,us,light,600.0,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,2013-09-09 22:00:00,boise,id,us,circle,1200.0,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,2013-09-09 22:00:00,napa,ca,us,other,1200.0,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,2013-09-09 22:20:00,vienna,va,us,circle,5.0,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [35]:
df.to_parquet(temp_dir+'/clean_shape.parquet')

## Comments

## Remove records where there are no comments

In [36]:
df[df.comments.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
2940,2004-10-19 20:00:00,grand island,ne,us,light,3600.0,,2004-11-02,40.925,-98.341667,grand island,nebraska,us
14317,1996-01-14 17:00:00,chesterfield,va,us,unknown,3600.0,,2005-11-03,37.376944,-77.506111,chesterfield,virginia,us
21844,1996-01-23 20:15:00,minot,nd,us,unknown,900.0,,2011-02-18,48.2325,-101.29583,minot,north dakota,us
24999,1996-01-07 11:30:00,st. george,ut,us,unknown,2.0,,2005-11-03,37.104167,-113.583336,st. george,utah,us
28764,1996-02-27 22:01:00,saginaw,mi,us,unknown,1440.0,,2004-03-02,43.419444,-83.950836,city of saginaw,michigan,us
32337,2004-03-19 12:10:00,atlanta,ga,us,circle,600.0,,2004-06-18,33.748889,-84.388054,atlanta,georgia,us
36089,2001-04-01 19:00:00,bangalore,,,unknown,10.0,,2002-05-14,12.971599,77.594566,bengaluru,karnataka,in
41782,2013-05-01 22:00:00,toledo,oh,us,oval,120.0,,2014-01-24,41.663889,-83.555275,toledo,ohio,us
46558,2002-06-10 03:30:00,chantilly,va,us,unknown,180.0,,2002-08-16,38.894167,-77.431389,chantilly,virginia,us
48599,1957-06-15 02:30:00,atlantic ocean,,,unknown,120.0,,2002-03-19,-14.599413,-28.673147,atlantic ocean,over_water,over_water


In [37]:
df.dropna(subset = ['comments'], inplace = True)
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80312,2013-09-09 21:15:00,nashville,tn,us,light,600.0,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80313,2013-09-09 22:00:00,boise,id,us,circle,1200.0,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80314,2013-09-09 22:00:00,napa,ca,us,other,1200.0,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80315,2013-09-09 22:20:00,vienna,va,us,circle,5.0,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


## Text Cleanup

In [38]:
df[df.comments.str.contains('&#44')]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
8,1966-10-10 20:00:00,pell city,al,us,disk,180.0,Strobe Lighted disk shape object observed clos...,2009-03-19,33.586111,-86.286110,pell city,alabama,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80296,2012-09-09 21:55:00,charleston,sc,us,flash,900.0,Orb of light flashing reds and blues&#44 stati...,2012-09-24,32.776389,-79.931114,charleston,south carolina,us
80307,2013-09-09 21:00:00,aleksandrow,,,light,15.0,Two points of light following one another in a...,2013-09-30,50.465843,22.891813,aleksandrow,lublin voivodeship,pl
80313,2013-09-09 22:00:00,boise,id,us,circle,1200.0,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80314,2013-09-09 22:00:00,napa,ca,us,other,1200.0,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us


In [39]:
# remove new line, tab and carraige return
df.comments = df.comments.str.translate(str.maketrans('','', '\n\t\r'))
# replace / with space
df.comments = df.comments.str.replace('/',' ')
# replace ascii codes with character
df.comments = df.comments.str.replace('&#9',chr(9))
df.comments = df.comments.str.replace('&#33',chr(33))
df.comments = df.comments.str.replace('&#39',chr(39))
df.comments = df.comments.str.replace('&#44',chr(44))
df.comments = df.comments.str.replace('&#160;',chr(160))
df.comments = df.comments.str.replace('&#161;',chr(161))
df.comments = df.comments.str.replace('&#167;',chr(167))
df.comments = df.comments.str.replace('&#170;',chr(170))
df.comments = df.comments.str.replace('&#176;',chr(176))
df.comments = df.comments.str.replace('&#180;',chr(180))
df.comments = df.comments.str.replace('&#182;',chr(1802))
df.comments = df.comments.str.replace('&#186;',chr(186))
df.comments = df.comments.str.replace('&#188;',chr(188))
df.comments = df.comments.str.replace('&#190;',chr(190))
df.comments = df.comments.str.replace('&#8211;',chr(8211))
df.comments = df.comments.str.replace('&#8212;',chr(8212))
df.comments = df.comments.str.replace('&#8216;',chr(8216))
df.comments = df.comments.str.replace('&#8217;',chr(8217))
df.comments = df.comments.str.replace('&#8220;',chr(8220))
df.comments = df.comments.str.replace('&#8221;',chr(8221))
df.comments = df.comments.str.replace('&#8230;',chr(8230))
# convert all text to lowercase
df.comments = df.comments.str.lower()
# remove numbers
df.comments = df.comments.str.translate(str.maketrans('', '', string.digits))
# remove punctuation
df.comments = df.comments.str.translate(str.maketrans('', '', string.punctuation))
# remove extra spaces
df.comments = df.comments.replace({' +':' '},regex=True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event took place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx lights racing across the sky ...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt flying an fjb fighter attack...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80312,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80313,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise id spherical min red lights seen by husb...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80314,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us
80315,2013-09-09 22:20:00,vienna,va,us,circle,5.0,saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [40]:
df[df.comments.str.contains('&#')]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


## Remove Blank Comments

In [41]:
df[df.comments == '']

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
22437,2013-12-04 19:00:00,tucson,az,us,fireball,180.0,,2013-12-23,32.221667,-110.925835,tucson,arizona,us
45442,2014-05-03 00:00:00,milford,ct,us,circle,900.0,,2014-05-08,41.222222,-73.056946,milford,connecticut,us
52663,1966-06-30 21:00:00,blocksburg,ca,us,disk,600.0,,2009-03-19,40.276111,-123.635277,blocksburg,california,us


In [42]:
df[df.comments == ' ']

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
60768,2002-07-30 01:25:00,mainville,oh,,light,5.0,,2002-08-16,39.315059,-84.220772,mainville,ohio,us


In [43]:
df.drop(df[df.comments == ''].index, inplace = True)
df.drop(df[df.comments == ' '].index, inplace = True)
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event took place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx lights racing across the sky ...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt flying an fjb fighter attack...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise id spherical min red lights seen by husb...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


## Load spaCy

In [44]:
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

Collecting en-core-web-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## Lemmatization

Lemmatization returns the root (dictionary) form a word. It changes the verb form of the word while keeping the meaning of the word the same.

Examples:

- better -> good
- walking -> walk
- was -> be
- mice -> mouse

In [45]:
def lemmatize_comments(x):
    doc = nlp(x)
    lemmed_list = []
    for token in doc:
        if not token.is_punct:
            if token.lemma_ == 'PRON':
                lemmed_list.append(token_)
            else:
                lemmed_list.append(token.lemma_)
            
    return " ".join(lemmed_list)

In [46]:
lemmed_file = temp_dir+'/lemmatized.parquet'
if exists(lemmed_file):
    df = pd.read_parquet(lemmed_file)
else:
    df.comments = df.comments.apply(lambda x: lemmatize_comments(x))
    df.to_parquet(lemmed_file)
    
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event take place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx light race across the sky am...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my old brother and twin sister be leave the on...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt fly an fjb fighter attack ai...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly change color an...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise i d spherical min red light see by husba...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,see a five gold light cicular craft move fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [47]:
# write each comment to a text file seperating each comment with a \n
lemm_comments_file = temp_dir+'/lemm_comments.txt'
if not exists(lemm_comments_file):
    with open(lemm_comments_file, 'w') as lem_comments_txt_file:
              df.comments.apply(lambda x: lem_comments_txt_file.write(x + '\n'))
# Read in each comment where one line = one sentence.
sentences_unigrams = LineSentence(lemm_comments_file)

## Phrase Modeling

Detect frequently used phrases and combine them.

## Bigrams

A bigram is a two word phrase. Find the most frequently occurring two word phrases and combine them.

## Trigrams

A trigram is a three word phrase. Find the most frequently occurring three word phrases and combine them.

In [48]:
bigram_model_file = temp_dir+'/bigram_phrase_model'

if not exists(bigram_model_file):
    bigram_phrases = Phrases(sentences_unigrams)
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    bigram_phrases = Phraser(bigram_phrases)
    bigram_phrases.save(bigram_model_file)

In [49]:
bigram_phrases = Phraser.load(bigram_model_file)
sentences_bigrams_file = temp_dir+'/sentence_bigram_phrases_all.txt'

if not exists(sentences_bigrams_file):
    with open(sentences_bigrams_file, 'w') as f:

        for sentence_unigrams in sentences_unigrams:
            #print(sentence_unigrams)
            sentence_bigrams = ' '.join(bigram_phrases[sentence_unigrams])

            f.write(sentence_bigrams + '\n')

In [50]:
sentences_bigrams = LineSentence(sentences_bigrams_file)
trigram_model_file = temp_dir+'/trigram_phrase_model'

if not exists(trigram_model_file):
    trigram_phrases = Phrases(sentences_bigrams) 
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    trigram_phrases = Phraser(trigram_phrases)
    trigram_phrases.save(trigram_model_file)

In [51]:
trigram_phrases = Phraser.load(trigram_model_file)
sentences_trigrams_file = temp_dir+'/sentence_trigram_phrases_all.txt'

if not exists(sentences_trigrams_file):
    with open(sentences_trigrams_file, 'w') as f:
        
        for sentence_bigrams in sentences_bigrams:
            
            sentence_trigrams = ' '.join(trigram_phrases[sentence_bigrams])
            
            f.write(sentence_trigrams + '\n')   

In [52]:
comments_trigrams_file = temp_dir+'/comments_trigrams_all.txt'

if not exists(comments_trigrams_file):
    # Read in each comment where one line = one sentence.
    comments_lemmatized = LineSentence(lemm_comments_file)

    with open(comments_trigrams_file, 'w') as f:
        
        for comments_unigrams in comments_lemmatized:
                        
            # apply the first-order and second-order phrase models
            comments_bigrams = bigram_phrases[comments_unigrams]
            comments_trigrams = trigram_phrases[comments_bigrams]
            
            # write the transformed comments as a line in the new file
            comments_trigrams = ' '.join(comments_trigrams)
            f.write(comments_trigrams + '\n')
            

In [53]:
trigram_df_file = temp_dir+'/tri_grams.parquet'

if not exists(trigram_df_file):
    tri_df = pd.DataFrame(columns = ['tri_comments'])
    
    with open(comments_trigrams_file) as f, open(comments_trigrams_file) as bf:
        
        for comments in f:
            comments = re.sub('\n', '', comments)
            tri_df.loc[len(tri_df)] =comments
            
    tri_df.to_parquet(trigram_df_file)

else:
    tri_df = pd.read_parquet(trigram_df_file)
    
tri_df

Unnamed: 0,tri_comments
0,this_event take_place in early fall around it ...
1,lackland afb tx light race_across the sky amp ...
2,green orange circular disc over chester england
3,my old brother and twin sister be leave the on...
4,as a marine st lt fly an fjb fighter attack ai...
...,...
80308,round from the distance slowly change_color an...
80309,boise i_d spherical min red light see by husba...
80310,napa ufo
80311,see a five gold light cicular craft move fastl...


In [54]:
# concatenate the comments with trigrams to dataframe
df = pd.concat([df, tri_df],axis = 1)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country,tri_comments
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event take place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us,this_event take_place in early fall around it ...
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx light race across the sky am...,2005-12-16,29.384210,-98.581085,san antonio,texas,us,lackland afb tx light race_across the sky amp ...
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb,green orange circular disc over chester england
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my old brother and twin sister be leave the on...,2004-01-17,28.978333,-96.645836,edna,texas,us,my old brother and twin sister be leave the on...
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt fly an fjb fighter attack ai...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us,as a marine st lt fly an fjb fighter attack ai...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly change color an...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us,round from the distance slowly change_color an...
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise i d spherical min red light see by husba...,2013-09-30,43.613611,-116.202499,boise,idaho,us,boise i_d spherical min red light see by husba...
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us,napa ufo
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,see a five gold light cicular craft move fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us,see a five gold light cicular craft move fastl...


In [55]:
df[df.tri_comments.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country,tri_comments


## Remove Stop Words

In [56]:
def remove_stop_words(x):
    doc = nlp(x)
    stopless_list = []
    for token in doc:
        if not token.is_stop:
            stopless_list.append(token.text)
    return " ".join(stopless_list)

In [57]:
df.tri_comments = df.tri_comments.apply(lambda x: remove_stop_words(x))
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country,tri_comments
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event take place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us,this_event take_place early fall occur boy sco...
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx light race across the sky am...,2005-12-16,29.384210,-98.581085,san antonio,texas,us,lackland afb tx light race_across sky amp maki...
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb,green orange circular disc chester england
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my old brother and twin sister be leave the on...,2004-01-17,28.978333,-96.645836,edna,texas,us,old brother twin sister leave edna theater pmw...
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt fly an fjb fighter attack ai...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us,marine st lt fly fjb fighter attack aircraft s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly change color an...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us,round distance slowly change_color hover
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise i d spherical min red light see by husba...,2013-09-30,43.613611,-116.202499,boise,idaho,us,boise i_d spherical min red light husband wife
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us,napa ufo
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,see a five gold light cicular craft move fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us,gold light cicular craft fastly rght leave


In [58]:
df.to_parquet(temp_dir+'/data_clean.parquet')

In [59]:
cv = CountVectorizer()
data_cv = cv.fit_transform(df.tri_comments)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = df.index
data_dtm

Unnamed: 0,a_high_rate,a_zig_zag,aa,aaa,aaaaaaaaauy,aampm,aaron,ab,abadania,aball,...,zore,zs,zthen,ztraacutecelo,zuerich,zukowski,zulu,zvala,zz,zzigzage
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80309,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80310,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80311,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
# Save for later use
data_dtm.to_parquet(temp_dir+'/dtm.parquet')
# Pickle it for later use
pickle.dump(cv, open(temp_dir+"/cv.pkl", "wb"))