# UFO Sightings

## Introduction
UFO sightings have been reported throughout history.  Many sightings can be explained scientifically but some sightings elude explanation.  Over the years the United States Government have studied UFOs; in 2021 the United States showed renewed interest in UFOs in the interest of national security.

## Problem Statement
What can we learn from the UFO Sighting Reports?
* Location
    - What are the most common locations that UFOs are sighted?
* What are the most common UFO shapes?
* What times of the day are UFOs seen the most?
* Descriptions
    - What are the topics discussed in UFO sightings reports?
    - What is the sentiment of UFO sightings reports?
    
## Output

1. Sighting Location
2. Sighting Duration
3. Sighting Day and Time
4. UFO Shape
5. Comments Corpus
6. Comments Document Term Matrix (DTM)


## Data Source
__[NATIONAL UFO REPORTING CENTER (NUFORC)](https://www.kaggle.com/datasets/NUFORC/ufo-sightings)__


MIT License

Copyright (c) 2022 UFO Software, LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import spacy
from spacy import displacy
from spacy.language import Language
from spacy.util import minibatch
from textblob import TextBlob
import re
import string
import os
from os.path import exists
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm.notebook import tqdm
import pickle
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# declare the directory structure
parent_dir = '/Volumes/datasets/ufo_sightings'
data_dir = parent_dir+'/data'
temp_dir = parent_dir+'/temp'
if not os.path.isdir(temp_dir):
    os.mkdir(temp_dir)

In [3]:
# read in the data

col_dtypes = {'city': 'string',
              'state': 'string',
              'country': 'string',
              'shape': 'string',
              'comments': 'string',
              'latitude': 'string',
              'longitude ': 'float32'
             }

date_cols = ['datetime',
             'duration (seconds)',
             'duration (hours/min)',
             'date posted'
            ]

cols = list(col_dtypes.keys()) + date_cols


df = pd.read_csv(data_dir+'/scrubbed.csv', low_memory = False, usecols = cols, dtype = col_dtypes, parse_dates = date_cols, skipinitialspace = True)

df.rename(columns = {'longitude ': 'longitude', 'duration (seconds)': 'seconds', 'duration (hours/min)': 'hours_min', 'date posted': 'date_posted'}, inplace = True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941109
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581085
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645836
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803604
...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.1658333,-86.784447
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.6136111,-116.202499
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.2972222,-122.284447
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.9011111,-77.265556


In [4]:
df.dtypes

datetime               object
city                   string
state                  string
country                string
shape                  string
seconds                object
hours_min              object
comments               string
date_posted    datetime64[ns]
latitude               string
longitude             float32
dtype: object

## Location Data

In [5]:
df.latitude[df.latitude.str.contains('q')]

43782    33q.200088
Name: latitude, dtype: string

In [6]:
df.iloc[43782, 9] = '33.200088'
df.latitude = df.latitude.astype(float)

## Location Data from GPS Coordinates
Fill in missing values and fix improperly recorded locations using the GPS coordinates

In [7]:
locator = Nominatim(user_agent='ufo_sightings', timeout=20)
rgeocode = RateLimiter(locator.reverse, min_delay_seconds=0.75, max_retries = 10, error_wait_seconds =  300.0)
def get_city_state_country(x):
    location = rgeocode(str(x.latitude)+','+str(x.longitude) , language="en")
    if location is not None:
        address = location.raw['address']
        city = address.get('city', '')
        state = address.get('state', '')
        country_code = address.get('country_code', '')

        return [city,state,country_code]
    else:
        # if the location is not found from the GPS coordinates return the orginal data
        return [x.city, x.state, x.country]

## Warning Long Execution Time
Takes over 15 hours to run

In [8]:
location_file = temp_dir+'/city_state_country.parquet'
if not os.path.isfile(location_file):
    tqdm.pandas()
    df['geo_city'], df['geo_state'], df['geo_country'] = zip(*df.progress_apply(get_city_state_country, axis =1))
    df = df.to_parquet(location_file)
else:
    df = pd.read_parquet(location_file)
    
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,,Texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,San Antonio,Texas,us
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,Chester,England,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,,Texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,Kaneohe,Hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,Nashville-Davidson,Tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,Boise,Idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,Napa,California,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,,Virginia,us


## Remove Text Between Parentheses 

In [9]:
df.geo_city = df.geo_city.apply(lambda x: re.sub("\(.*?\)","",x))
df.city = df.city.apply(lambda x: re.sub("\(.*?\)","",x))
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,,Texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,San Antonio,Texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,Chester,England,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,,Texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,Kaneohe,Hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,Nashville-Davidson,Tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,Boise,Idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,Napa,California,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,,Virginia,us


## Fill in missing city values

In [10]:
df[df.geo_city.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


In [11]:
df[df.geo_city == '']

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,,Texas,us
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,,Texas,us
6,10/10/1965 21:00,penarth,,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.180000,,Wales,gb
8,10/10/1966 20:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.586111,-86.286110,,Alabama,us
9,10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,2005-05-11,30.294722,-82.984169,,Florida,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80322,9/9/2013 21:00,aleksandrow,,,light,15,15 seconds,Two points of light following one another in a...,2013-09-30,50.465843,22.891813,,Lublin Voivodeship,pl
80324,9/9/2013 21:00,hamstead,nc,,light,120,2 minutes,8 to ten lights bright orange in color large t...,2013-09-30,34.367594,-77.710548,,North Carolina,us
80325,9/9/2013 21:00,milton,on,ca,fireball,180,3 minutes,Massive Bright Orange Fireball in Sky,2013-09-30,46.300000,-63.216667,,Prince Edward Island,ca
80326,9/9/2013 21:00,woodstock,ga,us,sphere,20,20 seconds,Driving 575 at 21:00 hrs saw a white and green...,2013-09-30,34.101389,-84.519447,,Georgia,us


## If the city found by geolocation is empty replace it with the city from the original data

In [12]:
df.geo_city = np.where(df.geo_city == '', df.city.str.lower(), df.geo_city.str.lower())
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,Texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,Texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,England,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,Texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,Hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,Tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,Idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,California,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,Virginia,us


## Fill in missing state values

In [13]:
df[df.state.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,England,gb
6,10/10/1965 21:00,penarth,,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.180000,penarth,Wales,gb
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,2002-01-11,32.364167,-64.678612,bermuda nas,,bm
20,10/10/1974 21:30,cardiff,,gb,disk,1200,20 minutes,back in 1974 I was 19 at the time and lived i...,2007-02-01,51.500000,-3.200000,cardiff,Wales,gb
24,10/10/1976 22:00,stoke mandeville,,gb,cigar,3,3 seconds,White object over Buckinghamshire UK.,2009-12-12,51.783333,-0.783333,stoke mandeville,England,gb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80217,9/9/2007 19:01,melbourne,,au,circle,600,10 min,Hostile,2007-10-08,-37.813938,144.963425,melbourne,Victoria,au
80234,9/9/2009 03:14,aberdeen,,gb,light,6,6 seconds,Bright light seen over Aberdeen&#44 Scotland&#...,2009-12-12,57.166667,-2.666667,aberdeen,Scotland,gb
80254,9/9/2009 21:15,nottinghamshire,,gb,fireball,600,10 mins,resembled orange flame imagine a transparent h...,2009-12-12,53.166667,-1.000000,newark and sherwood,England,gb
80255,9/9/2009 21:38,kaiserlautern,,de,light,40,about 40 seconds,2 white lights over Kaiserslautern&#44 ramstei...,2009-12-12,49.450000,7.750000,kaiserslautern,Rhineland-Palatinate,de


In [14]:
df[df.geo_state.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
515,10/1/1970 23:00,indian ocean,,,light,240,3-4 minutes,Bright object seemingly appeared out of nowher...,2004-07-08,-33.137551,81.826172,indian ocean,,
1740,10/15/1968 21:30,pacific ocean,,,circle,30,30 sec.,Bright&#44 white soundless orb with no trajeco...,2003-09-12,-8.783195,-124.508522,pacific ocean,,
3282,10/20/2008 02:00,indian ocean,,,unknown,300,5 minuts,at night in the middle of the ocean ( a light ...,2009-08-27,-33.137551,81.826172,indian ocean,,
4212,10/24/1995 02:00,tyrrhenian sea,,,sphere,30,30sec,blue colour sphere was obsereved from containe...,2006-07-16,40.076986,11.343106,tyrrhenian sea,,
5363,10/29/2010 21:00,indian ocean,,,fireball,5400,1.5 hrs,During the routine bridge watch at sea&#44 on ...,2010-11-21,-33.137551,81.826172,indian ocean,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74542,9/15/1966 01:30,pacific ocean,,,unknown,300,5 mis.,object in water 2 feet from boat made a straig...,2007-10-08,-8.783195,-124.508522,pacific ocean,,
75541,9/18/2005 14:00,atlantic ocean,,,disk,60,1 minute,lenticular cloud to disc,2005-10-11,-14.599413,-28.673147,atlantic ocean,,
76026,9/20/1988 13:00,atlantic ocean,,,unknown,20,20 seconds,The craft was visible at different positions f...,2005-05-11,-14.599413,-28.673147,atlantic ocean,,
76282,9/21/1988 03:00,atlantic ocean,,,fireball,15,15 seconds,The light clearly lit up the bow of the vessel...,2005-05-11,-14.599413,-28.673147,atlantic ocean,,


## Fill in state and country when the UFO was sighted over water

In [15]:
df.geo_state = np.where((df.geo_state.isna()) & ((df.geo_city.str.contains('ocean') | df.geo_city.str.contains('sea') | df.geo_city.str.contains('gulf') | df.geo_city.str.contains('antarctica'))),'over_water', df.geo_state.str.lower())
df.geo_country = np.where((df.geo_country.isna()) & ((df.geo_city.str.contains('ocean') | df.geo_city.str.contains('sea') | df.geo_city.str.contains('gulf') | df.geo_city.str.contains('antarctica'))),'over_water', df.geo_country)
df  

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [16]:
df[df.geo_state == '']

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,2002-01-11,32.364167,-64.678612,bermuda nas,,bm
184,10/10/2007 23:20,stord,,,light,600,10 min,Thise could be an ETV case&#44 but it could al...,2008-01-21,59.900209,5.282347,stord,,no
285,10/11/1986 20:30,alice springs,,au,,20,20 seconds,Being of light reported&#44Jesus or another m...,2005-01-19,-23.697479,133.883621,alice springs,,au
296,10/11/1997 22:00,hafnarfjordur,,,sphere,300,5 min,playing with a jet,2008-06-12,64.066667,-21.950001,hafnarfjordur,,is
480,10/1/1952 03:30,fukuoka,,,disk,1200,about 20 mins,UFO seen by multiple U. S. military personnel;...,2006-12-07,33.590355,130.401718,fukuoka,,jp
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78898,9/3/2004 14:54,busan,,,chevron,2,seconds,It has dark brown color&#44 an empennage-shape...,2004-09-09,35.179554,129.075638,busan,,kr
79526,9/6/2002 00:00,mbour,,,light,60,1 minute,In Mbour&#44 Senegal&#44 ( 14deg.&#4425 min. N...,2005-05-11,14.416667,-16.966667,m'bour,,sn
79538,9/6/2002 22:00,kunsan city&#44 south korea,,,triangle,60,1 minute,Triangular &#44Cloud like shape,2002-09-13,35.967677,126.736626,gunsan-si,,kr
79745,9/7/2003 12:03,pecs,,,egg,1500,25min,((NUFORC Note: Hoax. PD)) Small object lands,2005-10-11,46.072735,18.232265,pécs,,hu


## Fill in state when the state is blank

In [17]:
df.geo_state = np.where(df.geo_state == '', 'unknown', df.geo_state)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,10/10/1955 17:00,chester,,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,9/9/2013 22:20,vienna,va,us,circle,5,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [18]:
df[(df.geo_country.isna()) | (df.geo_country == '')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
46045,5/7/2003 17:00,europe,,,unknown,5,5 sec,RADAR WARNING,2003-05-09,54.525961,15.255119,europe,,


In [19]:
df.iloc[46045, 12] = 'unknown'
df.iloc[46045, 13] = 'unknown'

In [20]:
df.to_parquet(temp_dir+'/clean_locations.parquet')

## Duration
The seconds column represents the hours minutes column in seconds.  It is cleaner and easier to work with.

The data ranges from less than a second to several years.  If a UFO was visible less than a second it is unlikely that a person actually saw something.  If a UFO was continuously visible for hours, days, weeks or years it would have been seen by multiple people and at some point well documented.  I adjusted these values to the median and mean since they are more realistic but in doing so the data was drastically altered so one should not have confidence in the results.

In [21]:
# remove seconds symbol so that the time in seconds can be represented as a float
df.seconds = df.seconds.str.replace(r'`', '')
df.seconds = df.seconds.astype(float)

In [22]:
df.dtypes

datetime               object
city                   object
state                  object
country                object
shape                  object
seconds               float64
hours_min              object
comments               object
date_posted    datetime64[ns]
latitude              float64
longitude             float32
geo_city               object
geo_state              object
geo_country            object
dtype: object

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     80332 non-null  object        
 1   city         80332 non-null  object        
 2   state        74535 non-null  object        
 3   country      70662 non-null  object        
 4   shape        78400 non-null  object        
 5   seconds      80332 non-null  float64       
 6   hours_min    80332 non-null  object        
 7   comments     80317 non-null  object        
 8   date_posted  80332 non-null  datetime64[ns]
 9   latitude     80332 non-null  float64       
 10  longitude    80332 non-null  float32       
 11  geo_city     80332 non-null  object        
 12  geo_state    80332 non-null  object        
 13  geo_country  80332 non-null  object        
dtypes: datetime64[ns](1), float32(1), float64(2), object(10)
memory usage: 8.3+ MB


In [24]:
df.seconds.describe()

count    8.033200e+04
mean     9.016889e+03
std      6.202168e+05
min      1.000000e-03
25%      3.000000e+01
50%      1.800000e+02
75%      6.000000e+02
max      9.783600e+07
Name: seconds, dtype: float64

In [25]:
mean_seconds = df.seconds.mean()
mean_seconds

9016.889016344669

In [26]:
median_seconds = df.seconds.median()
median_seconds

180.0

In [27]:
df[df.hours_min.str.contains('day')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
336,10/11/2003 00:00,san diego,ca,us,,172800.0,2 days?,Lost two days awaken to power out tv fried and...,2003-10-31,32.715278,-117.156387,san diego,california,us
1110,10/12/2007 23:00,rogers,ar,us,unknown,172800.0,1-2 days,((HOAX??)) abduction. 500 Lights On Object0: Yes,2008-03-04,36.331944,-94.118332,rogers,arkansas,us
2157,10/15/2006 22:00,thompson,mb,ca,,432000.0,5 days,Orions Belt *nebula (faint) out in ...,2007-02-01,55.750000,-97.866669,thompson,manitoba,ca
2991,10/19/2008 23:09,laurel,ms,us,light,432000.0,5 days,Lights captured on wild game camera.,2009-01-10,31.693889,-89.130554,laurel,mississippi,us
3725,10/2/2011 21:00,marion,in,us,,172800.0,2 days,Red and green lights over marion indiana,2011-10-10,40.558333,-85.659164,marion,indiana,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77750,9/26/1999 09:00,castlegar,bc,ca,other,7200.0,throughout the day,HBCCUFO CANADIAN REPORT: Chemtrail/contrail s...,2004-03-17,49.316667,-117.666664,castlegar,british columbia,ca
78051,9/27/2009 02:39,randle,wa,us,disk,1200.0,three days in a row,It seems to move left and right and up and dow...,2009-12-12,46.535278,-121.955833,randle,washington,us
79325,9/5/2002 21:45,coloma,wi,us,,172800.0,days,Airplane like object stationary in the north s...,2002-09-06,44.035556,-89.521385,coloma,wisconsin,us
79888,9/8/1999 01:30,andover,ma,us,unknown,86400.0,about a day,emmited a green/white glow,1999-10-02,42.658333,-71.137497,andover,massachusetts,us


In [28]:
df[df.hours_min.str.contains('day')].describe()

Unnamed: 0,seconds,latitude,longitude
count,126.0,126.0,126.0
mean,209001.904762,36.341398,-83.195969
std,148017.212997,15.078277,50.232704
min,60.0,-34.713016,-157.739441
25%,86400.0,33.820139,-113.351873
50%,172800.0,39.034306,-93.708195
75%,259200.0,42.337361,-77.662708
max,777600.0,56.470833,153.194305


### A UFO could not be visible continuously for a day or days so change the value to the mean number of seconds

In [29]:
df.loc[df.hours_min.str.contains('day'), 'seconds'] = mean_seconds

In [30]:
df[df.hours_min.str.contains('year')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
559,10/1/1983 17:00,birmingham,,gb,sphere,97836000.0,31 years,Firstly&#44 I was stunned and stared at the ob...,2013-04-12,52.466667,-1.916667,birmingham,england,gb
693,10/1/2001 24:00,chulucanas-piura la vieja,,,other,6312000.0,2 years,go to: http://www.24horas.com.pe/data/videos/...,2003-03-04,-5.129547,-80.120567,chulucanas-piura la vieja,piura,pe
10853,1/1/1977 02:30,new canaan,ct,us,,9468000.0,2-3 years,possible abductions when I was a kid living in...,1998-11-19,41.146667,-73.495277,new canaan,connecticut,us
21527,12/31/2006 23:00,imperial desert,ca,,oval,6312000.0,new years eve,The Object was more round than Oval. It was be...,2007-02-01,32.841179,-115.590172,imperial desert,california,us
21583,12/31/2009 23:30,livingston,la,us,unknown,6312000.0,new years,The 12 UFO we seen brite yellow like a street ...,2010-02-14,30.501944,-90.74778,livingston,louisiana,us
29747,2/7/2004 12:00,tehran,,,light,6312000.0,two years,dear sirs: a few nights ago I was sitting outs...,2004-06-18,35.696111,51.423058,tehran,unknown,ir
30590,3/1/1993 05:30,ganado,az,us,other,6312000.0,20years,For many years since 1978 to 2004 there have b...,2007-08-07,35.711389,-109.541389,ganado,arizona,us
49422,6/15/2012 21:00,huntington,ny,us,light,6312000.0,1-2 years,Strange lights in the sky that move and follow...,2013-10-14,40.868056,-73.426109,huntington,new york,us
52709,6/30/1969 22:45,somerset,,gb,cone,25248000.0,8 years,First time it was a bright light and missing t...,2009-08-05,51.083333,-3.0,sedgemoor,england,gb
53006,6/30/2002 22:00,honolulu,hi,us,circle,6312000.0,years,Green glowing UFOs and some that look like sta...,2007-02-01,21.306944,-157.858337,honolulu,hawaii,us


### A UFO could not be continuously visible for weeks, months or years so adjust the time to the mean number of seconds

In [31]:
df.loc[df.hours_min.str.contains('year'), 'seconds'] = mean_seconds

In [32]:
df[df.hours_min.str.contains('week')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
4123,10/23/2011 01:30,alamo,tn,us,triangle,604800.0,one week,Over the past week me and 7-10 of my friends h...,2011-10-25,35.784722,-89.117226,alamo,tennessee,us
7101,10/8/2008 21:00,ukiah,ca,us,light,1209600.0,2 weeks,&quot;Bright-Star or Something Else&quot;&#33 ...,2008-10-31,39.150278,-123.206665,ukiah,california,us
8837,11/12/2013 21:30,mount isa,,au,sphere,1209600.0,2 weeks,Orange orb over mount Isa. ((NUFORC Note: Po...,2013-11-20,-20.725229,139.497269,mount isa,queensland,au
12324,11/22/2005 04:00,hemet/south jacinto,ca,,light,604800.0,1 week,Bright light coming out of a shallow mountain&#33,2005-12-16,33.758728,-116.95871,hemet/south jacinto,california,us
14090,1/13/2007 11:30,rio piedras,,,unknown,604800.0,1 week,Yacimiento cientifico de madre e hijos encontr...,2009-01-10,18.399722,-66.050003,san juan,puerto rico,us
14181,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,2011-12-12,44.953056,-92.291115,woodville,wisconsin,us
14588,1/14/2013 02:15,bradford,bc,gb,circle,9016.889,week day,((HOAX)) Circle and big.,2013-02-04,51.0,-3.183333,somerset west and taunton,england,gb
14703,1/15/1995 12:00,gorham,me,us,,1209600.0,2 weeks,Man called to enquire about all the strange si...,1999-11-02,43.679444,-70.444725,gorham,maine,us
19129,1/2/2007 18:30,clinton township,mi,,formation,604800.0,1 week,bright lights in the southwest sky moving&#44s...,2007-02-01,42.586888,-82.919548,clinton township,michigan,us
22347,1/24/2009 18:00,fairbanks,ak,us,oval,604800.0,a little more than a week,Bright light high above my town&#44 very pecul...,2009-03-19,64.837778,-147.716385,fairbanks,alaska,us


In [33]:
df.loc[df.hours_min.str.contains('week'), 'seconds'] = mean_seconds

In [34]:
df[df.hours_min.str.contains('month')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
3476,10/21/2007 01:00,maysville,ky,us,light,2631600.0,1 month,strange lights over maysville ky,2007-11-28,38.641111,-83.744446,maysville,kentucky,us
6991,10/7/2013 20:00,oklahoma,ok,,circle,10526400.0,4 months,Bright flying orb.,2013-10-14,35.46756,-97.516426,oklahoma city,oklahoma,us
16410,11/9/2005 12:00,tukwila,wa,us,disk,219300.0,several months,((HOAX??)) Appeared solid&#44 silver metallic...,2011-05-12,47.474167,-122.25972,tukwila,washington,us
17388,12/1/2002 24:00,gordes-manisa,,,fireball,2631600.0,one month,We see the same objects even on claudy days. T...,2002-12-23,38.932515,28.290667,gordes-manisa,unknown,tr
20064,12/23/2010 00:00,mccomb,ms,us,cylinder,2631600.0,month,Have you there been any reports from mississip...,2011-01-05,31.243611,-90.453056,mccomb,mississippi,us
21742,12/31/2013 00:00,chesterfield,va,us,oval,2102400.0,>8 months,Collection of orbs&#44 rods and discs sighted ...,2014-01-16,37.376944,-77.506111,chesterfield,virginia,us
25822,2/1/1978 24:00,detroit,mi,us,,5263200.0,1-2 months,Detroit UFO Flap of February 1978,2007-04-27,42.331389,-83.04583,detroit,michigan,us
28355,2/24/2002 17:30,springdale,ut,us,triangle,2631600.0,one month,On Feb. 24th&#44 2002 a report come into our ...,2002-03-19,37.188889,-112.99778,springdale,utah,us
30596,3/1/1994 01:00,menifee,ca,us,unknown,10526400.0,4 months,Sun City / Menifee UFO sightings in 1994,2005-02-22,33.728333,-117.145554,menifee,california,us
30617,3/1/1998 20:00,cebu city,,,other,5263200.0,1 to 2 months,End of the Century UFO,2006-05-15,10.315699,123.885437,cebu city,unknown,ph


In [35]:
df.loc[df.hours_min.str.contains('month'), 'seconds'] = mean_seconds

In [36]:
df[df.hours_min.str.contains('hour')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
21,10/10/1974 23:00,hudson,ks,us,light,1200.0,one hour?,The light chased us.,2004-07-25,38.105556,-98.659721,hudson,kansas,us
51,10/10/1992 17:00,panama city,fl,us,formation,3600.0,1 hour(?),During a road trip to Panama City a friend and...,1999-01-28,30.158611,-85.660278,panama city,florida,us
58,10/10/1994 15:00,mercedies,tx,,cigar,3600.0,1 hour,ufo chased by fighter jet over Rio Grande Vall...,2011-12-12,26.149798,-97.913612,mercedies,texas,us
61,10/10/1994 23:00,toronto,on,ca,sphere,3600.0,~1 hour,Large rusty sphere,2013-07-03,43.666667,-79.416664,toronto,ontario,ca
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80276,9/9/2011 10:20,double springs,al,us,sphere,3600.0,1 or more hours,Strange bright spheres in the sky that sometim...,2011-10-10,34.146389,-87.402222,double springs,alabama,us
80286,9/9/2011 23:00,kenmore,wa,us,changing,5400.0,1.5 hours,UFO changing colors&#44 shapes and pulsating -...,2011-10-10,47.757500,-122.242775,kenmore,washington,us
80289,9/9/2012 04:43,murfreesboro,tn,us,triangle,7200.0,2 hours,Triangular shape white light with red and gree...,2012-09-24,35.845556,-86.390274,murfreesboro,tennessee,us
80302,9/9/2012 20:00,wilson,nc,us,light,10800.0,3 hours,Bright orb being chased by a jet along with se...,2012-09-24,35.721111,-77.915833,wilson,north carolina,us


In [37]:
df.loc[(df.hours_min == '1/2 hours') & (df.seconds != 1800)]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
1841,10/15/1986 18:00,bellevue,wa,us,disk,10.0,1/2 hours,The saucer hovered over us&#44 it was huge&#44...,2009-03-19,47.610556,-122.199448,bellevue,washington,us
39914,4/30/2010 20:00,broadview heights,oh,us,circle,120.0,1/2 hours,Me and my family saw weird objects in the sky ...,2010-05-12,41.313889,-81.68528,broadview heights,ohio,us


### 1/2 hours = 1800 seconds

In [38]:
df.loc[(df.hours_min == '1/2 hours') & (df.seconds != 1800), 'seconds'] = 1800

In [39]:
df.loc[(df.hours_min.str.contains('1 1/2 hours')) & (df.seconds != 5400)]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
528,10/1/1973 21:30,hixson,tn,us,oval,37800.0,1 1/2 hours,Oval&#44 brilliant lights&#44 red glow&#44 win...,2005-04-16,35.140556,-85.23278,chattanooga,tennessee,us
816,10/1/2006 21:00,orchard park,ny,us,unknown,37800.0,approx. 1 1/2 hours,Spotted again as before........,2006-10-30,42.7675,-78.744164,orchard park,new york,us
2699,10/18/1998 09:30,jacksonville,il,us,light,37800.0,1 1/2 hours,First one appeared in the eastern sky and move...,1998-11-01,39.733889,-90.228889,jacksonville,illinois,us
4757,10/26/2008 23:17,st. louis,mo,us,triangle,37800.0,1 1/2 hours,Metallic object&#44 square/triangular shape&#4...,2008-10-31,38.627222,-90.197777,saint louis,missouri,us
5850,10/31/2005 18:00,poughkeepsie,ny,us,light,37800.0,1 1/2 hours,Several bright lights moving erratically for e...,2005-11-03,41.700278,-73.921387,city of poughkeepsie,new york,us
7052,10/8/2003 22:36,salmon arm,bc,ca,cigar,37800.0,1 1/2 hours,walking with a cigar shaped object white/yellow,2003-10-15,50.7,-119.283333,salmon arm,british columbia,ca
7764,11/10/2008 22:00,pirenopolis,,,circle,37800.0,1 1/2 hours,Bright white and yellow circular light over de...,2009-01-10,-15.851148,-48.958874,pirenopolis,goiás,br
10146,11/17/2000 17:30,freehold,nj,us,circle,37800.0,1 1/2 hours,an unidentifed flying object along with a smal...,2002-11-20,40.26,-74.27417,freehold,new jersey,us
12524,11/23/2000 22:00,auburn,nh,us,circle,37800.0,1 1/2 hours,Two objects&#44 lots of lights viewed for 1 1/...,2000-12-02,43.004444,-71.348892,auburn,new hampshire,us
14373,1/14/2003 19:00,wisconsin dells,wi,us,light,37800.0,1 1/2 hours,We saw 10 or more&#44 very bright&#44 beautifu...,2003-03-04,43.6275,-89.770836,wisconsin dells,wisconsin,us


### 1.5 hours = 5400 seconds not 37800

In [40]:
df.loc[(df.hours_min.str.contains('1 1/2 hours')) & (df.seconds != 5400), 'seconds'] = 5400

In [41]:
df.loc[(df.hours_min == '2 hours') & (df.seconds != 7200)]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


In [42]:
df.loc[(df.hours_min == '3 hours') & (df.seconds != 10800)]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


In [43]:
df.loc[(df.hours_min == '4 hours') & (df.seconds != 14400)]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


### 23000 hours

In [44]:
df.loc[df.seconds == 8.280000e+07]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
53384,6/3/2010 23:30,ottawa,on,ca,other,82800000.0,23000hrs,((HOAX??)) I was out in a field near mil&#44 ...,2010-07-06,45.416667,-75.699997,ottawa,ontario,ca


In [45]:
df.loc[df.seconds == 8.280000e+07, 'seconds'] = mean_seconds

### 1700 hours

In [46]:
df.loc[df.seconds == 6.120000e+06]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
23721,12/9/1965 24:00,newtown square,pa,us,fireball,6120000.0,1700hrs,DEC 1965 NEWTOWN SQUARE PENN FLAMING BURNIN...,2006-07-16,39.986667,-75.40139,newtown township,pennsylvania,us


In [47]:
df.loc[df.seconds == 6.120000e+06, 'seconds'] = mean_seconds

In [48]:
df.loc[df.seconds > 1.0e+05]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
1224,10/13/1969 21:15,millington,tn,us,formation,259200.0,72 hours,Star-like&#44 8-point inverted V-shape&#44 cha...,2002-01-29,35.341389,-89.897224,millington,tennessee,us
3883,10/22/2009 00:00,deep gap,nc,,disk,109800.0,3 1/2 hrs,Five lighted craft dance about Orion during a ...,2009-12-12,36.20022,-81.531197,deep gap,north carolina,us
10967,1/1/2001 19:00,north whitefield,me,us,light,109800.0,3 1/2 hours,I have seen this craft for years and would lik...,2001-01-03,44.221944,-69.587776,north whitefield,maine,us
13217,11/27/2001 19:30,warner robins,ga,us,triangle,361800.0,10 1/2 hours,There ARE UFOs over Robins AFB&#44 Ga.,2001-12-05,32.620833,-83.599998,warner robins,georgia,us
13220,11/27/2002 04:30,rainier,wa,us,triangle,109800.0,3 1/2 hours,bright lights hovering in the sky,2002-12-23,46.888333,-122.687225,rainier,washington,us
13221,11/27/2002 04:45,san francisco,ca,us,changing,109800.0,hour,My wife and I witnessed a large&#44 extremely ...,2002-12-23,37.775,-122.418335,san francisco,california,us
16889,12/10/1998 01:15,camp pendelton,ca,,sphere,109800.0,3 1/2 hrs,Blue Sphere,2005-05-24,33.317842,-117.320511,camp pendelton,california,us
21793,12/31/2013 21:54,mesa,az,us,fireball,172800.0,48 hours,I&#39m reporting very concerning sightings of ...,2014-01-10,33.422222,-111.821945,mesa,arizona,us
21857,12/3/1999 19:30,lodi,ca,us,fireball,145800.0,approx. 41/2 hrs,Strange fireballs on Dec. 03&#44 1999 on the W...,2000-12-02,38.130278,-121.271385,lodi,california,us
22680,1/25/2008 24:00,plymouth,,gb,triangle,172800.0,48 hrs,just wondered if you seen anything strange in ...,2008-02-14,50.396389,-4.138611,plymouth,england,gb


### it seems unlikely that a UFO would be visible for 72 hrs straight

In [49]:
df.loc[df.hours_min.str.contains('72 hours')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
1224,10/13/1969 21:15,millington,tn,us,formation,259200.0,72 hours,Star-like&#44 8-point inverted V-shape&#44 cha...,2002-01-29,35.341389,-89.897224,millington,tennessee,us
23460,12/8/2002 18:45,sharpsville,pa,us,egg,259200.0,72 hours (est),Western Pennsylvania,2003-06-10,41.259167,-80.472221,sharpsville,pennsylvania,us
36011,4/1/1978 21:30,green springs,oh,us,cylinder,259200.0,72 hours intermitten,The begining of the lights in the night,2006-10-30,41.256111,-83.051666,green springs,ohio,us


In [50]:
df.loc[df.hours_min.str.contains('72 hours'), 'seconds'] = mean_seconds

### 8.5 to 10.25 hrs 

In [51]:
df.loc[df.seconds == 3654000.0, 'seconds'] = mean_seconds

### 30 hrs

In [52]:
df.loc[df.seconds == 1080000.0, 'seconds'] = mean_seconds

### 3.5 hrs is 12600 seconds not 109800

In [53]:
df.loc[df.seconds == 109800.0, 'seconds'] = 12600

### 10.5 hrs is 37800 seconds not 361800

In [54]:
df.loc[df.seconds == 361800.0, 'seconds'] = 37800

### 96 hrs

In [55]:
df.loc[df.seconds == 345600.0, 'seconds'] = mean_seconds

### 7.5 hrs is 27000 seconds not 253800

In [56]:
df.loc[df.seconds == 253800.0, 'seconds'] = 27000

### 5.5 hrs is 19800 seconds not 181800

In [57]:
df.loc[df.seconds == 181800.0, 'seconds'] = 19800

### 45 hours

In [58]:
df.loc[df.seconds == 162000.0, 'seconds'] = mean_seconds

### 48 hrs

In [59]:
df.loc[df.seconds == 172800.0, 'seconds'] = mean_seconds

### 4.5 hrs is 16200 seconds not 145800

In [60]:
df.loc[df.seconds == 145800.0, 'seconds'] = 16200

### 3 hours is 10800 seconds not 1080000

In [61]:
df.loc[df.seconds == 1080000.0, 'seconds'] = 10800

## 30 hrs

In [62]:
df.loc[df.seconds == 108000.0, 'seconds'] = mean_seconds

In [63]:
df.loc[df.seconds > 3.0e+04]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
123,10/10/2003 21:10,crescent beach,sc,us,formation,37800.0,1 1/2 hr.,For two consecutive nights&#44 we watched a pa...,2004-01-17,33.807500,-78.701111,crescent beach,south carolina,us
1557,10/14/2002 20:15,gaysville,vt,us,cone,37800.0,1 1/2 hour,Multicolor flashing cone object in Vermont,2002-10-28,43.778333,-72.699448,gaysville,vermont,us
1651,10/14/2011 18:00,el paso,tx,us,circle,37800.0,1 1/2 hrs,Silver orb floating over far westside of El Pa...,2011-10-25,31.758611,-106.486389,el paso,texas,us
1810,10/15/1981 07:30,stonewall,mb,ca,triangle,36000.0,10 hours,I remember feeling so scared and helpless&#44 ...,2013-11-20,50.133333,-97.316666,stonewall,manitoba,ca
3185,10/20/2002 24:00,raton/pueblo,co,,light,37800.0,1 1/2hrs,2Bright star-like objects follow me for over 1...,2002-10-28,37.939968,-104.819885,raton/pueblo,colorado,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76437,9/21/2009 22:00,bastrop,tx,us,sphere,37800.0,1-1/2 hrs,9/21/0Blinking colored lights forming perfect ...,2009-12-12,30.110278,-97.315002,bastrop,texas,us
76955,9/22/2007 04:00,martin,ga,us,other,73800.0,2-2 1/2 hours,UFO sighting in Georgia&#44 at a lake&#44 3 of...,2008-01-21,34.486944,-83.184998,martin,georgia,us
77332,9/24/1994 20:00,orcas island,wa,,light,37800.0,1-1/2 hrs,Bright blue flashes in the sky and zigzagging ...,2003-11-26,48.633082,-122.928986,orcas island,washington,us
78575,9/30/1966 19:00,hohonfels,,,cigar,50400.0,14 hours,Hohenfels training facility USAEURO,2007-02-24,49.203278,11.849022,hohonfels,bavaria,de


### 1.5 hours is 5400 seconds not 37800

In [64]:
df.loc[df.seconds == 37800.0, 'seconds'] = 5400

### 2.5 hours is 9000 seconds not 72800

In [65]:
df.loc[df.seconds == 73800.0, 'seconds'] = 9000

In [66]:
df.seconds.describe()

count    80332.000000
mean       863.008117
std       2484.566619
min          0.001000
25%         30.000000
50%        180.000000
75%        600.000000
max      97200.000000
Name: seconds, dtype: float64

### This is such a short amount of time it would not be perceptible  

In [67]:
df[df.seconds == 0.001000]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
4081,10/23/2008 04:45,remote,wy,,flash,0.001,0.001sec,brilliant strobe light at 4am&#44 moving light...,2009-01-10,-46.163992,169.875046,remote,otago,nz


In [68]:
df.loc[df.seconds < median_seconds, 'seconds'] = median_seconds

### 27 hours

In [69]:
df[df.seconds == 97200]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
66880,8/16/1968 03:00,kansas ciy,ks,,changing,97200.0,27 hours,Deep red light changeing to two much larger sp...,2004-01-31,39.114053,-94.627464,kansas city,kansas,us


In [70]:
df.loc[df.seconds == 97200, 'seconds'] = mean_seconds

In [71]:
df.seconds.describe()

count    80332.000000
mean       925.632942
std       2440.870693
min        180.000000
25%        180.000000
50%        180.000000
75%        600.000000
max      86400.000000
Name: seconds, dtype: float64

In [72]:
df[df.seconds > 43200]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
10500,11/18/2008 19:06,washington&#44 d.c.,dc,,other,68814.0,19:06:54,UFO flies into frames taken @ US Capitol Build...,2009-01-10,38.907231,-77.036461,washington,district of columbia,us
19788,12/2/2013 19:00,antioch,ca,us,circle,56000.0,800 milliseconds,((HOAX??)) BIG circular ball of light flashes...,2013-12-05,38.005,-121.804726,antioch,california,us
20655,12/26/1997 20:00,primrose,ga,us,light,86400.0,24 hours,SINGLE WHITE LIGHT JUMPING AROUND IN WEST SKIE...,1998-03-07,33.143333,-84.741943,primrose,georgia,us
21906,12/3/2003 06:55,taylorsville,nc,us,unknown,54000.0,15 hours,Two Clouds moving at the speed of Light,2003-12-09,35.921667,-81.176666,taylorsville,north carolina,us
30113,2/9/2008 14:08,san carlos&#44 sonora,,,oval,64800.0,18 hours,UFO over San Carlos&#44 Sonora&#44 Mexico. ((...,2008-02-14,27.961788,-111.037102,san carlos&#44 sonora,sonora,mx
31563,3/15/1998 21:30,los osos,ca,us,sphere,60600.0,10 hous,Possible UFO abduction&#44,2011-01-31,35.311111,-120.83139,los osos,california,us
31575,3/15/2000 06:39,winthrop,ma,us,changing,50400.0,14 hours,The strang multicolored object was up in the s...,2000-05-11,42.375,-70.98333,winthrop,massachusetts,us
31580,3/15/2000 16:30,boulder,co,us,diamond,50400.0,14 hrs,Convoy of crafts came from the ground one at a...,2005-06-20,40.015,-105.269997,boulder,colorado,us
42696,5/16/1958 24:00,cincinnati,oh,us,diamond,86400.0,24 hours,5/16/1958/ 1500 hours/ just getting dark&#44mo...,2006-05-15,39.161944,-84.456947,cincinnati,ohio,us
43145,5/19/2002 00:30,camarillo,ca,us,sphere,86400.0,24 hours&#44 30 minutes,Liquid craft steals time,2002-06-12,34.216389,-119.036667,camarillo,california,us


### It seems improbable that a UFO would be visible for 12 hours but I had to pick a cutoff point

In [73]:
df.loc[df.seconds > 43200, 'seconds'] = mean_seconds

In [74]:
df.describe()

Unnamed: 0,seconds,latitude,longitude
count,80332.0,80332.0,80332.0
mean,910.848555,38.124416,-86.772881
std,2197.107913,10.469585,39.697205
min,180.0,-82.862752,-176.658051
25%,180.0,34.134722,-112.073334
50%,180.0,39.411111,-87.90361
75%,600.0,42.788333,-78.754997
max,43200.0,72.7,178.441895


In [75]:
df.to_parquet(temp_dir+'/clean_duration.parquet')

## DateTime
Change 24:00 to 00:00 for midnight

In [76]:
df['datetime'] = df['datetime'].apply(lambda x: re.sub('24:00', '00:00', x))
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,10/10/1955 17:00,chester,,gb,circle,180.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,10/10/1956 21:00,edna,tx,us,circle,180.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600.0,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,9/9/2013 22:00,boise,id,us,circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,9/9/2013 22:00,napa,ca,us,other,1200.0,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,9/9/2013 22:20,vienna,va,us,circle,180.0,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [77]:
df[df['datetime'].isnull()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


In [78]:
df['datetime'] = pd.to_datetime(df['datetime'])
df.dtypes

datetime       datetime64[ns]
city                   object
state                  object
country                object
shape                  object
seconds               float64
hours_min              object
comments               object
date_posted    datetime64[ns]
latitude              float64
longitude             float32
geo_city               object
geo_state              object
geo_country            object
dtype: object

In [79]:
df.to_parquet(temp_dir+'/clean_datetime.parquet')

## UFO Shape

In [80]:
df['shape'].unique()

array(['cylinder', 'light', 'circle', 'sphere', 'disk', 'fireball',
       'unknown', 'oval', 'other', 'cigar', 'rectangle', 'chevron',
       'triangle', 'formation', None, 'delta', 'changing', 'egg',
       'diamond', 'flash', 'teardrop', 'cone', 'cross', 'pyramid',
       'round', 'crescent', 'flare', 'hexagon', 'dome', 'changed'],
      dtype=object)

In [81]:
df[df['shape'].isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
62,1995-10-10 19:45:00,milwaukee,wi,us,,180.0,2 min.,Man on Hwy 43 SW of Milwaukee sees large&#44 ...,1999-11-02,43.038889,-87.906387,milwaukee,wisconsin,us
63,1995-10-10 22:40:00,oakland,ca,us,,180.0,1 minute,Woman repts. bright light in NW sky&#44 sudde...,1999-11-02,37.804444,-122.269722,oakland,california,us
239,2011-10-10 19:30:00,murfeesboro/smyrna,tn,,,2700.0,30-45 minutes,Multi color oblect over Smyrna/Murfreesboro 10...,2011-10-19,35.947474,-86.488365,murfeesboro/smyrna,tennessee,us
285,1986-10-11 20:30:00,alice springs,,au,,180.0,20 seconds,Being of light reported&#44Jesus or another m...,2005-01-19,-23.697479,133.883621,alice springs,unknown,au
293,1995-10-11 18:30:00,new york city,ny,us,,720.0,12 min.,Young man&#44 mother witness watch strange red...,1999-11-02,40.714167,-74.006386,new york,new york,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80128,1999-09-09 22:00:00,mount shasta,ca,us,,18000.0,5 hours,multiple anomalious lights&#44white flashes&#4...,1999-10-02,41.310000,-122.309441,mount shasta,california,us
80155,2002-09-09 19:02:00,moriches bay,ny,,,180.0,30 sec.,Two men report witnessing a peculiar object de...,2002-09-13,40.789394,-72.715630,moriches bay,new york,us
80156,2002-09-09 19:02:00,moriches bay,ny,,,180.0,1 minute&#44 or less.,U. S. Coast Guard (Boston) forwards report of ...,2002-09-13,40.789394,-72.715630,moriches bay,new york,us
80179,2003-09-09 22:00:00,prescott,az,us,,2700.0,45 minutes,Bright &quot;stars&quot; flying in sky in Pres...,2007-08-07,34.540000,-112.467781,prescott,arizona,us


In [82]:
df['shape'].value_counts()

light        16565
triangle      7865
circle        7608
fireball      6208
other         5649
unknown       5584
sphere        5387
disk          5213
oval          3733
formation     2457
cigar         2057
changing      1962
flash         1328
rectangle     1297
cylinder      1283
diamond       1178
chevron        952
egg            759
teardrop       750
cone           316
cross          233
delta            7
round            2
crescent         2
pyramid          1
flare            1
hexagon          1
dome             1
changed          1
Name: shape, dtype: int64

## Fold the shapes that occur less often into similar shapes

In [83]:
df.loc[df['shape'] == 'changed', 'shape'] = 'changing'
df.loc[df['shape'] == 'delta', 'shape'] = 'triangle'
df.loc[df['shape'] == 'cigar', 'shape'] = 'cylinder'
df.loc[df['shape'] == 'flare', 'shape'] = 'fireball'
df.loc[df['shape'] == 'round', 'shape'] = 'circle'
df.loc[df['shape'] == 'dome', 'shape'] = 'disk'
df.loc[df['shape'] == 'crescent', 'shape'] = 'teardrop'
df.loc[df['shape'] == 'pyramid', 'shape'] = 'other'
df.loc[df['shape'] == 'hexagon', 'shape'] = 'other'
df.loc[df['shape'].isna(), 'shape'] = 'unknown'
df['shape'].value_counts()

light        16565
triangle      7872
circle        7610
unknown       7516
fireball      6209
other         5651
sphere        5387
disk          5214
oval          3733
cylinder      3340
formation     2457
changing      1963
flash         1328
rectangle     1297
diamond       1178
chevron        952
egg            759
teardrop       752
cone           316
cross          233
Name: shape, dtype: int64

In [84]:
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,180.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,180.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80327,2013-09-09 21:15:00,nashville,tn,us,light,600.0,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80328,2013-09-09 22:00:00,boise,id,us,circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80329,2013-09-09 22:00:00,napa,ca,us,other,1200.0,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80330,2013-09-09 22:20:00,vienna,va,us,circle,180.0,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [85]:
df.to_parquet(temp_dir+'/clean_shape.parquet')

## Comments

## Remove records where there are no comments

In [86]:
df[df.comments.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
2940,2004-10-19 20:00:00,grand island,ne,us,light,3600.0,over 1 hr.,,2004-11-02,40.925,-98.341667,grand island,nebraska,us
14317,1996-01-14 17:00:00,chesterfield,va,us,unknown,3600.0,1 hour,,2005-11-03,37.376944,-77.506111,chesterfield,virginia,us
21844,1996-01-23 20:15:00,minot,nd,us,unknown,900.0,15 min.,,2011-02-18,48.2325,-101.29583,minot,north dakota,us
24999,1996-01-07 11:30:00,st. george,ut,us,unknown,180.0,2 sec.,,2005-11-03,37.104167,-113.583336,st. george,utah,us
28764,1996-02-27 22:01:00,saginaw,mi,us,unknown,1440.0,24 min.,,2004-03-02,43.419444,-83.950836,city of saginaw,michigan,us
32337,2004-03-19 12:10:00,atlanta,ga,us,circle,600.0,5-10 min.,,2004-06-18,33.748889,-84.388054,atlanta,georgia,us
36089,2001-04-01 19:00:00,bangalore,,,unknown,180.0,5-10 seconds,,2002-05-14,12.971599,77.594566,bengaluru,karnataka,in
41782,2013-05-01 22:00:00,toledo,oh,us,oval,180.0,2:00,,2014-01-24,41.663889,-83.555275,toledo,ohio,us
46558,2002-06-10 03:30:00,chantilly,va,us,unknown,180.0,two hours,,2002-08-16,38.894167,-77.431389,chantilly,virginia,us
48599,1957-06-15 02:30:00,atlantic ocean,,,unknown,180.0,minutes,,2002-03-19,-14.599413,-28.673147,atlantic ocean,over_water,over_water


In [87]:
df.dropna(subset = ['comments'], inplace = True)
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,180.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,180.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80312,2013-09-09 21:15:00,nashville,tn,us,light,600.0,10 minutes,Round from the distance/slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80313,2013-09-09 22:00:00,boise,id,us,circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80314,2013-09-09 22:00:00,napa,ca,us,other,1200.0,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us
80315,2013-09-09 22:20:00,vienna,va,us,circle,180.0,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


## Text Cleanup

In [88]:
df[df.comments.str.contains('&#44')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,180.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,180.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
8,1966-10-10 20:00:00,pell city,al,us,disk,180.0,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.586111,-86.286110,pell city,alabama,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80296,2012-09-09 21:55:00,charleston,sc,us,flash,900.0,15 minutes,Orb of light flashing reds and blues&#44 stati...,2012-09-24,32.776389,-79.931114,charleston,south carolina,us
80307,2013-09-09 21:00:00,aleksandrow,,,light,180.0,15 seconds,Two points of light following one another in a...,2013-09-30,50.465843,22.891813,aleksandrow,lublin voivodeship,pl
80313,2013-09-09 22:00:00,boise,id,us,circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80314,2013-09-09 22:00:00,napa,ca,us,other,1200.0,hour,Napa UFO&#44,2013-09-30,38.297222,-122.284447,napa,california,us


In [89]:
# remove new line, tab and carraige return
df.comments = df.comments.str.translate(str.maketrans('','', '\n\t\r'))
# replace / with space
df.comments = df.comments.str.replace('/',' ')
# replace ascii codes with character
df.comments = df.comments.str.replace('&#9',chr(9))
df.comments = df.comments.str.replace('&#33',chr(33))
df.comments = df.comments.str.replace('&#39',chr(39))
df.comments = df.comments.str.replace('&#44',chr(44))
df.comments = df.comments.str.replace('&#160;',chr(160))
df.comments = df.comments.str.replace('&#161;',chr(161))
df.comments = df.comments.str.replace('&#167;',chr(167))
df.comments = df.comments.str.replace('&#170;',chr(170))
df.comments = df.comments.str.replace('&#176;',chr(176))
df.comments = df.comments.str.replace('&#180;',chr(180))
df.comments = df.comments.str.replace('&#182;',chr(1802))
df.comments = df.comments.str.replace('&#186;',chr(186))
df.comments = df.comments.str.replace('&#188;',chr(188))
df.comments = df.comments.str.replace('&#190;',chr(190))
df.comments = df.comments.str.replace('&#8211;',chr(8211))
df.comments = df.comments.str.replace('&#8212;',chr(8212))
df.comments = df.comments.str.replace('&#8216;',chr(8216))
df.comments = df.comments.str.replace('&#8217;',chr(8217))
df.comments = df.comments.str.replace('&#8220;',chr(8220))
df.comments = df.comments.str.replace('&#8221;',chr(8221))
df.comments = df.comments.str.replace('&#8230;',chr(8230))
# convert all text to lowercase
df.comments = df.comments.str.lower()
# remove numbers
df.comments = df.comments.str.translate(str.maketrans('', '', string.digits))
# remove punctuation
df.comments = df.comments.str.translate(str.maketrans('', '', string.punctuation))
# remove extra spaces
df.comments = df.comments.replace({' +':' '},regex=True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,this event took place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1-2 hrs,lackland afb tx lights racing across the sky ...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,180.0,20 seconds,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,180.0,1/2 hour,my older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,as a marine st lt flying an fjb fighter attack...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80312,2013-09-09 21:15:00,nashville,tn,us,light,600.0,10 minutes,round from the distance slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80313,2013-09-09 22:00:00,boise,id,us,circle,1200.0,20 minutes,boise id spherical min red lights seen by husb...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80314,2013-09-09 22:00:00,napa,ca,us,other,1200.0,hour,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us
80315,2013-09-09 22:20:00,vienna,va,us,circle,180.0,5 seconds,saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [90]:
df[df.comments.str.contains('&#')]

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country


## Remove Blank Comments

In [91]:
df[df.comments == '']

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
22437,2013-12-04 19:00:00,tucson,az,us,fireball,180.0,3 minutes,,2013-12-23,32.221667,-110.925835,tucson,arizona,us
45442,2014-05-03 00:00:00,milford,ct,us,circle,900.0,15 minutes,,2014-05-08,41.222222,-73.056946,milford,connecticut,us
52663,1966-06-30 21:00:00,blocksburg,ca,us,disk,600.0,10 min,,2009-03-19,40.276111,-123.635277,blocksburg,california,us


In [92]:
df[df.comments == ' ']

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
60768,2002-07-30 01:25:00,mainville,oh,,light,180.0,4-5 seconds,,2002-08-16,39.315059,-84.220772,mainville,ohio,us


In [93]:
df.drop(df[df.comments == ''].index, inplace = True)
df.drop(df[df.comments == ' '].index, inplace = True)
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,hours_min,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,this event took place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,1-2 hrs,lackland afb tx lights racing across the sky ...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,180.0,20 seconds,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,180.0,1/2 hour,my older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,as a marine st lt flying an fjb fighter attack...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,10 minutes,round from the distance slowly changing colors...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,20 minutes,boise id spherical min red lights seen by husb...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,hour,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us
80311,2013-09-09 22:20:00,vienna,va,us,circle,180.0,5 seconds,saw a five gold lit cicular craft moving fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


## Load spaCy

In [94]:
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## Lemmatization

Lemmatization returns the root (dictionary) form a word. It changes the verb form of the word while keeping the meaning of the word the same.

Examples:

- better -> good
- walking -> walk
- was -> be
- mice -> mouse

In [95]:
def lemmatize_comments(x):
    doc = nlp(x)
    lemmed_list = []
    for token in doc:
        if not token.is_punct:
            if token.lemma_ == 'PRON':
                lemmed_list.append(token_)
            else:
                lemmed_list.append(token.lemma_)
            
    return " ".join(lemmed_list)

In [96]:
lemmed_file = temp_dir+'/lemmatized.parquet'
if exists(lemmed_file):
    df = pd.read_parquet(lemmed_file)
else:
    df.comments = df.comments.apply(lambda x: lemmatize_comments(x))
    df.to_parquet(lemmed_file)
    
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event take place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx light race across the sky am...,2005-12-16,29.384210,-98.581085,san antonio,texas,us
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my old brother and twin sister be leave the on...,2004-01-17,28.978333,-96.645836,edna,texas,us
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt fly an fjb fighter attack ai...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly change color an...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise i d spherical min red light see by husba...,2013-09-30,43.613611,-116.202499,boise,idaho,us
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,see a five gold light cicular craft move fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us


In [97]:
# write each comment to a text file seperating each comment with a \n
lemm_comments_file = temp_dir+'/lemm_comments.txt'
if not exists(lemm_comments_file):
    with open(lemm_comments_file, 'w') as lem_comments_txt_file:
              df.comments.apply(lambda x: lem_comments_txt_file.write(x + '\n'))
# Read in each comment where one line = one sentence.
sentences_unigrams = LineSentence(lemm_comments_file)

## Phrase Modeling

Detect frequently used phrases and combine them.

## Bigrams

A bigram is a two word phrase. Find the most frequently occurring two word phrases and combine them.

## Trigrams

A trigram is a three word phrase. Find the most frequently occurring three word phrases and combine them.

In [98]:
bigram_model_file = temp_dir+'/bigram_phrase_model'

if not exists(bigram_model_file):
    bigram_phrases = Phrases(sentences_unigrams)
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    bigram_phrases = Phraser(bigram_phrases)
    bigram_phrases.save(bigram_model_file)

In [99]:
bigram_phrases = Phraser.load(bigram_model_file)
sentences_bigrams_file = temp_dir+'/sentence_bigram_phrases_all.txt'

if not exists(sentences_bigrams_file):
    with open(sentences_bigrams_file, 'w') as f:

        for sentence_unigrams in sentences_unigrams:
            #print(sentence_unigrams)
            sentence_bigrams = ' '.join(bigram_phrases[sentence_unigrams])

            f.write(sentence_bigrams + '\n')

In [100]:
sentences_bigrams = LineSentence(sentences_bigrams_file)
trigram_model_file = temp_dir+'/trigram_phrase_model'

if not exists(trigram_model_file):
    trigram_phrases = Phrases(sentences_bigrams) 
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    trigram_phrases = Phraser(trigram_phrases)
    trigram_phrases.save(trigram_model_file)

In [101]:
trigram_phrases = Phraser.load(trigram_model_file)
sentences_trigrams_file = temp_dir+'/sentence_trigram_phrases_all.txt'

if not exists(sentences_trigrams_file):
    with open(sentences_trigrams_file, 'w') as f:
        
        for sentence_bigrams in sentences_bigrams:
            
            sentence_trigrams = ' '.join(trigram_phrases[sentence_bigrams])
            
            f.write(sentence_trigrams + '\n')   

In [102]:
comments_trigrams_file = temp_dir+'/comments_trigrams_all.txt'

if not exists(comments_trigrams_file):
    # Read in each comment where one line = one sentence.
    comments_lemmatized = LineSentence(lemm_comments_file)

    with open(comments_trigrams_file, 'w') as f:
        
        for comments_unigrams in comments_lemmatized:
                        
            # apply the first-order and second-order phrase models
            comments_bigrams = bigram_phrases[comments_unigrams]
            comments_trigrams = trigram_phrases[comments_bigrams]
            
            # write the transformed comments as a line in the new file
            comments_trigrams = ' '.join(comments_trigrams)
            f.write(comments_trigrams + '\n')
            

In [103]:
trigram_df_file = temp_dir+'/tri_grams.parquet'

if not exists(trigram_df_file):
    tri_df = pd.DataFrame(columns = ['tri_comments'])
    
    with open(comments_trigrams_file) as f, open(comments_trigrams_file) as bf:
        
        for comments in f:
            comments = re.sub('\n', '', comments)
            tri_df.loc[len(tri_df)] =comments
            
    tri_df.to_parquet(trigram_df_file)

else:
    tri_df = pd.read_parquet(trigram_df_file)
    
tri_df

Unnamed: 0,tri_comments
0,this_event take_place in early fall around it ...
1,lackland afb tx light race_across the sky amp ...
2,green orange circular disc over chester england
3,my old brother and twin sister be leave the on...
4,as a marine st lt fly an fjb fighter attack ai...
...,...
80308,round from the distance slowly change_color an...
80309,boise i_d spherical min red light see by husba...
80310,napa ufo
80311,see a five gold light cicular craft move fastl...


In [104]:
# concatenate the comments with trigrams to dataframe
df = pd.concat([df, tri_df],axis = 1)
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country,tri_comments
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event take place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us,this_event take_place in early fall around it ...
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx light race across the sky am...,2005-12-16,29.384210,-98.581085,san antonio,texas,us,lackland afb tx light race_across the sky amp ...
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb,green orange circular disc over chester england
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my old brother and twin sister be leave the on...,2004-01-17,28.978333,-96.645836,edna,texas,us,my old brother and twin sister be leave the on...
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt fly an fjb fighter attack ai...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us,as a marine st lt fly an fjb fighter attack ai...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly change color an...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us,round from the distance slowly change_color an...
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise i d spherical min red light see by husba...,2013-09-30,43.613611,-116.202499,boise,idaho,us,boise i_d spherical min red light see by husba...
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us,napa ufo
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,see a five gold light cicular craft move fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us,see a five gold light cicular craft move fastl...


In [105]:
df[df.tri_comments.isna()]

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country,tri_comments


## Remove Stop Words

In [106]:
def remove_stop_words(x):
    doc = nlp(x)
    stopless_list = []
    for token in doc:
        if not token.is_stop:
            stopless_list.append(token.text)
    return " ".join(stopless_list)

In [107]:
df.tri_comments = df.tri_comments.apply(lambda x: remove_stop_words(x))
df

Unnamed: 0,datetime,city,state,country,shape,seconds,comments,date_posted,latitude,longitude,geo_city,geo_state,geo_country,tri_comments
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,this event take place in early fall around it ...,2004-04-27,29.883056,-97.941109,san marcos,texas,us,this_event take_place early fall occur boy sco...
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200.0,lackland afb tx light race across the sky am...,2005-12-16,29.384210,-98.581085,san antonio,texas,us,lackland afb tx light race_across sky amp maki...
2,1955-10-10 17:00:00,chester,,gb,circle,20.0,green orange circular disc over chester england,2008-01-21,53.200000,-2.916667,chester,england,gb,green orange circular disc chester england
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,my old brother and twin sister be leave the on...,2004-01-17,28.978333,-96.645836,edna,texas,us,old brother twin sister leave edna theater pmw...
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,as a marine st lt fly an fjb fighter attack ai...,2004-01-22,21.418056,-157.803604,kaneohe,hawaii,us,marine st lt fly fjb fighter attack aircraft s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,2013-09-09 21:15:00,nashville,tn,us,light,600.0,round from the distance slowly change color an...,2013-09-30,36.165833,-86.784447,nashville-davidson,tennessee,us,round distance slowly change_color hover
80309,2013-09-09 22:00:00,boise,id,us,circle,1200.0,boise i d spherical min red light see by husba...,2013-09-30,43.613611,-116.202499,boise,idaho,us,boise i_d spherical min red light husband wife
80310,2013-09-09 22:00:00,napa,ca,us,other,1200.0,napa ufo,2013-09-30,38.297222,-122.284447,napa,california,us,napa ufo
80311,2013-09-09 22:20:00,vienna,va,us,circle,5.0,see a five gold light cicular craft move fastl...,2013-09-30,38.901111,-77.265556,vienna,virginia,us,gold light cicular craft fastly rght leave


In [108]:
df.to_parquet(temp_dir+'/corpus.parquet')

In [109]:
cv = CountVectorizer()
data_cv = cv.fit_transform(df.tri_comments)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = df.index
data_dtm

Unnamed: 0,a_high_rate,a_zig_zag,aa,aaa,aaaaaaaaauy,aampm,aaron,ab,abadania,aball,...,zore,zs,zthen,ztraacutecelo,zuerich,zukowski,zulu,zvala,zz,zzigzage
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80308,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80309,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80310,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80311,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [110]:
# Save for later use
data_dtm.to_parquet(temp_dir+'/dtm.parquet')
# Pickle it for later use
pickle.dump(cv, open(temp_dir+"/cv.pkl", "wb"))