Things to do:
* Date/Time
    * verify dates are real (what to do with dates that are strange, e.g., pre-1900?)
* split date and time (?)
* convert times to UTC
* duration
    * split number and unit(e.g., seconds, minutes
    * convert duration to milliseconds
* location
    * pull all non-location information out of location fields (i.e., city, state)
    * add country column
    * validate all 3 columns
* summary information
    * creates columns for nouns, adjectives, verbs (?)
    * extra colors (new column)
* factorize
    * colors
    * shapes
   
References:

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

https://www.analyticsvidhya.com/blog/2020/12/understanding-text-classification-in-nlp-with-movie-review-example-example/

https://blog.dataiku.com/text-classification-the-first-step-toward-nlp-mastery

https://thinkinfi.com/complete-guide-for-natural-language-processing-in-python/

https://towardsdatascience.com/nlp-in-python-vectorizing-a2b4fc1a339e

In [1]:
import pandas as pd
import numpy as np
import time
import logging
import nltk
import re
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
!pip install contractions
import contractions

!pip install country_list
import country_list

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
logging.basicConfig(filename='sightings_cleaning.log', format='%(asctime)s - %(message)s', datefmt='%d-%b-%y %H:%M:%S', level=logging.INFO)
logger = logging.getLogger()

In [3]:
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in text.split(' ')])

def expand_contractions(text):
    expanded_words = []    
    for word in text.split():
        expanded_words.append(contractions.fix(word))   

    return ' '.join(expanded_words)

def regex_clean(text):
    """
    Applies some pre-processing on the given text.

    Steps :
    - Removing HTML tags
    - Removing punctuation
    - Lowering text
    """
    
    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # remove the characters [\], ['] and ["]
    text = re.sub(r"\\", "", text)    
    text = re.sub(r"\'", "", text)    
    text = re.sub(r"\"", "", text)    
    
    # convert text to lowercase
    text = text.strip().lower()
    
    # replace punctuation characters with spaces
    filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    translate_dict = dict((c, " ") for c in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)

    return text

def clean_text(text):
    stop = set(nltk.corpus.stopwords.words('english'))
    cleaned = expand_contractions(text.lower())
    # cleaned = regex_clean(text)
    tokens = word_tokenize(cleaned)
    cleaned = ' '.join([w for w in tokens if not w in stop])
    cleaned = lemmatize_text(cleaned)
    return cleaned

In [4]:
def load_clean_sightings_dataframe():
    file_name = "sightings.pkl"
    logger.info(f"Data read from {file_name}")
    sightings = pd.read_pickle(file_name)

    sightings = sightings[sightings['Summary'].str.contains('MADAR')==False]
    
    # The Detail_Summary column needs to be cleaned for the cleaning function to work.
    sightings["Detail_Summary"] = sightings["Detail_Summary"].fillna("")
    sightings.loc[sightings["Detail_Summary"] == "Summary detail page not found.", "Detail_Summary"] = ""

    sightings_cleaned = sightings.copy()
    sightings_cleaned['Detail_Summary_nltk'] = sightings_cleaned['Detail_Summary'].apply(clean_text)
    
    sightings_cleaned.to_pickle(cleaned_file_name)

In [5]:
cleaned_file_name = "sightings_cleaned.pkl"
start_fresh = False

if start_fresh:
    load_clean_sightings_dataframe()

In the next block we save the data to a file. The initial cleaning is on +95k records. This takes some time. In the remaining cleaning, if one messes up, as I have done quite a few times, they can re-run pd.read_pickle() which will restore the dataframe to the "just processed" phase.

In [6]:
sightings_cleaned = pd.read_pickle(cleaned_file_name)

In [7]:
sightings_cleaned.head()

Unnamed: 0,Date_Time,City,State,Shape,Duration,Summary,Posted,Detail_Link,Detail_Summary,Detail_Summary_nltk
0,4/23/21 06:30,Blackshear,GA,Circle,9 minutes,Very strange ((NUFORC Note: Rocket launch f...,4/23/21,http://www.nuforc.org/webreports/162/S162815.html,\nVery strangeI have recorded a video of this ...,strangei recorded video sighting
1,4/23/21 06:00,Mechanicsville,VA,Circle,Seconds,Ball in the sky ((NUFORC Note: Rocket launc...,4/23/21,http://www.nuforc.org/webreports/162/S162814.html,\nBall in the skyObject appears as a white bal...,ball skyobject appears white ball vapor strewi...
2,4/23/21 06:00,Vero Beach,FL,Light,5 minutes,I was driving and saw something strange in the...,4/23/21,http://www.nuforc.org/webreports/162/S162822.html,\nI was driving and saw something strange in t...,driving saw something strange sky pulled car i...
3,4/23/21 05:59,St. Augustine,FL,Light,3 minutes,2 extremely bright lights appeared over east c...,4/23/21,http://www.nuforc.org/webreports/162/S162824.html,\n2 extremely bright lights appeared over east...,2 extremely bright light appeared east coast n...
4,4/23/21 05:58,Durham,NC,Cone,>5 minutes,A cone of light coming from the sky unlike any...,4/23/21,http://www.nuforc.org/webreports/162/S162819.html,\nA cone of light coming from the sky unlike a...,cone light coming sky unlike anything ever see...


In [8]:
len(sightings_cleaned)

95690

Do we want to remove the ones below with NUFORC notes? Perhaps someone could go through them to see what the notes are. Things like Starlink sightings should be removed. For now, the following block removes them otherwise nuforc and note dominate the wordcloud.

In [9]:
sightings_cleaned = sightings_cleaned[sightings_cleaned['Detail_Summary_nltk'].str.contains('nuforc note')==False]
sightings_cleaned = sightings_cleaned[sightings_cleaned['Detail_Summary_nltk'].str.contains('NUFORC note')==False]
sightings_cleaned = sightings_cleaned[sightings_cleaned['Detail_Summary_nltk'].str.contains('nuforc')==False]
sightings_cleaned = sightings_cleaned[sightings_cleaned['Detail_Summary_nltk'].str.contains('NUFORC')==False]
len(sightings_cleaned)

67841

In [10]:
sightings_cleaned

Unnamed: 0,Date_Time,City,State,Shape,Duration,Summary,Posted,Detail_Link,Detail_Summary,Detail_Summary_nltk
0,4/23/21 06:30,Blackshear,GA,Circle,9 minutes,Very strange ((NUFORC Note: Rocket launch f...,4/23/21,http://www.nuforc.org/webreports/162/S162815.html,\nVery strangeI have recorded a video of this ...,strangei recorded video sighting
1,4/23/21 06:00,Mechanicsville,VA,Circle,Seconds,Ball in the sky ((NUFORC Note: Rocket launc...,4/23/21,http://www.nuforc.org/webreports/162/S162814.html,\nBall in the skyObject appears as a white bal...,ball skyobject appears white ball vapor strewi...
2,4/23/21 06:00,Vero Beach,FL,Light,5 minutes,I was driving and saw something strange in the...,4/23/21,http://www.nuforc.org/webreports/162/S162822.html,\nI was driving and saw something strange in t...,driving saw something strange sky pulled car i...
3,4/23/21 05:59,St. Augustine,FL,Light,3 minutes,2 extremely bright lights appeared over east c...,4/23/21,http://www.nuforc.org/webreports/162/S162824.html,\n2 extremely bright lights appeared over east...,2 extremely bright light appeared east coast n...
4,4/23/21 05:58,Durham,NC,Cone,>5 minutes,A cone of light coming from the sky unlike any...,4/23/21,http://www.nuforc.org/webreports/162/S162819.html,\nA cone of light coming from the sky unlike a...,cone light coming sky unlike anything ever see...
5,4/23/21 05:55,I-16 south,GA,Sphere,10 minutes,Noticed a intense light that was covering a la...,4/23/21,http://www.nuforc.org/webreports/162/S162823.html,\nDriving on I-16 south and noticed a intense ...,driving i-16 south noticed intense light cover...
6,4/23/21 05:54,Parrish,FL,Light,5 minutes,Two bright lights one flashing with a descendi...,4/23/21,http://www.nuforc.org/webreports/162/S162820.html,\nTwo bright lights one flashing with a descen...,two bright light one flashing descending expan...
7,4/23/21 05:45,Champions Gate,FL,Light,~10-15 minutes,Im former military and have never seen aircraf...,4/23/21,http://www.nuforc.org/webreports/162/S162826.html,\nIm former military and have never seen aircr...,I former military never seen aircraft that.inc...
9,4/23/21 02:40,Firestone,CO,Chevron,3-4 seconds,"I witnessed a chevron-shaped object, silent an...",4/23/21,http://www.nuforc.org/webreports/162/S162827.html,"\nI witnessed a chevron-shaped object, silent ...","witnessed chevron-shaped object , silent seven..."
10,4/22/21 22:23,New York City (Brooklyn),NY,Fireball,2 minutes,Saw a steady pulsating fireball above that mov...,4/23/21,http://www.nuforc.org/webreports/162/S162818.html,\nSaw a steady pulsating fireball above that m...,saw steady pulsating fireball moved slowly awa...


The next section is cleaning up the city, state, and country column. To begin, we take cities that have () in them. We split that. Some of the parentheses have country or state in them, however, many do not have anything useful. Once cleaned, that needs to be merged back in and the general task of cleaning up the remaining location data may go forward.

In [11]:
sightings_cleaned[sightings_cleaned['City'].str.contains('\(')].City

10                    New York City (Brooklyn)
11                           Firozabad (India)
24                    New York City (Brooklyn)
33                            Nanaimo (Canada)
95                     Merseyside (UK/England)
148                      Melbourne (Australia)
152                          Winnipeg (Canada)
203                         Gilching (Germany)
205                        Chilliwack (Canada)
215                          Langford (Canada)
223                     Vancouver Bc  (Canada)
306                           Oakbank (Canada)
309                         Rugby (UK/England)
344                      Tamworth (UK/England)
349                      New York City (Bronx)
365                           Calgary (Canada)
382                     Northwich (UK/England)
388                    Rosarito)(Baja)(Mexico)
398                         Monterrey (Mexico)
436                           Wiarton (Canada)
446                 Littlehampton (UK/England)
512          

In [12]:
# df is a temporary dataframe so that I can clean cities with a () in them. The following is all of that work.
# Eventually, this could be merged into the main dataframe, or this code could be applied to that dataframe
# when we are confident it works.
df = sightings_cleaned[sightings_cleaned['City'].str.contains('\(')].City.str.split("\(([^)]+)", expand= True)
df.columns = ["City", "Country", "EndParenth", "Empty1", "Empty2", "Empty3", "Empty4"]
df.drop(["EndParenth"], axis=1, inplace = True)
df["City"] = df["City"].str.strip()
df

for index, row in df.iterrows():
    df.loc[index, "State"] = sightings_cleaned.loc[index].State

In [13]:
countries = dict(country_list.countries_for_language('en'))
cities_file = "us_cities_states_counties.csv"
cities_df = pd.read_csv(cities_file, delimiter="|")
city_list = cities_df.City.unique().tolist()
city_list[0:10]

['Holtsville',
 'Adjuntas',
 'Aguada',
 'Aguadilla',
 'Maricao',
 'Anasco',
 'Angeles',
 'Arecibo',
 'Bajadero',
 'Barceloneta']

In [14]:
countries

{'AF': 'Afghanistan',
 'AX': 'Åland Islands',
 'AL': 'Albania',
 'DZ': 'Algeria',
 'AS': 'American Samoa',
 'AD': 'Andorra',
 'AO': 'Angola',
 'AI': 'Anguilla',
 'AQ': 'Antarctica',
 'AG': 'Antigua & Barbuda',
 'AR': 'Argentina',
 'AM': 'Armenia',
 'AW': 'Aruba',
 'AU': 'Australia',
 'AT': 'Austria',
 'AZ': 'Azerbaijan',
 'BS': 'Bahamas',
 'BH': 'Bahrain',
 'BD': 'Bangladesh',
 'BB': 'Barbados',
 'BY': 'Belarus',
 'BE': 'Belgium',
 'BZ': 'Belize',
 'BJ': 'Benin',
 'BM': 'Bermuda',
 'BT': 'Bhutan',
 'BO': 'Bolivia',
 'BA': 'Bosnia & Herzegovina',
 'BW': 'Botswana',
 'BV': 'Bouvet Island',
 'BR': 'Brazil',
 'IO': 'British Indian Ocean Territory',
 'VG': 'British Virgin Islands',
 'BN': 'Brunei',
 'BG': 'Bulgaria',
 'BF': 'Burkina Faso',
 'BI': 'Burundi',
 'KH': 'Cambodia',
 'CM': 'Cameroon',
 'CA': 'Canada',
 'CV': 'Cape Verde',
 'BQ': 'Caribbean Netherlands',
 'KY': 'Cayman Islands',
 'CF': 'Central African Republic',
 'TD': 'Chad',
 'CL': 'Chile',
 'CN': 'China',
 'CX': 'Christmas Isla

In [15]:
# This record is "NA" for the country. Have to fix that or the next few things will throw an error.
# Ask me how I know that.
df.loc[89338, "Country"] = "USA"

In [16]:
# Do we want these as England, Wales, etc.?
df.loc[df['Country'].str.contains('UK'), "Country"] = "United Kingdom"
df.loc[df['Country'].str.contains('Northern Ireland'), "Country"] = "United Kingdom"
df.loc[df['Country'].str.contains('UK'), 'Country'].unique()

array([], dtype=object)

In [17]:
df.loc[(df['Country'].isin(countries.values())==False)&(df.Empty1.isnull()==False), "Country"] = df["Empty1"]
df.loc[df['Country'].str.contains('Brooklyn'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Bronx'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Brookline'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Westchester County'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Baja'), "Country"] = "Mexico"
df.loc[df['Country'].str.contains('Manhattan'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Bronx'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Queens'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Watts'), "Country"] = "United States"
df.loc[df['City'].str.contains('Warsaw/Clinton'), "Country"] = "United States"
df.loc[df['Country'].str.contains('USA'), "Country"] = "United States"
df.loc[df['Country'].str.contains('Calgary'), "Country"] = "Canada"
df.loc[df['Country'].str.contains('Wilhelmsburg'), "Country"] = "Germany"
df.loc[df['Country'].str.contains('Czech Republic'), "Country"] = "Czechia"
df.loc[df['Country'].str.contains('Punjab'), "Country"] = "India"
df.loc[df['Country'].str.contains('West Germany'), "Country"] = "Germany"
df.loc[df.Empty1 == "Canada", "Country"] = "Canada"
df.loc[df['Country'].str.contains('Brasil'), "Country"] = "Brazil"
df.loc[df['Country'].str.contains('Macedonia'), "Country"] = "North Macedonia"
df.loc[df['Country'].str.contains('México'), "Country"] = "Mexico"

In [18]:
df[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City))].Empty1.unique()

array(['in-flight', None, '"Abiquiu"', 'Northern Ireland', 'UK/England',
       'UK/Scotland', 'Victoria', 'pilot report', 'Riverside',
       'Republic of South Africa', 'UK/Wales', 'near', 'in flight'],
      dtype=object)

In [19]:
df[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City))].Empty2.unique()

array([')', None, ''], dtype=object)

In [20]:
df[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City))].Empty3.unique()

array([None, 'Australia'], dtype=object)

In [21]:
df[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City))].Empty4.unique()

array([None, ')'], dtype=object)

In [22]:
# Country not in Country list
# Cities not in the US city list
# cities_df[cities_df.City == "Canada"] ----> Canada is a city in Kentucky
# df[(df['Country'].isin(countries.values())==False)]

# Cities in the US city list
#df[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City)), "Country"] = "United States"
df.loc[df['Empty1'].isnull(), "Empty1"] = " "
df.loc[df['Empty2'].isnull(), "Empty2"] = " "
df.loc[df['Empty3'].isnull(), "Empty3"] = " "
df.loc[df['Empty4'].isnull(), "Empty4"] = " "
df.loc[df['Empty3'].str.contains('Australia'), "Country"] = "Australia"
df.loc[df['Empty1'].str.contains('Northern Ireland'), "Country"] = "United Kingdom"
df.loc[df['Empty1'].str.contains('UK/England'), "Country"] = "United Kingdom"
df.loc[df['Empty1'].str.contains('UK/Wales'), "Country"] = "United Kingdom"
df.loc[df['Empty1'].str.contains('UK/Scotland'), "Country"] = "United Kingdom"
df.loc[df['Empty1'].str.contains('Republic of South Africa'), "Country"] = "South Africa"

In [23]:
df.drop(["Empty1", "Empty2", "Empty3", "Empty4"], axis=1, inplace=True)

In [24]:
df.loc[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City)), "Country"] = "United States"

In [25]:
df.loc[(df['Country'].isin(countries.values())==False)&(df['City'].isin(cities_df.City)==False)]

Unnamed: 0,City,Country,State
585,Thetford/Bradford,between,VT
848,Ras Al Khaimah,UAR,
1101,Urjala,Finland??,AK
1999,St. George,north of,UT
2162,,Unspecified by witness,KY
2363,Las Vegas/Ryolite,between,NV
3274,Lake Kachess,Easbound on I-90,WA
3754,NoLos Angeles,northeast of,CA
4280,Pearl Lake State Park,25 miles north of Steamboat Springs,CO
4445,West Los Angeles,Brentwood,CA


df needs to be merged back into sightings_cleaned then saved to the file.

In [28]:
sightings_cleaned.to_pickle(cleaned_file_name)

In [27]:
sightings_cleaned.head()

Unnamed: 0,Date_Time,City,State,Shape,Duration,Summary,Posted,Detail_Link,Detail_Summary,Detail_Summary_nltk
0,4/23/21 06:30,Blackshear,GA,Circle,9 minutes,Very strange ((NUFORC Note: Rocket launch f...,4/23/21,http://www.nuforc.org/webreports/162/S162815.html,\nVery strangeI have recorded a video of this ...,strangei recorded video sighting
1,4/23/21 06:00,Mechanicsville,VA,Circle,Seconds,Ball in the sky ((NUFORC Note: Rocket launc...,4/23/21,http://www.nuforc.org/webreports/162/S162814.html,\nBall in the skyObject appears as a white bal...,ball skyobject appears white ball vapor strewi...
2,4/23/21 06:00,Vero Beach,FL,Light,5 minutes,I was driving and saw something strange in the...,4/23/21,http://www.nuforc.org/webreports/162/S162822.html,\nI was driving and saw something strange in t...,driving saw something strange sky pulled car i...
3,4/23/21 05:59,St. Augustine,FL,Light,3 minutes,2 extremely bright lights appeared over east c...,4/23/21,http://www.nuforc.org/webreports/162/S162824.html,\n2 extremely bright lights appeared over east...,2 extremely bright light appeared east coast n...
4,4/23/21 05:58,Durham,NC,Cone,>5 minutes,A cone of light coming from the sky unlike any...,4/23/21,http://www.nuforc.org/webreports/162/S162819.html,\nA cone of light coming from the sky unlike a...,cone light coming sky unlike anything ever see...
