### This Notebook features the second step in the Data Science Method Workflow after problem identification which is data Wrangling. It comprises of collecting, defining and cleaning data for furthur exploratory analysis so as to have a clean uniform data with proper data types

# Project Title - UFO Sightings Analysis

### Idea behind this project is to analyze and mine the documented data available to us based on UFO sightings and find interesting trends and predict some behavior in terms of their shape, size, color, etc. Most of the analysis is based on visualizing maps to understand the geographies of UFO sighting

In [1]:
## ---- Importing the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import pycountry
import re
import reverse_geocoder as rg


In [2]:
## Importing the CSV file

ufo_data = pd.read_csv("UFO.csv", low_memory = False, na_values = ['UNKNOWN','UNK'], na_filter = True, skip_blank_lines = True)

### checking basic attributes about the data

In [3]:
# Shape of Data
ufo_data.shape

(88824, 11)

In [4]:
ufo_data.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)       object
duration (hours/min)     object
comments                 object
date posted              object
latitude                 object
longitude               float64
dtype: object

In [5]:
ufo_data.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude'],
      dtype='object')

In [6]:
ufo_data.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


In [7]:
ufo_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88824 entries, 0 to 88823
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              88824 non-null  object 
 1   city                  88679 non-null  object 
 2   state                 81352 non-null  object 
 3   country               76314 non-null  object 
 4   shape                 85865 non-null  object 
 5   duration (seconds)    88822 non-null  object 
 6   duration (hours/min)  85755 non-null  object 
 7   comments              88789 non-null  object 
 8   date posted           88824 non-null  object 
 9   latitude              88824 non-null  object 
 10  longitude             88824 non-null  float64
dtypes: float64(1), object(10)
memory usage: 7.5+ MB


In [8]:
ufo_1 = ufo_data.copy()

In [9]:
# Dropping some columns
ufo_1 = ufo_1.drop(['duration (seconds)','country', 'state', 'city'], axis = 1)

In [10]:
ufo_1.columns

Index(['datetime', 'shape', 'duration (hours/min)', 'comments', 'date posted',
       'latitude', 'longitude'],
      dtype='object')

In [11]:
## Renaming some columns
ufo_1 = ufo_1.rename({'datetime':"Date_time", "shape":"Shape", "duration (hours/min)":"Duration_minutes", "comments":"Description","date posted" : "Date_posted", "latitude":"Lat", "longitude":"Long"}, axis = 1)

In [12]:
ufo_1.columns

Index(['Date_time', 'Shape', 'Duration_minutes', 'Description', 'Date_posted',
       'Lat', 'Long'],
      dtype='object')

In [13]:
# checking Null values

ufo_1.isnull().sum()

Date_time              0
Shape               2959
Duration_minutes    3069
Description           35
Date_posted            0
Lat                    0
Long                   0
dtype: int64

#### Shape & Duration columns have some null values

### Checking and converting data types

In [14]:
ufo_1.dtypes

Date_time            object
Shape                object
Duration_minutes     object
Description          object
Date_posted          object
Lat                  object
Long                float64
dtype: object

##### Converting both date column to Datetime

In [15]:
 ufo_1['Date_time'] = ufo_1['Date_time'].map({t:pd.to_datetime(t,errors="coerce") for t in ufo_1.Date_time.unique()})

In [16]:
 ufo_1['Date_posted'] = ufo_1['Date_posted'].map({t:pd.to_datetime(t,errors="coerce") for t in ufo_1.Date_posted.unique()})

In [17]:
ufo_1.dtypes

Date_time           datetime64[ns]
Shape                       object
Duration_minutes            object
Description                 object
Date_posted         datetime64[ns]
Lat                         object
Long                       float64
dtype: object

In [18]:
# Dropping null values from Datetime column

In [19]:
ufo_1 = ufo_1.dropna(axis = 0, subset = ['Date_time'])

In [20]:
ufo_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87595 entries, 0 to 88823
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         87595 non-null  datetime64[ns]
 1   Shape             84904 non-null  object        
 2   Duration_minutes  84913 non-null  object        
 3   Description       87565 non-null  object        
 4   Date_posted       87595 non-null  datetime64[ns]
 5   Lat               87595 non-null  object        
 6   Long              87595 non-null  float64       
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 5.3+ MB


In [21]:
ufo_a = ufo_1.copy()

In [22]:
ufo_a.reset_index(drop = True)

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,Lat,Long
0,1949-10-10 20:30:00,cylinder,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111
1,1949-10-10 21:00:00,light,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,circle,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,circle,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833
4,1960-10-10 20:00:00,light,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803611
...,...,...,...,...,...,...,...
87590,2013-09-09 22:00:00,other,hour,Napa UFO&#44,2013-09-30,38.2972222,-122.284444
87591,2013-09-09 22:20:00,circle,5 seconds,Saw a five gold lit cicular craft moving fastl...,2013-09-30,38.9011111,-77.265556
87592,2013-09-09 23:00:00,cigar,17 minutes,2 witnesses 2 miles apart&#44 Red &amp; White...,2013-09-30,35.6527778,-97.477778
87593,2013-09-09 23:00:00,diamond,2 nights,On September ninth my wife and i noticed stran...,2013-09-30,34.3769444,-82.695833


In [23]:
# One value is incoorect in Lat column so correcting that value

In [24]:
ufo_a.iloc[47931,5] = 33.200088

In [25]:
## Changing Lat column to Float Type

ufo_a['Lat'] = ufo_a['Lat'].astype(float)

In [26]:
ufo_a.dtypes

Date_time           datetime64[ns]
Shape                       object
Duration_minutes            object
Description                 object
Date_posted         datetime64[ns]
Lat                        float64
Long                       float64
dtype: object

In [27]:
ufo_a.sample(10)

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,Lat,Long
6067,2004-10-30 21:00:00,sphere,1.5-2.0 sec.,While siting in the front yard conversing with...,2004-11-02,31.410833,-96.571944
43241,1998-04-25 21:00:00,formation,5 minnutes,3 circular shaped ufos changing formation more...,1999-01-28,40.425833,-89.779167
66848,2009-07-27 21:20:00,circle,2 minutes,It was moving faster than any other aircraft I...,2009-08-27,43.71,-74.974722
61366,1978-07-01 00:00:00,other,5 minutes,Electric Blue Half dome with possible alient e...,2001-08-05,40.118333,-75.178056
11701,1995-11-19 18:35:00,,3 sec.,MUFON investigator relays rept. of multiple si...,1999-11-02,38.729722,-120.7975
75798,2010-08-21 23:45:00,changing,1 minute,Coming from upstate new york we left here at 5...,2010-08-24,39.008333,-75.578333
33727,1999-03-11 15:00:00,disk,4 minutes,I was looking up because I was carring laungr...,2013-04-12,37.654722,-122.406667
47177,2004-05-15 21:00:00,circle,10 minutes,White orb follows jet aircraft,2004-06-04,39.776389,-74.862778
83318,2004-09-17 21:09:00,triangle,30 sec,Triangle shape craft sighted moving north west...,2004-09-29,29.423889,-98.493333
13331,2007-11-02 20:15:00,other,30 minutes,Object which looked like a lighted jellyfish p...,2007-11-28,37.908611,-121.599167


In [28]:
ufo_a.columns

Index(['Date_time', 'Shape', 'Duration_minutes', 'Description', 'Date_posted',
       'Lat', 'Long'],
      dtype='object')

In [29]:
# Dropping null values from Duartion Column
ufo_b = ufo_a.dropna(subset = ['Duration_minutes'])

In [30]:
ufo_b = ufo_b.reset_index(drop = True)

In [31]:
ufo_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84913 entries, 0 to 84912
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         84913 non-null  datetime64[ns]
 1   Shape             82815 non-null  object        
 2   Duration_minutes  84913 non-null  object        
 3   Description       84898 non-null  object        
 4   Date_posted       84913 non-null  datetime64[ns]
 5   Lat               84913 non-null  float64       
 6   Long              84913 non-null  float64       
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 4.5+ MB


### After dropping null from datetime and duration columns we have 84913 rows

### Getting Country, State, City from Latitude and Longitudes

In [32]:
ufo_b['lat_long'] = list(zip(ufo_b.Lat , ufo_b.Long))

In [33]:
ufo_b.head()

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,Lat,Long,lat_long
0,1949-10-10 20:30:00,cylinder,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941111,"(29.8830556, -97.9411111)"
1,1949-10-10 21:00:00,light,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082,"(29.38421, -98.581082)"
2,1955-10-10 17:00:00,circle,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667,"(53.2, -2.916667)"
3,1956-10-10 21:00:00,circle,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645833,"(28.9783333, -96.6458333)"
4,1960-10-10 20:00:00,light,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803611,"(21.4180556, -157.8036111)"


In [34]:
location = rg.search(list(ufo_b['lat_long']))

Loading formatted geocoded file...


In [35]:
ufo_b['Country'] = [p["cc"] for p in location]

In [36]:
ufo_b['State'] = [p["admin1"] for p in location]

In [37]:
ufo_b['City'] = [p["name"] for p in location]

In [38]:
ufo_b = ufo_b.drop(['Lat','Long'], axis = 1)

In [39]:
ufo_b.sample(10)

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,lat_long,Country,State,City
56960,2013-06-05 05:45:00,unknown,30 seconds,White lights in the morning sky over Lake Onta...,2013-07-03,"(43.15, -79.5)",CA,Ontario,St. Catharines
24467,2012-12-06 19:00:00,fireball,10 minutes,Orange fireballs in the sky.,2012-12-20,"(40.5580556, -90.035)",US,Illinois,Canton
32429,1995-03-01 21:00:00,light,1-2 sec.,Instantaneous flashes of white bars of light. ...,1999-08-30,"(38.8922222, -89.4130556)",US,Illinois,Greenville
41256,2014-04-24 21:35:00,light,60 seconds,This would be the third sighting by me&#44 but...,2014-05-02,"(28.8827778, -81.3088889)",US,Florida,DeBary
16266,2007-11-06 20:30:00,flash,15 seconds,light over san fernando valley,2007-11-28,"(34.1733333, -118.5530556)",US,California,Canoga Park
35669,2012-03-26 21:57:00,triangle,2 minutes,Triangle ship seen,2012-04-18,"(35.2269444, -80.8433333)",US,North Carolina,Charlotte
1636,2002-10-14 22:30:00,light,15-20 min.,UP to 7 silent orange lights seen hovering for...,2002-10-15,"(43.9, -78.866667)",CA,Ontario,Oshawa
42661,1973-04-05 21:30:00,disk,30 seconds,Perfect disk with lights on edge; one side red...,1998-06-18,"(29.7630556, -95.3630556)",US,Texas,Houston
70217,1992-08-15 22:30:00,other,45 minutes,The sky turned blood red along with a loud boo...,1999-11-09,"(29.6236111, -81.8905556)",US,Florida,Interlachen
73240,2003-08-24 20:07:00,circle,3 seconds,Very large circle of white light with a haze f...,2003-08-28,"(0.0, 0.0)",GH,Western,Takoradi


In [40]:
ufo_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84913 entries, 0 to 84912
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         84913 non-null  datetime64[ns]
 1   Shape             82815 non-null  object        
 2   Duration_minutes  84913 non-null  object        
 3   Description       84898 non-null  object        
 4   Date_posted       84913 non-null  datetime64[ns]
 5   lat_long          84913 non-null  object        
 6   Country           84913 non-null  object        
 7   State             84913 non-null  object        
 8   City              84913 non-null  object        
dtypes: datetime64[ns](2), object(7)
memory usage: 5.8+ MB


### Duration Column Cleaning

In [41]:
# Using regex package as re, trying to look for some patterns and converting it all into minutes
# First Function is to check for type digit followed by unit: (/d+) (/w+) 

#--------------------------------------------------------------------

### 1. Check for certain string and replace it my numerical unit, such as few by 3.5, several by 7.5, couple by 2, 1/2 by 0.5
### 2. Look for pattern as digit.digit or digit
###3. look for units in terms of seconds, minutes, hours

#------------------------------------------------------------------------
# Func takes string as the input and returns string after multiplying accordingly to the unit

#--------------------------------------------------------------------

def duration_clean(string):
    string = string.replace("few", "3.5").replace("1/2","0.5").replace("several","7.5").replace("couple","2").replace("?","").replace("one","1").replace("two","2").replace("five","5").replace("ten","10").replace("three","3").replace('a',"1")
    #aplhanumeric = ["(\d+)(\+w)"]
    numeric = ["(\d+\.\d+)","(\d+)"]
    unit = ["se[cs]", "secon[ds]","mi[ns]", "mi[mn]ut[es]","h[rs]", "hou[rs]"]
    unitConversion = {'s':1/60, 'm':1, 'h':60, '6':60}
    try:
        m_1 = re.search(re.compile("|".join(str(x) for x in numeric)), string).group(0)
        m_2 = re.search(re.compile("|".join(str(x) for x in unit)), string.lower()).group(0)[0]
        string = float(m_1)*unitConversion[m_2]
    except:
        s = np.nan
    return string

In [42]:
ufo_b['Duration_minutes'] = ufo_b['Duration_minutes'].apply(duration_clean)

In [43]:
ufo_b.sample(10)

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,lat_long,Country,State,City
15025,2011-11-03 17:03:00,other,5.0,3 fires hovering over the hudson river in a tr...,2011-12-12,"(42.9027778, -73.6877778)",US,New York,Mechanicville
24537,1998-12-07 18:00:00,light,12.0,Bright white light moving swiftly over Anchora...,1999-01-28,"(60.943367, -149.170693)",US,Alaska,Girdwood
68751,2000-08-01 21:00:00,light,1.0,UFO seen over Argos&#44 IN,2003-12-19,"(41.6972222, -86.245)",US,Indiana,Notre Dame
63850,2013-07-27 23:30:00,circle,1.0,Circular&#44 rustic looking craft over Floren...,2013-08-30,"(38.9988889, -84.6266667)",US,Kentucky,Florence
53998,2001-06-22 05:30:00,light,3.0,i saw a light zig zagging in the sky.at one po...,2001-08-05,"(26.9338889, -80.0944444)",US,Florida,Jupiter
60029,1978-07-15 20:30:00,circle,0.166667,Small circular object took out the tops of trees,2006-10-30,"(0.0, 0.0)",GH,Western,Takoradi
18943,2003-12-13 01:30:00,oval,0.033333,I saw it.,2003-12-19,"(33.8702778, -117.9244444)",US,California,Fullerton
56428,2012-06-30 23:50:00,oval,0.5,Kitsap county ufo sighting,2012-07-04,"(47.5675, -122.6313889)",US,Washington,Bremerton
79295,2012-09-15 21:00:00,circle,2.0,Round&#44 bright red lighted objects of 10 or ...,2012-09-24,"(40.6936111, -89.5888889)",US,Illinois,Peoria
7061,2003-10-06 18:30:00,cigar,15.0,Unexplained Objects in a Very Clear Sky &#44 C...,2003-10-15,"(39.055, -78.3683333)",US,Virginia,Strasburg


In [44]:
ufo_b[ufo_b['Duration_minutes'].apply(lambda x: isinstance(x, str))].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6729 entries, 93 to 84912
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         6729 non-null   datetime64[ns]
 1   Shape             6518 non-null   object        
 2   Duration_minutes  6729 non-null   object        
 3   Description       6727 non-null   object        
 4   Date_posted       6729 non-null   datetime64[ns]
 5   lat_long          6729 non-null   object        
 6   Country           6729 non-null   object        
 7   State             6729 non-null   object        
 8   City              6729 non-null   object        
dtypes: datetime64[ns](2), object(7)
memory usage: 525.7+ KB


In [45]:
# Checking null values again

In [46]:
ufo_b.isnull().sum()

Date_time              0
Shape               2098
Duration_minutes       0
Description           15
Date_posted            0
lat_long               0
Country                0
State                  0
City                   0
dtype: int64

In [47]:
## Only shape column has 2098 misiing values

In [60]:
ufo_c = ufo_b.dropna(subset = ['Description'])

In [62]:
ufo_c.isnull().sum()

Date_time              0
Shape               2091
Duration_minutes       0
Description            0
Date_posted            0
lat_long               0
Country                0
State                  0
City                   0
dtype: int64

### Cleanig shape column and checking if Description column has some info on shape where Shape value is NUll

In [63]:
ufo_c['Shape'].unique()

array(['cylinder', 'light', 'circle', 'sphere', 'disk', 'fireball',
       'unknown', 'oval', 'other', 'cigar', 'rectangle', 'chevron',
       'triangle', 'formation', nan, 'delta', 'changing', 'egg', 'flash',
       'diamond', 'cross', 'teardrop', 'cone', 'pyramid', 'round',
       'crescent', 'flare', 'hexagon', 'changed'], dtype=object)

In [64]:
ufo_c['Shape'].value_counts()

light        17420
triangle      8227
circle        8073
fireball      6395
unknown       6040
other         6005
disk          5647
sphere        5621
oval          3959
formation     2604
cigar         2158
changing      2071
flash         1425
rectangle     1368
cylinder      1333
diamond       1252
chevron        990
egg            818
teardrop       787
cone           348
cross          251
delta            7
crescent         2
round            2
flare            1
pyramid          1
changed          1
hexagon          1
Name: Shape, dtype: int64

In [65]:
# Creating a list of all uniques shapes mentioned

shapes = [i.lower() for i in ufo_c['Shape'].value_counts().index if i not in ['unknown','other']]

In [66]:
print(shapes)

['light', 'triangle', 'circle', 'fireball', 'disk', 'sphere', 'oval', 'formation', 'cigar', 'changing', 'flash', 'rectangle', 'cylinder', 'diamond', 'chevron', 'egg', 'teardrop', 'cone', 'cross', 'delta', 'crescent', 'round', 'flare', 'pyramid', 'changed', 'hexagon']


In [55]:
## Function to chech if shape column words exist in description column

#-----------------------------------------------------------------------

# Takes in the column and shape list and returns a list of shapes

def shape(r, shape_list):
    Desc = r.lower().split()
    shape_count = dict(zip(list(shape_list), [0] * len(shape_list)))
    for word in Desc:
        if word in shape_list:
            shape_count[word] += 1
    shape_count = {k[0].upper()+k[1:]:v for k, v in shape_count.items() if v}
    return list(shape_count.keys())


In [67]:
ufo_c['Shape Categories'] = ufo_c['Description'].apply(lambda x: shape(x, shapes))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [68]:

ufo_c.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84898 entries, 0 to 84912
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         84898 non-null  datetime64[ns]
 1   Shape             82807 non-null  object        
 2   Duration_minutes  84898 non-null  object        
 3   Description       84898 non-null  object        
 4   Date_posted       84898 non-null  datetime64[ns]
 5   lat_long          84898 non-null  object        
 6   Country           84898 non-null  object        
 7   State             84898 non-null  object        
 8   City              84898 non-null  object        
 9   Shape Categories  84898 non-null  object        
dtypes: datetime64[ns](2), object(8)
memory usage: 7.1+ MB


In [69]:
ufo_c.head()

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,lat_long,Country,State,City,Shape Categories
0,1949-10-10 20:30:00,cylinder,45.0,This event took place in early fall around 194...,2004-04-27,"(29.8830556, -97.9411111)",US,Texas,San Marcos,[]
1,1949-10-10 21:00:00,light,60.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,"(29.38421, -98.581082)",US,Texas,Lackland Air Force Base,[]
2,1955-10-10 17:00:00,circle,0.333333,Green/Orange circular disc over Chester&#44 En...,2008-01-21,"(53.2, -2.916667)",GB,England,Blacon,[]
3,1956-10-10 21:00:00,circle,30.0,My older brother and twin sister were leaving ...,2004-01-17,"(28.9783333, -96.6458333)",US,Texas,Edna,[]
4,1960-10-10 20:00:00,light,15.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,"(21.4180556, -157.8036111)",US,Hawaii,Kane'ohe,[]


In [76]:
## Checking if for null vlaues in Shaoe column does the new column ( Shape categories has any value) 
## and assigning it into a shape_final column

shape = []
for x, y in zip(ufo_c['Shape'], ufo_c['Shape Categories']):
    if (pd.isnull(x)) and (y != []):
        shape.append(y)
    elif (pd.isnull(x)) and (y == []):
        shape.append(x)
    elif (x in ['unknown','other']) and (y != []):
        shape.append(y)
    elif (x in ['unknown','other']) and (y == []):
        shape.append(y + [x])
    elif (x not in ['Unknown','Other']) and (x not in y):
        shape.append(y + [x])
    else:
        shape.append(y)
ufo_c['Shape_final'] = shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [77]:
ufo_c.head()

Unnamed: 0,Date_time,Shape,Duration_minutes,Description,Date_posted,lat_long,Country,State,City,Shape Categories,Shape_final
0,1949-10-10 20:30:00,cylinder,45.0,This event took place in early fall around 194...,2004-04-27,"(29.8830556, -97.9411111)",US,Texas,San Marcos,[],[cylinder]
1,1949-10-10 21:00:00,light,60.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,"(29.38421, -98.581082)",US,Texas,Lackland Air Force Base,[],[light]
2,1955-10-10 17:00:00,circle,0.333333,Green/Orange circular disc over Chester&#44 En...,2008-01-21,"(53.2, -2.916667)",GB,England,Blacon,[],[circle]
3,1956-10-10 21:00:00,circle,30.0,My older brother and twin sister were leaving ...,2004-01-17,"(28.9783333, -96.6458333)",US,Texas,Edna,[],[circle]
4,1960-10-10 20:00:00,light,15.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,"(21.4180556, -157.8036111)",US,Hawaii,Kane'ohe,[],[light]


### There are around 1500 null values in Shape Final column so after dropping thise rows final count of rows is 83392

In [96]:
# Droping null rows form shape_final column

ufo_d = ufo_c.dropna(subset = ['Shape_final'])

In [97]:
ufo_d = ufo_d.reset_index(drop = True)

In [99]:
# Dropping Shape and Shape categories columns

ufo_d.drop(['Shape', 'Shape Categories'], axis = 1, inplace = True)

In [101]:
ufo_d.columns

Index(['Date_time', 'Duration_minutes', 'Description', 'Date_posted',
       'lat_long', 'Country', 'State', 'City', 'Shape_final'],
      dtype='object')

In [102]:
ufo_d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83392 entries, 0 to 83391
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         83392 non-null  datetime64[ns]
 1   Duration_minutes  83392 non-null  object        
 2   Description       83392 non-null  object        
 3   Date_posted       83392 non-null  datetime64[ns]
 4   lat_long          83392 non-null  object        
 5   Country           83392 non-null  object        
 6   State             83392 non-null  object        
 7   City              83392 non-null  object        
 8   Shape_final       83392 non-null  object        
dtypes: datetime64[ns](2), object(7)
memory usage: 5.7+ MB


In [103]:
ufo_d.sample(10)

Unnamed: 0,Date_time,Duration_minutes,Description,Date_posted,lat_long,Country,State,City,Shape_final
46388,2011-05-27 14:30:00,240.0,((HOAX??)) The craft made things slow down an...,2011-06-27,"(44.1636111, -93.9991667)",US,Minnesota,Mankato,[changing]
27071,2012-02-01 11:00:00,60.0,My neighbors and I spotted 3 small orange ligh...,2012-02-03,"(38.4327778, -90.3775)",US,Missouri,Arnold,"[Triangle, triangle]"
20344,1999-12-02 17:10:00,0.5,Two lights moving eratically over Lake Washing...,1999-12-16,"(47.6063889, -122.3308333)",US,Washington,Seattle,[light]
18977,1985-12-15 23:00:00,don&#39t know,Stadium sized triangular craft with red blue a...,2004-09-09,"(32.0833333, -81.1)",US,Georgia,Savannah,[triangle]
43335,2007-05-01 21:00:00,30.0,One reddish-orange bright light with others co...,2007-06-12,"(33.6888889, -78.8869444)",US,South Carolina,Myrtle Beach,"[Light, light]"
16426,2009-01-17 21:00:00,10.0,Bright human/diamond shape seen over the Pacif...,2009-03-19,"(32.7152778, -117.1563889)",US,California,San Diego,[flash]
8125,2011-11-10 01:00:00,5.0,Triangular U.F.O. hovers near a main road&#44 ...,2011-12-12,"(27.3361111, -82.5308333)",US,Florida,Sarasota,[triangle]
21325,2004-12-25 06:00:00,1.0,Star? Moving Rapidly Across the sky maintainin...,2005-05-24,"(39.8908333, -75.0733333)",US,New Jersey,Audubon,[unknown]
71519,2008-08-22 19:15:00,40.0,Disappearing daytime star-like object appeared...,2008-10-31,"(30.2419444, -93.2505556)",US,Louisiana,Westlake,[other]
8223,1984-11-11 23:00:00,10.0,My fiance and I&#44 both artist&#44 were getti...,2006-05-15,"(41.3888889, -70.5138889)",US,Massachusetts,Edgartown,[triangle]


## Getting Year and Month From Spotted datetime

In [106]:
ufo_d['Year'] = ufo_d['Date_time'].dt.year

In [107]:
ufo_d.head()

Unnamed: 0,Date_time,Duration_minutes,Description,Date_posted,lat_long,Country,State,City,Shape_final,Year
0,1949-10-10 20:30:00,45.0,This event took place in early fall around 194...,2004-04-27,"(29.8830556, -97.9411111)",US,Texas,San Marcos,[cylinder],1949
1,1949-10-10 21:00:00,60.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,"(29.38421, -98.581082)",US,Texas,Lackland Air Force Base,[light],1949
2,1955-10-10 17:00:00,0.333333,Green/Orange circular disc over Chester&#44 En...,2008-01-21,"(53.2, -2.916667)",GB,England,Blacon,[circle],1955
3,1956-10-10 21:00:00,30.0,My older brother and twin sister were leaving ...,2004-01-17,"(28.9783333, -96.6458333)",US,Texas,Edna,[circle],1956
4,1960-10-10 20:00:00,15.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,"(21.4180556, -157.8036111)",US,Hawaii,Kane'ohe,[light],1960


In [114]:
ufo_d['Month'] = ufo_d['Date_time'].dt.month_name()

In [116]:
ufo_d.sample(15)

Unnamed: 0,Date_time,Duration_minutes,Description,Date_posted,lat_long,Country,State,City,Shape_final,Year,Month
69591,2005-08-16 22:10:00,0.166667,Flashing light Ascends quickly in the night sky,2005-10-11,"(39.7294444, -104.8313889)",US,Colorado,Aurora,"[Light, light]",2005,August
81583,1999-09-30 21:00:00,5.0,UFO seen - Giant&#44 seven lights&#44 slow&#44...,2007-04-27,"(46.9755556, -123.8144444)",US,Washington,Aberdeen,[triangle],1999,September
72056,2013-08-24 18:04:00,4.0,Just saw a very strange sight. A group of bri...,2013-08-30,"(39.7683333, -86.1580556)",US,Indiana,Indianapolis,[changing],2013,August
24960,2011-12-09 23:32:00,5.0,Large red pulsing light over Sarasota&#44 Flor...,2011-12-12,"(27.3361111, -82.5308333)",US,Florida,Sarasota,"[Light, light]",2011,December
39852,2010-04-02 23:45:00,3.0,Flashing rotating lights,2010-04-13,"(38.8338889, -104.8208333)",US,Colorado,Colorado Springs,[light],2010,April
61406,2009-07-22 22:00:00,7.0,Orange fireball that zigged and zagged across ...,2009-08-27,"(47.723087, -86.940716)",CA,Ontario,Marathon,"[Fireball, fireball]",2009,July
71389,1967-08-22 20:00:00,5.0,i looked up at the stars as it was getting dar...,2003-03-21,"(40.4405556, -79.9961111)",US,Pennsylvania,Pittsburgh,"[Light, circle]",1967,August
61208,2010-07-02 22:50:00,0.5,Orange glowing sphere moving from west to east...,2010-07-06,"(53.533333, -2.616667)",GB,England,Wigan,[Sphere],2010,July
12092,2004-11-20 20:00:00,10.0,Round to oval morphing Characteristics,2004-12-03,"(33.5091667, -111.8983333)",US,Arizona,Scottsdale,"[Oval, Round, changing]",2004,November
26126,2005-01-07 20:51:00,0.166667,Unusual object traveled at rapid speed across ...,2005-01-11,"(25.7738889, -80.1938889)",US,Florida,Miami,[other],2005,January


### Credits to https://github.com/Dascienz/ufo-sightings/blob/master for some help I got from there