# Checking column types

Take a look at the UFO dataset's column types using the `.info()` method. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as `object`, and the `date` column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [65]:
import pandas as pd
ufo = pd.read_csv("dataset/ufo_sightings_large.csv")
# ufo.head()
ufo.columns

Index(['date', 'city', 'state', 'country', 'type', 'seconds', 'length_of_time',
       'desc', 'recorded', 'lat', 'long'],
      dtype='object')

In [66]:
# Print the DataFrame info
print(ufo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            4935 non-null   object 
 1   city            4926 non-null   object 
 2   state           4516 non-null   object 
 3   country         4255 non-null   object 
 4   type            4776 non-null   object 
 5   seconds         4935 non-null   float64
 6   length_of_time  4792 non-null   object 
 7   desc            4932 non-null   object 
 8   recorded        4935 non-null   object 
 9   lat             4935 non-null   object 
 10  long            4935 non-null   float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB
None


In [67]:


# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype("float")

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            4935 non-null   datetime64[ns]
 1   city            4926 non-null   object        
 2   state           4516 non-null   object        
 3   country         4255 non-null   object        
 4   type            4776 non-null   object        
 5   seconds         4935 non-null   float64       
 6   length_of_time  4792 non-null   object        
 7   desc            4932 non-null   object        
 8   recorded        4935 non-null   object        
 9   lat             4935 non-null   object        
 10  long            4935 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 424.2+ KB
None


# Dropping missing data

In this exercise, you'll remove some of the rows where certain columns have missing values. You're going to look at the `length_of_time` column, the `state` column, and the `type` column. You'll drop any row that contains a missing value in at least one of these three columns.

In [68]:
# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[['length_of_time', 'state', 'type']].isnull().sum())

# Drop rows where length_of_time, state, or type are missing
ufo_no_missing = ufo.dropna(subset=['length_of_time', 'state', 'type'])

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


# Extracting numbers from strings

The `length_of_time` field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

In [69]:
import re
def return_minutes(time_string):

    # Search for numbers in time_string
    num = re.search(r'\d+', str(time_string))
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[["minutes", "length_of_time"]].head())

   minutes   length_of_time
0      2.0          2 weeks
1     30.0           30sec.
2      NaN              NaN
3      5.0  about 5 minutes
4      2.0                2


# Identifying features for standardization

In this exercise, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the `seconds` and `minutes` column, you'll see that the variance of the `seconds` column is extremely high. Because `seconds` and `minutes` are related to each other (an issue we'll deal with when we select features for modeling), let's log normalize the `seconds` column.

In [70]:
# Check the variance of the seconds and minutes columns
print(ufo[["minutes","seconds"]].var())
ufo[["minutes","seconds"]].info()

minutes    8.425929e+02
seconds    3.156735e+10
dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   minutes  4443 non-null   float64
 1   seconds  4935 non-null   float64
dtypes: float64(2)
memory usage: 77.2 KB


In [71]:
import numpy as np

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

nan


  result = getattr(ufunc, method)(*inputs, **kwargs)


# Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [72]:
# Use pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda val: 1 if val == "us" else 0)

# Print the number of unique type values
print(len(ufo["type"].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo["type"])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

22


# Features from dates

Another feature engineering task to perform is month and year extraction. Perform this task on the `date` column of the `ufo` dataset.

In [73]:
# Look at the first 5 rows of the date column
print(ufo["date"].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the head of all three columns
print(ufo[["date","month","year"]])

0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
2   2009-09-25 21:00:00
3   2002-11-21 05:45:00
4   2010-08-19 12:55:00
Name: date, dtype: datetime64[ns]
                    date  month  year
0    2011-11-03 19:21:00     11  2011
1    2004-10-03 19:05:00     10  2004
2    2009-09-25 21:00:00      9  2009
3    2002-11-21 05:45:00     11  2002
4    2010-08-19 12:55:00      8  2010
...                  ...    ...   ...
4930 2000-07-05 19:30:00      7  2000
4931 2008-03-18 22:00:00      3  2008
4932 2005-06-15 02:30:00      6  2005
4933 1991-11-01 03:00:00     11  1991
4934 2005-12-10 18:00:00     12  2005

[4935 rows x 3 columns]


# Text vectorization

You'll now transform the `desc` column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [74]:
print(ufo["desc"].head())

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object


In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Take a look at the head of the desc field
# print(ufo["desc"].head())

# Instantiate the tfidf vectorizer object
vec = TfidfVectorizer()

# Fit and transform desc using vec
desc_tfidf = vec.fit_transform(ufo["desc"].values.astype('U'))

# Look at the number of columns and rows
print(desc_tfidf.toarray().shape)

(4935, 6434)


# Selecting the ideal dataset

Now to get rid of some of the unnecessary features in the ufo dataset. Because the `country` column has been encoded as `country_enc`, you can select it and drop the other columns related to location: `city`, `country`, `lat`, `long`, and `state`.

You've engineered the `month` and `year` columns, so you no longer need the `date` or `recorded` columns. You also standardized the `seconds` column as `seconds_log`, so you can drop `seconds` and `minutes`.

You can also get rid of the `length_of_time` column, which is unnecessary after extracting minutes.

In [76]:
# Make a list of features to drop
to_drop = ['city', 'country', 'lat', 'long', 'state','date' , 'recorded', 'seconds' , 'minutes','desc', 'length_of_time']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
# filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

# Modeling the UFO dataset, part 1

In this exercise, you're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. The `X` dataset contains the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The `y` labels are the encoded country column, where 1 is "us" and 0 is "ca".

In [77]:
ufo_dropped['seconds_log']=ufo_dropped['seconds_log'].replace({-np.inf: 0, np.nan: 0})
ufo_dropped['seconds_log']=ufo_dropped['seconds_log'].where(ufo_dropped['seconds_log'] >= 0.01, 0)
ufo_dropped.head()


Unnamed: 0,type,seconds_log,country_enc,changing,chevron,cigar,circle,cone,cross,cylinder,...,light,other,oval,rectangle,sphere,teardrop,triangle,unknown,month,year
0,unknown,14.0058,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,11,2011
1,circle,3.401197,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,10,2004
2,cigar,0.0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,9,2009
3,triangle,5.703782,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,11,2002
4,oval,0.0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,8,2010


In [78]:
# ufo_dropped.columns
X = ufo_dropped.drop(["country_enc","type"], axis=1)
y = ufo_dropped["country_enc"]

In [80]:
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.model_selection import train_test_split
# knn = KNeighborsClassifier()
# # Take a look at the features in the X set of data
# print(X.columns)

# # Split the X and y sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# # Fit knn to the training sets
# knn.fit(X_train, y_train)

# # Print the score of knn on the test sets
# print(knn.score(X_test, y_test))

# Modeling the UFO dataset, part 2

Finally, you'll build a model using the text vector we created, `desc_tfidf`, using the `filtered_words` list to create a filtered text vector. Let's see if you can predict the type of the sighting based on the text. You'll use a Naive Bayes model for this.

In [81]:
# from sklearn.naive_bayes import GaussianNB
# nb= GaussianNB()
# # Use the list of filtered words we created to filter the text vector
# filtered_text = desc_tfidf[:, list(filtered_words)]

# # Split the X and y sets using train_test_split, setting stratify=y 
# X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# # Fit nb to the training sets
# nb.fit(X_train, y_train)

# # Print the score of nb on the test sets
# nb.score(X_test,y_test)