## Checking column types

Take a look at the UFO dataset's column types using the .info() method. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

    Call the .info() method on the ufo dataset.
    Convert the type of the seconds column to the float data type.
    Convert the type of the date column to the datetime data type.
    Call .info() on ufo again to see if the changes worked.

In [None]:
# Print the DataFrame info
print(ufo.info())

# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype(float)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo.info())

## Dropping missing data

In this exercise, you'll remove some of the rows where certain columns have missing values. You're going to look at the length_of_time column, the state column, and the type column. You'll drop any row that contains a missing value in at least one of these three columns.

    Print out the number of missing values in the length_of_time, state, and type columns, in that order, using .isna() and .sum().
    Drop rows that have missing values in at least one of these columns.
    Print out the shape of the new ufo_no_missing dataset.

In [None]:
# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[['length_of_time', 'state', 'type']].isna().sum())

# Drop rows where length_of_time, state, or type are missing
ufo_no_missing = ufo.dropna(subset=['length_of_time', 'state', 'type'])

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

## Extracting numbers from strings

The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

    Search time_string for numbers using an appropriate RegEx pattern.
    Use the .apply() method to call the return_minutes() on every row of the length_of_time column.
    Print out the .head() of both the length_of_time and minutes columns to compare.

In [None]:
def return_minutes(time_string):

    # Search for numbers in time_string
    num = re.search('[0-9]+', time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[['minutes', 'length_of_time']].head())

## Identifying features for standardization

In this exercise, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normalize the seconds column.

    Calculate the variance in the seconds and minutes columns and take a close look at the results.
    Perform log normalization on the seconds column, transforming it into a new column named seconds_log.
    Print out the variance of the seconds_log column.

In [None]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

## Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

    Using apply(), write a conditional lambda function that returns a 1 if the value is "us", else return 0.
    Print out the number of .unique() values in the type column.
    Using pd.get_dummies(), create a one-hot encoded set of the type column.
    Finally, use pd.concat() to concatenate the type_set encoded variables to the ufo dataset.

In [None]:
# Use pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda a: 1 if a == 'us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

## Features from dates

Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

    Print out the .head() of the date column.
    Retrieve the month attribute of the date column.
    Retrieve the year attribute of the date column.
    Take a look at the .head() of the date, month, and year columns.

In [None]:
# Look at the first 5 rows of the date column
print(ufo['date'].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the head of all three columns
print(ufo[['month','year','date']].head())

## Text vectorization

You'll now transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

    Print out the .head() of the desc column.
    Instantiate a TfidfVectorizer() object.
    Fit and transform the desc column using vec.
    Print out the .shape of the desc_tfidf vector, to take a look at the number of columns this created.

In [None]:
# Take a look at the head of the desc field
print(ufo['desc'].head())

# Instantiate the tfidf vectorizer object
vec = TfidfVectorizer()

# Fit and transform desc using vec
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns and rows
print(desc_tfidf.shape)

## Selecting the ideal dataset

Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.

You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.

You vectorized desc, so it can be removed. For now you'll keep type.

You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.

    Make a list of all the columns to drop, to_drop.
    Drop these columns from ufo.
    Use the words_to_filter() function you created previously; pass in vocab, vec.vocabulary_, desc_tfidf, and keep the top 4 words as the last parameter.

In [None]:
# Make a list of features to drop
to_drop = ['date','recorded','seconds', 'minutes', 'city', 'country', 'lat', 'long', 'state', 'desc', 'length_of_time']

# Drop those features
ufo_dropped = ufo.drop(columns=to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

## Modeling the UFO dataset, part 1

In this exercise, you're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. The X dataset contains the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is "us" and 0 is "ca".

    Print out the .columns of the X set.
    Split the X and y sets, ensuring that the class distribution of the labels is the same in the training and tests sets, and using a random_state of 42.
    Fit knn to the training data.
    Print the test set accuracy of the knn model.

In [None]:
# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Fit knn to the training sets
knn.fit(X_train, y_train)

# Print the score of knn on the test sets
print(knn.score(X_test, y_test))

## Modeling the UFO dataset, part 2

Finally, you'll build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if you can predict the type of the sighting based on the text. You'll use a Naive Bayes model for this.

    Filter the desc_tfidf vector by passing a list of filtered_words into the index.
    Split the filtered_text features and y, ensuring an equal class distribution in the training and test sets; use a random_state of 42.
    Use the nb model's .fit() to fit X_train and y_train.
    Print out the .score() of the nb model on the X_test and y_test sets.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# Fit nb to the training sets
nb.fit(X_train, y_train)

# Print the score of nb on the test sets
print(nb.score(X_test, y_test))