# Preprocessing for Machine Learning in Python


## Course Description
This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

## Introduction to Data Preprocessing

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
plt.style.use('seaborn')

In [3]:
volunteer = pd.read_csv('datasets/volunteer_opportunities.csv')
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


In [4]:
volunteer.isnull().sum().sort_values(ascending=False)

NTA                   665
primary_loc           665
BIN                   665
Census Tract          665
Community Council     665
Community Board       665
Longitude             665
Latitude              665
amsl                  665
amsl_unit             665
BBL                   665
is_priority           603
locality               70
category_id            48
category_desc          48
postalcode              6
display_url             0
end_date_date           0
content_id              0
vol_requests            0
event_time              0
title                   0
hits                    0
summary                 0
status                  0
start_date_date         0
recurrence_type         0
last_modified_date      0
created_date            0
org_title               0
org_content_id          0
addresses_count         0
hours                   0
region                  0
opportunity_id          0
dtype: int64

### Missing data - rows
Taking a look at the volunteer dataset again, we want to drop rows where the category_desc column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

In [5]:
# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

48
(617, 35)


### Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.

Which data types are present in the volunteer dataset?

The dataset volunteer has been provided.
Use the .dtypes attribute to check the datatypes.

In [6]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

### Converting a column type
If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.

In [7]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer['hits'].astype(int)

# Look at the dtypes of the dataset
print(volunteer.dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64


### Class imbalance
In the volunteer dataset, we're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?

The dataset volunteer has been provided.
The colum you want to check is category_desc.
Use the value_counts() method to check variable counts.

In [8]:
volunteer['category_desc'].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

### Stratified sampling
We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [9]:
from sklearn.model_selection import train_test_split


In [10]:
volunteer = volunteer[volunteer['category_desc'].notnull()]

In [11]:

# Create a data with all columns except category_desc
volunteer_X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


### Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

In [12]:
wine = pd.read_csv('datasets/wine_types.csv')

In [13]:
X = wine.loc[:, ['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine.loc[:, ['Type']]

In [14]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

In [15]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.6444444444444445


  """


### Log normalization in Python
Now that we know that the Proline column in our wine dataset has a large amount of variance, let's log normalize it.

Numpy has been imported as np in your workspace.

In [16]:
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012


### Scaling data - standardizing columns
Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

In [17]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

### KNN on non-scaled data
Let's first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data. The knn model as well as the X and y data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.

In [18]:
wine = pd.read_csv('datasets/wine_types.csv')

X = wine.iloc[:, 1:]
y = wine.iloc[:, 0]

In [19]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.6666666666666666


### KNN on scaled data
The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the knn model as well as the X and y data and labels set have already been created for you.

In [20]:
from sklearn.preprocessing import StandardScaler

In [21]:
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=21)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))

0.9555555555555556


### Encoding categorical variables - binary
Take a look at the hiking dataset. There are several columns here that need encoding, one of which is the Accessible column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y or N - so it needs to be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation.

In [22]:
hiking = pd.read_json('datasets/hiking.json')

In [23]:
from sklearn.preprocessing import LabelEncoder

# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
hiking[['Accessible', 'Accessible_enc']].head()


Unnamed: 0,Accessible,Accessible_enc
0,Y,1
1,N,0
2,N,0
3,N,0
4,N,0


### Encoding categorical variables - one-hot
One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' get_dummies() function to do so.

In [24]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer[['category_desc']])

# Take a look at the encoded columns
category_enc.head()


Unnamed: 0,category_desc_Education,category_desc_Emergency Preparedness,category_desc_Environment,category_desc_Health,category_desc_Helping Neighbors in Need,category_desc_Strengthening Communities
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0
5,0,0,1,0,0,0


### Engineering numerical features - taking an average
A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

In [25]:
running_times_5k = pd.read_csv('datasets/runs.csv')

In [26]:
# Create a list of the columns to average
run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
running_times_5k

Unnamed: 0,name,run1,run2,run3,run4,run5,mean
0,Sue,20.1,18.5,19.6,20.3,18.3,19.36
1,Mark,16.5,17.1,16.9,17.6,17.3,17.08
2,Sean,23.5,25.1,25.2,24.6,23.9,24.46
3,Erin,21.7,21.1,20.9,22.1,22.2,21.6
4,Jenny,25.8,27.1,26.1,26.7,26.9,26.52
5,Russell,30.9,29.6,31.4,30.4,29.9,30.44


### Engineering numerical features - datetime
There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

In [27]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer['start_date_date'])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer['start_date_converted'].apply(lambda row: row.month)

# Take a look at the converted and new month columns
volunteer[['start_date_converted', 'start_date_month']].head()


Unnamed: 0,start_date_converted,start_date_month
1,2011-02-01,2
2,2011-01-29,1
3,2011-02-14,2
4,2011-02-05,2
5,2011-02-12,2


### Engineering features from strings - extraction
The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

In [28]:
import re

In [29]:
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))

hiking['Length'] = hiking['Length'].fillna('').apply(str)    
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking['Length'].apply(lambda row: return_mileage(row))
hiking[["Length", "Length_num"]].head()

Unnamed: 0,Length,Length_num
0,0.8 miles,0.8
1,1.0 mile,1.0
2,0.75 miles,0.75
3,0.5 miles,0.5
4,0.5 miles,0.5


### Engineering features from strings - tf/idf
Let's transform the volunteer dataset's title column into a text vector, to use in a prediction task in the next exercise.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
# Take the title text
title_text = volunteer['title'].copy()

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

### Text classification using tf/idf vectors
Now that we've encoded the volunteer dataset's title column into tf/idf vectors, let's use those vectors to try to predict the category_desc column.

In [32]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

In [33]:
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.4645161290322581


### Selecting relevant features
Now let's identify the redundant columns in the volunteer dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain repeated information, so it would make sense to keep only one of the features.

There are also features that have gone through the feature engineering process: columns like Education and Emergency Preparedness are a product of encoding the categorical variable category_desc, so category_desc itself is redundant now.

Take a moment to examine the features of volunteer in the console, and try to identify the redundant features.

In [36]:
# # Create a list of redundant column names to drop
# to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# # Drop those columns from the dataset
# volunteer_subset = volunteer.drop(to_drop, axis=1)

# # Print out the head of the new dataset
# print(volunteer_subset.head())

### Checking for correlated features
Let's take a look at the wine dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame

In [37]:
# Print out the column correlations of the wine dataset
print(wine.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, axis=1)

                                  Type   Alcohol  Malic acid       Ash  \
Type                          1.000000 -0.328222    0.437776 -0.049643   
Alcohol                      -0.328222  1.000000    0.094397  0.211545   
Malic acid                    0.437776  0.094397    1.000000  0.164045   
Ash                          -0.049643  0.211545    0.164045  1.000000   
Alcalinity of ash             0.517859 -0.310235    0.288500  0.443367   
Magnesium                    -0.209179  0.270798   -0.054575  0.286587   
Total phenols                -0.719163  0.289101   -0.335167  0.128980   
Flavanoids                   -0.847498  0.236815   -0.411007  0.115077   
Nonflavanoid phenols          0.489109 -0.155929    0.292977  0.186230   
Proanthocyanins              -0.499130  0.136698   -0.220746  0.009652   
Color intensity               0.265668  0.546364    0.248985  0.258887   
Hue                          -0.617369 -0.071747   -0.561296 -0.074667   
OD280/OD315 of diluted wines -0.788230

### Exploring text vectors, part 1
Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.

In [40]:
# # Add in the rest of the parameters
# def return_weights(vocab, original_vocab, vector, vector_index, top_n):
#     zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
#     # Let's transform that zipped dict into a series
#     zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
#     # Let's sort the series to pull out the top n weighted words
#     zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
#     return [original_vocab[i] for i in zipped_index]

# # Print out the weighted words
# print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

### Exploring text vectors, part 2
Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

In [41]:
# def words_to_filter(vocab, original_vocab, vector, top_n):
#     filter_list = []
#     for i in range(0, vector.shape[0]):
    
#         # Here we'll call the function from the previous exercise, and extend the list we're creating
#         filtered = return_weights(vocab, original_vocab, vector, i, top_n)
#         filter_list.extend(filtered)
#     # Return the list in a set, so we don't get duplicate word indices
#     return set(filter_list)

# # Call the function to get the list of word indices
# filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# # By converting filtered_words back to a list, we can use it to filter the columns in the text vector
# filtered_text = text_tfidf[:, list(filtered_words)]

### Training Naive Bayes with feature selection
Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the volunteer dataset's title and category_desc columns.

In [42]:
# # Split the dataset according to the class distribution of category_desc
# train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# # Fit the model to the training data
# nb.fit(train_X, train_y)

# # Print out the model's accuracy
# print(nb.score(test_X, test_y))

### Using PCA
Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.

In [44]:
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98098798e-01 1.73593305e-03 9.43282757e-05 4.89438533e-05
 1.04695097e-05 5.60981698e-06 2.79968212e-06 1.44536313e-06
 9.75418873e-07 3.94184513e-07 2.13661389e-07 8.91974959e-08]


### Training a model with PCA
Now that we have run PCA on the wine dataset, let's try training a model with it.

In [47]:
knn = KNeighborsClassifier()

In [80]:
y = wine['Type']

In [81]:
# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

0.6

### Checking column types
Take a look at the UFO dataset's column types using the dtypes attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [82]:
ufo = pd.read_csv('datasets/ufo_sightings_large.csv')

In [86]:
# Check the column types
print(ufo.dtypes)



date              datetime64[ns]
city                      object
state                     object
country                   object
type                      object
seconds                  float64
length_of_time            object
desc                      object
recorded                  object
lat                       object
long                     float64
dtype: object


In [87]:
# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype(float)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo[["seconds", "date"]].dtypes)

seconds           float64
date       datetime64[ns]
dtype: object


### Dropping missing data
Let's remove some of the rows where certain columns have missing values. We're going to look at the length_of_time column, the state column, and the type column. If any of the values in these columns are missing, we're going to drop the rows.

In [88]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[["length_of_time", "state", "type"]].isnull().sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
                      ufo["state"].notnull() & 
                      ufo["type"].notnull()]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


### Extracting numbers from strings
The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

In [108]:
def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))

ufo['length_of_time'] = ufo['length_of_time'].fillna('').apply(str)  
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(lambda row: return_minutes(row))

# Take a look at the head of both of the columns
print(ufo[['length_of_time', 'minutes']].head())

    length_of_time  minutes
0          2 weeks      2.0
1           30sec.     30.0
2                       NaN
3  about 5 minutes      NaN
4                2      2.0


### Identifying features for standardization
In this section, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the seconds column.

In [109]:
# Check the variance of the seconds and minutes columns
print(ufo['seconds'].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo['seconds_log'].var())

31567346180.21189
nan


  result = getattr(ufunc, method)(*inputs, **kwargs)


### Encoding categorical variables
There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [123]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda value: 1 if value == 'us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([type_set, ufo], axis=1)

22


### Features from dates
Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

In [126]:
# Look at the first 5 rows of the date column
ufo['date'].head()



0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
2   2009-09-25 21:00:00
3   2002-11-21 05:45:00
4   2010-08-19 12:55:00
Name: date, dtype: datetime64[ns]

In [127]:
# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda m: m.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda y: y.year)

# Take a look at the head of all three columns
ufo[['date','month','year']]

Unnamed: 0,date,month,year
0,2011-11-03 19:21:00,11,2011
1,2004-10-03 19:05:00,10,2004
2,2009-09-25 21:00:00,9,2009
3,2002-11-21 05:45:00,11,2002
4,2010-08-19 12:55:00,8,2010
...,...,...,...
4930,2000-07-05 19:30:00,7,2000
4931,2008-03-18 22:00:00,3,2008
4932,2005-06-15 02:30:00,6,2005
4933,1991-11-01 03:00:00,11,1991


### Text vectorization
Let's transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [128]:
# Take a look at the head of the desc field
print(ufo['desc'].head())



0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object


In [130]:
# Create the tfidf vectorizer object
vec = TfidfVectorizer()
ufo['desc'] = ufo['desc'].fillna('').apply(str)  
# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns this creates
print(desc_tfidf.shape)

(4935, 6433)


### Selecting the ideal dataset
Let's get rid of some of the unnecessary features. Because we have an encoded country column, country_enc, keep it and drop other columns related to location: city, country, lat, long, state.

We have columns related to month and year, so we don't need the date or recorded columns.

We vectorized desc, so we don't need it anymore. For now we'll keep type.

We'll keep seconds_log and drop seconds and minutes.

Let's also get rid of the length_of_time column, which is unnecessary after extracting minutes.

In [None]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds', 'seconds_log','minutes']].corr())

# Make a list of features to drop
to_drop = ['city', 'country', 'date', 'desc', 'lat', 'long', 'length_of_time', 'minutes', 'recorded', 'seconds', 'state']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
#filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf,4)

### Modeling the UFO dataset, part 1
In this exercise, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our X dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is us and 0 is ca.

In [None]:
# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

### Modeling the UFO dataset, part 2
Finally, let's build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if we can predict the type of the sighting based on the text. We'll use a Naive Bayes model for this.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
print(nb.score(test_X, test_y))