# Feature engineering knowledge test

Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?

- A column of timestamps
- A column of newspaper headlines

# Identifying areas for feature engineering

Take an exploratory look at the `volunteer` dataset.

Which of the following columns would you want to perform a feature engineering task on?

In [1]:
import pandas as pd
volunteer = pd.read_csv("dataset/volunteer_opportunities.csv")
print(volunteer.columns)
volunteer[['vol_requests', 'title', 'created_date', 'category_desc']].head()


Index(['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'locality', 'region', 'postalcode', 'primary_loc', 'display_url',
       'recurrence_type', 'hours', 'created_date', 'last_modified_date',
       'start_date_date', 'end_date_date', 'status', 'Latitude', 'Longitude',
       'Community Board', 'Community Council ', 'Census Tract', 'BIN', 'BBL',
       'NTA'],
      dtype='object')


Unnamed: 0,vol_requests,title,created_date,category_desc
0,50,Volunteers Needed For Rise Up & Stay Put! Home...,January 13 2011,
1,2,Web designer,January 14 2011,Strengthening Communities
2,20,Urban Adventures - Ice Skating at Lasker Rink,January 19 2011,Strengthening Communities
3,500,Fight global hunger and support women farmers ...,January 21 2011,Strengthening Communities
4,15,Stop 'N' Swap,January 28 2011,Environment


- `title`
- `created_date`
- `category_desc`

# Encoding categorical variables - binary

Take a look at the `hiking` dataset. There are several columns here that need encoding before they can be modeled, one of which is the Accessible column. Accessible is a binary feature, so it has two values, `Y` or `N`, which need to be encoded into 1's and 0's. Use scikit-learn's `LabelEncoder` method to perform this transformation.

In [2]:
import pandas as pd

hiking = pd.read_json("dataset/hiking.json")
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


In [3]:
from sklearn.preprocessing import LabelEncoder
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

# Compare the two columns
print(hiking[["Accessible_enc", "Accessible"]].head())

   Accessible_enc Accessible
0               1          Y
1               0          N
2               0          N
3               0          N
4               0          N


# Encoding categorical variables - one-hot

One of the columns in the `volunteer` dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use pandas' `pd.get_dummies()` function to do so.

In [4]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0  


# Aggregating numerical features

A good use case for taking an aggregate statistic to create a new feature is when you have many features with similar, related values. Here, you have a DataFrame of running times named `running_times_5k`. For each name in the dataset, take the mean of their 5 run times.

In [5]:
# Data for the DataFrame
data = {
    'name': ['Sue', 'Mark', 'Sean', 'Erin', 'Jenny'],
    'run1': [20.1, 16.5, 23.5, 21.7, 25.8],
    'run2': [18.5, 17.1, 25.1, 21.1, 27.1],
    'run3': [19.6, 16.9, 25.2, 20.9, 26.1],
    'run4': [20.3, 17.6, 24.6, 22.1, 26.7],
    'run5': [18.3, 17.3, 23.9, 22.2, 26.9]
}

# Create DataFrame
running_times_5k = pd.DataFrame(data)
running_times_5k

Unnamed: 0,name,run1,run2,run3,run4,run5
0,Sue,20.1,18.5,19.6,20.3,18.3
1,Mark,16.5,17.1,16.9,17.6,17.3
2,Sean,23.5,25.1,25.2,24.6,23.9
3,Erin,21.7,21.1,20.9,22.1,22.2
4,Jenny,25.8,27.1,26.1,26.7,26.9


In [6]:
# Use .loc to create a mean column
running_times_5k["mean"] = running_times_5k.loc[:, :].mean(axis=1, numeric_only=True)

# Take a look at the results
print(running_times_5k.head())

    name  run1  run2  run3  run4  run5   mean
0    Sue  20.1  18.5  19.6  20.3  18.3  19.36
1   Mark  16.5  17.1  16.9  17.6  17.3  17.08
2   Sean  23.5  25.1  25.2  24.6  23.9  24.46
3   Erin  21.7  21.1  20.9  22.1  22.2  21.60
4  Jenny  25.8  27.1  26.1  26.7  26.9  26.52


# Extracting datetime components

There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.

In [7]:
# volunteer.head(1)
print(volunteer.columns)
volunteer["start_date_date"].head()

Index(['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'locality', 'region', 'postalcode', 'primary_loc', 'display_url',
       'recurrence_type', 'hours', 'created_date', 'last_modified_date',
       'start_date_date', 'end_date_date', 'status', 'Latitude', 'Longitude',
       'Community Board', 'Community Council ', 'Census Tract', 'BIN', 'BBL',
       'NTA'],
      dtype='object')


0        July 30 2011
1    February 01 2011
2     January 29 2011
3    February 14 2011
4    February 05 2011
Name: start_date_date, dtype: object

In [8]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].dt.month

# Take a look at the converted and new month columns
print(volunteer[["start_date_converted", "start_date_month"]].head())

  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2


# Extracting string patterns

The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in pandas to apply the extraction to the DataFrame.

In [9]:
hiking["Length"].head()


0     0.8 miles
1      1.0 mile
2    0.75 miles
3     0.5 miles
4     0.5 miles
Name: Length, dtype: object

In [10]:
import re
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    
    # Search the text for matches
    mile = re.search(r'(\d+\.\d+)', str(length))
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(return_mileage)
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


# Vectorizing text

You'll now transform the `volunteer` dataset's `title` column into a text vector, which you'll use in a prediction task in the next exercise.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

# Text classification using tf/idf vectors

Now that you've encoded the `volunteer` dataset's `title` column into tf/idf vectors, you'll use those vectors to predict the `category_desc` column.

In [12]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

nb = GaussianNB()
# Split the dataset according to the class distribution of category_desc
X = pd.DataFrame(text_tfidf.toarray(), columns=tfidf_vec.get_feature_names_out())
print(volunteer[["category_desc"]].isna().sum())
volunteer[["category_desc"]].value_counts()

category_desc    48
dtype: int64


category_desc            
Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
dtype: int64

In [13]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
volunteer["category_desc"] = label_encoder.fit_transform(volunteer["category_desc"])
y = volunteer["category_desc"]
y.value_counts()

5    307
4    119
0     92
3     52
6     48
2     32
1     15
Name: category_desc, dtype: int64

In [14]:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.437125748502994
