# Extracting information from features

### What is feature engineering?

Feature engineering is the creation of new features based on existing features, and it adds information to your dataset that is useful in some way: it adds features useful for your prediction or clustering task, or it sheds insight into relationships between features. Real world data is often not neat and tidy, and in addition to preprocessing steps like standardization, you'll likely have to extract and expand information that exists in the columns in your dataset.

Feature engineering is something that is very dependent on the particular dataset you're analyzing, so it's very important to have an in-depth understanding of the dataset you want to model.

### Extracting features using regular expressions

Regular expressions are patterns that can be used to extract matches from text data. Here we have a string, and we want to extract the temperature digit - `45.6` - from it: 

In [1]:
temp_string = "temperature:45.6 F"

Notice that this number is a float. We'll need a pattern to extract this float, which we can create using the `re` Python library.

In [2]:
import re

pattern = re.compile("\d+\.\d+")

Let's break down the pattern in re.compile. `\d` means that we want to grab digits, and `+` means we want to grab as many as possible - so if there are two next to each other, we want both (like the 45). `\.` means we want to grab the decimal point, and then there's another `\d+` at the end to grab the digits on the right-hand side of the decimal.

To return the matching pattern, we can use `findall()`:

In [3]:
temperature = re.findall(pattern, temp_string)

temperature

['45.6']

Notice that `findall()` returns a list of strings of the matched pattern. In this situation, we'd want to return the temperature as a `float`. Since we know there's only a single temperature in our string, we can do the following:

In [4]:
temperature_num = float(re.findall(pattern, temp_string)[0])

temperature_num

45.6

This would be a little trickier in circumstances where we have multiple numerical instances in a string, but this is good enough for now.

Let's try this out on the `hiking` dataset. The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in pandas to apply the extraction to the entire `DataFrame`.

In [5]:
import pandas as pd

dir_string = "../../datasets/"
hiking = pd.read_json(dir_string + "hiking.json")

hiking["Length"].head()

0     0.8 miles
1      1.0 mile
2    0.75 miles
3     0.5 miles
4     0.5 miles
Name: Length, dtype: object

First, let's create a function that will find and return the mileage as a float value. Creating a function makes it a little easier to apply to the whole `DataFrame`. In the function, we're going to use Python's `isinstance()` to make sure what we're processing is a string—otherwise this will fail on missing values. By checking the length of our list created from `findall()`, we determine if there are any matches, and if so, we return them.

In [6]:
def return_mileage(length):
    if isinstance(length, str):
        pattern = re.compile(r"\d+\.\d+")
        mile = re.findall(pattern, length)
        if len(mile) == 1:
            return float(mile[0])
    else:
        return 0

Next, let's apply it on the `Length` column and generate a new column with just the hiking mileage:

In [7]:
hiking["Length Mileage"] = hiking["Length"].apply(lambda row: return_mileage(row))

Finally, let's compare our new column to the original column to make sure it worked:

In [8]:
print(hiking[["Length", "Length Mileage"]].head())
print("\n")
print(hiking[["Length", "Length Mileage"]].dtypes)

       Length  Length Mileage
0   0.8 miles            0.80
1    1.0 mile            1.00
2  0.75 miles            0.75
3   0.5 miles            0.50
4   0.5 miles            0.50


Length             object
Length Mileage    float64
dtype: object


### Encoding variables: binary

Because models in scikit-learn require numerical input, if your dataset contains categorical variables, you'll have to encode them. Encoding binary values is actually quite simple, and can be done in both pandas and scikit-learn. You might want encode variables in pandas if you're not finished preprocessing, or if you're interested in further exploratory work once you've encoded. On the other hand, you may want to use scikit-learn if, for example, you're implementing encoding as part of scikit-learn's pipeline functionality, which allows you to string different parts of the machine learning process together. 
 
Let's look at an example using the `hiking` dataset. One column that needs encoding is the `Accessible` column, which has values of either `Y` or `N`.

In [9]:
hiking["Accessible"].head()

0    Y
1    N
2    N
3    N
4    N
Name: Accessible, dtype: object

In pandas, we can use `apply()` to encode 1s and 0s in a `DataFrame` column, using a simple conditional that returns a 1 if the value in `Accessible` is `Y`, and a 0 if the value is `N`.

In [10]:
hiking["Accessible_enc"] = hiking["Accessible"].apply(lambda val: 1 if val == "Y" else 0)

hiking[["Accessible", "Accessible_enc"]].head()

Unnamed: 0,Accessible,Accessible_enc
0,Y,1
1,N,0
2,N,0
3,N,0
4,N,0


Looking at a side by side comparison of the columns, you can see that the column is now numerically encoded.

You can also do this in scikit-learn using `LabelEncoder`. Creating a `LabelEncoder` object also allows you to reuse this encoding on other data, such as on new data or a test set. You can use `fit_transform()` to both fit the encoder to the data as well as transform the column. 

In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
hiking["Accessible_enc_le"] = le.fit_transform(hiking["Accessible"])

hiking[["Accessible", "Accessible_enc", "Accessible_enc_le"]].head()

Unnamed: 0,Accessible,Accessible_enc,Accessible_enc_le
0,Y,1,1
1,N,0,0
2,N,0,0
3,N,0,0
4,N,0,0


Printing out `Accessible` and its encoded counterparts, we can see that the `Y` and `N` values have been encoded to 1s and 0s in the same manner in both pandas and scikit-learn.

### Encoding variables: one-hot

One-hot encoding encodes categorical variables into 1s and 0s when you have more than two variables to encode. It works by looking at the entire list of unique values in a column, transforming each value into an array, and designating a 1 in the appropriate position to encode that a particular value occurs.

Let's look at a toy example to see how this works, taking a small dataset of colors:

In [12]:
color_ex = pd.Series(["blue", "red", "green", "red"])

Here we have three values: blue, green, and red. In order to one-hot encode these values, we can use `get_dummies()` to generate columns for each.

In [13]:
color_enc = pd.get_dummies(color_ex)
color_enc

Unnamed: 0,blue,green,red
0,1,0,0
1,0,0,1
2,0,1,0
3,0,0,1


To concatenate these columns back to the original data, you can use `pd.concat`:

In [14]:
pd.concat([color_ex, color_enc], axis=1)

Unnamed: 0,0,blue,green,red
0,blue,1,0,0
1,red,0,0,1
2,green,0,1,0
3,red,0,0,1


If we were to encode these colors with 0s and 1s based on this list, we would get something like this: blue would have a one in the first position followed by two zeros, green would have a one in the second position, and red would have a one in the last position. So an encoded column would look something like this where a value of `1` indicates that that color appeared in the original column in that particular row.

One of the columns in the `volunteer` dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, let's practice using one-hot encoding to transform this column numerically:

In [15]:
volunteer = pd.read_csv(dir_string + "volunteer.csv")

category_enc = pd.get_dummies(volunteer["category_desc"])
category_enc.head()

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0


Finally, we can concatenate these back onto the `DataFrame` and take a look at a few rows and their encodings:

In [16]:
volunteer_enc = pd.concat([volunteer, category_enc], axis=1)

volunteer_enc[["category_desc", "Strengthening Communities", "Helping Neighbors in Need"]].tail()

Unnamed: 0,category_desc,Strengthening Communities,Helping Neighbors in Need
660,Helping Neighbors in Need,0,1
661,Strengthening Communities,1,0
662,Helping Neighbors in Need,0,1
663,Strengthening Communities,1,0
664,Strengthening Communities,1,0


### Aggregate stats

If you had, say, a collection of features related to a single feature, like temperature or running time, you might want to take an average or median to use as a feature for modeling instead. 

A common method of feature engineering is to take an aggregate of a set of numbers to use in place of those features. This can be helpful in reducing the dimensionality of your feature space, or perhaps you simply don't need multiple similar values that are close in distance to each other. 

Let's say we have a `DataFrame` of running times named `running_times_5k`:

In [17]:
running_dict = {"name": ["Sue", "Mark", "Sean", "Erin", "Jenny", "Russell"], 
                "run1": [20.1, 16.5, 23.5, 21.7, 25.8, 30.9], 
                "run2": [18.5, 17.1, 25.1, 21.1, 27.1, 29.6], 
                "run3": [19.6, 16.9, 25.2, 20.9, 26.1, 31.4], 
                "run4": [20.3, 17.6, 24.6, 22.1, 26.7, 30.4], 
                "run5": [18.3, 17.3, 23.9, 22.2, 26.9, 29.9]}

running_times_5k = pd.DataFrame(running_dict)
running_times_5k

Unnamed: 0,name,run1,run2,run3,run4,run5
0,Sue,20.1,18.5,19.6,20.3,18.3
1,Mark,16.5,17.1,16.9,17.6,17.3
2,Sean,23.5,25.1,25.2,24.6,23.9
3,Erin,21.7,21.1,20.9,22.1,22.2
4,Jenny,25.8,27.1,26.1,26.7,26.9
5,Russell,30.9,29.6,31.4,30.4,29.9


Instead of using each individual run time to build our model, let's use the mean of these five runs for each person in the dataset.

Let's create a list of the columns we want to average, just to make things easier:

In [18]:
run_columns = ["run1", "run2", "run3", "run4", "run5"]

We can use `apply()` to apply a function to our dataset. In this situation, we apply `mean()` to just the columns we want to average, and `axis=1` will return values by row:

In [19]:
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

running_times_5k

Unnamed: 0,name,run1,run2,run3,run4,run5,mean
0,Sue,20.1,18.5,19.6,20.3,18.3,19.36
1,Mark,16.5,17.1,16.9,17.6,17.3,17.08
2,Sean,23.5,25.1,25.2,24.6,23.9,24.46
3,Erin,21.7,21.1,20.9,22.1,22.2,21.6
4,Jenny,25.8,27.1,26.1,26.7,26.9,26.52
5,Russell,30.9,29.6,31.4,30.4,29.9,30.44


And now we have an aggregate mean column.