**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**

---


# Introduction

You can provide more information for your model by creating new features from the data itself. For example, you can calculate the number of total projects in the last week and the duration of the fundraising period. The features you can create are different for every dataset so it takes a bit of creativity and experimentation. We're actually a bit limited here since I'm working with only one table. Typically you'll have access to multiple tables with relevant data that you can use to create new features.

First I'll show you how to make new features using categorical features, then a few examples of generated numerical features.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from sklearn.preprocessing import LabelEncoder

ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])

# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

# Timestamp features
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

data_cols = ['goal', 'hour', 'day', 'month', 'year', 'outcome']
baseline_data = ks[data_cols].join(encoded)

# Interactions

One of the easiest ways to create new features is by combining categorical variables. For example, if one record has the country `"CA"` and category `"Music"`, you can create a new value `"CA_Music"`. This is a new categorical feature that can provide information about correlations between categorical variables. This type of feature is typically called an **interaction**. In general, you would build interaction features from all pairs of categorical features. You can make interactions from three or more features as well, but you'll tend to get diminishing returns.

Pandas lets us simply add string columns together like normal Python strings.

In [None]:
interactions = ks['category'] + "_" + ks['country']
print(interactions.head(10))

Then, label encode the interaction feature and add it to our data.

In [None]:
label_enc = LabelEncoder()
data_interaction = baseline_data.assign(category_country=label_enc.fit_transform(interactions))
data_interaction.head()

In the next exercise, you'll build interaction terms for all pairs of categorical features.

# Number of projects in the last week

First up I'll show you how to count the number of projects launched in the preceeding week for each record. To do this I'll use the `.rolling` method on a series with the `"launched"` column as the index. I'll create the series, using `ks.launched` as the index and `ks.index` as the values, then sort the times. Using a time series as the index allows us to define the rolling window size in terms of hours, days, weeks, etc.

In [None]:
# First, create a Series with a timestamp index
launched = pd.Series(ks.index, index=ks.launched, name="count_7_days").sort_index()
launched.head(20)

There are seven projects that have obviously wrong launch dates, but we'll just ignore them. Again, this is something you'd handle when cleaning the data, but it's not the focus of this mini-course.

With a timeseries index, you can use `.rolling` to select time periods as the window. For example `launched.rolling('7d')` creates a rolling window that contains all the data in the previous 7 days. The window contains the current record, so if we want to count all the *previous* projects but not the current one, we'll need to subtract 1. I'll also plot the results so we can make sure it looks right.

In [None]:
count_7_days = launched.rolling('7d').count() - 1
print(count_7_days.head(20))

# Ignore records with broken launch dates
plt.plot(count_7_days[7:]);
plt.title("Competitions in the last 7 days");

Now that we have the counts, we need to adjust the index so we can join it with the other training data. 

In [None]:
count_7_days.index = launched.values
count_7_days = count_7_days.reindex(ks.index)

In [None]:
count_7_days.head(10)

Now join the new feature with the other data again using `.join` since we've matched the index.

In [None]:
baseline_data.join(count_7_days).head(10)

# Time since the last project in the same category

It's possible that projects in the same category compete for donors. If you're trying to fund a video game and another game project was just launched, you might not get as much money. What I'd like to do then is calculate the time since the last project in the same category.

A handy method for performing operations within groups is to use `.groupby` then `.transform`. The `.transform` method takes a function then passes a series or dataframe to that function for each group. This will a return a dataframe with the same indices as the original dataframe. What we can do is perform a groupby on `"category"` and use transform to calculate the time differences for each category.

In [None]:
def time_since_last_project(series):
    # Return the time in hours
    return series.diff().dt.total_seconds() / 3600.

df = ks[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas.head(20)

We get `NaN`s here for projects that are the first in their category. We'll need to fill those in with something like the mean or median. We'll also need to reset the index so we can join it with the other data.

In [None]:
# Final time since last project
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(baseline_data.index)
timedeltas.head(20)

# Transforming numerical features

If we look at the distribution of the values in `"goal"` we see most projects have goals less than 
5000 USD. However, there is a long tail of goals going up to $100,000. Some models work better when the features are normally distributed, so it might help to transform the goal values. Common choices for this are the square root and natural logarithm. These transformations can also help constrain outliers.

Here I'll transform the goal feature using the square root and log functions, then fit a model to see if it helps

In [None]:
plt.hist(ks.goal, range=(0, 100000), bins=50);
plt.title('Goal');

In [None]:
plt.hist(np.sqrt(ks.goal), range=(0, 400), bins=50);
plt.title('Sqrt(Goal)');

In [None]:
plt.hist(np.log(ks.goal), range=(0, 25), bins=50);
plt.title('Log(Goal)');

The log transformation won't help our model since tree-based models are scale invariant. However, this should help if we had a linear model or neural network.

Other transformations include squares and other powers, exponentials, etc. These might help the model discriminate, like the kernel trick for SVMs. Again, it takes a bit of experimentation to see what works. One method is to create a bunch of new features and later choose the best ones with feature selection algorithms.

Next up, you'll get practice generating features with the TalkingData ad data.

---
**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*