# Feature Preprocessing and Generation wrt Models

Always have to preprocess features and create new ones from existing ones.

Cover:
1. Feature preprocessing
2. Feature generation
3. Their dependence on a model type

Will cover: numeric, categorical, datetime, and coordinate features (and missing values).

Different kinds of features: binary (0 or 1), numeric, counts, categorical, id (these are unique on each row, they are _not_ numeric features), text.

**Why do we care about different features having different types?**
There is a strong connection between preprocessing our model and common feature generation methods for each feature type.

## Feature Preprocessing

Usually we can take the features, fit our fave model and expect it to get great results.

Each feature has its own ways of being preprocessed to improve the quality of the model. Choice of preprocessing depends on the model we are going to use.

If we OHE, this will create linear boundaries on our model, for example, and thus linear models will perform much better. But RF doesn't need you to OHE variables, it can deal with non-linear decision boundaries just fine.

## Feature Generation

If we have a linear trend, we can add linear features. E.g. if we are doing sales predictions, we can add in the past weeks as a feature (week 1, 2, 3, etc.) as this will give it a linear thing to grab on to.

But a GBDT will use this feature to calculate something like mean target value for each week. Explanation was a bit weird and confusing. BUt basically the idea is that GBDT can't see linear dependencies (as it is just trained on the data immediately beforehand)

## Numeric Features

We'll look at preprocessing for tree-based and non-tree based. Then we'll look at feature generation.

Some models depend on feature scale, some do not. If you look at the decision boundaries between trees and non-trees it is clear that the trees are much harder and more square boundaries. But non-trees are more curved and can often be continuous.

If we use linear models, we are always going to use regularization as this gives us a better fit. However, regularization is proportional to feature scale (I went through this in detail for Jedha). Intuitively we can understand that bigger features will be regularized. Plus, optimization techniques can work differently depending on the feature's scale. Thus, we will want to scale our features when using linear models as well (since we will always get a better fit using regularized models).

Also, gradient descent methods can go crazy without proper scaling. Thus for NNs we need to apply similar preprocessing techniques as for linear models.

Different feature scalings lead to different model quality. In this sense, it is another hyperparameter that you need to optimize.

Easiest thing to do is scale all features to be on the same scale e.g. using `MinMaxScaler`. Or, of course, we can apply `StandardScaler`. 

After either normalization or standardization, each feature will have a similar level of impact on non-treebased models. Or rather, each will have the opportunity to have the same amount of impact. Obviously some will be better than others.

Note for KNN: we kow that the bigger a feature is, the more important it will be for KNN. So, we can boost our best features by actually making those scales _larger_. Very clever!

### Outliers

This is especially important for linear models.

To protect linear models from outliers, we can clip features between two chosen values of lower bound and upper bound. We choose them as some percentile of that feature e.g. the 1st and 99th percentiles. This is a well known practice for financial data and is called winsorization (I did this with Evan!)

### Rank

The rank transformation sets spaces between proper assorted values to be equal (huh???). This can be better than MinMaxScaler if we have outliers as rank will move the values closer together. I guess it just transforms the features to be 0, 1, 2, 3, 4 etc.

Linear models, KNN and NNs can benefit from this additional kind of transformation if we have no time to handle outliers manually.

We can do `from scipy.stats import rankdata` to use it. Note that to apply it to test data, you need to store the creative mapping from feature values to their rank values. Or, alternatively, you can concatenate train and test data before applying the rank transformation to get the proper ranking. This would obviously not work well in the real world!


### Transforms

These are especially helpful for non-tree based models and especially NNs.

Log and sqrt transforms.

np.log(1+x) and np.sqrt(1+x)

Note that both need to have positive feature values and that we will shift everything by 1 so that we don't have any 0s in the dataset.

Both can be useful as they bring values that are too big closer to the feature's average value. Plus the values near zero become more distinguishable.

Using these can make a big difference to your model's performance!

### Important Note

Sometimes it is beneficial to train a model on concatenated dataframes produced by different preprocessings. Or to mix model training with differently preprocessed datasets. Linear models, KNN and NNs can benefit hugely from this. 

Now we have discussed numeric feature preprocessing, how model choice impacts it and what the most commonly used preprocessing methods are.

## Feature Generation

This is creating new features using knowledge about the features and the task. It makes model training more simple and effective.

Sometimes we can engineer them using prior knowledge and logic. Sometimes we have to dig into the data, create and check hypotheses and use this derived knowledge and our intuition to derive new features.

Having prior knowledge is good. But obvs is not something we can rely on. Being able to do solid EDA and find new and better features is what makes a good competitor a great one. They will go over how to do this in detail in future videos.

One example is if we have real estate data and have 2 features: squared area, and price, we can combine these and make a new feature price per meter squared which is just the former divided by the latter.

Another is if we have horizontal distance to a point and vertical distance to a point, we may as well add the direct distance to the point by using pythagoras to help us!

By adding features that are addition, subtraction, multiplication or division of other features combined, we  are helping every single one of our models. Even GBDTs would benefit from adding some of this stuff as they would struggle to do these kinds of calculations themselves.

### Fractional Prices

Wow ok so this is cool.

If one of our features is price, we can create a new feature which is `fractional_part` which just takes the value after the decimal point. This lets the model see if peoples' perception of price impacts what we are modelling.

We can find similar patterns in tasks that require distinguishing between a human and a robot. Humans are irrational, robots (at least for now) are not. For example, if we have auction financial data, we may observe that people tend to set round numbers as prices. Likewise, if we are trying to find spambots on social media sites, we can be sure that no human has ever read loads of messages each with an exact interval of one second.

## Summary

- The distinction is between tree and non-tree models. The latter can be heavily impacted by feature scaling and thus all numeric features need to be preprocessed.
- Most often preprocessing techniques:
 - MinMaxScaler to [0, 1]
 - StandardScaler to mean 0 and std 1
 - Rank - sets spaces between sorted values to be equal i.e. ranks values from 0, 1, 2, 3...
 - np.log(1+x) and np.sqrt(1+x)
- Feature generation is powered by an understanding of the data.

## Categorical and Ordinal Features

Ordinal features are _ordered_ categorical features. In the Titanic dataset the feature `Pclass` is ordinal and has three values: 1, 2, 3. Class 1 is 1st class and is more expensive than 3rd class.

Note that ordinal features differ from numeric features. 

If `Pclass` was numeric, we could say that the difference between the first and the second class is equal to the difference between the second and the third class. But, because `Pclass` is ordinal, we don't know which difference is bigger.

The example of class seems a bit odd to me. But if we think about the earthquake damage prediction competition we see that we have ordinal classification task (I made this up but it makes sense) and we have 1 being little damage, 2 moderate and 3 being almost total destruction (should there not be one that is 'quite high'?!). So, I'd say the difference between 1-2 is _smaller_ than between 2-3.

Ordinal feature examples:
- Driver's license type: A, B, C, D
- Education: kindergarten, school, undergraduate, masters, doctoral

The categories are sorted in an increasingly complex order which can prove to be useful.

The simplest way to encode categorical features is to map its unique values to different numbers... (I would have thought that this introduces and ordering that we don't want??).

This is called Label Encoding.

It is fine to use Label Encoding for trees as they can split features and extract most of the useful values in categories on its own. This makes sense as it can just split the feature anywhere it wants but the other models would have to make a nice line around it which is not really possible.

Some situations where it is better to use label encoding rather than OHE for tree based models:
- When the number of categorical features in the dataset is huge
- When we can create a label encoder that assigns close labels to similar (in terms of target) categories. (this was taken from the quiz and not explained at all either implementation or theory in the videos...)
- When categorical feature is ordinal obvs makes sense to use label encoding (same is true if the model is linear!)

When OHE can be better than Label:
- If the target dependence on the label encoded feature is very non-linear i.e. values that are close to each other in the label encoded feature correspond to target values that aren't close. OHE gives nice clear distinct feature boundaries that our tree can make use of. If a feature is important, a tree would try to make a lot of splits and select each feature's value in a category on its own. But because  trees are built in a greedy way, it can be hard to select one important value in a label encoded vector. This won't be a problem if you use OHE.

Note for linear models it is not necessarily always best to OHE (since if the cat feature is ordinal, it will need to be label encoded).

Note: if the number of cat features in the dataset is huge, if you OHE you will take up a lot of memory (so use sparse matrcies) and if building trees you could have a situation where the numerical features are hardly ever used to build trees (since a random subset of features is usually chose). You can change this by modifying how many features each tree can use. 

Note though that it is not necessarily better to use label encoding over one-hot. It will be less computationally expensive but better results are not guaranteed (duh! Not much is guaranteed in ML).

But non-tree based models won't be able to use this feature effectively if it has been label encoded.

If we have a categorical feature that is not already a number, e.g. given by letters, we need to transform it into numbers before we can use it.

We can apply encoding in:
- Alphabetical/sorted order - [S, C, Q] -> [2, 1, 3] - `sklearn.preprocessing.LabelEncoder`
- Order of appearance - [S, C, Q] -> [1, 2, 3] - `pandas.factorize` - can make sense if the data is sorted in a meaningful way beforehand.

### Frequency Encoding

We can map values to their frequencies! Now this is cool.

```python
encoding = titanic.groupby('Embarked').size()
encoding = encoding / len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
```

Very clever. I would not have known how to do this!

This is helpful for both tree and non-tree models. If frequency of a category is correlated with the target value, a linear model will utilize this dependency. And, for the same reason, a tree model will need to use less splits.

This preserves info about the values distribution and can help both linear and tree models.

Note: if you have multiple categories with the same frequency, they will not be distinguishable in this new feature. So we could apply a rank operation here to deal with such ties using `from scipy.stats import rankdata`.

There are other ways to do label encoding and he encourages us to be creative when constructing them.

### One-hot encoding

For non-tree based models we cannot use label encoding. So, we use one-hot encoding instead. Note another benefit of this is that the features are already scaled (since the min is 0 and the max is 1 for each column). 

Note: if you have a few numeric features and hundreds of one-hot encoded features, it can become difficult for tree-based methods to use to the first ones efficiently (this is the situation I am in atm with the earthquake challenge). Tree models will slow down and not always improve results.

Note that if our categories have loads of unique values, we may add too many new columns with few non-zero values and now our dataset is sparse. My personal opinion on what 'too many' is going to be learned through trail and error. But either way, it will make sense to store our data as sparse matrices if this happens. This way, we only store non-zero elements of our array and thus save a lot of memory.

Going with sparse matrices makes sense if the number of non-zero values is _far less than half of all the values_. Sparse matrices are often useful if we work with a lot of categorical features or text data. Most libraries can work with sparse matrices directly.

## Categorical Feature Generation

One of the most useful examples of feature generation is feature interaction between several categorical features. This is usually useful for non-tree based models. Especially helpful for linear models and KNNs.

If the target depends on both the sex and the class, linear model could adjust its predictions for every possible combination of these two features and get a better result. But how can we make this happen?

We could do this by concatenating strings from both columns and one-hot encoding the results i.e. we now have columns 1male, 1female, 2male, 2female, 3male, and 3female. Now our model can find the optimal coefficient for every interaction and improve.

## Summary

- Ordinal features are sorted in some meaningful order
- Label encoding maps categories to numbers (this does not automatically mean that the categories are ordered, a tree based model can handle label encoded features no problem but for non-tree based, we must use one-hot encoding).
- Frequency encoding maps categories to their frequencies
- Label and frequency encodings are often used for tree-based models
- One-hot encoding is often used for non-tree based models
- Interactions of categorical features can help linear models and KNN

## Datetime and Coordinates

These differ significantly from numeric and categorical features. We can infer the meaning of both of these very easily and thus can come up with some specific ideas about feature generation.

### Datetime

It is not just year. It can also include day, week, and (of course) time.

Can be divided into two broad categories:
1. Time moments in a period (periodicity)
2. Time since a particular event

**Periodicity**
Day number in week, month, season, year, second, minute, hour.
Useful to capture repetitive patterns in the data. If we know about non-common periods that influence the data, we can add them as well. For example, if patients take medicine once every 3 days, we can consider this a special time period.

**Time Since**
- Row-independent moment e.g. since 00:00:00 UTC 1 January 1970
- Row-dependent important moment e.g. number of days left until next holidays/time passed after last holiday or since last sales campaign (this seems v useful if predicting sales!). Or even the number of days left until these events.

So instead of just date, we have the week day (as a number 0-7), day number since start of 2014 (0-364), is_holiday (binary), days_till_holidays (countdown until the next holiday), number of sales.

Woah yeah we can do SO MUCH with date data.

**Difference Between Dates**
Sometimes we have several datetime columns in our data. Tdddhe most straightforward idea here is to subtract one feature from another. Or perhaps subtract the features we have just generated.

One example could be calculating the difference between the date someone last purchased something vs. the time we called them. Perhaps big differences lead to more churn?

Note that once you've processed datetime, you will usually get numeric features (time passed since 2000) or categorical features (day of week). And thus, now _these_ features will need to be treated accordingly with the topics above. 

## Coordinate Data

Let's assume we are doing house price prediction problem.

Generally you want to calculate distances to important points on the map. For example, could add distance to nearest shop, to the best school in the neighborhood and so on. If you do not have this, you can extract interesting points on the map from all data you have available to you.

For example, you can divide your map into grid squares and, within each square, find the most expensive flat. Then, for every other object in this square, add the distance to that flat.

Or you can organize data points into clusters and use centers of clusters as the important points. 

Or find special areas e.g. those with v old buildings and add distance to this one.

Can also calculate aggregated stats for objects surrounding an area e.g. number of lets around a particular point which can then be interpreted as areas of popularity. Or add mean realty price which indicates how expensive the area is around selected points.

Both distances and aggregated statistics are often useful in tasks with coordinates.

Another killer idea: if you are using decision trees, it may make sense to rotate your coordinates slightly so that you can get perfectly horiztonal and linear decision boundaries (since trees work best with those). It can be hard to know what rotations to make, so you can test different ones and see which performs best. Common are 45 or 22.5 rotations.

## Summary

Summary of most frequent methods used for feature generation from datetime and coordinates

Datetime
- Periodicity
- Time since row-independent/dependent event
- Difference between dates

Coordinates
- Interesting places from train/test data or additional data
- Centers of clusters
- Aggregated statistics