# 3. Feature Engineering

## 3.1. A quick recap

<span style="font-size:14px;line-height:1.5">

In Chapter 1 we performed some basic feature engineering to gain insight into the dataset. The table below summarises some of the main insights about the original features and the simple features we also computed:

<b>Original Features</b>

| Feature  | Notes                 |
| -------- | ------------ |
| Age      | Contains missing data  |
| Sex      |                        |
| Class    |                        |
| Fare     | Contains zero coded missing data  |
| Ticket   |                                        |
| Embarked | Contains missing data                  |
| Cabin    | Contains inconsistent string formats   |
| Name     | Contains inconsistent string formats   |
| SibSp     | |
| Parch     | |
| PassengerId     | |


<p><b>Engineered Features</b></ps>

| Feature   | Based On | Notes                 |
| --------  | -------- | ------------ |
| age_measure  | Age  | Indicates confidence in age measurement|
| age_group    | Age  | Rounded age group, *don't suggest using in the model |
| log_fare     | Fare | Log-transform reveals additional peaks in distribution, could be useful |
| people_on_ticket | Ticket | Shows number of people travelling on same ticket as passenger |
| ticket_prefix | Ticket | String at start of ticket number, common to multiple tickets (possible subsection?) |
| n_rooms | Cabin | Highlights that some passengers had up to 4 rooms |
| section | Cabin | The major division of the ship in which the passenger was staying (also a refinement on Class) |
| title | Name | 'Mr','Mrs','Dr','Rev' etc. |
| professional_title | Name | Boolean indicating if title indicates a profession (lower survival) |
| noble_title | Name | Boolean indicating if title indicates an aristocratic label (higher survival) |
| family_size | SibSp + Parch | Total number of relatives |




Fare
- log transform

Age
- measurement type (estimated, observed or missing/imputed)
- age group

Ticket information
    - people on ticket

- One-hot encoding ticket prefixes
- Target encoding ticket prefixes
- Ticket embedding 

</span>

In [None]:
def get_age_measurement_type(x):            # there may be nicer ways of writing this function but it does the job
    if np.isnan(x):
        return 'missing'
    else:
        if (x - np.floor(x)) == 0.5:
            return 'estimated'        
    return 'observed'

train['age_measure'] = train['Age'].apply(get_age_measurement_type)

### One Hot Encoding

#### The Dummy Variable Trap

One hot encoding leads to redundancy as, for *N* categories, the final column can be determined from the combination of all other columns in the dummy dataset. 

It's easy to see this in a column with two categories, which is encoded as two columns using one-hot encoding, yet the values in the two encoded columns are perfectly inversely correlated. This introduces unwanted **Multi-collinearity** that we want to avoid.

The solution to the problem is simple: drop one of the columns resulting from one-hot encoding. The presence of that particular category will still be evident within the remaining dataset as encoded columns full of zeros.

