# 3. Feature Engineering
<span style="font-size:14px;line-height:1.5">
<p>
Some of the approaches we will cover in this chapter may not be well-suited to the titanic dataset. However, that's a feature rather than a bug in our book design. We specifically on a single dataset is intentional because it allows us to discuss not only the tools that are relevant to a particular problem, but also those tools that are not suitable. This approach mirrors the workflow one might normally encounter in the real-world, where your problem is centered around a dataset that you need to work with, rather than a technique you need to apply.
</p><p>

## 3.1. A quick recap

In Chapter 1 we performed some basic feature engineering to gain insight into the dataset. The table below summarises some of the main insights about the original features and the simple features we also computed:</p>

<b>Original Features</b>

| Feature  | Notes                 |
| -------- | ------------ |
| Age      | Contains missing data  |
| Sex      |                        |
| Class    |                        |
| Fare     | Contains zero coded missing data  |
| Ticket   |                                        |
| Embarked | Contains missing data                  |
| Cabin    | Contains inconsistent string formats   |
| Name     | Contains inconsistent string formats   |
| SibSp     | |
| Parch     | |
| PassengerId     | |


<p><b>Engineered Features</b></ps>

| Feature   | Based On | Notes                 |
| --------  | -------- | ------------ |
| age_measure  | Age  | Indicates confidence in age measurement|
| age_group    | Age  | Rounded age group, *don't suggest using in the model |
| log_fare     | Fare | Log-transform reveals additional peaks in distribution, could be useful |
| people_on_ticket | Ticket | Shows number of people travelling on same ticket as passenger |
| ticket_prefix | Ticket | String at start of ticket, common to multiple tickets (possible subsection?) |
| ticket_number | Ticket | Numeric part of ticket, shows indications of structure associated with ship |
| n_rooms | Cabin | Highlights that some passengers had up to 4 rooms |
| section | Cabin | The major division of the ship in which the passenger was staying (also a refinement on Class) |
| title | Name | 'Mr','Mrs','Dr','Rev' etc. |
| professional_title | Name | Boolean indicating if title indicates a profession (lower survival) |
| noble_title | Name | Boolean indicating if title indicates an aristocratic label (higher survival) |
| family_size | SibSp + Parch | Total number of relatives |

Rather than duplicate the code used to extract these features, we will load in a modified version of the data
</span>

In [None]:
def get_age_measurement_type(x):            # there may be nicer ways of writing this function but it does the job
    if np.isnan(x):
        return 'missing'
    else:
        if (x - np.floor(x)) == 0.5:
            return 'estimated'        
    return 'observed'

train['age_measure'] = train['Age'].apply(get_age_measurement_type)

### One Hot Encoding

#### The Dummy Variable Trap

One hot encoding leads to redundancy as, for *N* categories, the final column can be determined from the combination of all other columns in the dummy dataset. 

It's easy to see this in a column with two categories, which is encoded as two columns using one-hot encoding, yet the values in the two encoded columns are perfectly inversely correlated. This introduces unwanted **Multi-collinearity** that we want to avoid.

The solution to the problem is simple: drop one of the columns resulting from one-hot encoding. The presence of that particular category will still be evident within the remaining dataset as encoded columns full of zeros.



## Social Networks
<span style="font-size:14px;line-height:1.5">
Passengers can be connected by a variety of social relationships:
<ul>
    <li>Family (e.g. mother-daughter)</li>
    <li>Friendship </li>
    <li>Employment
        <ul>
        <li>Hierachical (e.g. boss-employee)</li>
        <li>Equal level (e.g. colleagues)</li>
        </ul>
    </li>
</ul>

The titanic dataset contains examples of all these types of relationships, both in terms of families, but also passengers accompanied by staff and groups of employees. 

</span>

### Women-Child Grouping

[Chris Deotte](https://www.kaggle.com/code/cdeotte/titanic-using-name-only-0-81818/notebook) makes the case that, if we assume that surname can be used as a proxy for family group (i.e. that there are no two unrelated individuals with the same surname), we can estimate the family survival rate for women and children. This then provides a strong predictor of survival for fellow women and children within the same social group.
