### Load previously stored datasets
Before we think of how our features can be further improved to make them more suitable for ML model, let's load the datasets from the previous notebook:

In [36]:
%store -r users_cleared
%store -r features
%store -r labels

Ok, so before applying feature engineering techniques grounded in tech expertise, I'd like to just think of what kinds of features we are interested for our research.

Even though in previous module we were not able to conclusively identify highly correlated or highly not correlated features, let's just apply some critical thinking and logic.

For example, do we care about user's `location`? I'd say not, as even if it's somehow related, it's not what we want to use to build a user profile. We want that profile to be location agnostic so that our model can be used anywhere.
Of course, accessible sporting venues, gyms, workout areas, etc. probably have a significant impact on the users' behaviour, but we can't infer it from a location anyway. We don't even have an easy way to identify if `location` corresponds to a city, small town, village, etc.
With that in mind, let's get rid of `location` column:

In [37]:
features = features.drop('location', axis=1)
features.head(5)

Unnamed: 0,manufacturer,age,training_before[Y/N],gender,body_fat[%],smoking[Y/N],alcohol_times_week,occupation,education,marital_status,kids
0,apple,37,True,F,27,False,2,Clinical cytogeneticist,bachelor,divorced,2
1,apple,19,False,F,29,True,2,,school,single,0
2,oura,40,True,F,28,False,2,"Buyer, retail",bachelor,single,2
3,garmin,42,False,M,29,False,1,"Accountant, chartered",master,married,3
4,xiaomi,22,False,F,39,False,0,"Loss adjuster, chartered",master,divorced,1


Ok, anything else that's not relevant for us? I'd say that we are not interested in `manufacturer` either. For one, we don't want a manufacturer-spesific research. For another, we don't have models, just brands, which, just like with location, is not enough to make any judgement as many manufacturers have dozens of different wearable models with very different characteristics. Let's remove `manufacturer` for that reason:

In [38]:
features = features.drop('manufacturer', axis=1)
features.head(5)

Unnamed: 0,age,training_before[Y/N],gender,body_fat[%],smoking[Y/N],alcohol_times_week,occupation,education,marital_status,kids
0,37,True,F,27,False,2,Clinical cytogeneticist,bachelor,divorced,2
1,19,False,F,29,True,2,,school,single,0
2,40,True,F,28,False,2,"Buyer, retail",bachelor,single,2
3,42,False,M,29,False,1,"Accountant, chartered",master,married,3
4,22,False,F,39,False,0,"Loss adjuster, chartered",master,divorced,1


Now, since I know I didn't use `occupation`, `education`, `marital_status` and `kids` in label generation, I don't want to remove them just yet as we migh want to do it when retrainign our model and comparing its performance. Let's move on to other, more tech-driven, approaches to feature engineering.
Let's start with `occupation`. Our users'`age` starts from 18yo, so it's likely we don't have anything filled in for them. Let's see if all rows have `occupation` filled:

In [39]:
import pandas as pd

len(features[(pd.isnull(features['occupation']))])

343

Ok, they do not. Also, imputing doesn't make any sense here, as it's non-numeric and it would semantically not be correct to impute. Let's remove occupation then, as 343 rows is more than 10% of the overall data, and it can have an impact on our model:

In [40]:
features = features.drop('occupation', axis=1)
features.head(5)

Unnamed: 0,age,training_before[Y/N],gender,body_fat[%],smoking[Y/N],alcohol_times_week,education,marital_status,kids
0,37,True,F,27,False,2,bachelor,divorced,2
1,19,False,F,29,True,2,school,single,0
2,40,True,F,28,False,2,bachelor,single,2
3,42,False,M,29,False,1,master,married,3
4,22,False,F,39,False,0,master,divorced,1


Now, let's see what columns contain categorical data and whether we can encode them.
As it turns out out of 9 columns, only 3 columns are non-categorical(`age`, `body_fat[%]` and `kids`). Everything else can be represented as a category.

Overall in the project we will use two types of encoding:
1. one-hot encoding;
2. label encoding;

We will use one-hot encoding when we know that the categorical column is not ordinal. Otherwise, it can give our model a false impression that the categories have some weight/order.

The problem with one-hot encoding is that it adds a lot of new columns so it's better to use when the number of categories is low. Also, one-hot enconding is prone to Dummy Variable Trap so and one way to resolve it is to remove one of the newly added columns to reduce the trap's effect.

Let's encode `alcohol_times_week`, `education` first.

For `alcohol_times_week` and `education` specifically, if you think about it, both of them are ordinal, so we will use label encoding:

In [41]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

features['education']= label_encoder.fit_transform(features['education']) 
features.head()

Unnamed: 0,age,training_before[Y/N],gender,body_fat[%],smoking[Y/N],alcohol_times_week,education,marital_status,kids
0,37,True,F,27,False,2,0,divorced,2
1,19,False,F,29,True,2,3,single,0
2,40,True,F,28,False,2,0,single,2
3,42,False,M,29,False,1,1,married,3
4,22,False,F,39,False,0,1,divorced,1


Since we decided to use label encoding, we don't need to do anything about `alcohol_times_week` as it can already be thought of as encoded numeric value.

Let's now encode `gender` and `marital_status`. I will not encode `boolean` columns at all for now, as in Python environment they will most likely be automatically interpreted as `0` and `1` when needed:

In [42]:
import pandas as pd

features = pd.get_dummies(features)
features.head(5)

Unnamed: 0,age,training_before[Y/N],body_fat[%],smoking[Y/N],alcohol_times_week,education,kids,gender_F,gender_M,marital_status_divorced,marital_status_married,marital_status_separated,marital_status_single,marital_status_widowed
0,37,True,27,False,2,0,2,1,0,1,0,0,0,0
1,19,False,29,True,2,3,0,1,0,0,0,0,1,0
2,40,True,28,False,2,0,2,1,0,0,0,0,1,0
3,42,False,29,False,1,1,3,0,1,0,1,0,0,0
4,22,False,39,False,0,1,1,1,0,1,0,0,0,0


As can be seen, we don't even need to specify columsn to encode. `gender` and `marital_status` are picked up automatically by `get_dummies` as they are the only remaining categorical columns which are not encoded.

`get_dummies` also has a parameter `drop_first` specifically to battle Dummy Variable Trap, but I want to remove a column later, in model optimization phase, to see how it impacts model performance.

Let's move to the next module and define our model(s). Before we do it, let's store a new features dataset to be available across modules:

In [43]:
%store features

Stored 'features' (DataFrame)
