## Missing value processing and Feature engineering


1. Processing missing values

One of the most common problems encountered during the data cleaning and analysis phase is the handling of missing values, i.e. when one of the variables is missing for a given observation.

There is not really a good way to apprehend missing values, determining the ideal replacement value would be a prediction problem in its own right, however several techniques exist depending on the type of problem - time series, machine learning, regression... It is therefore delicate to propose a general solution. The main methods will be summarized here and a structured solution will be proposed.



    1. Why do we have missing values

Before we delve into the different processing techniques, we will first ask ourselves why are there missing values?



* **Missing Value Completely at Random:** This is the situation where the true value of the missing data does not depend on the true value or values of other variables describing the individual.
* **Missing Value at Random:** The probability that a value is missing does not depend on the true value but on the other variables describing the individual. For example, the French tend not to want to talk about their income, unlike the Americans, without this preference necessarily being linked to the level of income.
* **Missing Value Not at Random:** Missing Value Not at Random is when the probability of a data item being missing is directly related to its value. For example, in an obesity study, if participants notice weight gain they tend to leave the study and generate missing values.

In the first two cases the observations concerned can be removed from the dataset without too great an impact on the distributions of the variables. In the latter case we cannot remove the observations without risking introducing a bias in our data set. Indeed, if rich people, for example, tend not to want to reveal their wealth or income, then these individuals may be under-represented in the dataset.



    2. Interpolation vs. Deletion

In the situation where one does not wish to remove observations with missing values, so-called interpolation can be carried out. That is to say, we replace the missing values by a value that depends on the values already filled in. The table below provides a simple summary of the different situations in which we may find ourselves and the strategies we can adopt:

![](https://drive.google.com/uc?export=view&id=1v1EZxk4Ox1QOQG6rL6IGaGIRWv9NQjzN)


* **Listwise**

    Listwise deletion (study of complete observations) consists of removing from the dataset all observations that contain at least one missing value. If the number of such observations is small compared to the size of the dataset it is very convenient to simply remove them. However, situations where missing values are completely random are rare and eliminating these observations introduces a bias.

# In python

```
mydata.dropna(inplace=True)
```


*   **Pairwise**

Peerwise suppression is interested in all observations for which the variables of interest are filled in and therefore maximizes the number of observations that can be used for analysis. An advantage of this approach is that it increases the quality of the analysis but has many disadvantages. This technique assumes that all missing values are completely random. This results in a different number of observations that come into play for the contribution of different variables in the analysis, which can make interpretation difficult.

![](https://drive.google.com/uc?export=view&id=1UWL4Z05tFRIyojNtS82yEAPH3MhtfGQb)


* **Dropping Variables**

    It is often more advantageous to keep the data than to ignore it, however in some cases a variable can present high rates of non-information like 60%, in this case we can remove this variable from the dataset provided that it does not bring information in our analysis. Interpolation is very often a better solution than abandoning a variable.


```
del mydata.column_name
mydata.drop('column_name', axis=1, inplace=True)
```




        1. Methods specific to Time Series
* **Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB) \
**This is a classic approach for analyzing longitudinal studies when some follow-up data are missing. These two methods consist of repeating the last previously entered value and replacing the missing values with the first next entered data. However, both methods can introduce biases if the time series shows a trend.
* ** Linear interpolation**

    This method consists of estimating the trend in the time series and replacing missing values as if they followed that trend exactly.

* **Seasonal Adjustment + Linear Interpolation **

    This method consists of taking into account the trend and seasonal component when replacing missing values.

![](https://drive.google.com/uc?export=view&id=1JVYgXZZgJNaPB7hH5Ane3Bl4zCgU3eZn)


![](https://drive.google.com/uc?export=view&id=10ZyuNyCg1gJVlQeKw1DnAdtdEFt-E9hZ)

![](https://drive.google.com/uc?export=view&id=1sN0r-3Xgaq7FC8s4EdsaHF9zOgR8gjlS)

![](https://drive.google.com/uc?export=view&id=11EIWTiQYmGM12mIGh0XuVlrGAxSgF5zB)




        2. Mean, Median and Mode

Interpolation by mean, median or mode are three very simple methods, and are the only ones that do not take into account the characteristics of time series or information from other variables. These methods are very fast but have a major disadvantage which is to reduce the variance of the dataset since all observations with missing values will look a little more similar.


```
from sklearn.preprocessing import Imputer
values = mydata.values
imputer = Imputer(missing_values='NaN', strategy='mean')
transformed_values = imputer.fit_transform(values)
# strategy can be changed to "median" and "most_frequent"
```




        3. Linear regression

First, many predictors of variables with missing values are calculated using a correlation matrix. The variables most related to the variable being interpolated are selected and used in a linear regression model. Cases where the interpolated variable is filled in are used to learn the model, which is then used to predict the most likely values to replace the missing values. The manoeuvre can then be repeated, this time using all observations for learning, predicting new replacement values and so on until the interpolation values converge.

This method theoretically produces good estimators for missing values of variables. However, this method has the following disadvantages, which tend to outweigh its advantages: firstly, the values thus found reflect the general logic of the dataset and reduce the variance of the dataset and also tends to strengthen the correlation between the dataset variables, whereas there is not necessarily a linear relationship between them and it is preferable in any case that there is not.



        4. Multiple interpolation
* ** Interpolation:** We interpolate _m_ times the missing values from a probability distribution, which gives _m_ datasets with no missing values.
* **Analysis:** We analyse each of the datasets.
* **Pooling:** The _m_ analyses are used to form a final result.



![](https://drive.google.com/uc?export=view&id=1SF0Uo5-enRqQDrMK9WW5nryzQ2Unblf6)



        5. Interpolation of qualitative variables
* Mode interpolation is very simple to implement but can introduce a strong bias.
* Missing values can be treated as a category in their own right.
* One can try to predict missing data values from the other variables, using models such as naive bayes or decision tree etc...
* We can do multiple interpolation.
        6. KNN (K Nearest Neighbours)

In this method, the _K_ closest neighbours at a pre-selected distance are used to calculate the replacement value, which will be the mean for a quantitative variable or the mode for a qualitative variable.

The choice of the distance depends on the type of variables available in the dataset.



* For quantitative variables the Euclidean distance is used to calculate their contribution to distance.
* For qualitative variables the distance is one if the modalities are different and zero if they are equal.

All the components of the distance are summed to calculate the distance between two observations.

The main advantage of KNN is that this method is very simple to understand and implement. The non-parametric nature of KNN (unlike regressions for example) gives it a plus in certain situations where the data are highly contrasted.

One of the main drawbacks of KNN is that it can take a long time in the case of very large datasets because it is necessary to calculate the distances between all observations, moreover the method may not work very well when many variables are available because the differences between the distances become smaller.


```
from fancyimpute import KNN   

# Use 5 nearest rows which have a feature to fill in each row's missing features
knnOutput = KNN(k=5).complete(mydata)
```


Of all the methods described above, multiple interpolation and KNN are often used. Multiple interpolation is often preferred because it is simpler.

**Resources: \
1. [https://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf \
](https://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf)2. [https://arxiv.org/pdf/1710.01011.pdf \
](https://arxiv.org/pdf/1710.01011.pdf)3. [https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/ \
](https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/)4. [https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/ \
](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)5. Time-Series Lecture at University of San Francisco by Nathaniel Stevens**



Feature engineering is the exercise of creating new features (new variables) in order to improve the clarity of an analysis, find new correlations or increase the performance of a prediction model from the data already at our disposal (in most cases).

It is an essential component of data science because it really makes the difference between the pure and simple application of a model and true reflection. As Andrew Ng, professor of machine learning at Stanford University says

"Coming up with features is difficult, time-consuming, requires expert knowledge, "Applied Machine Learning" is basically feature engineering." (Variable creation is difficult, time consuming, and requires expertise, the "Applied Learning Machine" can be summed up as variable creation)_.

Through feature engineering, it is possible to isolate essential information, highlight behavioural models, and provide expertise in certain areas.

Unsurprisingly, it's easy to feel stuck when you get into feature engineering because the possibilities are almost endless, so we're going to discuss here about twenty methods that will help guide you in this work.



    1. What is feature engineering

Feature engineering is an informal discipline with multiple definitions. The process of building a machine learning project is fluid, iterative and non-linear, which makes it difficult to construct an unwavering and unique definition.

In summary the feature engineering could be roughly described as : **the construction of new variables from existing variables in order to improve the performance of a model.

The typical process of building a data science project can be broken down as follows:



* Evaluation / Problem definition and data collection
* Descriptive and exploratory analysis (what we did earlier this week)
* Data cleaning (what we did today) Data cleaning in English.
* Feature engineering
* Modeling, model training in English. (which we'll see next week)
* Rendering of the project, formatting of the conclusions, production of the model
    2. What's not feature engineering?

Some steps in the construction of a data science project are not similar to feature engineering, for example :



* Initial data collection is not considered feature engineering.
* In the same way the identification and creation of the target variable
* Deleting duplicate observations, managing missing data or correcting outliers is considered more like data cleaning.
* Nor is standardization considered part of feature engineering.
* Finally, the size reduction methods we'll see in week 4 are also outside the scope of feature engineering.

This is a purely arbitrary categorisation, some data scientists may consider another arrangement to be the right one and this is not a problem, the important thing is that we agree at this stage on a definition of the terms.

Let's now move on to the presentation of feature engineering methods!



    3. Indicator variables

The first feature engineering technique we will discuss is to isolate key information. One might legitimately ask whether it is not the role of the model to understand what information is important. In practice this is not so simple, it depends on the amount of information available and the presence of contrary signals that prevent the model from establishing simple prediction rules.

It is very easy in practice to help the model focus on certain information by highlighting it in the data itself. This technique is based on the construction of indicator variables (whose value is 0 or 1 according to a criterion), we present here some examples:



* **Threshold indicator**: Let's say that we are looking at the preferences of American consumers for alcoholic beverages and we have a variable that gives the age of the individuals. We can create an indicator variable that is 0 if an individual is under 21 years of age and 1 if the individual is older than 21, which is the legal drinking age in the United States.
* Multiple feature indicator:** If we are trying to predict the price of real estate in a certain area and we measure the number of bedrooms and bathrooms, we can build an indicator that is worth 1 when a house has 2 bedrooms and 2 bathrooms because we know that this type of property has a much higher value on the market.
* **Special events indicator:** If we model the sales of a store every day of the year, it may be interesting to create dummy variables that are 1 during Black Friday, sales or the Christmas period, so that the model understands that these periods are marked by particular consumption behaviours.
* **Class group indicator:** Imagine that you are analyzing the traffic on a website and you have a variable called traffic_source that tells you the source of your traffic. You could create an indicator variable that tells you when visitors to the site come from a paid source such as "facebook ads" or "google adwords".

    4. Compound variables

This type of feature corresponds to the association of several existing variables together through operations such as sum, difference, product or quotient. Often in data science there is strength in numbers, and creating new variables that are composed of existing variables can allow a prediction model to take into account information that it was previously unable to understand.

_*NB: That said, it is not recommended to create variables in this way automatically with a loop for all the variables available in your dataset, it will explode the number of features and many of them will not necessarily be very useful for your model._



* ** Sum of two variables:** Say you want to predict income from historical sales data. If you have at your disposal the sales of blue and black pens (sales_blue_pens, sales_black_pens) you could sum them to create a variable representing the total sales (sales_pens)
* **Difference of two variables:** Let's say you are interested in real estate and you have the date of construction (house_built_date) of a house and its date of sale (house_sales_date). We can calculate the difference between these two dates to obtain the age of the house at the time of its sale (house_age_at_sales).
* **Product of two variables :** If you test a new price on a merchant site, you can for example multiply this price by a variable that indicates the conversion status of a visit to the product page into a purchase and thus obtain an income variable.
* **Quotient of two variables :** If you have in your possession a dataset concerning marketing campaigns with variables giving the number of clicks n_clicks and the number of impressions of the campaign n_impressions. You can divide the number of clicks by the number of impressions so that you can compare the click_through_rate for all campaigns.
* It regularly happens that the link between your target variables and your explanatory variables is not linear, so it is often interesting to visualize together in a graph the target variable and the explanatory variables in order to discover which usual function (polynomials, log, exp etc...) is likely to represent as well as possible the link that potentially exists between the two.


    5. Representation of variables

This new method may seem simple but can be very useful, it consists of observing the same variable in different forms in order to extract the maximum amount of information.

Your data will not always be presented in its ideal form. Some of them can provide very useful information in a different form:



* **Dates and Times :** Let's say you have a purchase_datetime variable. Translating this variable into days of the week is likely to provide relevant information, since some businesses do better on certain days of the week.
* Transform quantitative variables into qualitative variables:** It is sometimes very useful to transform some quantitative variables into ordinal qualitative variables, for example we can transform age into categories such as '0-18', '19-25', '26-35' etc... which correspond to different key periods of people's lives, or we can group income into tax brackets.
* It regularly happens that some modalities of a qualitative variable are very poorly represented in the dataset, when this happens it may be useful to group these modalities together under the generic label "other". The logic behind this is that it is not very useful for modelling and analysis that all individuals are in the same class, nor is it useful that some classes contain too few individuals.
* It is sometimes necessary to transform a categorical variable into a collection of dummy variables representing each of the modalities of that variable in order to use certain machine learning algorithms (such as the Bernouillis Naive Bayes). (Remember to group the modalities that are very poorly represented under an 'other' label before doing this).


    6. External data

The provision of data from an external source can lead to the best results in terms of performance. For example, one of the strategies adopted by investment funds to predict the value of certain financial securities is to use several sources of financial data.

Many machine learning problems can benefit greatly from external data input, here are a few examples:



* Time series:** The main advantage of time series is that only the date associated with each observation is needed to provide a secondary source of data.
* **External API's :** Many API's can help you with new features, for example, the [Microsoft Computer Vision] API (https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/) allows you to find the number of faces in an image.
**Geolocation:** When you have geolocation data, or a postal address, it is convenient to find data that characterizes that location and turn it into variables using another [dataset](http://www.pitneybowes.com/us/data/demographic-data.html).
* **Ancillary source of the same data:** Some data can be collected in several ways, for example a Facebook advertising campaign can be tracked using Facebook's analytics tool, but also Google Analytics or other third party software. Each source can potentially bring up information that another does not show and they could be, as such, complementary.


    7. Error Analysis (Post-modeling)

The last technique for creating variables that we will see together is called error analysis. This technique can only be applied once the first model has been trained.

This term refers to the analysis of observations that the model has failed to classify well (in a classification model) or observations for which the prediction error is very large (in a regression).

The possible conclusions of this analysis may lead you to try to collect more data because you realize that essential information is missing, you may need to separate the problem into several sub-problems because you realize that some groups of observations behave too differently to be effectively modeled together. You can try to create new variables to reduce the errors you observe, in this case a thorough analysis of the high error observations is necessary to understand why your model fails. Here's a methodology for understanding error analysis:



* ** Begin by analyzing large errors :** Error analysis is a manual process. However, you will not have time to analyze each observation. For this reason, it is recommended that you analyze the observations for which your model makes the most important errors first.
* Another technique is to segment the observations using a well-chosen variable or group of variables and criteria often dictated by the business experience you have gained or will be given in a specific industry.
* **Unsupervised learning:** If you do not have the business knowledge or expertise to segment observations effectively, you can use unsupervised learning techniques to highlight patterns in the data. It is not recommended to use the clusters thus created as an explanatory variable to train the next model, but they will make it easier for you to differentiate between groups of similar observations. Remember that the goal is to understand why the model failed to understand these obsevrations.
* It is very common for data scientists to work in industries or fields of expertise in which they have limited experience. This is increasingly the case as data collection, processing, analysis and modelling is becoming commonplace in all fields. This is why it is always useful to ask fellow data scientists or experts in a specific field for guidance in your research.


## Conclusion

As you can see, there are many ways to create new variables, we've touched on a few, but the possibilities are almost endless!

If you had to remember the basics, good variables:



* Can be calculated for new observations added to your database
* Can be explained simply
* Must potentially have predictive power for the target variable, don't be tempted to create variables by obligation
* Are based on an expertise in the field of study or a descriptive analysis of the dataset
* Never touch the target variable, if it intervenes in any way in your new variables, it's cheating and anyway you won't be able to use the variables you create to predict from new observations for which the target variable would not be filled in.

Finally, don't worry if all this seems overwhelming at the moment, you will get better at this game of variable creation with time, practice, experience and with the help of your pairs!
