# Pre-course

This course is an applied look at machine learning concepts and practice using the python programming language. 

Please ensure you have read the pre-course material before starting the material.

For more information about machine learning theory please see the Machine Learning Theory course.

The material in this course is separated broadly into 4 sections:
* **Data Preparation**: Chapter 1
* **Supervised Learning**: Chapter 2 (Regression) & Chapter 3 (Classification)
* **Unsupervised Learning**: Chapters 4 (Dimension Reduction) & Chapter 5 (Clustering)
* **Case Studies** A-C, at present covering supervised learning

The intended use of this course is to be completed in order, as concepts in chapters are often built on top of the previous chapter.

## Table of Contents
<a href="#Programming-Refresh"><font size="+0.5">Programming Refresh</font></a>
* Load Relevant Libraries
* Import Raw Data
* `sklearn` Objects

<a href="#Data-Cleaning"><font size="+0.5">Data Cleaning</font></a>
* Converting Types
* Dropping Rows
* Replacing with an Average
* Interpolation - Inferring Missing Values
* `sklearn` alternative

<a href="#Feature-Engineering"><font size="+0.5">Feature Engineering</font></a>
* Data Types
* Encoding Data

<a href="#Data-Scaling"><font size="+0.5">Data Scaling</font></a>
* Standardisation
* Normalization
* Other Scaling

<a href="#Feature-Selection"><font size="+0.5">Feature Selection</font></a>
* Removing Unique Attributes
* Correlation Matrix
* Automatic Feature Selection

<a href="#Data-Structures-in-sklearn"><font size="+0.5">Data Structures in sklearn</font></a>
* numpy and pandas
* Train Test Split
* Random States
* Validation Sets


<center><h1><font size=6>Chapter 1</font></h1></center>
<center><h1><font size=7>Data Preparation</h1></center>

It is often said in machine learning that "preparing your data takes up 80% of your time". Whilst not being the most exciting stage of development, it is crucial. The performance of any model built is dependent on: the suitability of the mathematics behind it; and the quality of the data put into it. The most technically powerful models are still limited by the amount and quality of data given to it. 

Good data preparation gets better performance out of any model, it is key to use the appropriate steps for the task at hand. The following sections contain broad ideas about preparing your data, some of which may not be relevant to every application, but all of which should be considered somewhere in the design process of your machine learning workflow. 

This chapter will be using data involving bike rentals, calendar data and weather patterns. Whilst there will be no prediction yet, we will be manipulating, cleaning and heavily interacting with this data, and therefore it is important to feel comfortable with python and the **`pandas`** data frame set of classes.

## Learning Objectives
- Understand different methods of handling missing data 
- Be aware of different data types relevant to Machine Learning
- Be able to encode different types of data
- Understand how and why to scale data
- Understand how to select features for models
- Be aware of the different data structures used by **`sklearn`**
- Understand why we use training and test data sets

---
# Programming Refresh


Much of the relevant data preparation required for machine learning can be completed by storing the data as a **`pandas`** data frame and using methods included in the class. 

For a data frame called **`data`**, to call a method (function), **`do_something()`**, on the frame we type in

```python
data.do_something()
```

If we want to store the change from the method we can assign it to a variable:

```python
new_data = data.do_something()
```

For the duration of this course we are going to access the column `"column_name"` from **`data`** to become a series data object **`column_as_series`**  using:


```python
column_as_series = data["column_name"]
```


Combining the two different capabilities shown we can reassign the values of a column to that of the new values from the method by using:

```python
data["column_name"] = data["column_name"].do_something()
```

## `sklearn` Objects

Throughout this course we are going to be using the `sklearn` module extensively. Within `sklearn` there are a wide range of classes and function contained.

This course is not an introduction to Object Oriented Programming, but includes a brief explaination of how we are going to use objects with `sklearn`.

For a range of processes we are going to create objects and use the methods within them to transform data/fit models/produce evaluations. We first create the object from a class by assigning it to a variable. We then use a method within that object to do something to our data. These methods are similar to functions but associated with a specific object. Classes typically begin with capital letters and functions without in this course. Below is an example of a process we might do.

```python
from sklearn import Genericclass

# create object of Genericclass
first_object = Genericclass()

# do something to our data with a method from Genericclass
new_data = first_object.do_something(original_data)

```


## Load Relevant Libraries

In [None]:
# Not an exhaustive list of libraries, but we will load the others when we need them.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# This snippet allows us to see plots without directly calling them.
%matplotlib inline

# We will be changing our data frames in multiple different ways for demonstration purposes 
# and therefore need many copies, this causes a warning that will not affect our results.
# The code below supresses this warning. 
pd.set_option('mode.chained_assignment', None)

## Import Raw Data

This data set is a combination of TfL data where **`count`** denotes the number of bike hires in a given day, combined with the weather and calendar information for that day. In the next chapter we will use the calendar and weather features to predict the corresponding number of bike hires.

In [None]:
# Designate the location of the data set.
bikes_filepath = '../../data/bikes.csv'

# Load the data into a pandas data frame.
original_data = pd.read_csv(filepath_or_buffer=bikes_filepath, delimiter=",")

What does our data look like? How is it formatted?

In [None]:
# .head() allows us to preview some of the data frame.
original_data.head(n=10)

In [None]:
# .info() allows us to get information about the data in the frame.
original_data.info()

We can see here that some of our columns contain missing data, that they do not have 730 records in them. This also shows the different data types we have within our data set. 

Using knowledge from our data source we also know the units for the features:

* date - date
* real_temperature - degrees celcius
* feel_temperature - degrees celcius
* humidity - %
* wind_speed - km/h
* weather_code - name of weather
* is_holiday - Boolean
* is_weekend - Boolean

---
# Data Cleaning

By displaying our data's **`info()`** we can see that there are three columns which contain missing values. These are **`feel_temperature`**, **`weather_code`** and **`season`**. Some machine learning models cannot take missing data as an input and therefore we need to handle it.

There are many different options for filling in missing data, each of which has benefits and restrictions that may affect our produced model. It is important that domain knowledge is used when deciding what method to empoly. How the data was produced, and what effects may have caused missing data are important to know, as well as the future modelling planned with the data.

To demonstrate some common methods for filling in data, each column will be filled using a different method. These methods are merely three options, the possible ways are limitless and range from simple models, such as dropping rows or columns with missing data, to building neural networks that predict the missing value. The **`pandas`** class conveniently contains functions which allow us to easily implement some methods for data cleaning.

## Converting Types
Sometime our **`pandas`** loading of the data will convert data types on our behalf, but this could be into the wrong format. For example, our date data is currently an "object", whereas we want it to be in a time based format.

In [None]:
# We can cast one type to another with astype()
original_data['date'] = original_data['date'].astype('datetime64[ns]')

We could also convert our count count data to a float type if we wanted to have all out numerics as floats.

In [None]:
original_data['count'].astype('float')

## Dropping Rows

If we have large amounts of data then we don't necessarily need every single record. Because of this, we can remove the rows that contain missing data. It is important that there is no significant information contained in these rows with missing data. 

For example, if all the missing data was originally one category of target data, by removing these records our overall data will be skewed away from that category. However, if there isn't a relationship between missing data of one variable and other columns then this is an easy method to choose and understand.

We are going to drop rows that contain missing values in the **`weather_code`** column.

This can be done to a data frame using the **`dropna()`** function.

When using a `pandas` data frame it is crucial that we reset our indexes after we drop rows. This is done in order to prevent mis-matched indexes when converting between different data types such as `numpy` arrays.

In [None]:
# Dropping rows column that contains missing values in a certain column.
clean_data = original_data.dropna(subset=['weather_code'])

# We are going to remove row indexes without values and reset all values so they are sequential.
clean_data = clean_data.reset_index(drop=True)

## Replacing with an Average

Replacing a missing value with an average value of a variable is often a good approximation for the data. This can be useful because an average will not significantly impact the distribution if there are small amounts of missing data. Which average is used will depend on the data type of the variable in question.  

- **Mean** - Numerical data - If your variable is distributed in a symetric manner.

- **Median** - Numerical data - If your variable is not distributed in a symetric manner.

- **Mode** - Categorical data - replaces missing data with the most common category in the data.

We need to be careful not to over use this method as replacing large amounts of our data with the average will change the distribution of our data.

We are going to use the median average to fill in the missing data in the **`feel_temperature`** variable.

First, calculate the median of the data set. Secondly replace the missing values in the columns using **`fillna()`**

In [None]:
# Calculate the median of feel temperature.
feel_temperature_median = clean_data['feel_temperature'].median()

# Fill in the missing values with the median value.
clean_data['feel_temperature'] = clean_data['feel_temperature'].fillna(feel_temperature_median)

# Calling the series displays it's values as an output.
clean_data['feel_temperature']


<div class="alert alert-block alert-warning">
<b><i> <font size=3> fillna() <font> </i>:</b>
<p>
We can also use other methods in the <b>fillna()</b> function which allows us to remove all types of missing data, with a vaiety of options for how. You can find more information about the different choices in the <b>fillna()</b> documentation online.
</p>
</div>

## Interpolation - Inferring Missing Values

Using the other variables and values in the data we can utilise different algorithms to infer what the missing data value could be. These methods include but are not limited to:

- **`'pad'`** - fills in the missing values with the previous value going down the column. (Useful for when the order of the data matters i.e. time series data)

- **`'linear'`** - fills in the missing values using a linear function between the point before and the point after the missing value.

- **`'nearest'`** - fills in the missing values with the value of the nearest other point based on unique ID. 

Below we are going to use the **`'pad'`** method on the missing value column **`season`**.

<div class="alert alert-block alert-warning">
<b><font size=3> Warning<font> </b> 
<p> Depending on the method, interpolation may not fill in all missing values, be sure to check it has worked before you continue using your data. If necessary you can always use multiple methods to fill in your missing data.
</p>
</div>

In [None]:
# The missing data is filled with the previous values going down the row.
clean_data['season'] = clean_data['season'].interpolate(method='pad')

clean_data.isna().sum()

<div class="alert alert-block alert-danger">
<b><font size=3> Warning<font> </b> 
<p> 
Please ensure you use the designated names for all data and variables in the exercises in this course. This will allow you to copy and paste the exercise answers should you miss one of the exercises as each exercise depends on the previous.
</p>
</div>

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 1:</font></b> <p> Clean all the data from the data frame <b>original_data</b> using methods of your choice. Call the new cleaned data frame <b>clean_data</b>. Use the <b>isna().sum()</b> functions to check your method has filled all missing data.
</p> </div>


In [None]:
# Write your code here


## **`sklearn`** Alternative
In addition to the missing data filling methods discussed, **`sklearn`** includes a range of methods within the **`impute`** class. Some basic methods including mean, median, most frequent and constant missing value filling can be implemented using the **`SimpleImputer`** class.

## Outliers
Outliers in the data - extreme values - may impact our model's predictive power severely. As a result, these should be found using exploratory techniques such as plotting and using domain knowledge. It may be necessary to remove the outlier data in order to improve model performance.

## Summary

In this section we discussed the importance of handling missing data properly, without doing so our model may perform worse, or simply not run. We have a range of options when dealing with missing data, each with tasks they are potentially better suited for. 

Dropping missing data can be useful when we don't want to make assumptions about our data that may skew it, but we need to be careful about the correlation of missing values - i.e. dropping assumes that missing data are random. Replacing the missing values with an average can be a easy to understand approach, with the result being clear, but we must be careful that we do not over use this approach as too much of our data being replaced could cause issues. Finally, there are numerous already designed methods given by **`pandas`**, these are preferable to use, but each has a limitation and should be understood before being selected.

In addition to the coneventional methods highlighted above, more recent approach and efficient algorithms are based upon imputation of missing values with model based correction. These techniques are called **model based methods**. Here, the algorithm model the missing data mechanism and then proceed to make a proper likelihood-based analysis either via maximum likelihhod or bayesian erstimation. This allows the distribution of the data and correlation with other variables to be considered in imputation of the missing data.  

---
# Feature Engineering

So far in this chapter we have been lucky with our data, in that most of it is in a format that both we, and a machine learning algorithm can easily understand. Namely, algorithms thrive on numerical input, but we will not always receive data that is in this format and will often need to convert it into numerical data. 

## Data Types for Machine Learning

There are a wide array of different data types that can be used in python and programming in general. For machine learning we are going to discuss three general categories. 

### Numerical Data


In [None]:
# Generate some random data of different types.
random_floats = np.random.uniform(low=-50, high=50, size=(15,))
random_ints = np.random.randint(low=-50, high=50, size=(15,))

print("Example floats:\n", random_floats)
print("\nExample integers:\n", random_ints)


This data type is quite self explanatory. It encompasses both float and integer types and is fundementally anything that is a number at origin. The distance between two numbers has a meaning, the person who is 180cm tall is 10cm taller than the person who is 170cm tall. There is a rigid order to each number, with each data point either being greater than, less than or equal to another. 

For ease of caclulation many machine learning algorithms will convert integers to floats within their internal processes but this will rarely be noticable for the user. 

An example of the numerical data type in our **`clean_data`** set is the **`real_temperature`**, feature which represents the temperature measured in degrees celcius. 

In [None]:
# Selecting only the numerical data columns in the frame.
numerical_data = clean_data.select_dtypes(include='number')

print(numerical_data.head(n=10),'\n')
print(numerical_data.info())

### Categorical Data (Nominal)

Categorical data is typically a string of characters such as a word or single letter. It is often named nominal data. As data of this type are not numerical, we cannot perform arithmetic between different categorical values, they have no quantitative value. Categorical data has no order to it. These pieces of data are commonly qualitative observations, descriptors, or representations of collections of other data types. Using **`pandas/numpy`** these are "strings" but often they will be called a generic python data type called "objects".

In [None]:
# Selecting only the categorical data in the frame.
categorical_data = clean_data.select_dtypes(include="object")

print('The unique categories in "weather_codes" are: \n\n\t {}'.format(categorical_data["weather_code"].unique()))

print('\nThe unique categories in "season" are: \n\n\t {}'.format(categorical_data["season"].unique()))

# Display a random sample of the data to show the variation in values.
categorical_data.sample(n=7)

We clearly only have two type of explicitly categorical data in this set as shown by the "object" type. However, we can often count boolean values as categorical data (True, False) as they are non-numerical and unordered. 

In [None]:
# Selecting only the boolean data from the frame.
boolean_data = clean_data.select_dtypes(include='bool')

boolean_data.head(n=7)

### Ordinal Data

<img src="../../images/ordinaldata.png"  width="750" height="500" alt="A pain chart which shows increasing levels of pain shown on cartoon faces">

Ordinal data too can be expressed as a string. However, the underlying different possible data values are related to each other in some order. Each data point can be evaluated against each other as: equal to, greater or less than another, but the distance between each data point is not explicitly known. Simply put, there is a natural order to ordinal data.

Examples of ordinal data would include things to do with ranks, such as education level, positions in a running race (first, second, third etc.), satisfaction (very unhappy, unhappy, neutral, happy, very happy).

We have two options when we find that we have ordinal data. We can either treat it as nominal/categorical data and potentially lose the relationship between the two values. Alternatively, ordinal data can be converted to integers, but this may risk our model treating the differences between the values in the data as exact real values. 

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 2:</font></b> 

<p> Which of the three data types discussed: numerical, ordinal and categorical are the following examples? </p>

<ul>
  <li>Key Stages in school</li>
  <li>Count of pupils</li>
  <li>Colour of a pupils jumper</li>
</ul>
</div>

## Encoding Data

We have alluded to the fact that our categorical data is not in the format that machine learning algorithms can use. The strings need to be converted into numbers, but how can we do this? 

There are two main routes to basic categorical data encoding: label encoding and one-hot encoding. Each of these has a time and a place to be used.

### Label Encoding

The most obvious method for converting categories to numbers is to do directly just that. The **`LabelEncoder()`** class maps each unique category to a different integer.

We can encode the data of categorical variables which maps the new variable as a value from $0$ to $n-1$, where $n$ is the number of unique categories in the feature. 

The code below shows how we can implement label encoding with a boolean variable, although it will work for other categorical data with more than two unique types.

In [None]:
# First look at our "is_weekend" data and make a new frame for it.
is_weekend_data = pd.DataFrame(data=clean_data["is_weekend"])
is_weekend_data.head(n=7)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create the label encoder object
label_encoder = LabelEncoder()

# Add the fitted and transformed data onto the new frame.
is_weekend_data["encoded_is_weekend"] = label_encoder.fit_transform(y=is_weekend_data["is_weekend"])
is_weekend_data.head(n=7)

Our True values have been converted to $1$ and the False values to $0$. If we had had a third value, that would have taken the value of $2$ and so on up to $n-1$.  

Notice how the data input into the **`LabelEncoder()`** has an argument of "**`y=`**". This is intentional, the "label" encoder should only be used for the **`target`** data variable, in other words the variable we are going to be trying to predict. This is because of the problem we discussed earlier with ordinal data, our model may learn a relationship between the numbers we give it that does not exist because we have given the data in a numerical order.

### One-Hot Encoding

A different approach to encoding categorical data does not try to replace the character variables with a range of numerical data. Instead one-hot encoding produces a new feature for each category. This means for every unique value within the original feature there is now a new column of data. Sometimes, this is called creating dummy variables. 

But what is this new column of data? It is quite a simple idea, each column will contain 1's and 0's which will denote whether or not that row has the feature of the new column. This way if we have three different categories, we will then have three different columns. In these columns we will have 1's and 0's corresponding to if or if not the original row was the value of that column. 

When I said it was simple, I meant it... but it is much easier to visualise than it is to read. 

<img src="../../images/onehot.png"  width="600" height="500" alt="using a colour category with three different colours each colour becomes a new column when encoded. Two small data tables shown, one before and one after encoding.">


Lets try to encode the **`season`** feature to show what this method can do.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Fetch the column as a data frame
season_data = pd.DataFrame(data=clean_data["season"], columns=["season"])

# Create the encoder object
one_hot_encoder = OneHotEncoder()

# Fit and transform the data to the new encoded form in the same step.
ohe_season_data_array = one_hot_encoder.fit_transform(season_data[["season"]]).toarray()

# Get the names of the new features created.
column_names = one_hot_encoder.get_feature_names(['season'])

# Make a data frame with the new data columns.
ohe_season_data = pd.DataFrame(data=ohe_season_data_array, columns=column_names)

# Combine the original season and encoded season data.
ohe_season_data = pd.concat([ohe_season_data, season_data], axis=1)

ohe_season_data.sample(n=10)

<div class="alert alert-block alert-warning">
<b> <font size=3> Warning <font></b>
<p>
Our results from the one hot encoder are in a **numpy** array, for this reason it is crucial that we reset the indexes on the original data frame or else we will have two different sets of indexes when we combine the separate frames later.
</p>
</div>

We can see in our new data frame that the value in the column for each season corresponds to the **`original_season`** feature. This is a really powerful tool to allow categorical data to be used by machine learning models, but let us for a moment discuss one of the limitations to this method. For example, we have used a feature with only four unique values, what happens if there are many thousand? How should we handle this?

<div class="alert alert-block alert-info">
<b><font size="3">Exercise 3:</font></b> 
    <p> Using the method shown in this section, one-hot encode the <b>"weather_code"</b> feature from <b>clean_data</b> and store the new data frame with the new <b>"weather_code"</b> columns as <b>weather_encoded_data</b>. Include the original <b>"weather_code"</b> feature in order to compare the results.
</p> </div>


In [None]:
# Write code here


What should we do however if we have incredibly large numbers of categories in our feature? This would generate large numbers of columns which may not be informative for our model to pick out a pattern. 

We have a few options in this case, one would be to use dimension reduction, an area of unsupervised machine learning which reduces the number of features in the data set by combining the features mathematically.

Another option would be to aggregate the different categories together. For example, if we have a feature "colours" that contains thousands of different hues such as "turquoise", "navy", "aqua" and so on. By mapping each of these different unique categories to broader groups we can reduce the number of different categories. We would categorise each of the listed colours under the umbrella of "blue".

## Summary

We will often be given data in a format that our machine learning algorithms cannot easily handle, therefore it is important that we engineer features that can be used from our raw data. The most common types of feature engineering are converting data from one format to another. In this section we have gone into detail about how to convert categorical data to numerical data, using both a label encoder and one hot encoder. There are many more methods for extracting features from data, each type of data will have some natural ways it is converted to a useable feature. However, as always you should check to see what the model you have chosen requires, and also take computational expense into account.

---
# Data Scaling


## What is it, and why do it?

We have now brought our data into the notebook, cleaned it of any missing values and converted it to numerical data. However, we are not yet able to use it to train a model. Notice what the variables in our dataset are: we have temperatures (celcius), wind speeds (km/h) and humidity (% moisture). Our machine learning model will not understand that these different variables are describing very different phenomena.

What matters in most machine learning algorithms is the variations in values for each variable and the correlation between different variables (i.e. the joined distribution of the data space). By scaling variables we can make them into a common form of data that expresses the variation in their values on a level playing field. Real data across different variables will have different sizes, ranges and units. For most machine learning algorithms, we can achieve a better convergence level and can handle/compare the features with different units and scales, if we transform the data into a common form (i.e. keeping the distribution but remove the effect of sizes, ranges and units).

Scaling our data effectively means transforming each different feature to be on the same scale as each other feature.

## Standardisation

In order to make our data usable for some algorithms we need to make each variable comparable. 

Using what we know about our data can help us to decide which form of scaling is most appropriate. If we assume that the our variables follow a normal distribution then there is a scaler designed to use this information. The scaler reconfigures the values to show the same shape as the data originally shows, but changing the values themselves. 

To do this, the scaler uses the mean and standard deviation of the data itself to produce new values called the "standard score" $z_i$, this is given by:

$$ z_i = \frac{x_i - \mu}{s} $$

where $x_i$ is a data value, $\mu$ is the mean of the data set and $s$ is the standard deviation of the data samples. This will typically produce a new dataset that now has an average value of $\mu'=0$, and a standard variance of $s'=1$. In other words, we transform the normally distributed variable, $x_i$, to a **standard normal** distribution.   

This method is called standard scaling, and is included within the **`sklearn`** package **`sklearn.preprocessing`**.

Standard Scaling is done **column wise**.

Firstly, lets have a look at one of our variables, **`real_temperature`**. 

In [None]:
# Creating a figure of the density of a variable across the different values.
ax = clean_data['real_temperature'].plot.density(title="Distribution of real_temperature", label="real_temperature")

mean_real_temperature = clean_data['real_temperature'].mean()

# Plotting the mean value on the distribution against the density.
plt.axvline(x=mean_real_temperature, color='r', linestyle='--', label="mean")
ax.legend();

Now this particular variable is not a perfect normal distribution, however, our real world data is rarely going to exactly follow a preprescribed distribution. It does look approximately in the right form so we are going to try to use the standard scaler on it and see what the result looks like. 

In [None]:
# Import the specific class we want to use.
from sklearn.preprocessing import StandardScaler

# We need to create the scaler object before we can use it.
standard_scaler = StandardScaler()

# Making a new data frame containing the data we want to process for handling purposes.
real_temperature_data = pd.DataFrame(data=clean_data['real_temperature'])

# We can both fit our scaler to the data and transform the data in one step with fit_transform
clean_data['scaled_real_temperature'] = standard_scaler.fit_transform(X=real_temperature_data)

One of the often challenging parts of a machine learning workflow can be the transformation of data into the format that the different functions require. We will go over this, data structures and more later this chapter.

Lets have a look at our newly scaled data to understand what has happened. First by comparing the two columns and then by plotting both.

In [None]:
# We can now compare the pre and post scaled data, using a sample of the data.
clean_data[["real_temperature", "scaled_real_temperature"]].sample(n=10)

In [None]:
# Generate the plots
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

# Plot the pre-scaling distribution.
clean_data['real_temperature'].plot.density(
        ax=ax1, color='r', title="Before Scaling", label="real_temperature")

ax1.axvline(x=mean_real_temperature, color='g', linestyle='--', label="mean")
ax1.legend(loc="lower right")

# Calculate the new mean and plot the post scaling distribution together.
mean_scaled_real_temperature = clean_data['scaled_real_temperature'].mean()

clean_data.scaled_real_temperature.plot.density(
        ax=ax2, color='b', title="After Scaling", label="real_temperature_scaled")

ax2.axvline(x=mean_scaled_real_temperature, color='y', linestyle='--', label="mean")
ax2.legend(loc="lower right");

We can now see the effect of standard scaling, the data is maintaining the same distribution shape, but has a mean, $\mu = 0$ and standard deviation, $s \approx 1$.

Let's drop the **`scaled_real_temperature`** column so we have the original cleaned data.

In [None]:
# Remove the scaled data.
clean_data = clean_data.drop(columns=["scaled_real_temperature"])

clean_data.head()

What has happened here may in fact be clearer when you see what happens to multiple features when they are standard scaled at the same time. 

In [None]:
scaled_clean_data = clean_data[["real_temperature", "wind_speed"]]

# Generate the plots
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

# Plot the pre-scaling distribution.
clean_data['real_temperature'].plot.density(
        ax=ax1, color='r', title="Before Scaling", label="real_temperature")
clean_data['wind_speed'].plot.density(
        ax=ax1, color='b', title="Before Scaling", label="wind_speed")

scaled_clean_data[["real_temperature", "wind_speed"]] = standard_scaler.fit_transform(X=scaled_clean_data)

scaled_clean_data['real_temperature'].plot.density(
        ax=ax2, color='r', title="After Scaling", label="real_temperature")
scaled_clean_data['wind_speed'].plot.density(
        ax=ax2, color='b', title="After Scaling", label="wind_speed");

We can see that both features, which are two incomparible things (temperature and wind speed) are now on the same scale.

## Normalization

If the scale of one feature compared to another is large then we may choose to instead normalize the features. This is particularly crucial for machine learning algorithms that are reliant on distance measurements in the data space (specifically euclidean distance).

Normalization will convert the values of one or more variable into the range of $0 \leq x_i \leq 1$ such that each variable is comparable. The length of each vector will then be equal to 1, a unit length. Where as in scaling we may change the range of the data, in normalization we alter the actual distribution.

This is done by the transformation: $$q_i = \frac{x_i}{\sqrt{x_i^{2}+y_i^{2}+z_i^{2}}} $$ 

where $q_i$ is the new value, $x_i$ is the feature in question, $y_i, z_i$ are other features. More generally the normalizer divides each value of $x_i$ by the magnitude of the data point in $n$-dimensional space where $n$ is the number of features.

Normalization is performed **row wise**.

Normalization is fundementally different to other scaling methods that we will see in this course. 

When features are normalized the resulting distribution is altered, not just shifted of stretched. For some machine learning problems this is very useful. For others this may make the resulting features less useful. 

Machine learning algorithms/models which use the distance between data points can often benefit from normalization. Families of models that use distance, or "metrics", include Kernel based models and Nearest Neighbor algorithms. We will look at distances, and distance based models in Case Study A, and the Clustering Chapter. 

Supervised models using distance measures:

* k-Nearest Neighbours
* Support Vector Machines (SVMs)

These sorts of models are especially good at modelling non-linear relationships.

Normalization is often a poor choice when the distribution of our feature follows a gaussian distribution of some form. In this case, using one of the scalers discussed is frequently better.

Supervised models where we would rarely use normalisation:

* Linear Regression
* Logistic Regression
* Tree-based models

Lets run through an example of normalizing our data using the **`humidity`** and **`wind_speed`** features to explore the output.

In [None]:
# Createa smaller data frame to manipulate certain columns.
humidity_wind_data = clean_data[["humidity", "wind_speed"]]

humidity_wind_data. head()

In [None]:
from sklearn.preprocessing import Normalizer

# Create normalizer object
normal_scaler = Normalizer()

# Fit and transform the data with the normalizer in a new data frame.
humidity_wind_data_normalized = pd.DataFrame(normal_scaler.fit_transform(X=humidity_wind_data), 
                                      columns=["normed_humidity","normed_wind_speed"])

humidity_wind_data_normalized.sample(n=10)

But what does this actually look like in our data space?

In [None]:
# Generate the plots
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5), sharey=False)

# Plot the pre-normalizing distribution.
humidity_wind_data['humidity'].plot.density(ax=ax1, color='r', title="Before Normalizer")

humidity_wind_data['wind_speed'].plot.density(ax=ax1, color='b')
ax1.legend(loc="upper right")

# Plot the normalized data
humidity_wind_data_normalized['normed_humidity'].plot.density(ax=ax2, color='r', title="After Normalizer")

humidity_wind_data_normalized['normed_wind_speed'].plot.density(ax=ax2, color='b')
ax2.legend(loc="upper right");

The two features now have a common range, between 0 and 1. This was achieved by changing the distributions of each features. They are now comparable to each other within the same unit values. 

We can somewhat disregard the y-axis in this example. As this is a density the sum of the area underneath each curve must equal 1. As we change the different values of the data, the density changes to keep the sum to 1. 

This plot however does not show us what is happening to each row of the data. The below code plots the two features against each other before and after normalisation, showing that the normaliser relates each feature to the other on a row by row basis and creates new values which have a unit length of 1.

In [None]:
# Generate the plots
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5), sharey=False)

# Plot the pre-normalizing distribution.
ax1.scatter(humidity_wind_data['humidity'], humidity_wind_data['wind_speed'], color='purple')
ax1.set_xlim(0, 100)
ax1.set_ylim(0, humidity_wind_data['wind_speed'].max()+3)
ax1.set_xlabel("humidity")
ax1.set_ylabel("wind_speed")
ax1.set_title("Features before normalization")

# Plot the post-normalizing distribution.
ax2.scatter(humidity_wind_data_normalized['normed_humidity'], 
            humidity_wind_data_normalized['normed_wind_speed'],
            color='green')
ax2.set_xlim(0, 1.1)
ax2.set_ylim(0, 1.1)
ax2.set_xlabel("normed_humidity")
ax2.set_ylabel("normed_wind_speed")

ax2.set_title("Features after normalization");

## Which Scaling Should be Used?

The two examples of scaling above are not exhaustive, there are different methods each appropriate for different data and combinations of features. There are some simple judgements that can be made to pick a method.

- **`StandardScaler()`** - When data is normally distributed this is the go to method. Transforms data to have $\mu = 0 , s = 1$.

- **`Normalizer()`** - Data is now arranged to be within 1 data distance unit from the centre of the data space. This is useful for algorithms that use euclidean distances to make calculations.

- **`MinMaxScaler()`** - This scaler transforms using the minimum and maximum values of the data to within either $0 \leq q \leq 1$ or $-1 \leq q \leq 1$ depending if there are negative data values. This method is especially useful when the input data is not normally distributed. It is able to capture skewedness in the data but bring different variables within the same range. 

- **`RobustScaler()`** - Robust scaling is functionally similar to MinMaxScaling, but instead of using minimum and maximum values, the interquartile range is used instead. This has the effect of being resilient to outlier values which can disrupt the MinMaxScaler.

Not only can data be scaled for standardisation purposes, it can also be converted back from scaled values using the **`inverse_transform(X=scaled_data)`** method. 

For a more in depth explanation of each method covered, and accompanying graphics, see [this link](http://benalexkeen.com/feature-scaling-with-scikit-learn/)

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 4:</font></b> 

<p> For all the *numerical* features in <b>clean_data</b> use the <b><i>RobustScaler()</b></i> to create a new data frame with the scaled data. This scaler has the same syntax as the previous shown. Call this frame <b>scaled_data</b>. Be sure to check what your data looks like after it has been scaled.</p> </div>

In [None]:
# Write your code here


## Summary

In this section we have seen some methods for scaling our data. We want to scale our data so that different variables are within the same range, and can therefore be compared appropriately by our machine learning algorithms. Without scaling, the differences between features units and ranges could significantly change the sensitivity of our model and its ability to predict. Not all models require scaling, so it is important to both determine if it should be used, and what method is the most appropriate to the data. 

---
# Composite steps in `sklearn`


Once you know how the different techniques we have discussed work, a useful method for implementing some of the functions is the **`ColumnTransformer()`**. It allows you to pass a "transformation" object such as **`StandardScaler()`** into it and will apply the transformation to the data frame.

An example that would use the standard scaler and one hot encoder on our data would be:

```python
# Create the transformer object with the different transformations and their columns.
column_transform = make_column_transformer(
    (['real_temperature', 'humidity'], StandardScaler()),
    (['weather_code'], OneHotEncoder())
)
# Fit and transform the data frame.
column_transform.fit_transform(clean_data)
```

## Pipelines

As we move further on in our process design, and beyond exploratory analysis we will want to be able to perform a number of steps, one after another. In `sklearn` we can do so using the object `sklearn.pipeline.Pipeline`.

This will allow us to apply transformations to our data one step after another.

For more information see the [sklearn documentation](https://scikit-learn.org/stable/modules/compose.html#pipeline).

---
# Feature Selection 

The purpose of supervised learning, broadly, is to make good predictions about new data using the features of the data we have. However, not all features are created equally, nor are they necessarily suited for the task at hand. The process of choosing which features of the data to use is not static, there isn't always an absolute correct answer. There are trade-offs with regards to how many features are chosen, these are generally as follows:

- **Many features (more complex)** - More likely to caputre complicated phemomena and provide higher accuracy of prediction.

- **Fewer features (less complex)** - Easier to explain model, less computationally expensive, less likely to overfit the data.

In this section we will go through a range of methods for selecting features to pass to a model. Fundementally, it is about understanding your data and the model you have chosen to use.

## Removing Unique Features

Due to how databases work, data will often come with a unique index feature that allows each row to be distinguished. For example, the rows could be ordered $0$ to $n-1$ where $n$ is the number of rows in the data. It is key that we remove these types of features that uniquely identify each row or else a model may learn the relationship between the specific row and the target variable, rather than actual features and the data itself. 

Other examples of unique data may include: individual names, time series dates, identification numbers etc.

Without removing the unique feature, if we give the model new data to predict, it will not be able to effectively take into account the right relationships as it will not have seen the new data's unique identifier before. Passing unique identifiers to the model will lead to significant overfitting.

Whilst dates can be useful features in for some applications, we are going to assume there is no underlying properties of the data strings that would benefit our model. As the date is a unique identifier we are going to remove this feature from the set which will prevent future overfitting of the model. 

In [None]:
# Remove the date feature from the frame.
clean_data = clean_data.drop(columns=['date'])

clean_data.head()

## Correlation Matrix

For many machine learning prediction tasks, what we want to know is how good feature A is going to be at predicting feature B. The correlation between two variables allows us to see quantitatively how related two variables are. A strong relationship between two variables will be a good indicator as to whether one will help predict the other.

Basic correlation measures only **linear** relationships between variables. If there is a relationship between the variables, but the relationship is non-linear, this may not be picked up in our correlation coefficients.

The Pearson's product-moment coefficient - a commonly used correlation coefficient is defined between two variables $X$ and $Y$ as:

$$corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y}$$

Where $cov(X,Y)$ is the [covariance](https://en.wikipedia.org/wiki/Covariance) between $X$ and $Y$, and $\sigma_X, \sigma_Y$ are the variances of $X$ and $Y$. For this reason you may see mentions of the covariance matrixes, a related concept.

**`pandas`** allows us to easily find the correlation coefficients between all the variables in our data frame and visualise them with a heatmap to show how strongly they are linked.

Most importantly, if a feature is strongly correlated with our target variable then it is likely to be a good predictor of the target variable. We want to include features in our model that are going to be helpful to predict the target.

The relationships between the features themselves are important to us too. Beyond just understanding our data better, we can use the realtionships between features to create new features.

When features are highly correlated this can introduce problems in our models. This is called "multicolinearity", and can reduce the robustness of the resulting model. This is discussed in more depth below.

<br>

A key point about correlations is that they can only be computed between numerical data, yet another reason why our encoding of categorical data is key. 

Here is an example of producing a correlation matrix plot of the numerical data in the **`clean_data`** set. We will be able to see what the relationships between the different features are. 

In [None]:
# Select only the numerical data.
numerical_data = clean_data.select_dtypes(include='number')

# Generate the correlation matrix with pandas.
correlation_matrix = numerical_data.corr()

# Show the matrix as a heatmap, matplotlib can also be used instead.
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(2)

Our target in this instance is the **`count`** variable, so we are going to look at how the other features are related to the **`count`**. What we are looking for is values that are close to $+1$ or $-1$ as this shows strong positive or negative correlations, they will be good predictors. The values nearer $0$ are not as useful to us. 

Looking at our map, the variables **`real_temperature`**, **`feel_temperature`** and **`humidity`** are correlated to a significant degree with the **`count`** variable. The **`wind_speed`** is less so, with a value nearer to $0$.  

### Multicolinearity

Unsuprisingly the **`real_temperature`** and the **`feel_temperature`** are nearly perfectly correlated. We do not want to include both of these in our model because that could significantly skew our results. This effect is called multicollinearity, and prevending it is an important step for many algorithms, particularly linear models. 

Multicolinearity occurs when one feature in a data set can be created as a result of a linear combination of other features. For example, if we created a new feature in the data set "average_temperature" which is defined as:

$$ average\_temperature = 0.5 \times feel\_temperature + 0.5 \times real\_temperature$$

This is a linear combination (the features are added) which causes multicollinearity.

<br>

### Multicolinearity Challenges

When we have multicolinearity in the features we give a model this can cause issues with interpretability and predictive performance.

Consider if we put $feel\_temperature$ and $real\_temperature$ into a simple linear model (covered in the Next Chapter).

If we use just one of these features to predict $count$, we can interpret the coefficient produced by the model easily.

If we use **both** features, which are highly correlated, then we might get a high coefficient value for the first feature, but not the second. Alternatively, we might get a high coefficient value for the second feature not the first. Or even, we could have middle sized coefficients for both. Whether any one these happens will depend on the small changes in data we might give to the model.

Using both highly correlated features in the model will mean that we cannot tell conclusively which feature has a bigger effect.

As a result, our model may not perform hugely worse **overall**, but it will be harder to interpret and will produce more incorect predictions for **some individual data points**.

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 5:</font></b> 
<p> 
    Using the <b>weather_encoded_data</b> frame from Exercise 4, plot a correlation map to see the relationship between different weather codes and the bike count. You will need to make sure <i>"count"</i> is in the data frame. 
</p> </div>

In [None]:
# Write code here


## Automatic Feature Selection

Scikit-Learn comes with a function that is useful in selecting the appropriate features for your model. **`SelectKBest()`** does as the name implies, selecting the $K$ best features in your data set, where $K$ is a number the user inputs to the model. But how does **`SelectKBest()`** choose which are the best features?

There are a range of different statistical tests that can be passed to the function which will give us the best features for a certain task. For example, we can use a chi-squared test for a classification problem or an F1-score for regression, or many other options. The function then scores each feature based on how well they perform in the statistical test. From these scores it then picks the best performing features. 

The **`SelectKBest()`** function requires a distinction between the features and the target data. These are typically denoted as $X$ (a matrix) and $y$ (a vector). To do this we must separate our data frame as shown below. We are also going to use a subset of the features to compare between. It is important that we get used to the $X$, $y$ notation as it is commonly used in Machine Learning to describe our feature and target sets. 

In [None]:
# Get the target data column into it's own series object.
y = clean_data['count']

# Remove the non-numerical data and the target.
X = clean_data.drop(columns=['count', 'is_weekend', 'is_holiday', 'weather_code', 'season'])
X.head()

The target data **`count`** is a continuous variable, so we are going to use regression to predict its value. In order to select the features we should therefore use a regression based statistical test in the **`SelectKBest()`** function. In this case, we will use the **`f_regresssion`** measure.

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression

# We have four features in the data set, we want to find the best three.
number_of_desired_features = 3

# Create the selector object
fr_selector = SelectKBest(score_func=f_regression, k=number_of_desired_features)

fr_selector.fit(X=X, y=y)

# We then get an object that has True/False values of whether the number_of_desired_features are in the best set.
columns_selected = fr_selector.get_support()

# Get the columns from our frame which correspond to the True/False values.
selected_cols = X.columns[columns_selected].to_list()
print("Top {} Columns are:\n\n\t{}".format(number_of_desired_features, selected_cols))

# Get the frame with only the columns we want.
selected_data = X[selected_cols]

selected_data.head()

We can see that using this method agrees with our previous conclusion that the wind speed is not a useful predictor for the count attribute. However, this method has not removed one of the two highly correlated attributes. Multicolinearity will still be an issue, at least in the example of this method.

There are two other common methods that can be used either manually, or by calling an **`sklearn`** implementation. These are based on the concept of adding and removing features one by one to see the change in performance of the model. The methods are as follows:

- **Forward Selection** - The user inputs the number of desired features, $K$, the model adds one by one the features that improve the model performance the most according to the F1 score (more on this later!). This is implemented using **`sklearn.feature_selection.F_regression()`**.
- **Recursive Feature Elimination** - The function starts with all features included in a model then removes a feature in order of the least beneficial to the model performance based on the subset of features. The model then evaluates the new subset of features and removes another, this is repeated until there are $K$ features. This is implemented using **`sklearn.feature_selection.RFE()`**.

<div class="alert alert-block alert-info">
<b><font size="4">Exercise 6:</font></b> 
<p> 
    Using the <b>weather_encoded_data</b> previously prepared find the <i>four</i> most important attributes in the data set using a <b><i>"score_func="</b></i> argument not used in the example above . Store this as the data frame <b>best_four_data<b>.
</p> </div>

(the online documentation for selecting the K best [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) may help!)

In [None]:
# Write your code here 


## Summary

Not all features are created equally, and there are different reasons why we may want to remove different attributes. Firstly, we need to remove unique identifying data, this helps prevent overfitting. In addition, we want to only use features that are actually useful for our task, this can often be measured by how correlated one of our features and our target attributes are. But this correlation is not in of itself the key, for example, if we have two highly correlated features we may decide to remove one of them in order to prevent multicolinearity. Lastly, there exist automated methods for selecting which attributes to include in our model, these score each feature on how useful they are for a certain task.

## `numpy` and `pandas`
In general, the scikit-learn package of classes is consistent in style, and allows for a variety of widely used data structures to be input into it's functions. 

The base data structures used across **`sklearn`** are **`numpy`** arrays. All structures, **`pandas`** data, series, lists are effectively converted into this format. **`numpy`** arrays are fast to use, however, they are not as user friendly at the beginning of your development when compared to **`pandas`** data frames. For example, column names are not readily available for **`numpy`** arrays. As your skill with the various aspects of **`sklearn`** improve, there will be more complicated functions and processes that may require usage of **`numpy`** arrays.

Below is an example of what **`numpy`** array versions of our data would look like.

In [None]:
# Looking at our X and y data.
X.head()

In [None]:
y.head()

In [None]:
# Convert the pandas data frames into numpy arrays.
X_np = X.to_numpy()
y_np = y.to_numpy()

print(X)
print(X_np)

print("\nShape of X: ", X_np.shape)
print("Shape of y: ", y_np.shape)

## The Data Structures We Want

When we pass an array, either `X` or `y`, into our model it need to be the right shape in order to be accepted. 

For the features array we require the data to be of the shape `(n_samples, n_features)`. This is shown in the example above.

The target array however is required to be of the shape `(n_samples,)`

## Train-Test Split

Once we create a model to predict some value, we need to be able to show that it works. In order to do this we need data that we have the correct target values to in order to compare our model with. Think of this as like having a markscheme (the known data) to an exam you are checking your answers (the model) against. What is key is that we cannot use the same data for creating our model as we do for evaluating it. It wouldn't be a valid exam if you studied the markscheme before you took it!

One of the most common methods we have for separating these data is called a training-test split. We randomly sample a portion of the data and call that the training set, this is what our model will be trained on. The rest of the data is then used as the test data, which is what we will measure our model against. 

<img src="../../images/traintestsplit.png"  width="650" height="600" alt="The splitting of a data set into a traing and test set of data with random sampling, done on both the feature and target set">

We typically use $80\%$ of our data as the training set, and $20\%$ as the test set. This is a trade off, the more data we have to train with, the better our model will typically be. However, we want as many different test records to evaluate against so we know where our model may be failing. We need the training and test set for both our features X, and our target y. It is important that the data preparation (other than scaling) is performed on both training and test sets. 

**`sklearn`** comes with a handy method for producing this training test split, unsurprisingly called **`train_test_split()`**.

In [None]:
from sklearn.model_selection import train_test_split

# Defining a random state allows us to reproduce our split in the future.
train_X, test_X, train_y, test_y = train_test_split(X_np, 
                                                    y_np, 
                                                    test_size=0.2, 
                                                    random_state=1234)

Lets look at the sizes of our different arrays. 

In [None]:
print("Original X data shape: {}\n".format(X_np.shape))

print("Training X data shape: {}\n".format(train_X.shape))

print("Test X data shape: {}\n\n".format(test_X.shape))


print("Original y data shape: {}\n".format(y_np.shape))

print("Training y data shape: {}\n".format(train_y.shape))

print("Test y data shape: {}\n".format(test_y.shape))

We can now use our training data to train our model, and then the testing data to evaluate how well our model performed without each influencing the other.

## Random States

In many processes in machine learning we use random processes to do things like shuffle our data, solve equations and produce numbers. This is an important part of machine learning, but with processes being random it means they are not easily reproducible as each time you run the code it will produce a different outcome. 

We combat this by fixing the "random state" of the program (or a function), meaning it still does a random process, but it will do the same random process each time it is run. 

For example if we want to produce a random number then if we run that code again we can get the same random number the second time we run it. In our case, by fixing the random state of the program we can randomly shuffle our data the same way each time we run the program.

In the above code we passed an argument to the `train_test_split` function called `random_state`. By picking a value for the state we set the seed of the random generator meaning the same train test split will be created every time the code is run. 

In this course I will typically be assigning a value (always integer) of 123 or similar, this is a personal choice.

This means that other people viewing our work can get the same answers when they run it, making our results reproducible.


## Validation Sets

For more more robust processes or advanced, such as those used in deep learning we often will use an addition set of data; the validation set. We therefore have three sets: the training, validation and test set. We use the training data to learn a model, the validation set to tune the model and improve the "hyper-parameters" (which we will discuss later). The test set is used to produce an unbiased estimate of the prediction power of the model.



## Summary


From this chapter you should now have an appreciation for the data preparation steps that may be required before features can be given to a model to train. We have covered:

- How to clean data
- How to extract useful features for machine learning
- How to transform data to a different scale
- How to select which features we should use
- Basic data structures useful in **`sklearn`**
- How to split our training and test data

Not all methods shown are required for all models or data sets, however, it is important to consider whether each method is needed, and the effect it will have on the data we pass to our model.

<div class="alert alert-block alert-success">
<b><font size="4"> Next Chapter: Regression</font> </b> 
<p> 
The next chapter will discuss regression, the prediction of numerical values. We will use our *bikes* data in order to predict the count of bikes hired in a day. The chapter will also explain basic evaluation and optomization of models. 
</p>
</div>