<div>
<img src="files/machine_learning.jpg" alt="CPU" width="100%" align='center' source="https://www.50a.fr/img/upload/machine%20learning..jpg" /> </div>

# Introduction

Machine learning may seem intimidating with its jargon derived from the realms of computer science and statistics. However, if we start with the basics and progressively increase the complexity, it is entirely possible to grasp the fundamental concepts of this field.

This course will provide you with an overview of how data scientists develop, design, and implement their ML models. You can then use this knowledge to continue learning on your own or stop now if you think you know enough to be able to talk with data scientists.

# Practical Case: Real Estate

The first dataset we will use contains data on the real estate. In real life, real estate agents can estimate the value of a property by associating a price with various characteristics of the property (number of rooms, area, location, etc.) based on their experience.

The program we are going to create will allow us to make predictions ourselves, i.e., to predict a given value. However, this time it's the computer that will "learn" on its own thanks to the data we will provide.

## Decision Tree

For now, we will use a model called a "decision tree." These are very intuitive models, easy to understand and analyze, which can be useful in many cases.

Let's start with a very simple example:

<div>
<img src="files/decision_tree_1.png" alt="CPU" width="50%" align='center'/> </div>

This model divides houses into two categories: those with 2 bedrooms or less and those with more than 2 bedrooms, and then it displays the average price of each group.

The model uses the dataset to decide how to allocate houses into these two groups, and then again to predict the price within each group. The step of setting a model's parameters from data is called training or fitting. The data used to set up this model is called training data.

The details of how the model is trained (e.g., how to split the data) are quite complex, and we will not discuss this topic in these notebooks. Once the model is fitted, we can apply it to new data to predict the price of a home.

We can consider more factors by using a tree with more "splits," meaning it is "deeper."
A decision tree that also takes into account the size of each house's land might look like this.

<div>
<img src="files/decision_tree_2.png" alt="CPU" width="60%" align='center'/> </div>

To predict the price of a house, we go through the decision tree, always choosing the path that corresponds to the features of that house. The predicted price for the house is found at the bottom of the tree, and this point is called a leaf.

<div>
<img src="files/classifier_tree_meme.webp" alt="CPU" width="50%" align='center' /> </div>

# Exploration with Pandas

In [None]:
import pandas as pd
df = pd.read_csv("data/iowa_housing.csv")

In [None]:
df.shape

In [None]:
df.columns

## Missing values

In [None]:
df.isna()

In [None]:
df.isna().sum()

In [None]:
df.isna().sum().loc[df.isna().sum() > 0]

In [None]:
max(df.isna().sum())

In [None]:
max_col_len = len(max(df.columns, key=len)) # Just to make sure that the table...
max_val_len = len(str(max(df.isna().sum(), key=lambda x : len(str(x))))) # ...displays nicely :)

for i, num in zip(df.isna().sum().index, df.isna().sum()):
    print(f'{i}{(max_col_len - len(i)) * " "} | Missing values : {num}{(max_val_len - len(str(num))) * " "} | Completion : {round(100 - (num / df.shape[0] * 100))}%') 

# Statistics

In [None]:
df['lotarea'].mean()

In [None]:
df['lotarea'].mean().round()

In [None]:
df['saleprice'].mean()

In [None]:
df['saleprice'].mean().round()

In [None]:
df.describe(include='all')

In [None]:
df['yearbuilt'].max()

In [None]:
df['yearbuilt'].min()

In [None]:
df['yearbuilt'].describe()

# Target Variable

The **target variable**, also known as the response variable, dependent variable, the variable to predict, outcome variable or criterion variable is the variable we want to predict. It is represented by "y" (lower-case).

In this case, it is the last column in our dataframe that contains the sale price of the real estate: `'saleprice'`.

In [None]:
y = df['saleprice']

# Explanatory Variables

The explanatory variables, also known as predictor variables or "features", are the input variables of our model. It is through these variables that the model will determine the value of our output variable. They are represented by "X" (upper-case).

The choice of these variables has a significant impact on the results. Sometimes, we will use all the available variables, while other times we will only use a subset of them. There are many different methods (logical, scientific, statistical, computational, etc.) to help us make this choice.

Here, we will use the following variables as features:

In [None]:
feature_names = [
'lotarea',
'yearbuilt',
'1stflrsf',
'2ndflrsf',
'fullbath',
'bedroomabvgr',
'totrmsabvgrd',
]

In [None]:
X = df[feature_names]

In [None]:
X.describe()

In [None]:
X.isna().sum()

# Modeling

### Model Selection

We will choose a "decision tree," also known as *DecisionTreeRegressor*, which we will name 'iowa_model'.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# random_state will allow model reproducibility
iowa_model = DecisionTreeRegressor(random_state=42)

### Model Fitting

Model training is very simple: just one line of code is enough! By convention, we first provide the features and then the target.

In [None]:
iowa_model.fit(X,y)

### Visualization

Once our model is created, we can visualize it in various ways.

In [None]:
from sklearn import tree
# print(tree.export_text(iowa_model))

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(25,20))
tree.plot_tree(iowa_model,
               feature_names=X.columns,
               max_depth=2,
               filled=True)

plt.show()

### Predictions

Our model can now predict values based on a set of variables. Let's try having it predict the results based on the features (X).

In [None]:
y_pred = iowa_model.predict(X)
print(y_pred)

Let's compare the first 5 predictions with the first 5 values of y.

In [None]:
list(y_pred[:5])

In [None]:
list(y[:5])

It's strange, we would expect our model to be a bit off, but it seems like it's predicting our variable y down to the dollar! Let's verify this with some Pandas.

In [None]:
res = pd.DataFrame({'y':y,'y_pred':y_pred})
res['y_pred'] = res['y_pred'].astype(int)
res['diff'] = res['y'] - res['y_pred'].round()
res

In [None]:
res.loc[res['diff'] != 0]

Out of the 2930 predictions made, only 71 are incorrect, and even those are not very far from the expected results. Have we created the best possible model?

# Model Validation

Each model, once trained, must be evaluated using different metrics. A good metric for evaluating a continuous value, as is the case here, is to examine the accuracy of the prediction. For each property, we will calculate the absolute difference between the actual value and the value predicted by the model:
```
error = |actual value - predicted value|
```
If we then take the average of these values, it gives us the MAE (Mean Absolute Error). The MAE tells us what the average difference is between a prediction and the actual value.

In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y, y_pred)

The accuracy is excellent, but that's because the way we adjusted our model is incorrect. Since the data we used to train and test the model are the same, it's normal that we get almost only correct answers. We are dealing with a classic case of **overfitting**.

To evaluate the robustness of a model, we will test it on data it has never seen before. To do this, we will split our data into two groups: one will be used for training (*train*), and the other to test the model (*test*).

The "train_size" parameter will determine the proportion of our data used for training. A value of 0.8 means that we reserve 80% of our data for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split
# Now our data are split in 4 different parts
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.8)

iowa_model.fit(X_train, y_train)

y_pred = iowa_model.predict(X_test)
print(mean_absolute_error(y_test, y_pred))

The MAE is significantly higher, approximately 200 times higher! Since the average price of a house was around $180,000, this means our model is off by about 1/6 of the price. There are, of course, many ways to achieve a higher score.

# Model parameters

A decision tree can be configured in many different ways, as you can see by examining the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for our type of model.

### Difference between parameters and hyperparameters

In machine learning, hyperparameters are the parameters that govern the process of generating the internal parameters of the model.

For example, in our model, its parameters include all the branches that lead to the leaves of our tree, among other things. These parameters were determined by the model during its training and changed significantly between the start of the fitting process and when it finished.

Hyperparameters, on the other hand, are parameters that are often set by a human and are not modified during training. They represent high-level directions or settings.

One of the most important hyperparameters for this model is the depth of the tree. For now, we didn't give any specific instructions, so this hyperparameter was generated by the program. Let's examine it:

In [None]:
iowa_model.tree_.max_depth

There is a maximum of 26 levels of depth in our tree. Each time we add a level of depth to our tree, we increase its maximum number of leaves and therefore its precision. However the number of houses in each leaf will be reduced, which means that predictions will become less reliable. So, we need to find a balance between precision and reliability.

### Overfitting and Underfitting

The phenomena of overfitting and underfitting are central concepts in machine learning.

- Overfitting occurs when the model's results closely match the data it was trained on but make significant errors when applied to unknown data. This happens when our decision tree is too deep.

- Underfitting occurs when the model fails to distinguish essential features in our data. It will have a poor score on both the training data and the test data. This occurs when our model is not deep enough.

<div>
<img src="files/underfitting_and_overfitting.png" alt="CPU" width="75%" align='center'/> </div>

The graph above shows the variation in MAE based on the depth of a decision tree. The term "validation" here refers to the "test" dataset. Here are some observations about the graph:

- On average, the model will always have a better score when predicting data from its training set rather than unknown data from the test set.

- Increasing the depth initially improves the model on both the training and test sets.

- There comes a point where increasing the depth improves precision on the training set, but the MAE starts to increase on the test set. This is the phenomenon of overfitting.

The goal of hyperparameters is to find this balance point, which should allow us to maximize the model's performance.

In [None]:
def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    return(mae)

In [None]:
# Exemple
get_mae(50, X_train, X_test, y_train, y_test)

In [None]:
d = {}
for max_leaf_nodes in [2, 5, 25, 30, 40, 50, 100, 200, 400]:
    mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    d[max_leaf_nodes] = mae
    print(f"Max leaf nodes: {max_leaf_nodes} \t\t Mean Absolute Error: {mae}")

In [None]:
(pd.DataFrame.from_dict(d.items())
             .rename(columns={0 : 'max_leaf_nodes', 1 : 'mae'})
             .plot(x='max_leaf_nodes', y='mae', color='red'));

Graphically, the best hyperparameter seems to be around 40. We could automatically find it using certain techniques that we will explore later.

### What happens next ?

Once we have found the best hyperparameters, we can retrain the model but this time using the entire dataset to further improve accuracy. Then, we could test it on real data from a different dataset to see how it performs.

Another possibility would be to use a different model and see if it performs better or worse.

# Let's go deeper

### Column Selection Based on the Rate of Missing Values

Applying `df.dropna()` to all columns would delete all rows in our dataframe. However, we can choose to eliminate all columns with a completion rate lower than 94% (for example), which allows us to retain at least 94% of our dataset.

In [None]:
# Example 1
(df['garagetype'].isna().sum() / df.shape[0]) < 0.06

In [None]:
# Example 2
(df['miscfeature'].isna().sum() / df.shape[0]) < 0.06

In [None]:
cols_to_keep = [col for col in df.columns if (df[col].isna().sum() / df.shape[0]) < 0.06]

In [None]:
df.shape, df.dropna().shape, df[cols_to_keep].dropna().shape

In [None]:
new_df = df[cols_to_keep].dropna()

In [None]:
(new_df.isna().sum() > 0).sum()