<div>
<img src="files/machine_learning.jpg" alt="ML" width="100%" align='center' source="https://www.50a.fr/img/upload/machine%20learning..jpg" /> </div>

# Introduction

Machine learning may seem intimidating with its jargon derived from the realms of computer science and statistics. However, if we start with the basics and progressively increase the complexity, it is entirely possible to grasp the fundamental concepts of this field.

This course will provide you with an overview of how data scientists develop, design, and implement their ML models. You can then use this knowledge to continue learning on your own or stop now if you think you know enough to be able to talk with data scientists.

# Practical Case: Real Estate

The first dataset we will use contains data on the real estate. In real life, real estate agents can estimate the value of a property by associating a price with various characteristics of the property (number of rooms, area, location, etc.) based on their experience.

The program we are going to create will allow us to make predictions ourselves, i.e., to predict a given value. However, this time it's the computer that will "learn" on its own thanks to the data we will provide.

## Linear Regression

We will now use a model called "Linear Regression". This is a well-known model that you have probably used in the past, and its formula is:

$f(x) = ax + b$

Where $a$ is the slope coefficient, and $b$ is the intercept. Linear regressions are simple models that are easily explainable.

<div>
<img src="files/linear_regression.svg" alt="ML" width="100%" align='center' source="https://www.reneshbedre.com/blog/linear-regression.html" /> </div>

# Exploration with Pandas

<div>
<img src="files/pandas_school.png" alt="CPU" width="100%" align='center' source='realpython.com'/> </div>

In [None]:
import pandas as pd
df = pd.read_csv("data/iowa_housing.csv")

In [None]:
df.shape

In [None]:
df.columns

## Missing values

In [None]:
df.isna()

In [None]:
df.isna().sum()

In [None]:
df.isna().sum().loc[df.isna().sum() > 0]

In [None]:
max(df.isna().sum())

In [None]:
max_col_len = len(max(df.columns, key=len)) # Just to make sure that the table...
max_val_len = len(str(max(df.isna().sum(), key=lambda x : len(str(x))))) # ...displays nicely :)

for i, num in zip(df.isna().sum().index, df.isna().sum()):
    print(f'{i}{(max_col_len - len(i)) * " "} | Missing values : {num}{(max_val_len - len(str(num))) * " "} | Completion : {round(100 - (num / df.shape[0] * 100))}%') 

# Statistics

In [None]:
df['LotArea'].mean()

In [None]:
df['LotArea'].mean().round()

In [None]:
df['SalePrice'].mean()

In [None]:
df['SalePrice'].mean().round()

In [None]:
df.describe(include='all')

In [None]:
df['YearBuilt'].max()

In [None]:
df['YearBuilt'].min()

In [None]:
df['YearBuilt'].describe()

# Target Variable

The **target variable**, also known as the response variable, dependent variable, the variable to predict, outcome variable or criterion variable is the variable we want to predict. It is represented by "y" (lower-case).

In this case, it is the last column in our dataframe that contains the sale price of the real estate: `'SalePrice'`.

In [None]:
y = df['SalePrice']

# Explanatory Variables

The explanatory variables, also known as predictor variables or "features", are the input variables of our model. It is through these variables that the model will determine the value of our output variable. They are represented by "X" (upper-case).

The choice of these variables has a significant impact on the results. Sometimes, we will use all the available variables, while other times we will only use a subset of them. There are many different methods (logical, scientific, statistical, computational, etc.) to help us make this choice.

Here, we will use only one variable as feature: ```GrLivArea``` (Above grade (ground) living area square feet).

In [None]:
df['GrLivArea']

In [None]:
X = df[['GrLivArea']] # double brackets because we want to create a DataFrame

In [None]:
X.describe()

In [None]:
X.isna().sum()

## Scatter plot

In [None]:
df.plot(kind='scatter', x='GrLivArea', y='SalePrice');

# Modeling

In [None]:
# Propre way to import functions in sklearn:
from sklearn.linear_model import LinearRegression

# Let's instanciate the class
model = LinearRegression()

### Model Fitting

Model training is very simple: just one line of code is enough! By convention, we first provide the features and then the target.

In [None]:
model.fit(X,y)

In [None]:
# Our 'a'
model.coef_ # A property of the train model

In [None]:
# Our 'b'
model.intercept_

In [None]:
model.score(X, y) # R²

## The $R^2$ Score

The coefficient of determination, often referred to as $R^2$ (R-squared), is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

Mathematically, $R^2$ is calculated as:

$R^2 = 1 - \frac{{\text{SSR}}}{{\text{SST}}}$

Where:
- SSR (Sum of Squared Residuals) represents the sum of the squared differences between the predicted values and the actual values.
- SST (Total Sum of Squares) represents the total sum of squared differences between the actual values and the mean of the dependent variable.

Suppose we have the following data:

- Actual values: [2, 4, 5, 4, 6]
- Predicted values: [3, 3, 5, 4, 7]

First, we calculate SSR, which is the sum of the squared differences between the predicted values and the actual values:

$\text{SSR} = (3-2)^2 + (3-4)^2 + (5-5)^2 + (4-4)^2 + (7-6)^2$
$\text{SSR} = 1 + 1 + 0 + 0 + 1 = 3$

Next, we calculate SST, which is the sum of the squared differences between the actual values and the mean of the dependent variable:

$\text{Mean of dependent variable} = \frac{2 + 4 + 5 + 4 + 6}{5} = \frac{21}{5} = 4.2$

$\text{SST} = (2-4.2)^2 + (4-4.2)^2 + (5-4.2)^2 + (4-4.2)^2 + (6-4.2)^2$
$\text{SST} = 4.84 + 0.04 + 0.64 + 0.04 + 4.84 = 14.4$

Now, we can calculate $R^2$:

$R^2 = 1 - \frac{{\text{SSR}}}{{\text{SST}}} = 1 - \frac{3}{14.4} \approx 0.792$

So, in this example, the $R^2$ score is approximately 0.792, indicating that about 79.2% of the variance in the dependent variable is explained by the independent variables in the model.

### Predictions

Our model can now predict values based on a one variable.

In [None]:
# We can pass only one value and check the output
model.predict(pd.DataFrame({'GrLivArea': [1000]}))

In [None]:
# Or we can make predictions for all the houses in our dataset
y_pred = model.predict(X)

In [None]:
y_pred

## Visualisation

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X, y,color='b')
plt.scatter(X, y_pred,color='r');
#plt.plot(X, y_pred,color='r'); # Or with a line

In [None]:
# We can check for each prediction how off we are
res = pd.DataFrame({'y':y,'y_pred':y_pred})
res['y_pred'] = res['y_pred'].astype(int)
res['diff'] = res['y'] - res['y_pred'].round()
res

## Conclusion

So far what we've done is a very simple way of doing Machine Learning. We fed an algorithm with data and we got a fitted model that can predict new values.

You'll see in the next chapters that our journey has just started.