# Linear Regression

In this first exercice, we will create linear regression models and emphasize how to use sklearn to do so.

We will start with a simple modeling of two data points and then focus on the practical steps needed to apply them to more complicated situations.

## 1. Predicting Apartement Prices

We will create a linear regression model and apply it to a simple dataset to get insights about the process and the theory. The idea is to start simple, get insights, and then add complexity.

[Scikit-Learn](https://scikit-learn.org/stable/) is a Python library containing a lot of machine learning algorithm implementations. You will see that it allows us to do a linear regression in [a few line of code](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)!

We will use some data about the price of Paris apartments. We took two apartments located in the third arrondissement of Paris from the website [seloger](https://www.seloger.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&idtt=2,5&naturebien=1,2,4&ci=750103), and reported the area (in square meters) and the price (in euros):

Square Meters | Price
:---:|:---:
39   |550000
47   |577000

The goal is to model the relation between the area (the feature or independent variable of our model) and the price (the ground truth or the dependent variable).

Let's start by importing the librairies we'll use:

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### 1.1 Data Structure

We need to create a small dataset with these data. Many ways are possible to structure your data. However, we want data to be easily pluged as input of machine learning libraries like sklearn.

Your can go on the [LinearRegression documentation of sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and have a look. We will use the `fit()` method of the class `LinearRegression` to get the model's parameters from our data (the two apartments). Your first job is to structure the data of the two apartments in a shape usable as input of a Sklearn linear model.


In [None]:
# Your code here


### 1.2. Data Visualization

Now, that you have created the dataset with variables `X` and `y`, we will have a look at the data. Data visualization is useful to check that everything is what we expected.

Your next job is to create a scatter plot showing the price ($y$ axis) in function of the area in square meters ($x$ axis).

In [None]:
# Your code here


### 1.3. Modeling

Before feeding your data to any machine learning algorithm, ask yourself the question: "how would I do the task manually?". If the answer is not obvious, the problem might need to be reframed. Here, we want to fit a line to our data. Since there are only two datapoints, this line can perfectly fit the data: this is the line passing by both points.

It can be easy to get lost while building a machine learning pipeline. Starting with a simple case will allow us to easily check that the algorithm is working as expected and create a healthy workflow.

Your task is now to create a linear regression model with Sklearn and fit our dataset. To check that it worked, use the method `score()` from `LinearRegression`: it should return a score of 1, corresponding to a perfect fit.

In [None]:
# Your code here


### 1.4. Plot the Regression Line

That's all! We trained our first linear regression algorithm! Now you will plot the regression line to be sure that it passes through our two points.

- 1. Get our model's parameters (corresponding to the intercept and the slope).
- 2. Plot the regression line along with the scatter plot of our apartments (the two points).

In [268]:
# Your code here


### 1.5. New Apartment Price Prediction

You will now create the last block of our machine learning pipeline: the price prediction of a new apartment using the area feature as input.

Your friend Bob is asking for advice as he wants to sell his apartment. He gives you the area in square meters and ask you how much could he sell it. Your task is to use your model to predict the price of Bob's apartment:

Square Meters | Price
:---:|:---:
34   |?

Feel free to use the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). It is great and shows examples.

In [9]:
# Your code here


<details>
  <summary>Meanwhile, Bob sold his apartment. Here is the true price.</summary>
  He sold it for 445000 euros.
</details>

What do you think of your prediction? Is it accurate?


### 1.6. More data

Now that we have created a linear regression model and applied it to our two apartments, let's try to improve our model by using more apartments.

You will train a new linear regression model with more data and try again to predict the price of the last apartment.

Square Meters | Price
:---:|:---:
32   |489,000
28   |336,000
47   |494,000
7    |85,000
85   |1,595,000
12   |130,000
16   |173,000
53   |520,000
30   |320,000
41   |660,000

Try to use these new data to train your model. You can use the same kind of workflow:

- Create the input data structure `X` and `y`.
- Check shape of input data.
- Have a look at the data with visualization.
- Clarify the problem we are trying to solve.
- Train the model using sklearn (don't include Bob's apartment).
- Predict again the price of Bob's apartement.


In [None]:
# You code here
