<font size="+3"><strong>Machine Learning: Linear Regression</strong></font>

# Linear Regression

In [None]:
from IPython.display import YouTubeVideo

In machine learning, a **regression** problem is when you need to build a model that's going to predict a continuous, numerical value, like the sale price of an apartment. One of the models that you can use for regression problems is called **linear regression**. In it's simplest form, we fit a model that will predict a single output variable (called a **target vector**) as a linear function of a single input variable (called a **feature matrix**). 

Speaking mathematically, if we have input data points $x$ and corresponding measured output $y$, then we find parameters $m$ and $b$ such that $y \approx m\times x + b$ for our measured data points.  We then use the fitted values of $m$ and $b$ to predict values of $y$ for new values of $x$.

## Fitting a Model to Training Data

You'll work on two cases: a model on the raw data set and a model on transformed data. First try to use linear regression to predict `price_aprox_usd` as a multiple of `surface_covered_in_m2` and the addition of a constant for the `mexico-city-real-estate-1.csv` dataset.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import data
columns = ["surface_covered_in_m2", "price_aprox_usd"]
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)

# Drop rows with missing values
# (or you could use an imputer ☝️)
mexico_city1.dropna(inplace=True)

# Split data into feature matrix
X = mexico_city1[["surface_covered_in_m2"]]
y = mexico_city1["price_aprox_usd"]

# Instantiate predictor
lr = LinearRegression()

# Fit predictor to data
lr.fit(X, y)

<font size="+1">Practice</font> 

Fit a linear regression model to the `mexico-city-real-estate-2.csv` data set to relate `"price_aprox_usd"` and `"surface_covered_in_m2"`.

In [None]:
# Import data
columns = ["price_aprox_usd", "surface_covered_in_m2"]
mexico_city2 = ...
# Drop rows with missing values


# Split data into feature matrix
X = ...
y = ...

# Instantiate predictor
lr = ...

# Fit predictor to data


## Generating Predictions Using a Trained Model

After fitting the model, we want to use it to make predictions. In most applications, you'll want to predict an unknown quantity from data that's different from the data you've fitted our model on. To test the accuracy of your fitted model, you'll typically use a different set of data with an outcome you already know. Here, we'll use the dataset from `mexico-city-test-features.csv` and `mexico-city-test-labels.csv`.  It's also helpful to plot the data and predicted data to see if there are any patterns that suggest fitting a different model.

In [None]:
# Import data
mexico_city_features = pd.read_csv(
    "./data/mexico-city-test-features.csv", usecols=["surface_covered_in_m2"]
)
mexico_city_labels = pd.read_csv("./data/mexico-city-test-labels.csv")

# Drop missing values
mexico_city_features.dropna(inplace=True)

# Generate predictions
price_pred_example = lr.predict(mexico_city_features)

# Print predictions
price_pred_example[:5]

<font size="+1">Practice</font> 

Read the data from `mexico-city-real-estate-4.csv` into a DataFrame and then generate a list of price predictions for the properties using your model `lr`.

In [None]:
# Import data
mexico_city4 = ...

# Drop missing values
mexico_city4.dropna(inplace=True)

# Generate predictions
price_pred = ...

# Print predictions
price_pred[:5]

## Ridge Regression
Sometimes,the values for coefficients and the intercept - both positive and negative - are very large. When you see this in a linear model — especially a high-dimensional model — what's happening is that the model is **overfitting** to the training data and then can't generalize to the test data. Some people call this the **curse of dimensionality**. ☠️

The way to solve this problem is to use **regularization**, a group of techniques that prevent overfitting. In this case, we'll change the predictor from `LinearRegression` to `Ridge`, which is a linear regressor with an added tool for keeping model coefficients from getting too big.

Here's a good explanation of what a ridge regression is and why it's important:

In [None]:
YouTubeVideo("Q81RR3yKn30")

## Generalization

Notice that we tested the model with a dataset that's *different* from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept **generalization**.  By testing your model with different data than you used to train it, you're checking to see if your model can generalize.  Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

## Calculating the Mean Absolute Error for a List of Predictions

Plots are great for displaying information, but a value that tells you the typical error in a prediction is helpful too. This value is called the **mean absolute error**, and it's defined as the average value of the magnitude of the error in the predictions. The closer the MAE is to `0`, the better our model fits the data. scikit-learn will do this for you if you pass it the price predictions from your regression model and the actual prices from the test data set. Let's see how our `lr` model did by comparing its predictions to the true values in `mexico_city_labels`.

In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(price_pred_example, mexico_city_labels)

## Access an Attribute of a Trained Model

After training a model that fits a straight line to your data, you can now obtain the parameters that fit your line. We're particularly interested in the slope `regr_lr.coef_` and the axis intercept `regr_lr.intercept_`

In [None]:
print(lr.coef_)

In [None]:
print(lr.intercept_)

## Multicollinearity

When you're creating a linear model that uses many features to make predictions, some of those features can be highly correlated with each other. This isn't a problem that's going to break your model; it will still make predictions and it might have good performance metrics. But it is an issue if you want to interpret the coefficients for your model because it becomes hard to tell which features are truly important. 

Let's look at `mexico-city-real-estate-1.csv` for an example. First we'll import the data.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import data
columns = [
    "price",
    "price_aprox_local_currency",
    "price_aprox_usd",
    "surface_total_in_m2",
    "surface_covered_in_m2",
    "price_per_m2",
]
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)

# Drop missing values
mexico_city1.dropna(inplace=True)

mexico_city1.head()

Now let's find the correlations between the columns.

In [None]:
mexico_city1.corr()

Let's see what happens when we fit a linear regression model for `surface_covered_in_m2` as a function of `price_aprox_usd` and `price_aprox_local_currency`.

In [None]:
lr = LinearRegression()
lr.fit(
    mexico_city1[["price_aprox_usd", "price_aprox_local_currency"]],
    mexico_city1["surface_covered_in_m2"],
)

Let's take a look at the coefficients of the model:

In [None]:
print(lr.coef_)

Ask yourself: Does it make sense that increasing the price of a property by one US dollar would translate to a 6593 m<sup>2</sup> increase in size? Perhaps, though it seems unlikely. And does it make sense that increasing the price by one Mexican peso would translate to a 350 m<sup>2</sup> *decrease* in size? Definitely not. So while this model may perform well when we evaluate it using metrics like mean absolute error, we can't use it to determine which features actually our target.

*References & Further Reading*

- [A primer on linear regression](https://medium.com/data-science-group-iitr/linear-regression-back-to-basics-e4819829d78b)
- [More on resampling from the pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)
- [More information on rolling averages](https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/moving-average/)
- [More on absolute and mean absolute errors](https://www.statisticshowto.com/absolute-error/)
- [A discussion of the various uses of model fitting in machine learning](https://www.datarobot.com/wiki/fitting/)
- [Wikipedia Page on Multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)
- [Online Article on Multicollinearity](https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/)
- [Wikipedia Article on Generalization](https://en.wikipedia.org/wiki/Generalization_error)
- [Online Tutorial on Regression with scikit-learn](https://stackabuse.com/linear-regression-in-python-with-scikit-learn/)
- [Official scikit-learn Documentation on Linear Models](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)
- [Wikipedia Article on Logarithm Function](https://en.wikipedia.org/wiki/Logarithm)

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
