# Unit 5 Applying Linear Regression to the Real Dataset

## Lesson Introduction

Hi there! Today, we're going to learn how to apply Linear Regression to a real dataset. Working with real data shows us how machine learning solves real problems. We'll use the California Housing Dataset. By the end of this lesson, you'll know how to use Linear Regression on a real dataset and understand the results.

## Understanding the California Housing Dataset

Before diving into the code, let's understand the dataset we'll be working with. The California Housing Dataset is based on data from the 1990 California census. It contains information about various factors affecting housing prices in different districts of California.

Here's a quick overview of the columns in the dataset:

* **MedInc**: Median income in block group
* **HouseAge**: Median house age in block group
* **AveRooms**: Average number of rooms per household
* **AveBedrms**: Average number of bedrooms per household
* **Population**: Block group population
* **AveOccup**: Average household size
* **Latitude**: Block group latitude
* **Longitude**: Block group longitude
* **MedHouseVal**: Median house value for California districts (This is our target variable)

## Loading and Preparing the Data: Part 1

First, let's load our data. Think of this step as getting all the ingredients ready before cooking. Here's the code to load the dataset:

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

print(df.head())
# Output:
#    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
# 0  8.3252      41.0  6.984127   1.023810      322.0  2.555556     37.88    -122.23        4.526
# 1  8.3014      21.0  6.238137   0.971880     2401.0  2.109842     37.86    -122.22        3.585
# 2  7.2574      52.0  8.288136   1.073446      496.0  2.802260     37.85    -122.24        3.521
# 3  5.6431      52.0  5.817352   1.073059      558.0  2.547945     37.85    -122.25        3.413
# 4  3.8462      52.0  6.281853   1.081081      565.0  2.181467     37.85    -122.25        3.422
```

We used the `fetch_california_housing` function to load the dataset and convert it to a Pandas `DataFrame` for easier handling.

## Loading and Preparing the Data: Part 2

Now, let's select our features and target. In the California Housing Dataset, we'll use all features except for the target column (MedHouseVal).

Here's the code:

```python
# Drop rows with missing values
df.dropna(inplace=True)

# Select the features (all except the target column)
X = df.drop(columns=['MedHouseVal'])

# Select the target column
y = df['MedHouseVal']

print(X.head())
# Output:
#    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude
# 0  8.3252      41.0  6.984127   1.023810      322.0  2.555556     37.88    -122.23
# 1  8.3014      21.0  6.238137   0.971880     2401.0  2.109842     37.86    -122.22
# 2  7.2574      52.0  8.288136   1.073446      496.0  2.802260     37.85    -122.24
# 3  5.6431      52.0  5.817352   1.073059      558.0  2.547945     37.85    -122.25
# 4  3.8462      52.0  6.281853   1.081081      565.0  2.181467     37.85    -122.25

print(y.head())
# Output:
# 0    4.526
# 1    3.585
# 2    3.521
# 3    3.413
# 4    3.422
# Name: MedHouseVal, dtype: float64
```

We drop rows with missing values and select all features except the target.

## Initializing and Training the Model

Next, we create and train our Linear Regression model. Think of it as teaching a kid to ride a bike: you show them a few times, and then they get the hang of it.

Here's the code:

```python
from sklearn.linear_model import LinearRegression

# Initialize the linear regression model
model = LinearRegression()

# Train the model
model.fit(X, y)
```

The `LinearRegression()` function initializes the model, and `model.fit(X, y)` trains it using our data.

## Making Predictions

Once our model is trained, it's ready to make predictions. This is like the kid finally riding the bike on their own.

Here's how you can make predictions:

```python
# Make predictions
y_pred = model.predict(X)
print(y_pred[:5])  # Display the first five predictions
# Output:
# [4.13642691 3.62014328 3.39896532 3.41478061 3.92649121]
```

The `model.predict(X)` function uses the model to predict house prices based on the feature values.

## Understanding the Model Outputs

It's important to understand how the model is making predictions. In a Linear Regression model, we have an intercept and coefficients for each feature. Think of the intercept as a starting point and the coefficients as slopes.

Here's the code to display them:

```python
# Display the intercept
print(f"Intercept: {model.intercept_}")
# Display the coefficients
coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coefficients)
# Output:
# Intercept: -36.94192020749747
#             Coefficient
# MedInc       0.447808
# HouseAge     0.011540
# AveRooms     0.080079
# AveBedrms   -0.144893
# Population  -0.000046
# AveOccup    -0.004605
# Latitude    -0.426464
# Longitude   -0.430478
```

The `model.intercept_` gives us the intercept, and `model.coef_` gives us the coefficients for each feature.

## Evaluating the Performance

Finally, we'll calculate the Mean Squared Error (MSE) to evaluate how well our model is doing. Think of it as checking if the kid can ride the bike without falling.

Here's the code to calculate MSE and make conclusions:

```python
from sklearn.metrics import mean_squared_error

# Calculate the Mean Squared Error
mse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {mse:.4f}")  # Mean Squared Error: 0.5308
```

The `mean_squared_error` function computes the MSE, which tells us how close our predictions are to the actual values. A lower MSE indicates a better fit.

## Lesson Summary

Great job! Today, we learned how to apply Linear Regression to a real dataset. We loaded the California Housing Dataset, selected features and target, trained a model, made predictions, and evaluated the results. Understanding how to work with real datasets is a key skill in machine learning.

Now it's your turn! Move on to the practice exercises where you'll apply what you've learned to another real dataset. You'll load data, train a model, make predictions, and visualize the results. Happy coding!

## Predicting California House Prices

Hey there, Space Explorer! Would you like to know how the average house prices in California relate to their income and age? The given code loads the California Housing Dataset, trains a linear regression model with MedInc (Median Income) and HouseAge as features, and predicts the house prices. Click Run to see the magic happen!

```python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
california = fetch_california_housing(as_frame=True)
df = california.frame.dropna()

# Select features and target
X = df[['MedInc', 'HouseAge']]
y = df['MedHouseVal']

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Predict for the first five rows of the dataset
y_pred = model.predict(X.head(5))

# Calculate and print the MSE
mse = mean_squared_error(y[:5], y_pred)
print("Mean Squared Error:", mse)

```

The Python code you've provided is designed to demonstrate a linear regression model's application to the California Housing Dataset, specifically looking at how median income (`MedInc`) and median house age (`HouseAge`) influence median house value (`MedHouseVal`).

Here's a breakdown of what the "magic" entails when you run this code:

### Code Explanation and Expected Output

1.  **Load the Dataset**: The code first loads the California Housing Dataset into a Pandas DataFrame. The `.dropna()` method ensures that any rows with missing values are removed, providing a clean dataset for analysis.
2.  **Select Features and Target**:
    * `X = df[['MedInc', 'HouseAge']]`: This line selects two specific features: `MedInc` (median income) and `HouseAge` (median house age). These are the independent variables the model will use to make predictions.
    * `y = df['MedHouseVal']`: This line selects `MedHouseVal` (median house value) as the target variable. This is the dependent variable the model will try to predict.
3.  **Initialize and Train the Model**:
    * `model = LinearRegression()`: An instance of the `LinearRegression` model is created.
    * `model.fit(X, y)`: The model is then trained using the selected features (`X`) and the target variable (`y`). During this training phase, the model learns the relationship between `MedInc`, `HouseAge`, and `MedHouseVal` by finding the optimal coefficients and intercept that best fit the data.
4.  **Predict for the First Five Rows**:
    * `y_pred = model.predict(X.head(5))`: After training, the model uses its learned relationships to predict `MedHouseVal` for the first five entries in your feature set (`X`).
5.  **Calculate and Print MSE**:
    * `mse = mean_squared_error(y[:5], y_pred)`: The Mean Squared Error (MSE) is calculated. This metric quantifies the average squared difference between the actual `MedHouseVal` for the first five rows (`y[:5]`) and the values predicted by your model (`y_pred`).
    * `print("Mean Squared Error:", mse)`: The calculated MSE value is printed to the console.

### How Average House Prices Relate to Income and Age (Model Interpretation)

When you run a linear regression, the model learns coefficients for each feature and an intercept. While I cannot execute the code live to give you exact numbers, here's how you would interpret the relationship based on the model's outputs:

* **Coefficients**: The `model.coef_` attribute (which would typically be printed after training for full interpretation, though not explicitly in your provided snippet) would give you a numerical value for `MedInc` and `HouseAge`.
    * A **positive coefficient for `MedInc`** would suggest that, as the median income in a block group increases, the median house value tends to increase as well (assuming all other factors remain constant). This is a common and expected relationship, as higher incomes generally correlate with higher purchasing power and demand for housing.
    * The **coefficient for `HouseAge`** could be positive or negative.
        * A **positive coefficient** might imply that older houses (perhaps due to historical value, larger plots, or desirable established neighborhoods) tend to have higher values.
        * A **negative coefficient** might suggest that newer houses are valued more (perhaps due to modern amenities, energy efficiency, or current architectural trends), or that very old houses are less desirable due to maintenance needs.
* **Intercept**: The `model.intercept_` attribute represents the predicted `MedHouseVal` when all feature values (`MedInc` and `HouseAge` in this case) are zero. In real-world datasets like housing, interpreting the intercept directly might not be meaningful if zero values for features are outside the practical range of the data.

### Understanding Mean Squared Error (MSE)

The **Mean Squared Error (MSE)** is a measure of the average squared difference between the estimated values (the predictions) and the actual value.

* In your output, the MSE value tells you, on average, how far off your model's predictions were from the actual house prices for those five rows, squared.
* A **lower MSE indicates a better fit** of the model to the data, meaning the predictions are closer to the actual values. A higher MSE suggests that the model's predictions are less accurate.
* Since you're calculating MSE on only the first five rows, this is a very small sample. Typically, MSE is evaluated on a separate "test set" of data to get a more reliable measure of the model's generalization performance on unseen data.

By inspecting the coefficients, you would gain insight into the strength and direction of the linear relationship between income, age, and house prices in California, and the MSE would give you a quantifiable measure of your model's prediction accuracy.

## Modifying Feature Selection to Evaluate MSE

Great job so far, Space Explorer! Now, let's modify the dataset to see how it affects our model. Change the feature selection to use all the columns (except 'MedHouseVal', which is our target), instead of just MedInc, HouseAge, AveRooms, and Latitude. Observe how this impacts the Mean Squared Error (MSE) of our predictions.

As a reminder, you can select everything except column 'MedHouseVal' from a dataframe, using the drop method.

Time to code!

```python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
california = fetch_california_housing(as_frame=True)
df = california.frame.dropna()

# Select features and target
X = df[['MedInc', 'HouseAge', 'AveRooms', 'Latitude']]  # Changed feature selection
y = df['MedHouseVal']

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Calculate and print the MSE
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)

```

```python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
california = fetch_california_housing(as_frame=True)
df = california.frame.dropna()

# Select features and target
# Changed feature selection to use all columns except 'MedHouseVal'
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Calculate and print the MSE
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)
```

## California Housing Model Debugging

Your mission is to identify and fix a bug in the code related to loading and training a Linear Regression model using the California Housing Dataset. The code is almost correct but contains a small mistake that prevents it from working properly. Can you spot the issue and fix it?

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

# Drop rows with missing values
df.dropna(inplace=True)

# Wrong code for selecting the features and target
X = df.drop(df['MedHouseVal'])
y = df['MedHouseVal']

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Display the intercept and first coefficient
print(f"Intercept: {model.intercept_}")
print(f"First Coefficient (MedInc): {model.coef_[0]}")

```

Hey there, Space Voyager!

You've successfully spotted the bug! The issue is indeed in how you're trying to drop the `'MedHouseVal'` column to create your feature set `X`.

The `df.drop()` method expects a list of column names (or row labels if `axis=0`) that you want to remove. When you pass `df['MedHouseVal']` directly, you're passing a pandas Series (the column data itself), which `drop()` interprets as labels to be removed from the DataFrame's index, not as a column to be removed.

Here's the corrected code:

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

# Drop rows with missing values
df.dropna(inplace=True)

# Fixed code for selecting the features and target
X = df.drop(columns=['MedHouseVal']) # Corrected: specify the column to drop using 'columns' parameter
y = df['MedHouseVal']

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Display the intercept and first coefficient
print(f"Intercept: {model.intercept_}")
print(f"First Coefficient (MedInc): {model.coef_[0]}")
```

Space Wanderer, your skills are impressive so far! Now, let's make sure we're using this data correctly. Complete the following two critical steps to help us prepare and train our Linear Regression model. You'll need to prepare the dataset and predict house prices. Happy coding!


```python
import pandas as pd
from sklearn.datasets import fetch_california_housing
# TODO: import LinearRegression from the right place

# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

# TODO: Prepare the features and the target

# Train the Linear Regression model once the features and target are prepared
model = LinearRegression()
model.fit(X, y)

# TODO: Make predictions using the trained model

```