# Unit 2 Training a Linear Regression Model

Here's the content converted to Markdown:

# Lesson Introduction

In this lesson, we'll learn how to train a **linear regression** model. Linear regression helps predict values based on data. Imagine you have house areas and prices and want to predict the price of a house with an unknown area. That's where linear regression helps.

By the end of this lesson, you will understand linear regression, how to generate and handle synthetic data, and how to use **Scikit-Learn** to train a linear regression model. You'll also learn to interpret the model's output.

-----

## Understanding Linear Regression

Linear regression models the relationship between two variables by fitting a linear equation to the data. One variable is the explanatory variable, often called "**feature**" and denoted by $X$, and the other is the dependent variable, often called "**target**" and denoted by $y$.

The linear regression formula is:
$y = kX + b$

Where:

  * $b$ is the $y$ value when $X$ is zero.
  * $k$ (or coefficient) indicates how much $y$ changes for each unit change in $X$.

Here is an example of some data and two lines trying to fit the data:

\<img src="[https://i.imgur.com/example\_image\_path.png](https://i.imgur.com/example_image_path.png)" alt="Example of data points with two linear regression lines" title="Data with two potential regression lines"\>

We aim to find the "best-fit" line, which minimizes the difference between the actual data points and the predicted values. It means that we need to find the optimal line parameters $k$ and $b$. This is typically achieved using the least squares method, which finds the line that minimizes the sum of the squared differences between the observed values and the values predicted by the line.

-----

## Generating Synthetic Data

To train our model, we need data. We'll generate synthetic (fake) data for learning, like in the previous lesson.

```python
import numpy as np
import pandas as pd

np.random.seed(42)
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)  # House area in square feet

# Assume a linear relationship: price = base_price + (area * price_per_sqft)
base_price = 50000.00
price_per_sqft = 200.00
noise = np.random.normal(0, 25000, num_samples)  # Adding noise
price = base_price + (area * price_per_sqft) + noise

# Output example data
print(f"Area: {area[:5].round(2)}")
# Area: [1623.62 3352.14 2695.98 2295.98  968.06]
print(f"Price: {price[:5].round(2)}")
# Price: [376900.25 712953.4  591490.38 459505.87 238119.39]

# Create DataFrame
df = pd.DataFrame({'Area': area.round(2), 'Price': price.round(2)})
```

We use **NumPy** to generate random house areas between 500 and 3500 square feet. The price is calculated based on a base price plus the area multiplied by a fixed rate per square foot, with some random noise.

Again, we use **Pandas** to organize our data into a DataFrame, a table-like structure for easier data manipulation.

-----

## Extracting Features and Target Variables

In machine learning, the inputs are called "**features**," and the output we want to predict is the "**target**" variable. In our case, the house area is the feature, and the house price is the target.

```python
# Extract features and target variable
X = df['Area'].values.reshape(-1, 1)  # Feature
y = df['Price'].values  # Target

# Output example features and target values
print(f"Features (X): {X[:5]}")
# Features (X): [[1623.62]
#  [3352.14]
#  [2695.98]
#  [2295.98]
#  [ 968.06]]
print(f"Target (y): {y[:5]}")
# Target (y): [376900.25 712953.4  591490.38 459505.87 238119.39]
```

We reshape the features to ensure they're in the correct format for our model. The `reshape(-1, 1)` converts `X` to a 2D array with one column.

-----

## Initializing and Training the Model

Now it's time to use the **Scikit-Learn** library to initialize and train our linear regression model.

```python
from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Output fitted model information
print(f"Model has been trained: Intercept = {model.intercept_:.2f}, Coefficient = {model.coef_[0]:.2f}")
# Model has been trained: Intercept = 57293.24, Coefficient = 196.17
```

We first import the `LinearRegression` class from **Scikit-Learn**. Then we create an instance of the model and train it using the `fit` method, which takes in the features `X` and the target `y`.

-----

## Interpreting the Model Output

After training, we can check the model's coefficients to understand the relationship it found.

```python
# Print model coefficients
print(f"Intercept: {model.intercept_:.2f}, Coefficients: {model.coef_[0]:.2f}")
# Intercept = 57293.24, Coefficient = 196.17
```

The **intercept** is the point where the line intercepts the y-axis (when the area is zero), and the **coefficients** are the slope of the line (how much the price changes with one unit of area). In simpler terms, the intercept is the base price, and the coefficients represent the rate per square foot.

-----

## Visualizing the Model Fit

Let's plot the data and the fitted line to visualize how well our model fits the data.

```python
import matplotlib.pyplot as plt

# Plot the data points
plt.scatter(X, y / 1000, alpha=0.5, color='blue', label='Data points')

# Plot the regression line
plt.plot(X, model.predict(X) / 1000, color='red', label='Regression line')

plt.xlabel('Area (sq ft)')
plt.ylabel('Price, thousands dollars')
plt.title('Linear Regression: House Price vs. Area')
plt.legend()
plt.grid()
plt.show()
```

Using **Matplotlib**, we plot the original data points and the regression line. Note that we use the `predict` method to create the line. We will discuss this method in detail in the next lesson. It generally takes our features and predicts the target variable for each using the obtained best-fit line.

The blue dots represent the actual data points, and the red line represents our fitted linear regression model:

\<img src="[https://i.imgur.com/another\_example\_image\_path.png](https://i.imgur.com/another_example_image_path.png)" alt="Plot of data points and linear regression line" title="Linear Regression Fit"\>

-----

## Multiple Features

So far, we've worked with a single feature, the house area. However, linear regression can also handle multiple features.

For example, suppose we have additional features such as the number of bedrooms and the age of the house. Our new dataset might look like this:

```text
   price     area     age  num_bedrooms
0  1000.00  500.00    3.00          3.00
1  2500.00  700.00    3.00          3.00
2  3000.00  700.00    2.00          2.00
3  4500.00  800.00    5.00          3.00
```

Here, `X` consists of multiple columns representing different features. The model will now learn a coefficient for each feature, indicating how each impacts the target variable. Though we can't visualize the four-dimensional data, the principle remains: linear regression finds the best-fit line for the given data\!

-----

## Lesson Summary

In this lesson, we've covered:

  * Understanding linear regression and its purpose.
  * Generating synthetic data to simulate real-world house areas and prices.
  * Creating a **DataFrame** to organize our data.
  * Extracting features and target variables for model training.
  * Initializing and training a linear regression model using **Scikit-Learn**.
  * Interpreting the model's output.

Now that you understand the theory and steps involved in training a linear regression model, it's time to put this knowledge into practice. In the practice session, you will get hands-on experience with training a model and making predictions based on new data. Get ready to apply what you've learned and build your own linear regression model\!

## Training the Linear Regression Model

Galactic Pioneer, let's predict the price of a house based on its area using linear regression. In the given code, you will see how we generate synthetic data, train a model, and print out its coefficients!

Click Run to see the model in action!

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generate synthetic data
area = np.array([1200, 1500, 1800, 2000, 2500])  # House areas in sqft
price = np.array([240000, 300000, 360000, 400000, 500000])  # House prices in $

# Create a DataFrame
df = pd.DataFrame({'Area': area, 'Price': price})

# Extract features and target
X = df['Area'].values.reshape(-1,1)
y = df['Price'].values

# Train linear regression model
model = LinearRegression()
model.fit(X, y)


# Print model coefficients
print(f"Intercept: {model.intercept_}, Coefficient: {model.coef_}")

```


## Adjust Base Price and Price per Square Foot

Now let's modify the code to see how changing the base_price and price_per_square_foot affects our linear regression model. Update the base_price to 50000 and price_per_sqft to 200, then check the new coefficients of the model.

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generate synthetic data
num_samples = 50
area = np.random.uniform(500, 3500, num_samples)
base_price = 60000  # Original value
price_per_sqft = 250  # Original value
noise = np.random.normal(0, 30000, num_samples)
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# Create DataFrame
df = pd.DataFrame({'Area': area, 'Price': price})

# Extract features and target
X = df['Area'].values.reshape(-1, 1)
y = df['Price'].values

# Initialize model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Print model coefficients
print(f"Intercept: {model.intercept_}, Coefficient: {model.coef_}")

```
```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42) # Added seed for reproducibility, as it's good practice
num_samples = 50
area = np.random.uniform(500, 3500, num_samples)
base_price = 50000  # Updated value
price_per_sqft = 200  # Updated value
noise = np.random.normal(0, 30000, num_samples)
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# Create DataFrame
df = pd.DataFrame({'Area': area, 'Price': price})

# Extract features and target
X = df['Area'].values.reshape(-1, 1)
y = df['Price'].values

# Initialize model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Print model coefficients
print(f"Intercept: {model.intercept_}, Coefficient: {model.coef_}")
```

## Predicting House Prices with Linear Regression

Alright, Space Wanderer, are you ready to complete your mission? Add the missing pieces to the code. We need to generate synthetic data and train our model to predict house prices based on house areas. Yeap, you got it right, we will make the prediction using our model! More details about the predictions are in the next lesson, but let's take a look at it now. You've got this!

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generate synthetic data
num_samples = 50
area = np.random.uniform(600, 3000, num_samples)
base_price = 55000
price_per_sqft = 150
noise = np.random.normal(0, 20000, num_samples)
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# Create DataFrame
df = pd.DataFrame({'Area': area, 'Price': price})

# TODO: Extract the features from `df` and reshape them
# X = ...

# TODO: Extract the target variable from `df`
# y = ...

# Train the model
model = LinearRegression()
model.fit(X, y)

# Make a prediction
predicted_price = model.predict([[2500]])
print(predicted_price)

```

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42) # Added seed for reproducibility, as it's good practice
num_samples = 50
area = np.random.uniform(600, 3000, num_samples) # Updated range for area
base_price = 55000 # Updated value as per current request
price_per_sqft = 150 # Updated value as per current request
noise = np.random.normal(0, 20000, num_samples) # Updated noise standard deviation
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# Create DataFrame
df = pd.DataFrame({'Area': area, 'Price': price})

# Extract the features from `df` and reshape them
X = df['Area'].values.reshape(-1, 1)

# Extract the target variable from `df`
y = df['Price'].values

# Train the model
model = LinearRegression()
model.fit(X, y)

# Make a prediction
predicted_price = model.predict([[2500]])
print(predicted_price)


```

## Training a Linear Regression Model from Scratch

The end of our exploration is near, Galactic Pioneer! Now, it's time to bring it all together and train a linear regression model from scratch. Follow the TODO comments to achieve the final result and print the line equation. You've got this!

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generating synthetic house data
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)  # House area in square feet
base_price = 50000
price_per_sqft = 200
noise = np.random.normal(0, 25000, num_samples)  # Adding noise
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# TODO: Create a DataFrame with 'Area' and 'Price' columns using pandas

# TODO: Extract features (house area) and target variable (house price)
# Ensure to reshape the features to be a 2D array with one column

# TODO: Initialize the LinearRegression model using scikit-learn

# TODO: Train the model with the extracted features and target variable using the fit() method

# TODO: Print the line equation y=kx + b using the model's intercept and coefficient
# Remember that k is the model.intercept_ and b is the model.coef_.

```

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Generating synthetic house data
np.random.seed(42) # Added seed for reproducibility, as it's good practice
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)  # House area in square feet
base_price = 50000
price_per_sqft = 200
noise = np.random.normal(0, 25000, num_samples)  # Adding noise
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# TODO: Create a DataFrame with 'Area' and 'Price' columns using pandas
df = pd.DataFrame({'Area': area, 'Price': price})

# TODO: Extract features (house area) and target variable (house price)
# Ensure to reshape the features to be a 2D array with one column
X = df['Area'].values.reshape(-1, 1)
y = df['Price'].values

# TODO: Initialize the LinearRegression model using scikit-learn
model = LinearRegression()

# TODO: Train the model with the extracted features and target variable using the fit() method
model.fit(X, y)

# TODO: Print the line equation y=kx + b using the model's intercept and coefficient
# Remember that k is the model.intercept_ and b is the model.coef_.
# Note: For a single feature, model.coef_ returns an array like [k], so we access k with model.coef_[0]
print(f"Line Equation: y = {model.coef_[0]:.2f}X + {model.intercept_:.2f}")



```

## Train Your Linear Regression Model

Great job so far! Now, train the linear regression model by fitting the model using the given features (X) and the target (y). This time, you will work with a dataset that has multiple features.

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Create synthetic data for house characteristics
area = np.array([1500, 2000, 2500, 3000, 3500])
bedrooms = np.array([3, 4, 2, 5, 3])
age = np.array([10, 15, 20, 5, 8])
price = np.array([400000, 500000, 600000, 700000, 650000])
price = np.round(price, 2)

# Create a DataFrame
df = pd.DataFrame({
    'Area': area,
    'Bedrooms': bedrooms,
    'Age': age,
    'Price': price
})

# TODO: Extract features and target. Target is the 'Price' column, and other columns are features.

# TODO: Train the linear regression model using `X` as features and `y` as target

# TODO: Output model coefficients

```

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Create synthetic data for house characteristics
area = np.array([1500, 2000, 2500, 3000, 3500])
bedrooms = np.array([3, 4, 2, 5, 3])
age = np.array([10, 15, 20, 5, 8])
price = np.array([400000, 500000, 600000, 700000, 650000])
price = np.round(price, 2)

# Create a DataFrame
df = pd.DataFrame({
    'Area': area,
    'Bedrooms': bedrooms,
    'Age': age,
    'Price': price
})

# TODO: Extract features and target. Target is the 'Price' column, and other columns are features.
X = df[['Area', 'Bedrooms', 'Age']] # Features
y = df['Price'] # Target

# TODO: Train the linear regression model using `X` as features and `y` as target
model = LinearRegression()
model.fit(X, y)

# TODO: Output model coefficients
print("Model Coefficients:")
# Display coefficients with their corresponding feature names
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")
print(f"Intercept: {model.intercept_:.2f}")


```