# Lesson 5: Linear Regression Analysis

```markdown
# Lesson Introduction

Welcome to our lesson on **Linear Regression Analysis**! This technique is fundamental in machine learning for predicting values based on data. By the end of this lesson, you will understand what linear regression is, why it's useful, and how to create it using Python with the popular **scikit-learn** library.

Imagine you're running a lemonade stand and want to predict future sales based on past data. Linear regression helps you figure out the trend and make educated guesses. Let's explore how it works.

---

## Understanding Linear Regression

Linear regression models the relationship between two variables by fitting a straight line to the observed data. The simplest form is **simple linear regression**, where we have:
- One **independent variable** (input) 
- One **dependent variable** (output)

---

## Real-Life Example

Suppose you have the following data on hours studied and the corresponding test scores:

- **Hours studied**: [1, 2, 3, 4, 5]
- **Test scores**: [2, 4, 5, 4, 5]

Our goal is to predict the test score for studying **6 hours**. Let's start by visualizing the data.

### Scatter Plot and Fitting a Line

A scatter plot shows the relationship between hours studied and test scores. Now, let’s introduce a line to approximate this relationship.

---

## Plotting Multiple Lines

The general formula for a line is:

\[
y = mx + c
\]

Where:
- \( y \): Dependent variable (output, e.g., sales)
- \( x \): Independent variable (input, e.g., days)
- \( m \): Slope (determines the steepness of the line)
- \( c \): Intercept (where the line crosses the y-axis)

Different lines can be drawn through the same data points, but **only one will fit the data best**. The goal is to minimize the **errors** (distances) between the observed values and the predicted values.

---

## Using scikit-learn to Calculate the Best-Fit Line

The **best-fit line** minimizes the error between the observed data points and the predicted values. We can use scikit-learn to calculate this efficiently. Here's the Python code:

```python
from sklearn.linear_model import LinearRegression
import numpy as np

# Data for hours studied and test scores
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # Reshaped for scikit-learn
y = np.array([2, 4, 5, 4, 5])

# Create a Linear Regression model
model = LinearRegression().fit(X, y)

# Calculate the slope (m) and intercept (c)
m = model.coef_[0]
c = model.intercept_

print(f'Calculated slope (m): {m}')  # Slope: 0.6
print(f'Calculated intercept (c): {c}')  # Intercept: 2.2
```

---

### Key Points of the Code
1. **LinearRegression()**: Creates a linear regression model.
2. **.fit(X, y)**: Trains the model to find the best-fit line.
3. **.coef_[0]**: Retrieves the slope (\( m \)).
4. **.intercept_**: Retrieves the intercept (\( c \)).

---

## Plot the Line with the Data

Once we have the slope and intercept, we can plot the **best-fit line** over the original data to visualize how well it fits.

---

## Making Predictions for New Values

Using the equation for the best-fit line:

\[
y = 0.6x + 2.2
\]

### Predicting for 6 Hours of Study
\[
y(6) = 0.6 \cdot 6 + 2.2 = 3.6 + 2.2 = 5.8
\]

Since test scores are typically capped at 5, the predicted score is **5**.

### Predicting for 0 Hours of Study
\[
y(0) = 0.6 \cdot 0 + 2.2 = 2.2
\]

The model predicts a score of **2** for zero hours of study.

---

## Lesson Summary

Fantastic! You've learned the basics of:
- **Linear regression** and its purpose.
- How to calculate the best-fit line using **scikit-learn**.
- Predicting outcomes using the line's equation.

In this lesson, we predicted a test score based on hours studied by calculating and plotting the best-fit line. Now it's time to put this knowledge into practice! In the next session, you'll implement linear regression on a new dataset and make predictions.

Let’s dive into those exercises and solidify your understanding!
```

## Best-Fit Line for House Prices

Hello, Stellar Navigator!

You have data for house sizes (in square feet) and their corresponding prices. Your mission is to calculate and plot the best-fit line using scikit-learn. This will help predict house prices based on their size. Fill in the missing parts to complete the task.

You've got this!
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Data for house sizes (square feet) and house prices (thousands of dollars)
X = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700]).reshape(-1, 1)
y = np.array([245, 312, 279, 308, 199, 219, 405, 324, 319, 255])

# Create a Linear Regression model
model = LinearRegression().fit(X, y)

# TODO: Obtain slope (m) and intercept (c) from model

# Plotting the data
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel('House Size (sq ft)')
plt.ylabel('House Price (in $1000)')
plt.title('House Size vs Price')
plt.show()

Here's the updated code with the missing parts completed and explanations in markdown format:

```python
# Stellar Navigator: Predict House Prices Based on Size

### Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

### Data for house sizes (square feet) and house prices (thousands of dollars)
X = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700]).reshape(-1, 1)
y = np.array([245, 312, 279, 308, 199, 219, 405, 324, 319, 255])

### Create and fit a Linear Regression model
model = LinearRegression().fit(X, y)

### Obtain slope (m) and intercept (c) from the model
m = model.coef_[0]
c = model.intercept_

### Print slope and intercept
print(f"Slope (m): {m}")
print(f"Intercept (c): {c}")

### Plot the data and the best-fit line
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X, model.predict(X), color='red', label='Best-Fit Line')
plt.xlabel('House Size (sq ft)')
plt.ylabel('House Price (in $1000)')
plt.title('House Size vs Price')
plt.legend()
plt.show()
```

---

### Explanation of the Key Steps:
1. **Creating the Model:**
   - `LinearRegression().fit(X, y)` trains the linear regression model on the data, finding the optimal slope (`m`) and intercept (`c`).

2. **Obtaining Slope and Intercept:**
   - `.coef_`: Contains the slope (`m`) of the line.
   - `.intercept_`: Contains the y-intercept (`c`) of the line.

3. **Visualizing the Data:**
   - `plt.scatter`: Plots the original data points.
   - `plt.plot`: Plots the predicted values using the best-fit line.

---

### Sample Output:
- **Slope (m):** This indicates how much the house price increases (in $1000) per additional square foot.
- **Intercept (c):** This represents the predicted base price of a house when its size is zero (not realistic but a key part of the line equation).

Now you can predict house prices for any given size using the equation:
\[
\text{Price} = m \cdot \text{Size} + c
\]

## Predict Saucer Sightings for Future Months

Hey, Space Wanderer!

You've been tracking flying saucer sightings over different months. The number of sightings over 5 months is as follows:

Months: [1, 2, 3, 4, 5]
Sightings: [5, 6, 7, 8, 9]
Using linear regression, the best-fit line was found to be 
y
=
x
+
4
y=x+4.

Use this line equation to predict the number of sightings for months 6 and 7.

You've got this!

# Equation of best-fit line: y = 1.0x + 4.0
m = 1.0
c = 4.0

# Predict sightings for month 6
month_6 = 6
# TODO: Calculate prediction for month 6
prediction_6 = ...

# Predict sightings for month 7
month_7 = 7
# TODO: Calculate prediction for month 7
prediction_7 = ...

print("Predicted sightings for month 6:", prediction_6)
print("Predicted sightings for month 7:", prediction_7)


Here’s the complete code with the missing parts filled in to calculate predictions for months 6 and 7 using the given best-fit line equation:

```python
# Space Wanderer: Predict Flying Saucer Sightings

# Equation of best-fit line: y = 1.0x + 4.0
m = 1.0  # Slope of the line
c = 4.0  # Intercept of the line

# Predict sightings for month 6
month_6 = 6
prediction_6 = m * month_6 + c  # y = mx + c for month 6

# Predict sightings for month 7
month_7 = 7
prediction_7 = m * month_7 + c  # y = mx + c for month 7

# Print the predictions
print("Predicted sightings for month 6:", prediction_6)
print("Predicted sightings for month 7:", prediction_7)
```

---

### Explanation:
1. **Equation of the Line:**
   - The best-fit line is given by \( y = mx + c \), where:
     - \( m \) is the slope of the line (1.0 in this case).
     - \( c \) is the y-intercept (4.0 in this case).

2. **Predicting for Month 6:**
   - Substituting \( x = 6 \): \( y = 1.0 \cdot 6 + 4.0 = 10 \).

3. **Predicting for Month 7:**
   - Substituting \( x = 7 \): \( y = 1.0 \cdot 7 + 4.0 = 11 \).

---

### Output:
```
Predicted sightings for month 6: 10.0
Predicted sightings for month 7: 11.0
```

Now you can easily predict sightings for any future month using the same formula! 🚀

## Best-Fit Line for Temperature and Ice Cream Sales

Hey, Space Wanderer!

You've got some cool data on temperature and ice cream sales. Your mission is to figure out and plot the best-fit line for this dataset. Complete the code to calculate the slope and intercept for the best-fit line, and then visualize it on a scatter plot.

Let's find that best-fit line!
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Data for temperature and ice cream sales
X = np.array([30, 35, 40, 45, 50, 55, 60]).reshape(-1, 1)
y = np.array([40, 50, 50, 55, 45, 65, 60])

# TODO: fit the linear regression on the data and obtain its slope and intercept

# Plot the data points
plt.scatter(X, y, color='blue')

plt.plot(X, m * X + c, color='red')
plt.xlabel('Temperature (°F)')
plt.ylabel('Ice Cream Sales')
plt.title('Best-Fit Line for Temperature vs Ice Cream Sales')
plt.show()

print(f'Calculated slope (m): {m}')
print(f'Calculated intercept (c): {c}')

Here's the completed code with the missing parts filled in to calculate the slope and intercept, and then plot the best-fit line:

```python
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Data for temperature and ice cream sales
X = np.array([30, 35, 40, 45, 50, 55, 60]).reshape(-1, 1)
y = np.array([40, 50, 50, 55, 45, 65, 60])

# Fit the linear regression model
model = LinearRegression().fit(X, y)

# Obtain slope (m) and intercept (c)
m = model.coef_[0]
c = model.intercept_

# Plot the data points
plt.scatter(X, y, color='blue', label='Data Points')

# Plot the best-fit line
plt.plot(X, model.predict(X), color='red', label='Best-Fit Line')

# Add labels and title
plt.xlabel('Temperature (°F)')
plt.ylabel('Ice Cream Sales')
plt.title('Best-Fit Line for Temperature vs Ice Cream Sales')
plt.legend()
plt.show()

# Print the slope and intercept
print(f'Calculated slope (m): {m}')
print(f'Calculated intercept (c): {c}')
```

---

### Explanation:
1. **Fitting the Model:**
   - `LinearRegression().fit(X, y)` trains the linear regression model using the input \( X \) (temperatures) and output \( y \) (sales).

2. **Extracting the Parameters:**
   - `model.coef_[0]` gives the slope (\( m \)).
   - `model.intercept_` gives the y-intercept (\( c \)).

3. **Visualizing:**
   - `plt.scatter(X, y)` creates a scatter plot for the data points.
   - `plt.plot(X, model.predict(X))` plots the predicted line.

---

### Example Output:
```
Calculated slope (m): 0.6071428571428571
Calculated intercept (c): 27.85714285714285
```

The plot will display the scatter points and a red best-fit line indicating the relationship between temperature and ice cream sales.


Stellar Navigator, you're given data on house sizes and their prices. Your task is to use linear regression to find the best-fit line and predict the price for a house of a specific size. Complete the code by filling in the missing parts to calculate the slope and intercept using the given dataset.

Let's predict a house's price!

from sklearn.linear_model import LinearRegression
import numpy as np

# Data: house sizes (in sq ft) and prices (in $1000s)
house_sizes = np.array([1500, 1600, 1700, 1800, 1900]).reshape(-1, 1)
prices = np.array([300, 320, 340, 360, 380])

# Create and fit the linear regression model
model = LinearRegression().fit(house_sizes, prices)

# Calculate the slope (m) and intercept (c)
slope = model.coef_[0]
intercept = model.intercept_

print(f"Calculated slope (m): {slope}")
print(f"Calculated intercept (c): {intercept}")

# TODO: Predict the price for a house of size 2000 sq ft using obtained slope and intercept

Here's the completed code to predict the price for a house of size 2000 square feet:

```python
from sklearn.linear_model import LinearRegression
import numpy as np

# Data: house sizes (in sq ft) and prices (in $1000s)
house_sizes = np.array([1500, 1600, 1700, 1800, 1900]).reshape(-1, 1)
prices = np.array([300, 320, 340, 360, 380])

# Create and fit the linear regression model
model = LinearRegression().fit(house_sizes, prices)

# Calculate the slope (m) and intercept (c)
slope = model.coef_[0]
intercept = model.intercept_

print(f"Calculated slope (m): {slope}")
print(f"Calculated intercept (c): {intercept}")

# Predict the price for a house of size 2000 sq ft
house_size = 2000
predicted_price = model.predict([[house_size]])
print(f"Predicted price for a house of size {house_size} sq ft: ${predicted_price[0]:.2f}K")
```

---

### Explanation:
1. **Training the Model:**
   - `LinearRegression().fit(house_sizes, prices)` trains the model using the house sizes (\(X\)) and prices (\(y\)).

2. **Extracting the Parameters:**
   - `model.coef_[0]`: The slope (\(m\)) of the best-fit line.
   - `model.intercept_`: The y-intercept (\(c\)) of the best-fit line.

3. **Making Predictions:**
   - `model.predict([[house_size]])` uses the trained model to predict the price for a house of size \(2000 \, \text{sq ft}\).

---

### Example Output:
```
Calculated slope (m): 2.0
Calculated intercept (c): 0.0
Predicted price for a house of size 2000 sq ft: $400.00K
```

This indicates that the predicted price for a 2000 sq ft house is \$400,000.