# Unit 3 Making Predictions and Visualizing Results

# Lesson Introduction
In this lesson, we will explore how to make predictions using a trained machine learning model and visualize the results. Understanding predictions and visualizations helps interpret model performance and make informed decisions.

Our goal is to build on the trained linear regression model from the last lesson, use it to predict, and visualize these predictions against actual data. This will provide a clear picture of your model's performance.

---

# Making Predictions with a Trained Model: Understanding
First, let's recap. In the last lesson, we trained a linear regression model to understand the relationship between the area of a house and its price using synthetic data. We will use the same code snippet for generating the data as we used in the previous lesson:

```python
import numpy as np
import pandas as pd

np.random.seed(42)
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)  # House area in square feet

# Assume a linear relationship: price = base_price + (area * price_per_sqft)
base_price = 50000.00
price_per_sqft = 200.00
noise = np.random.normal(0, 25000, num_samples)  # Adding noise
price = base_price + (area * price_per_sqft) + noise

# Create DataFrame
df = pd.DataFrame({'Area': np.round(area, 2), 'Price': np.round(price, 2)})
X = df[['Area']]
y = df['Price']
```
Now, imagine you have new data, like areas of new houses, and want to predict their prices using your trained model. This is where the **predict** method from Scikit-Learn comes into play.

The **predict** method takes input data and generates the predicted output based on the model.

We'll start by importing essential libraries, including NumPy, Pandas, Matplotlib, and Scikit-Learn. We'll use the same data generation script as in the previous lesson.

---

# Making Predictions with a Trained Model: Application
First, we initialize and train the model using our data.

```python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
```
Then, let's make predictions with our trained model. Suppose you want to predict prices for houses with areas of 200 and 3500 square feet.

```python
X_new = np.array([[200.00], [3500.00]])
# Converting to a DataFrame to include feature name
X_new = pd.DataFrame(X_new, columns=['Area'])
y_predict = model.predict(X_new)
print("Predicted prices:", np.round(y_predict, 2))  # [ 96526.95 743883.09]
```
In real life, you might need to predict prices for several new houses, not just two.

---

# Visualizing Prediction Results: Part 1
Visualizing your predictions against actual data helps to understand your model's performance. We'll use Matplotlib to create a scatter plot for the actual data and a line plot for predicted data.

A **scatter plot** is ideal for visualizing individual data points, perfect for plotting actual house prices against their areas.

```python
import matplotlib.pyplot as plt

plt.scatter(X, y.round(), color='blue', alpha=0.5)
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Prices vs. Area')
plt.grid()
plt.show()
```

---

# Visualizing Prediction Results: Part 2
Next, we'll draw a line plot to visualize the predictions. This line shows how the predicted prices vary with house areas.

```python
plt.scatter(X, y, color='blue')  # Plot actual data
plt.plot(X_new, np.round(y_predict, 2), color='red', linewidth=2)  # Plot predicted line
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('Linear Regression: Area vs. Price')
plt.grid()
plt.show()
```
In this plot:

* **Blue points** represent actual data.
* The **red line** represents the model's predictions.

The closer the points are to the red line, the better your model's predictions. In this case, we can see that the line and the points are well-aligned. In reality, data could be much noisier.

---

# Lesson Summary
Congratulations! You've learned how to make predictions using a trained machine learning model and visualize results. In this lesson, we covered:

* Generating and preparing data.
* Using the **predict** method to generate new predictions.
* Visualizing actual data and predicted results using Matplotlib.

Visualizing predictions helps you understand and evaluate your model's performance. Good models have predicted values close to the actual values, making the plot look smoother.

Now, it's time to use what you've learned in practice. In the practice session, you'll generate your predictions and visualize them. This hands-on experience will reinforce your skills and make you more comfortable with making predictions and visualizing results. Happy coding!

## Predicting and Visualizing House Prices

How would you predict house prices based on their areas? Visualizing this can help you see if your predictions make sense.

Space Explorer, let’s predict prices for new house areas and visualize the results! The blue points represent actual house prices, and the red line shows predicted prices for new areas. Click Run to see the magic happen!


```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
data = {'Area': [700, 750, 800, 850, 900], 'Price': [35000, 38000, 39000, 41000, 45000]}
df = pd.DataFrame(data)

# Define features and target
X = df[['Area']].values
y = df['Price'].values

# Train the linear regression model
model = LinearRegression()
model.fit(X, y)

# Make predictions for new data
X_new = np.array([[600], [1000]]) 
y_predict = model.predict(X_new)

# Plot the results
plt.scatter(X, y, color='blue', alpha=0.5)
plt.plot(X_new, y_predict, color='red', linewidth=2)
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('Real Estate: Area vs. Price Prediction')
plt.grid()
plt.show()

```

## Visualize Predicted Prices Using Line Plot

The given code visualizes linear regression model as a best-fit line. Let's plot separate predictions of the model instead by changing plt.plot to plt.scatter. Simply change the function and remove the linewidth parameter!


```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating synthetic data for training
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)
base_price = 50000.00
price_per_sqft = 200.00
noise = np.random.normal(0, 25000, num_samples)
price = base_price + (area * price_per_sqft) + noise

# Training the linear regression model
df = pd.DataFrame({'Area': np.round(area, 2), 'Price': np.round(price, 2)})
X = df[['Area']]
y = df['Price']
model = LinearRegression()
model.fit(X, y)

# Predict house prices for new areas
X_new = np.array([[400.00], [3500.00]])
X_new = pd.DataFrame(X_new, columns=['Area'])
y_predict = model.predict(X_new)

# Plotting the actual data and predicted prices as a line plot
plt.scatter(X, y, color='blue', alpha=0.5)  # actual data
plt.plot(X_new, y_predict, color='red', linewidth=2)  # predicted prices as line plot
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Prices: Actual vs Predicted')
plt.grid()
plt.show()

```

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating synthetic data for training
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)
base_price = 50000.00
price_per_sqft = 200.00
noise = np.random.normal(0, 25000, num_samples)
price = base_price + (area * price_per_sqft) + noise

# Training the linear regression model
df = pd.DataFrame({'Area': np.round(area, 2), 'Price': np.round(price, 2)})
X = df[['Area']]
y = df['Price']
model = LinearRegression()
model.fit(X, y)

# Predict house prices for new areas
X_new = np.array([[400.00], [3500.00]])
X_new = pd.DataFrame(X_new, columns=['Area'])
y_predict = model.predict(X_new)

# Plotting the actual data and predicted prices as a scatter plot
plt.scatter(X, y, color='blue', alpha=0.5)  # actual data
plt.scatter(X_new, y_predict, color='red')  # predicted prices as scatter plot
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Prices: Actual vs Predicted')
plt.grid()
plt.show()

```

## Predict House Prices

Great job so far! Now let's see if you can fill in the missing pieces to make the code work. Use the trained linear regression model to predict house prices for new house areas and visualize the results.

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Synthetic data for house areas (sq ft) and prices ($)
X = np.array([[1500], [2300], [1800], [2700], [3300]])
y = np.array([350000, 480000, 390000, 570000, 650000])

# Train the Linear Regression model
model = LinearRegression()
model.fit(X, y)

# Predict the price for new house areas
X_new = np.array([[1000], [3000]])
# TODO: Use the model to predict prices for the new house areas

# Visualize the result
plt.scatter(X, y, color='blue', label='Actual Prices')
# TODO: Plot the predicted prices for the new house areas
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid()
plt.show()

```

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Synthetic data for house areas (sq ft) and prices ($)
X = np.array([[1500], [2300], [1800], [2700], [3300]])
y = np.array([350000, 480000, 390000, 570000, 650000])

# Train the Linear Regression model
model = LinearRegression()
model.fit(X, y)

# Predict the price for new house areas
X_new = np.array([[1000], [3000]])
# Use the model to predict prices for the new house areas
y_predict = model.predict(X_new)

# Visualize the result
plt.scatter(X, y, color='blue', label='Actual Prices')
# Plot the predicted prices for the new house areas
plt.scatter(X_new, y_predict, color='red', marker='x', s=100, label='Predicted Prices') # Added marker and size for clarity
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Prices: Actual vs Predicted for New Areas') # Added a more descriptive title
plt.legend()
plt.grid(True) # Ensure grid is explicitly True
plt.show()



```

## Debugging House Price Predictions

Stellar Navigator, you have code that predicts house prices based on their areas and visualizes the results. There is a small bug that prevents it from working correctly. Can you find and fix the issue to ensure the code runs without errors and produces the correct output?

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating synthetic data for areas and prices
areas = np.array([500, 1000, 1500, 2000, 2500])
prices = np.array([150000, 280000, 440000, 580000, 760000])

df = pd.DataFrame({'Area': areas, 'Price': prices})

# Train the linear regression model
model = LinearRegression()
X = df['Area'].values
y = df['Price'].values
model.fit(X, y)

# Making predictions (including areas 0 and 3500)
X_new = np.array([[0], [3500]])
predictions = model.predict(X_new)

# Visualizing actual data and predicted results
plt.scatter(X, y, color='blue')  # actual data
plt.plot(X_new, predictions, color='red', linewidth=2)  # predicted line
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('Area vs. Price Prediction')
plt.grid()
plt.show()

```

Stellar Navigator, you've almost got it! The small bug lies in how you're preparing your `X` (feature) data for the `model.fit()` method.

Scikit-learn's `fit` method for models like `LinearRegression` expects the input feature array (`X`) to be 2-dimensional: `(n_samples, n_features)`. Even when you only have one feature (like 'Area'), it still needs to be explicitly shaped as a column vector.

Currently, `X = df['Area'].values` creates a 1-dimensional NumPy array.

**The Fix:**
You need to reshape `X` to be 2-dimensional. You can do this by using `df[['Area']]` to keep it as a DataFrame (which is naturally 2D for single columns), or by explicitly reshaping the NumPy array using `.reshape(-1, 1)`. Using `df[['Area']]` is generally more idiomatic for scikit-learn.

Here's the corrected code:

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating synthetic data for areas and prices
areas = np.array([500, 1000, 1500, 2000, 2500])
prices = np.array([150000, 280000, 440000, 580000, 760000])

df = pd.DataFrame({'Area': areas, 'Price': prices})

# Train the linear regression model
model = LinearRegression()
# BUG FIX: X needs to be a 2D array (n_samples, n_features)
X = df[['Area']] # Using double brackets makes it a DataFrame (2D)
y = df['Price'].values # y can remain 1D
model.fit(X, y)

# Making predictions (including areas 0 and 3500)
X_new = np.array([[0], [3500]]) # X_new is already correctly 2D
predictions = model.predict(X_new)

# Visualizing actual data and predicted results
plt.scatter(X, y, color='blue', label='Actual Data')  # actual data
plt.plot(X_new, predictions, color='red', linewidth=2, label='Predicted Line')  # predicted line
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.title('Area vs. Price Prediction')
plt.legend() # Added legend for clarity
plt.grid(True) # Ensure grid is explicitly True
plt.show()
```

**Reasoning for the fix:**
The `model.fit(X, y)` function from Scikit-learn's `LinearRegression` expects `X` to be a 2D array (or DataFrame) where each row is a sample and each column is a feature. When you extracted `df['Area'].values`, you got a 1D array. By using `df[['Area']]`, you preserve the DataFrame structure, which is inherently 2D (a single column in this case), satisfying the `fit` method's requirement.

## Making Predictions from Scratch

Great job so far, Space Explorer! It's time for our final challenge in this lesson. From scratch, predict and visualize house prices using a linear regression model. Follow the steps in the TODO comments to complete the task.

May the Force be with you, and happy coding!

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating synthetic data
np.random.seed(0)
areas = np.random.rand(50, 1) * 5000  # Areas between 0 and 5000 sq ft
prices = 10000 + 150 * areas + np.random.randn(50, 1) * 20000  # Linear relation
prices = np.round(prices, 2)

# TODO: Use the generated synthetic data to create a DataFrame named 'data' with columns 'Area' and 'Price'.

# TODO: Train a Linear Regression model using the DataFrame 'data'. Features are in 'Area'; target is 'Price'.

# TODO: Create a new DataFrame for prediction with areas: 1000, 2000, 3000, and 4000 sq ft.

# TODO: Predict prices for the new data using the trained Linear Regression model.

# TODO: Visualise results by plotting both data and new predictions with plt.scatter 

```

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating synthetic data
np.random.seed(0)
areas = np.random.rand(50, 1) * 5000  # Areas between 0 and 5000 sq ft
prices = 10000 + 150 * areas + np.random.randn(50, 1) * 20000  # Linear relation
prices = np.round(prices, 2)

# TODO: Use the generated synthetic data to create a DataFrame named 'data' with columns 'Area' and 'Price'.
data = pd.DataFrame({'Area': areas.flatten(), 'Price': prices.flatten()})

# TODO: Train a Linear Regression model using the DataFrame 'data'. Features are in 'Area'; target is 'Price'.
model = LinearRegression()
model.fit(data[['Area']], data['Price'])

# TODO: Create a new DataFrame for prediction with areas: 1000, 2000, 3000, and 4000 sq ft.
new_areas = pd.DataFrame({'Area': [1000, 2000, 3000, 4000]})

# TODO: Predict prices for the new data using the trained Linear Regression model.
predicted_prices = model.predict(new_areas)

# TODO: Visualise results by plotting both data and new predictions with plt.scatter
plt.figure(figsize=(10, 6))
plt.scatter(data['Area'], data['Price'], label='Original Data', alpha=0.7)
plt.scatter(new_areas['Area'], predicted_prices, color='red', marker='x', s=100, label='Predictions')
plt.plot(data['Area'], model.predict(data[['Area']]), color='green', linestyle='--', label='Regression Line')
plt.title('House Price Prediction using Linear Regression')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()
```