# Unit 1 Introduction to Machine Learning

Here's the provided text converted into Markdown format, with appropriate headings, code blocks, and emphasis.

---

# Lesson Introduction

**Machine learning!** You've probably heard this term a lot. But what exactly is it? Think of it as teaching a computer to learn from data and make decisions or predictions based on that data. This is like teaching a child to recognize different objects by showing them examples.

In this lesson, our goal is to understand the basics of a machine learning project. We’ll generate data, visualize it, and understand the relationships within it.

## Data Generation

Let’s start by generating some data. In real-life projects, the first step is to collect data, but we'll create synthetic (fake) data for our learning purposes using **NumPy**.

Why random data? It simulates different scenarios and creates a controlled environment for learning. Don't worry, in the end of this course we will work with the real data as well.

We'll use **NumPy** to generate areas of houses (in square feet) and their prices:

```python
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
num_samples = 100
area = np.random.uniform(500, 3500, num_samples)  # House area in square feet
area = np.round(area, 2)  # Round to 2 decimal places

# Assume a linear relationship: price = base_price + (area * price_per_sqft)
base_price = 50000
price_per_sqft = 200
noise = np.random.normal(0, 25000, num_samples)  # Adding some noise
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)  # Round to 2 decimal places

# Display a few generated data points for verification
print("Area (sq ft):", area[:5])
print("Price ($):", price[:5])
```

Output from the code:
```
Area (sq ft): [1623.62 3352.14 2695.98 2295.98  968.06]
Price ($): [376900.18 712952.82 591490.02 459506.78 238120.2]
```

**Real-life example:** Imagine you want to predict house prices in your neighborhood. The area of the house affects the price. We simulate this by creating a simple linear relationship but add noise to make it realistic.

Let's break down the data generation:

* **Generate House Areas:** Creates 100 random house areas between 500 and 3500 square feet.
* **Define Price Relationship:**
    * **Base price:** A constant starting price.
    * **Price per square foot:** A fixed price per unit area.
    * **Noise:** Adds variability to simulate real-world data.
* **Calculate Prices:** Computes the final prices based on the area, base price, price per square foot, and added noise.

This method creates a realistic dataset with variable house prices based on their areas.

## Creating a Data Structure

Now that we have our data, we need to handle it. This is where **Pandas** comes in handy. **Pandas** provides a powerful data structure called a **DataFrame**.

A **DataFrame** is like a table in an Excel sheet. It helps us organize data in rows and columns, making it easy to manipulate and analyze.

```python
import pandas as pd

# Create DataFrame
data = pd.DataFrame({'Area': area, 'Price': price})

# Display first few rows of the dataset
print(data.head())
```

Output:

```
      Area      Price
0  1623.62  376900.18
1  3352.14  712952.82
2  2695.98  591490.02
3  2295.98  459506.78
4   968.06  238120.20
```

## Data Visualization

To understand our data better, we need to visualize it. This means creating graphs to see patterns and relationships. We use **Matplotlib** for this purpose.

Visualizing data is crucial because it helps us see trends, patterns, and outliers, guiding us in choosing the right algorithms and parameters.

```python
import matplotlib.pyplot as plt

# Plot the data to visualize the relationship
plt.scatter(data['Area'], data['Price'], alpha=0.5)
plt.title('House Area vs. Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid()
plt.show()
```

Here is the generated scatter plot showing the relationship between house area and price, with 'House Area vs. Price' title, and labeled axes:

*(Note: The actual scatter plot image is not provided in the text, so it cannot be displayed here.)*

## Lesson Summary

Great job! Let’s recap what we learned today:

* Introduction to machine learning.
* Generated synthetic data using **NumPy**.
* Created a **DataFrame** with **Pandas** to handle and organize data.
* Visualized our data using **Matplotlib**.

By visualizing our data, we gain insights into relationships within it. Understanding these relationships is key to building effective machine learning models.

Now it’s time for hands-on practice. You will create your synthetic data, construct a **DataFrame**, and plot relationships to understand the data better. This hands-on practice will reinforce the concepts we covered and make you more comfortable with data manipulation and visualization before building your first machine learning model.

Let’s get started!

## House Area and Price Relationship

Hey Space Explorer! In the given code, you can see the relationship between the area and price of houses. This simulation generates synthetic data for house areas and their prices, then visualizes it using a scatter plot.

Run the code to see how house prices vary with their area!

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Another random seed for reproducibility
np.random.seed(10)

# Generate synthetic data
num_samples = 50
area = np.random.uniform(600, 3000, num_samples)  # House area in square feet

# Another relationship: price = base_price + (area * price_per_sqft)
base_price = 45000
price_per_sqft = 180
noise = np.random.normal(0, 20000, num_samples)  # Adding some noise
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# Create DataFrame
data = pd.DataFrame({'HouseArea': area, 'HousePrice': price})

# Plot the data to visualize the relationship
plt.scatter(data['HouseArea'], data['HousePrice'], alpha=0.5)
plt.title('House Area vs. House Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid()
plt.show()

```

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Another random seed for reproducibility
np.random.seed(10)

# Generate synthetic data
num_samples = 50
area = np.random.uniform(600, 3000, num_samples)  # House area in square feet

# Another relationship: price = base_price + (area * price_per_sqft)
base_price = 45000
price_per_sqft = 180
noise = np.random.normal(0, 20000, num_samples)  # Adding some noise
price = base_price + (area * price_per_sqft) + noise
price = np.round(price, 2)

# Create DataFrame
data = pd.DataFrame({'HouseArea': area, 'HousePrice': price})

# Plot the data to visualize the relationship
plt.scatter(data['HouseArea'], data['HousePrice'], alpha=0.5)
plt.title('House Area vs. House Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid()
plt.show()

```

## Enhance Data Visualization by Changing Sample Size

Space Explorer, let's enhance our data visualization skills! Change the number of samples from 50 to 100 to see how the scatter plot changes with more data points.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a seed for reproducibility
np.random.seed(10)

# Generate synthetic data
num_samples = 50
areas = np.random.uniform(500, 3500, num_samples)  # House area in square feet
base_price = 70000
price_per_sqft = 250
noise = np.random.normal(0, 25000, num_samples)

prices = base_price + areas * price_per_sqft + noise

# Create a DataFrame
data = pd.DataFrame({'Area': areas, 'Price': prices})

# Plot the data
plt.scatter(data['Area'], data['Price'], alpha=0.5)
plt.title('Scatter Plot: House Area vs. Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid()
plt.show()

```

Certainly, Space Explorer! I will update the "House Area and Price Relationship" Canvas to change the number of samples as you requested.

I have updated the num_samples variable in the Canvas from 50 to 100. This will now generate and display a scatter plot with more data points, allowing you to observe how the visualization changes with an increased sample size.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a seed for reproducibility
np.random.seed(10)

# Generate synthetic data
num_samples = 100 # Changed from 50 to 100
areas = np.random.uniform(500, 3500, num_samples)  # House area in square feet
base_price = 70000
price_per_sqft = 250
noise = np.random.normal(0, 25000, num_samples)

prices = base_price + areas * price_per_sqft + noise

# Create a DataFrame
data = pd.DataFrame({'Area': areas, 'Price': prices})

# Plot the data
plt.scatter(data['Area'], data['Price'], alpha=0.5)
plt.title('Scatter Plot: House Area vs. Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid()
plt.show()


```

## Complete the Data Generation and Visualization

Great work, Space Explorer!

Now, let's add some important pieces to make the code functional. Add the missing lines to complete the data generation and visualization.

Keep shining like a star!

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
num_samples = 50
area = np.random.uniform(500, 3500, num_samples)
base_price = 50000
price_per_sqft = 200
noise = np.random.normal(0, 25000, num_samples)
price = base_price + (area * price_per_sqft) + noise

# Create DataFrame
# TODO: Create a DataFrame named 'data' with columns 'Area (sq ft)' and 'Price ($)' 
# using the variables 'area' and 'price'.

# Plot the data to visualize the relationship
# TODO: Create a scatter plot of 'Area (sq ft)' vs. 'Price ($)'.
plt.title('House Area vs. Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid()
plt.show()

```

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
num_samples = 50
area = np.random.uniform(500, 3500, num_samples)
base_price = 50000
price_per_sqft = 200
noise = np.random.normal(0, 25000, num_samples)
price = base_price + (area * price_per_sqft) + noise

# Create DataFrame
# TODO: Create a DataFrame named 'data' with columns 'Area (sq ft)' and 'Price ($)'
# using the variables 'area' and 'price'.
data = pd.DataFrame({'Area (sq ft)': area, 'Price ($)': price})

# Plot the data to visualize the relationship
# TODO: Create a scatter plot of 'Area (sq ft)' vs. 'Price ($)'.
plt.figure(figsize=(10, 6)) # Optional: Set figure size for better visualization
plt.scatter(data['Area (sq ft)'], data['Price ($)'], alpha=0.7) # Use alpha for better visibility of overlapping points
plt.title('House Area vs. Price')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price ($)')
plt.grid(True) # Ensure grid is visible
plt.show()

```