### Step 1: Import Necessary Packages

Before we begin, we need to import the necessary Python libraries for plotting and performing the regression. We'll use:

- `matplotlib.pyplot` for creating the graph
- `numpy` for numerical operations


In [None]:
import matplotlib.pyplot as plt
import numpy as np

### Step 2: Anscombe's Quartet Dataset

This dataset is known as **Anscombe's Quartet**, created by statistician Francis Anscombe to illustrate the importance of visualizing data. Despite having nearly identical statistical properties (e.g., mean, variance, correlation, and linear regression), each dataset tells a very different story when graphed.

- **x**: The independent variable, common across three datasets.
- **y1, y2, y3**: Three different dependent variables associated with the same `x` values.
- **x4, y4**: A special case where most of the `x` values are identical, with one outlier.

#### Anscombe's Quartet:

`

In [None]:
# Anscombe's Quartet:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]


In [None]:
plt.scatter(x, y1)


In [None]:
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y1, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y1)
plt.plot(x, regression_line,color='red')
plt.xlabel('x')
plt.ylabel('y1')


### Your Task:

Perform the same linear regression process for the following datasets: y2, y3, and y4.
Modify the code to calculate and plot the regression lines for each of these datasets.
Use distinct colors for each plot and appropriately label the axes (y2, y3, etc.).
Discuss any differences you observe when comparing the results across all datasets.


In [None]:
# Code for x and y2

# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y2, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y2, color="magenta")
plt.plot(x, regression_line,color='grey')
plt.xlabel('x')
plt.ylabel('y2')



In [None]:
# Your Code need to be here for x and y3

# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y3, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y3, color="blue")
plt.plot(x, regression_line,color='purple')
plt.xlabel('x')
plt.ylabel('y3')

In [None]:
# Your code need to be here for x and y4

# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y4, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y4, color="teal")
plt.plot(x, regression_line,color='lavender')
plt.xlabel('x')
plt.ylabel('y4')

In [None]:
# Your code need to be here for x4 and y4

# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x4, y4, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x4, y4, color="red")
plt.plot(x4, regression_line,color='black')
plt.xlabel('x4')
plt.ylabel('y4')

### Reflection Question:

After visualizing the linear regression for all four datasets in Anscombe's Quartet, reflect on the following:

#### What is your reflection?
- How do the datasets visually differ despite having similar summary statistics?
- How did the outlier in the `x4, y4` dataset affect the regression line compared to the other datasets?
- Why is it important to visualize data in addition to calculating summary statistics?

Please provide your insights and discuss the importance of data visualization in understanding relationships between variables.


### My Reflection:

The datasets differ in terms of the relationship between the variables. The scatterplots for x and y1, x and y2 and x and y3, have a positive relationship as the regression line and points go in an upwards direction. Then, the x and y4 scatterplot has a negative relationship as the regresion line and points go in a downward direction. Also, while there are similar summary statitics and similar relationships for some of the scatterplots, it is important to note that some of the relationships are strong in terms of the data points being closer to the regression line whereas it can be observed that some are more spreadout and considered a bit more weak in terms of the relationship. The outlier for the x4 and y4 data made the regression line non-linear. This is because the outlier is too high relative to the other data points causing the regression to become non-linear. It is important to visualize data in addition to calculating the summary statistics because it can show the relationships of the between the variables whether the relationship is strong, weak, positive or negative something that we are not able to determine with just the summary statisitcs.