<a href="https://colab.research.google.com/github/ssoma2mc/Data110/blob/main/ShinkoSoma_Week2_class_sp25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Working with Dataset Files (CSV)

There are two main ways to load data into your notebook:

1. **Direct Upload**: Upload a CSV file directly to Google Colab
   - Good for small to medium files
   - You'll need to re-upload if you close and reopen Colab

2. **GitHub URL**: Use a direct link to the raw file on GitHub
   - Works for any file size
   - No need to upload files manually
   - Links remain stable

For this tutorial, we'll be working with 'happiness_2017.csv'.

In [None]:
df=pd.read_csv("https://raw.githubusercontent.com/Reben80/Data110-32008--Sp25/refs/heads/main/dataset/happiness_2017.csv")
# or

#in case you have the csv file already upploaded to the google colab directory ( left side panel) remember if you close and come back another day this you still need to uplad this agaion
#df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DATA110/happiness.csv") # Prof. Muhammad already cleaned the data, so 'hapiness_2017 is not equal to 'hapiness'


# Exploring Your Dataset

When working with a new dataset, there are several essential steps to understand your data:

1. **View Sample Data**: Use `df.head()` to see the first few rows
   - `df.head()` shows first 5 rows
   - `df.head(10)` shows first 10 rows
   - `df.tail()` shows last 5 rows

This gives you a quick preview of what your data looks like.

In [None]:
df.head()
#df.head(10)

In [None]:
df.tail()

# Understanding Your Data Structure

`df.info()` is a powerful command that tells you:
- How many rows and columns you have
- The name of each column
- The data type of each column (int, float, string, etc.)
- How many non-null values exist
- How much memory your data is using

This is crucial for identifying missing data and understanding your dataset's structure.

In [None]:
df.info()

# Statistical Summary

For numerical data, `df.describe()` provides key statistics:
- count: number of non-null values
- mean: average value
- std: standard deviation
- min: minimum value
- 25%, 50%, 75%: quartile values
- max: maximum value

This helps you understand the distribution of your numerical data.

In [None]:
df.describe()

# Working with Column Names

A helpful tip: Instead of typing column names manually (which can lead to errors), you can:
1. Use `print(df.columns)` to see all column names
2. Copy and paste the exact column names you need
3. This prevents typos that could cause your code to fail

In [None]:
print(df.columns)

# Creating Scatter Plots

Now we'll learn how to create scatter plots using matplotlib. A scatter plot is perfect for showing relationships between two variables.

Basic syntax: `plt.scatter(x_data, y_data)`

We'll improve our plots step by step:
1. Start with a basic plot
2. Add proper sizing
3. Include labels and titles
4. Customize the appearance

In [None]:
plt.style.use('default')

In [None]:
plt.scatter(df['Rank'],df['HappinessScore'])

# Customizing Plot Size

The figure size determines how large your plot will appear:
- `plt.figure(figsize=(width, height))`
- Width and height are in inches
- Common sizes: (10,6), (16,10)
- Larger figures are better for presentations
- Smaller figures work well for documents

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(df['Rank'],df['HappinessScore'])

# Essential Plot Components

Every professional plot should include:
1. **X-axis label**: `plt.xlabel('Label Name')`
2. **Y-axis label**: `plt.ylabel('Label Name')`
3. **Title**: `plt.title('Your Title')`
4. **plt.show()**: Always end with this to display the plot cleanly

These elements help others understand your visualization immediately.

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(df['Rank'],df['HappinessScore'])
plt.xlabel('Rank')
plt.ylabel('Happiness Score')
plt.title(" Rank vs Hapiness")
plt.show()

# info about scatter plot https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

# Advanced Scatter Plot Customization

Let's create a professional-looking scatter plot with all the important customization options:

**Key Parameters:**
- `color`: Choose point color ('blue', 'red', etc.)
- `marker`: Point shape ('o' for circle, 's' for square, '^' for triangle)
- `edgecolors`: Outline color of points
- `alpha`: Transparency (0 to 1)
- `s`: Point size

**Styling Elements:**
- `fontsize`: Control text size
- `fontweight`: Make text bold
- `grid`: Add background grid
- `xticks/yticks`: Customize axis numbers

Below is a complete example using these parameters:

In [None]:
# Explanation of Scatter Plot Parameters


# Set the figure size
plt.figure(figsize=(16,10))  # 16 inches wide and 10 inches tall for better readability

# Scatter plot with styling
plt.scatter(df['Rank'], df['HappinessScore'],
            color='blue',         # Sets marker color to blue
            marker='o',           # Uses circular markers
            edgecolors='black',    # Adds a black outline to markers
            alpha=0.75,           # Makes points slightly transparent (75% opacity)
            s=50)                # change marker size

# X and Y axis labels with styling
plt.xlabel('Rank', fontsize=14, fontweight='bold')  # Bold and larger font for readability
plt.ylabel('Happiness Score', fontsize=14, fontweight='bold')

# Title of the plot
plt.title("Rank vs Happiness Score", fontsize=18, fontweight='bold')  # Larger and bold title

# Grid for better readability
plt.grid(True, linestyle='--', alpha=0.5)  # Dashed grid lines with slight transparency

# Adjust tick label size
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Invert x-axis if lower rank means better happiness
plt.gca().invert_xaxis()

# Show the plot
plt.show()


# Styling Your Plots

Matplotlib offers pre-built styles to make your plots look professional:
1. Check available styles with `plt.style.available`
2. Apply a style with `plt.style.use('style_name')`
3. Popular styles include:
   - 'ggplot': Clean, professional look (from R)
   - 'seaborn': Modern, attractive defaults
   - 'classic': Traditional matplotlib style

Try different styles to find what works best for your presentation!

In [None]:
plt.style.available

In [None]:
plt.style.use('ggplot')

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(df['Rank'],df['HappinessScore'])
plt.xlabel('Rank')
plt.ylabel('Happiness Score')
plt.title(" Rank vs Hapiness")
plt.show()

a good place to learn more about how to style your scatter plot, it the offical website of [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html). check it our and try to experment with some of the setting.

# Practice Assignments 📊

Let's practice creating scatter plots using different variables from our happiness dataset!

### Assignment 1: Basic Scatter Plot
Create a scatter plot showing the relationship between 'Log GDP per capita' and 'HappinessScore'.
- Use appropriate axis labels
- Add a title
- Set figure size to (12,8)

### Assignment 2: Styled Scatter Plot
Create a scatter plot comparing 'Social support' vs 'Healthy life expectancy at birth'.
- Use red markers with black edges
- Set marker size to 100
- Add a grid
- Make markers semi-transparent (alpha=0.6)

### Assignment 3: Advanced Visualization
Create a scatter plot showing 'Freedom to make life choices' vs 'Positive affect'.
- Use triangle markers ('^')
- Make the plot blue with yellow edges
- Add bold labels
- Include a grid with dashed lines

### Bonus Challenge 🌟
Create a scatter plot comparing any two variables of your choice, but:
- Use a style from `plt.style.available`
- Add custom font sizes for labels
- Include a brief interpretation of what the plot shows



In [None]:
plt.style.use('default')

In [None]:
# Your code should be here

# Assignment 1: Basic Scatter Plot
## Create a scatter plot showing the relationship between 'Log GDP per capita' and 'HappinessScore'.
####### #1 Use appropriate axis labels
####### #2 Add a title
####### #3 Set figure size to (12,8)

# Set the figure size
plt.figure(figsize=(12,8))      #3 Set figure size to (12,8)

# Scatter plot with styling
plt.scatter(df['Log GDP per capita'], df['HappinessScore'])

# X and Y axis labels with styling
plt.xlabel('Log GDP per capita', fontsize=14)  #1 Use appropriate axis labels
plt.ylabel('Happiness Score', fontsize=14)

# Title of the plot
plt.title("Log GDP per capita vs Happiness Score", fontsize=18, fontweight='bold')  #2 Add a title

In [None]:
# Assignment 2: Styled Scatter Plot
## Create a scatter plot comparing 'Social support' vs 'Healthy life expectancy at birth'.
###### #4 Use red markers with black edges
###### #5 Set marker size to 100
###### #6 Add a grid
###### #7 Make markers semi-transparent (alpha=0.6)

# Set the figure size
plt.figure(figsize=(12,8))      #3 Set figure size to (12,8)

# Scatter plot with styling
plt.scatter(df['Social support'], df['Healthy life expectancy at birth'],
            color='red',   #4 Use red markers with black edges
            edgecolors='black',
            alpha=0.6,     #7 Make markers semi-transparent (alpha=0.6)
            s=100.         #5 Set marker size to 100
            )

# X and Y axis labels with styling
plt.xlabel('Social support', fontsize=14)  #1 Use appropriate axis labels
plt.ylabel('Healthy life expectancy at birth', fontsize=14)

# Title of the plot
plt.title("Social support vs Healthy life expectancy at birth", fontsize=18, fontweight='bold')  #2 Add a title

# Grid for better readability
plt.grid(True, alpha=0.5)  #6 Add a grid



In [None]:
# Assignment 3: Advanced Visualization
## Create a scatter plot showing 'Freedom to make life choices' vs 'Positive affect'.
###### #8 Use triangle markers ('^')
###### #9 Make the plot blue with yellow edges
###### #10 Add bold labels
###### #11 Include a grid with dashed lines

# Set the figure size
plt.figure(figsize=(12,8))      #3 Set figure size to (12,8)

# Scatter plot with styling
plt.scatter(df['Freedom to make life choices'], df['Positive affect'],
            marker='^',          #8 Use triangle markers ('^')s
            color='blue',   #9 Make the plot blue with yellow edges
            edgecolors='yellow',
            alpha=0.6,     #7 Make markers semi-transparent (alpha=0.6)
            s=100.         #5 Set marker size to 100
            )

# X and Y axis labels with styling
plt.xlabel('Freedom to make life choices', fontsize=14, fontweight='bold')  #1 Use appropriate axis labels, #10 Add bold labels
plt.ylabel('Positive affect', fontsize=14, fontweight='bold')

# Title of the plot
plt.title("Freedom to make life choices vs Happiness Score", fontsize=18, fontweight='bold')  #2 Add a title

# Grid for better readability
plt.grid(True, linestyle='--', alpha=0.5)  #6 Add a grid, #11 Include a grid with dashed lines

# Invert x-axis if lower rank means better happiness
#plt.gca().invert_xaxis()

# Show the plot
plt.show()

In [None]:
# Bonus Challenge 🌟
## Create a scatter plot comparing any two variables of your choice, but:
###### #12 Use a style from plt.style.available
###### #13 Add custom font sizes for labels
###### #14 Include a brief interpretation of what the plot shows

plt.style.use('seaborn-v0_8-poster') #12 Use a style from plt.style.available

# Set the figure size
plt.figure(figsize=(16,10))  # I followed "16 inches wide and 10 inches tall for better readability"

# Scatter plot with styling
plt.scatter(df['Confidence in national government'], df['HappinessScore'], # Choose 'Confidence in national government' and 'HappinessScore'
            color='magenta',         # Set marker color to magenda
            marker='h',           # Use hexagon markers
            edgecolors='black',    # Add a black outline to the markers
            alpha=0.5,           # Make points slightly transparent (50% opacity)
            s=100)                # Set marker size to 100

# X and Y axis labels with styling
plt.xlabel('Confidence in national government', fontsize=14, fontweight='bold')  # Bold and larger font for readability
plt.ylabel('Happiness Score', fontsize=14, fontweight='bold')

# Title of the plot
plt.title("Confidence in national government vs Happiness Score", fontsize=18, fontweight='bold')  # Larger and bold title

# Grid for better readability
plt.grid(True, linestyle='--', alpha=0.5)  # Dashed grid lines with slight transparency

# Adjust tick label size
plt.xticks(fontsize=12) #13 Add custom font sizes for labels
plt.yticks(fontsize=12)

# Invert x-axis if lower rank means better happiness # I think it's better not to invert
# plt.gca().invert_xaxis()

# Change y axis range
plt.xlim(0, 1.0)
plt.ylim(0, 10)

# Show the plot
plt.show()


#### A brief interpretation of what the plot shows

# Exploring the Relationship Between Confidence in National Government and Happiness Score: A Scatter Plot Analysis

Analysis of the scatter plot between 'Confidence in National Government' and 'Happiness Score' using happiness data reveals that there is no clear correlation between the two variables. Although it is commonly believed that higher trust in the government contributes to greater well-being [1][2][3], the data in this analysis does not show a consistent or significant relationship between the confidence in the national government and the happiness score across the sample. This indicates that a deeper analysis, such as by country-specific factors, is needed to better understand the complex relationship between these variables.

> [1] Helliwell, J. F., Layard, R., & Sachs, J. (2020). World Happiness Report 2020.

> [2] Rothstein, B., & Stolle, D. (2003). Social capital, trust, and government performance: A literature review. Scandinavian Political Studies, 26(3), 191-218.

> [3] Delhey, J., & Newton, K. (2005). Predicting cross-national levels of social trust: The influence of institutional performance and historical roots. European Sociological Review, 21(4), 311-327.

--------------------------------

###  Anscombe's Quartet Dataset

This dataset is known as **Anscombe's Quartet**, created by statistician Francis Anscombe to illustrate the importance of visualizing data. Despite having nearly identical statistical properties (e.g., mean, variance, correlation, and linear regression), each dataset tells a very different story when graphed.

- **x**: The independent variable, common across three datasets.
- **y1, y2, y3**: Three different dependent variables associated with the same `x` values.
- **x4, y4**: A special case where most of the `x` values are identical, with one outlier.

#### Anscombe's Quartet:

In [None]:
plt.style.use('default')

In [None]:
# Anscombe's Quartet:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

In [None]:
plt.scatter(x, y1)

Lets also do the linear regression for this dataset, do not worry about the code for now, just focus on the output. and we will be back to this code later.

In [None]:
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y1, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y1)
plt.plot(x, regression_line,color='blue')
plt.xlabel('x')
plt.ylabel('y1')



### Assignment 4: Anscombe's Quartet and Linear Regression

Perform the same linear regression process for the following datasets: y2, y3, and y4. Modify the code to calculate and plot the regression lines for each of these datasets. Use distinct colors for each plot and appropriately label the axes (y2, y3, etc.). Discuss any differences you observe when comparing the results across all datasets.

In [None]:
# For m1, b1
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m1, b1 = np.polyfit(x, y1, 1)

# Create the regression line
regression_line1 = m1 * np.array(x) + b1

# Plot the data points and regression line
plt.scatter(x, y1, color='blue', label = 'D1')
plt.plot(x, regression_line1,color='blue', label = "Line1")
plt.xlabel('x', fontsize = 16)
plt.ylabel('y1', fontsize = 16)
plt.grid(True, linestyle='--', alpha=0.5)

# Change y axis range
plt.xlim(2, 20)
plt.ylim(0, 15)

# Add title and legend
plt.legend()
plt.title("Anscombe's Quartet with Linear Regression Line: \n y1 = m1x + b1", fontsize = 20)

# Print the slope and intercept for each regression line
text_x = 12
text_y = 4
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m1:.6f}, Intercept = {b1:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=12, color='black')



In [None]:
# For m2, b2
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m2, b2 = np.polyfit(x, y2, 1)

# Create the regression line
regression_line2 = m2 * np.array(x) + b2

# Plot the data points and regression line
plt.scatter(x, y2, color='green', label = 'D2')
plt.plot(x, regression_line2,color='green', label = 'Line2')
plt.xlabel('x', fontsize = 16)
plt.ylabel('y2', fontsize = 16)
plt.grid(True, linestyle='--', alpha=0.5)

# Change y axis range
plt.xlim(2, 20)
plt.ylim(0, 15)

# Add title and legend
plt.legend()
plt.title("Anscombe's Quartet with Linear Regression Line: \n y2 = m2x + b2", fontsize = 20)

# Print the slope and intercept for each regression line
text_x = 12
text_y = 4
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m2:.6f}, Intercept = {b2:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=12, color='black')


In [None]:
# For m3, b3
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m3, b3 = np.polyfit(x, y3, 1)

# Create the regression line
regression_line3 = m3 * np.array(x) + b3

# Plot the data points and regression line
plt.scatter(x, y3, color='red', label = 'D3')
plt.plot(x, regression_line3,color='red', label = 'Line4')
plt.xlabel('x', fontsize = 16)
plt.ylabel('y3', fontsize = 16)
plt.grid(True, linestyle='--', alpha=0.5)

# Change y axis range
plt.xlim(2, 20)
plt.ylim(0, 15)

# Add title and legend
plt.legend()
plt.title("Anscombe's Quartet with Linear Regression Line: \n y3 = m3x + b3", fontsize = 20)

# Print the slope and intercept for each regression line
text_x = 12
text_y = 4
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m3:.6f}, Intercept = {b3:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=12, color='black')



In [None]:


# For m4, b4
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m4, b4 = np.polyfit(x4, y4, 1)

# Create the regression line
regression_line4 = m4 * np.array(x4) + b4

# Plot the data points and regression line
plt.scatter(x4, y4, color='purple', label = 'D4')
plt.plot(x4, regression_line4,color='purple', label = 'Line4')
plt.xlabel('x4', fontsize = 16)
plt.ylabel('y4', fontsize = 16)
plt.grid(True, linestyle='--', alpha=0.5)

# Change y axis range
plt.xlim(2, 20)
plt.ylim(0, 15)

# Add title and legend
plt.legend()
plt.title("Anscombe's Quartet with Linear Regression Line: \n y4 = m4x4 + b4", fontsize = 20)

# Print the slope and intercept for each regression line
text_x = 12
text_y = 4
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m4:.6f}, Intercept = {b4:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=12, color='black')


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

fig.suptitle("Anscombe's Quartet with Linear Regression Line", fontsize=16)

# For m1, b1
m1, b1 = np.polyfit(x, y1, 1)
regression_line1 = m1 * np.array(x) + b1
plt.subplot(2, 2, 1)
plt.scatter(x, y1, color='blue', label = 'D1', s = 20)
plt.plot(x, regression_line1,color='blue', label = "Line1", linewidth = 1)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y1', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y1 = m1x + b1", fontsize = 10)
text_x = 15
text_y = 5
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m1:.6f}, Intercept = {b1:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

# For m2, b2
m2, b2 = np.polyfit(x, y2, 1)
regression_line2 = m2 * np.array(x) + b2
plt.subplot(2, 2, 2)
plt.scatter(x, y2, color='green', label = 'D2', s = 20)
plt.plot(x, regression_line2,color='green', label = 'Line2', linewidth = 1)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y2', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y2 = m2x + b2", fontsize = 10)
text_x = 15
text_y = 5
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m2:.6f}, Intercept = {b2:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

# For m3, b3
m3, b3 = np.polyfit(x, y3, 1)
regression_line3 = m3 * np.array(x) + b3
plt.subplot(2, 2, 3)
plt.scatter(x, y3, color='red', label = 'D3', s = 20)
plt.plot(x, regression_line3,color='red', label = 'Line3', linewidth = 1)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y3', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y3 = m3x + b3", fontsize = 10)
text_x = 15
text_y = 5
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m3:.6f}, Intercept = {b3:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

# For m4, b4
m4, b4 = np.polyfit(x4, y4, 1)
regression_line4 = m4 * np.array(x4) + b4
plt.subplot(2, 2, 4)
plt.scatter(x4, y4, color='purple', label = 'D4', s = 20)
plt.plot(x4, regression_line4,color='purple', label = 'Line4', linewidth = 1)
plt.xlabel('x4', fontsize = 10)
plt.ylabel('y4', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y4 = m4x4 + b4", fontsize = 10)
text_x = 15
text_y = 5
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m4:.6f}, Intercept = {b4:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

plt.subplots_adjust(hspace=0.3, wspace=0.3, top=0.9)

y1 and y3: Both of these datasets fit well with a linear model, showing a clear linear relationship between the variables. The regression lines for these datasets provide a good approximation of the data points.

y2: Unlike y1 and y3, the data in y2 suggests a more complex relationship. A quadratic model might be more appropriate for this dataset as it captures the curvature in the data more effectively.

y4: The dataset for y4 contains significant outliers that skew the linear regression results. These outliers strongly influence the model, leading to misleading conclusions. It might be advisable to exclude these outliers or consider an alternative approach, such as robust regression, to handle the data more effectively.

The result of applying the quadratic regression model for y2 and the result of removing the outlier of y4 are the following.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

fig.suptitle("Anscombe's Quartet with Linear Regression Line", fontsize=16)

# For m1, b1
m1, b1 = np.polyfit(x, y1, 1)
regression_line1 = m1 * np.array(x) + b1
plt.subplot(2, 2, 1)
plt.scatter(x, y1, color='blue', label = 'D1', s = 20)
plt.plot(x, regression_line1,color='blue', label = "Line1", linewidth = 1)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y1', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y1 = m1x + b1", fontsize = 10)
text_x = 15
text_y = 5
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m1:.6f}, Intercept = {b1:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

# For m2, b2
x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
m2, b2, c2 = np.polyfit(x, y2, 2)
plt.subplot(2, 2, 2)
x_fine = np.linspace(x.min(), x.max(), 100)
regression_line2 = m2 * x_fine ** 2 + b2 * x_fine + c2
plt.scatter(x, y2, color='green', label = 'D2', s = 20)
plt.plot(x_fine, regression_line2,color='green', label = 'Line2', linewidth = 1)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y2', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y2 = m2x² + b2x + c2", fontsize=10)
text_x = 15
text_y = 5
results_text = f"Quadratic Regression Results:\n"
results_text += f"y2: m2 = {m2:.6f}, b2 = {b2:.6f}, c2 = {c2:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')


# For m3, b3
m3, b3 = np.polyfit(x, y3, 1)
regression_line3 = m3 * np.array(x) + b3
plt.subplot(2, 2, 3)
plt.scatter(x, y3, color='red', label = 'D3', s = 20)
plt.plot(x, regression_line3,color='red', label = 'Line3', linewidth = 1)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y3', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y3 = m3x + b3", fontsize = 10)
text_x = 15
text_y = 5
results_text = f"Linear Regression Results:\n"
results_text += f"y1: Slope = {m3:.6f}, Intercept = {b3:.6f}\n"
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')


# For m4, b4
x4 = np.array([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8])
y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])
m4, b4 = np.polyfit(x4, y4, 1)
regression_line4 = m4 * np.array(x4) + b4
plt.subplot(2, 2, 4)
plt.scatter(x4, y4, color='purple', label = 'D4', s = 20)
plt.plot(x4, regression_line4,color='purple', label = 'Line4', linewidth = 1)
plt.xlabel('x4', fontsize = 10)
plt.ylabel('y4', fontsize = 10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlim(2, 20)
plt.ylim(0, 15)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=10)
plt.legend()
plt.title("y4 = m4x4 + b4", fontsize = 10)
text_x = 12
text_y = 4
results_text = results_text = (
    "When all the values of x4 are the same,\n"
    "the calculation of the regression equation becomes meaningless,\n"
    "causing the regression line to become infinite.\n"
    "Therefore, it cannot be plotted."
)
plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

plt.subplots_adjust(hspace=0.3, wspace=0.3)

plt.show()

The following is the reason why I used the four separate graphs instead of the single graph.

In [None]:
# challenge for "def" function
# The reason why separate graphs are better

def plt_regression(x, y, color, label, title, text_x, text_y):
    m, b = np.polyfit(x, y, 1)
    regression_line = m * np.array(x) + b
    plt.scatter(x, y, color=color, label=label, s=20)
    plt.plot(x, regression_line, color=color, label=f"{label} Line", linewidth=1)
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.tick_params(axis='x', labelsize=10)
    plt.tick_params(axis='y', labelsize=10)
    plt.legend()
    results_text = f"Linear Regression Results:\n"
    results_text += f"Slope = {m:.6f}, Intercept = {b:.6f}\n"
    plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color=color)

plt_regression(x, y1, 'blue', 'y1', 'y1 = m1x + b1', 16, 6)
plt_regression(x, y2, 'green', 'y2', 'y2 = m2x + b2', 16, 5)
plt_regression(x, y3, 'red', 'y3', 'y3 = m3x + b3', 16, 4)
plt_regression(x4, y4, 'purple', 'y4', 'y4 = m4x + b4', 16, 3)

plt.title("Anscombe's Quartet with Linear Regression Line", fontsize = 20)
plt.xlabel('x', fontsize = 10)
plt.ylabel('y', fontsize = 10)
plt.xlim(2, 20)
plt.ylim(0, 15)

plt.show()


In [None]:
# challenge for "def" function
def plt_regression(x, y, plt, color, label, title, text_x, text_y):
    m, b = np.polyfit(x, y, 1)
    regression_line = m * np.array(x) + b
    plt.scatter(x, y, color=color, label=label, s=20)
    plt.plot(x, regression_line, color=color, label=f"{label} Line", linewidth=1)
    plt.set_xlabel('x', fontsize=10)
    plt.set_ylabel(f'{label}', fontsize=10)
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.set_xlim(2, 20)
    plt.set_ylim(0, 15)
    plt.tick_params(axis='x', labelsize=10)
    plt.tick_params(axis='y', labelsize=10)
    plt.legend()
    plt.set_title(title, fontsize=10)
    results_text = f"Linear Regression Results:\n"
    results_text += f"Slope = {m:.6f}, Intercept = {b:.6f}\n"
    plt.text(text_x, text_y, results_text, ha='center', va='top', fontsize=6, color='black')

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle("Anscombe's Quartet with Linear Regression Line", fontsize=16)

plt_regression(x, y1, axes[0, 0], 'blue', 'y1', 'y1 = m1x + b1', 15, 5)
plt_regression(x, y2, axes[0, 1], 'green', 'y2', 'y2 = m2x + b2', 15, 5)
plt_regression(x, y3, axes[1, 0], 'red', 'y3', 'y3 = m3x + b3', 15, 5)
plt_regression(x4, y4, axes[1, 1], 'purple', 'y4', 'y4 = m4x + b4', 15, 5)

plt.subplots_adjust(hspace=0.3, wspace=0.3, top=0.9)

plt.show()
