# Python Machine Learning: Simple Linear Regression

**Course:** Python ML for Middle School Students  
**Topic:** Simple Linear Regression with scikit-learn  
**Website:** www.learnandhelp.com  
**Instructor:** Siva.Jasthi@metrostate.edu

---

## What is Linear Regression?

Linear Regression is a way to find the relationship between two things:
- **Independent Variable (X):** The thing we can control (e.g., marketing budget)
- **Dependent Variable (Y):** The thing we want to predict (e.g., sales)

Think of it like this: If you spend more money on advertising, do you get more sales? Linear regression helps us find that pattern!

### The Linear Regression Equation:
```
y = mx + b
```
Where:
- **y** = predicted value (sales)
- **m** = slope (how much sales change per dollar of marketing)
- **x** = input value (marketing budget)
- **b** = intercept (sales when marketing = 0)


---
## Section 1: Understanding the Basics
### Quick Example: How Linear Regression Works

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Create simple data: marketing budget (X) and resulting sales (Y)
X = np.array([6, 16, 26, 36, 46, 56]).reshape((-1, 1))  # reshape makes it 2D
y = np.array([4, 23, 10, 12, 22, 35])

print("Marketing Budget (X):")
print(X.flatten())  # flatten makes it easier to read
print("\nSales (y):")
print(y)

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Get the results
print("\n" + "="*50)
print("MODEL RESULTS")
print("="*50)

# R² Score: How well the model fits (1.0 = perfect, 0 = terrible)
r_squared = model.score(X, y)
print(f'\nR² Score (Coefficient of Determination): {r_squared:.4f}')
print('  → This tells us how well our line fits the data')
print('  → Closer to 1.0 = better fit')

# Intercept: Where the line crosses the y-axis
print(f'\nIntercept (b): {model.intercept_:.2f}')
print('  → This is the sales when marketing = 0')

# Slope: How steep the line is
print(f'\nSlope (m): {model.coef_[0]:.2f}')
print('  → For every $1 increase in marketing, sales go up by this amount')

# Make predictions
y_predicted = model.predict(X)
print('\nPredicted Sales:')
print(np.round(y_predicted, 2))

print("\n" + "="*50)

---
## Section 2: Working with Real Data
### Loading and Exploring the Dataset

Now let's work with a real dataset that shows the relationship between marketing spending and sales!

In [None]:
# Import all the libraries we'll need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# This makes our plots appear in the notebook
%matplotlib inline

print("✓ All libraries imported successfully!")

In [None]:
# Load the dataset from CSV file
df = pd.read_csv('marketing_sales.csv')

# You can find the data set here https://github.com/sjasthi/Python-DS-Data-Science/blob/main/datasets/marketing_sales.csv
# Alternatively, you can load from Excel:
# df = pd.read_excel('marketing_sales.xlsx')

print("📊 Marketing and Sales Dataset Loaded!")
print("="*60)

print("\n🔍 First 5 rows:")
display(df.head())

print("\n🔍 Last 5 rows:")
display(df.tail())

print("\n🎲 Random 5 rows:")
display(df.sample(5))

### Data Quality Check
Before we build our model, we need to make sure our data is clean!

In [None]:
# Check for missing values
print("🔎 Checking for Missing Values:")
print(df.isna().sum())

if df.isna().sum().sum() == 0:
    print("\n✓ Great! No missing values found.")
else:
    print("\n⚠ Warning: Missing values detected!")

In [None]:
# Get information about the dataset
print("📋 Dataset Information:")
print("="*60)
print(f'Number of rows and columns: {df.shape}')
print(f'Total data points: {df.shape[0] * df.shape[1]}')

print("\n📊 Data Types:")
df.info()

print("\n📈 Statistical Summary:")
print("="*60)
display(df.describe())

---
## Section 3: Visualizing the Data
### Scatter Plot: Marketing vs Sales

Let's create a scatter plot to see if there's a relationship between marketing and sales!

In [None]:
# Create a scatter plot
plt.figure(figsize=(10, 6))
df.plot(x='marketing', y='sales', kind='scatter', color='blue', s=50, alpha=0.6)
plt.title('Marketing Budget vs Sales', fontsize=16, fontweight='bold')
plt.xlabel('Marketing Budget ($1000s)', fontsize=12)
plt.ylabel('Sales ($1000s)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print("💡 Can you see a pattern? As marketing increases, what happens to sales?")

### Correlation Analysis
Correlation tells us how strongly two variables are related:

- **+1:** Perfect positive relationship (both increase together)
- **0:** No relationship
- **-1:** Perfect negative relationship (one increases, other decreases)

In [None]:
# Calculate correlation
correlation = df.corr()

print("🔢 Correlation Matrix:")
print(correlation)

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='YlGnBu', fmt='.3f', 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.show()

print(f"\n💡 Marketing and Sales correlation: {correlation.loc['marketing', 'sales']:.3f}")
print("   → This shows a positive relationship!")

---
## Section 4: Building the Linear Regression Model
### Step 1: Prepare the Data

In [None]:
# Separate features (X) and target (y)
# Feature: What we use to make predictions (marketing budget)
# Target: What we want to predict (sales)

X = df[['marketing']]  # Double brackets keep it as DataFrame
y = df['sales']        # Single bracket makes it a Series

print("📦 Data Preparation Complete!")
print("="*60)
print(f"Feature (X) shape: {X.shape}")
print(f"Feature (X) type: {type(X)}")
print(f"\nTarget (y) shape: {y.shape}")
print(f"Target (y) type: {type(y)}")

print("\n✓ We have 1 feature and 1 target variable")
print("✓ This is perfect for Simple Linear Regression!")

### Step 2: Split the Data

We split our data into two parts:
- **Training Set (75%):** Used to teach the model
- **Testing Set (25%):** Used to check if the model learned correctly

In [None]:
# Split the data: 75% for training, 25% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.25,    # 25% for testing
    random_state=42    # Makes results reproducible
)

print("✂️ Data Split Complete!")
print("="*60)
print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Testing set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")

print("\n📊 Data Shapes:")
print(f"X_train: {X_train.shape} | y_train: {y_train.shape}")
print(f"X_test:  {X_test.shape} | y_test:  {y_test.shape}")

### Step 3: Create and Train the Model

In [None]:
# Create the Linear Regression model
model = LinearRegression()

print("🤖 Linear Regression Model Created!")
print(f"Model type: {type(model)}")
print(f"Model: {model}")

In [None]:
# Train the model using our training data
model.fit(X_train, y_train)

print("🎓 Model Training Complete!")
print("\n✓ The model has learned the pattern from the training data")
print("✓ Now it's ready to make predictions!")

### Step 4: Examine the Model Results

In [None]:
# Get the model parameters
intercept = model.intercept_
slope = model.coef_[0]
r_squared = model.score(X_train, y_train)

print("📊 MODEL PARAMETERS")
print("="*60)

print(f"\n📐 Intercept (b): {intercept:.4f}")
print("   → This is where the line crosses the y-axis")
print("   → Predicted sales when marketing = $0")

print(f"\n📈 Slope (m): {slope:.4f}")
print("   → This shows the relationship strength")
print(f"   → For every $1000 increase in marketing, sales go up by ${slope*1000:.2f}")

print(f"\n🎯 R² Score: {r_squared:.4f}")
print(f"   → The model explains {r_squared*100:.2f}% of the variation in sales")
if r_squared > 0.7:
    print("   → Excellent fit! 🌟")
elif r_squared > 0.5:
    print("   → Good fit! 👍")
else:
    print("   → Fair fit - could be better 📈")

print(f"\n📝 Equation: Sales = {slope:.4f} × Marketing + {intercept:.4f}")

---
## Section 5: Making Predictions
### Predict on Test Data

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

print("🔮 Predictions Made!")
print("="*60)
print(f"Number of predictions: {len(y_pred)}")
print(f"Predictions type: {type(y_pred)}")

# Show some example predictions
print("\n📋 Sample Predictions:")
comparison_df = pd.DataFrame({
    'Marketing': X_test.values.flatten()[:10],
    'Actual Sales': y_test.values[:10],
    'Predicted Sales': y_pred[:10],
    'Difference': np.abs(y_test.values[:10] - y_pred[:10])
})
display(comparison_df.round(2))

---
## Section 6: Visualizing the Results
### Training Data with Regression Line

In [None]:
# Plot training data with regression line
plt.figure(figsize=(10, 6))

# Scatter plot of actual training data
plt.scatter(X_train, y_train, color='blue', alpha=0.6, s=50, label='Actual Data')

# Plot the regression line
plt.plot(X_train, model.predict(X_train), color='red', linewidth=2, label='Best Fit Line')

plt.xlabel('Marketing Budget ($1000s)', fontsize=12)
plt.ylabel('Sales ($1000s)', fontsize=12)
plt.title('Training Data: Marketing vs Sales with Regression Line', 
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("💡 The red line shows our model's predictions!")
print("   The closer the blue points are to the line, the better our model!")

### Testing Data with Regression Line

In [None]:
# Plot testing data with regression line
plt.figure(figsize=(10, 6))

# Scatter plot of actual test data
plt.scatter(X_test, y_test, color='blue', alpha=0.6, s=50, label='Actual Data')

# Plot the regression line
plt.plot(X_test, model.predict(X_test), color='red', linewidth=2, label='Predictions')

plt.xlabel('Marketing Budget ($1000s)', fontsize=12)
plt.ylabel('Sales ($1000s)', fontsize=12)
plt.title('Testing Data: Marketing vs Sales with Predictions', 
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("💡 This shows how well our model works on NEW data it hasn't seen before!")

### Actual vs Predicted Comparison

In [None]:
# Create index for plotting
index = range(1, len(y_test) + 1)

# Create the plot
plt.figure(figsize=(12, 6))

# Plot actual values
plt.plot(index, y_test.values, color='blue', linewidth=2, 
         linestyle='-', marker='o', markersize=4, label='Actual Sales')

# Plot predicted values
plt.plot(index, y_pred, color='red', linewidth=2, 
         linestyle='-', marker='s', markersize=4, label='Predicted Sales')

plt.xlabel('Test Sample Index', fontsize=12)
plt.ylabel('Sales ($1000s)', fontsize=12)
plt.title('Actual vs Predicted Sales Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("💡 When the red and blue lines are close together, our predictions are accurate!")

---
## Section 7: Making New Predictions
### Try Different Marketing Budgets!

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("🎯 Making Predictions for Different Marketing Budgets")
print("="*60)

# Test different marketing budgets
test_budgets = [50, 100, 150, 200, 250]

print("\n📊 Prediction Results:")
print("-" * 60)
print(f"{'Marketing Budget':<20} {'Predicted Sales':<20} {'Expected Return'}")
print("-" * 60)

for budget in test_budgets:
    predicted_sales = model.predict([[budget]])[0]
    roi = (predicted_sales - budget) / budget * 100  # Return on Investment
    
    print(f"${budget:>5,.0f}            {predicted_sales:>15.2f}         {roi:>8.1f}%")

print("-" * 60)
print("\n💡 You can try your own marketing budget values below!")

In [None]:
# Interactive prediction cell - Try your own values!
print("🎮 Try Your Own Prediction!")
print("="*60)

# Change this value to whatever you want!
my_marketing_budget = 175  # <-- Change this number!

# Make the prediction
my_predicted_sales = model.predict([[my_marketing_budget]])[0]

# Calculate some interesting metrics
profit = my_predicted_sales - my_marketing_budget
roi = (profit / my_marketing_budget) * 100

print(f"\n💰 If you spend ${my_marketing_budget:,.2f} (thousands) on marketing:")
print(f"   📈 Predicted Sales: ${my_predicted_sales:,.2f} (thousands)")
print(f"   💵 Expected Profit: ${profit:,.2f} (thousands)")
print(f"   📊 Return on Investment: {roi:.1f}%")

if roi > 50:
    print("\n🌟 Excellent ROI! This looks like a great investment!")
elif roi > 20:
    print("\n👍 Good ROI! This could be a solid investment.")
else:
    print("\n⚠️  Low ROI. You might want to reconsider this budget.")

---
## Section 8: Model Evaluation
### How Good is Our Model?

Let's calculate some metrics to understand how well our model performs!

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("📊 MODEL EVALUATION METRICS")
print("="*60)

print(f"\n✓ R² Score: {r2:.4f}")
print(f"  → Explains {r2*100:.2f}% of the variance in sales")

print(f"\n✓ Mean Absolute Error (MAE): ${mae:.2f}k")
print(f"  → On average, predictions are off by ${mae:.2f}k")

print(f"\n✓ Root Mean Squared Error (RMSE): ${rmse:.2f}k")
print(f"  → Typical prediction error is ${rmse:.2f}k")

print("\n" + "="*60)
print("💡 Lower MAE and RMSE = Better predictions!")
print("💡 Higher R² = Better model fit!")

---
## 🎓 Summary and Key Takeaways

### What We Learned:

1. **Linear Regression** finds relationships between two variables
2. **Training Data** teaches the model patterns
3. **Testing Data** checks if the model learned correctly
4. **R² Score** tells us how good our model is (closer to 1 is better)
5. **Predictions** help us make data-driven decisions

### Real-World Applications:

- 📈 Predicting sales based on advertising
- 🏠 Estimating house prices based on size
- 🌡️ Forecasting temperature based on historical data
- 📚 Predicting test scores based on study hours

### Next Steps:

1. Try different datasets
2. Experiment with multiple variables (Multiple Linear Regression)
3. Learn about other machine learning algorithms
4. Build your own prediction projects!

---
**Great job! You've successfully built and evaluated a Linear Regression model! 🎉**