# 2. Linear Regression

This notebook uses the same dataset to demonstrate the machine learning model.

In [10]:
# Load and Preview Data
import pandas as pd
df = pd.read_csv("ml_customer_data.csv")
df.head()

Unnamed: 0,age,salary,purchased
0,56,19000,0
1,46,85588,1
2,32,53304,1
3,60,84449,1
4,25,97986,0


In [12]:
# Prepare Features and Target
from sklearn.model_selection import train_test_split
X = df[['age', 'salary']]
y = df['purchased']

# Split data: 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [14]:
# Train the Linear Regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [16]:
# Make Predictions
y_pred = model.predict(X_test)
print("First 5 predicted values (continuous):")
print(y_pred[:5])

First 5 predicted values (continuous):
[ 0.65648904 -0.03486507  0.70891562  0.28060955  0.08192613]


In [18]:
# Evaluate Model
from sklearn.metrics import mean_squared_error

# Mean Squared Error (MSE) tells us how far off the predictions are on average
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# R-squared value shows how well the model explains the variance in the data
r_squared = model.score(X_test, y_test)
print("R-squared:", r_squared)

Mean Squared Error: 0.1655894289448338
R-squared: 0.30176871228284075


In [22]:
# Model Coefficients 
print("Coefficients (age, salary):", model.coef_)
print("Intercept (bias):", model.intercept_)

Coefficients (age, salary): [1.18712185e-02 1.00938693e-05]
Intercept (bias): -0.6384132822577367


## Model Summary

This section summarizes the performance and interpretation of the Linear Regression model trained to predict whether a customer will purchase based on their age and salary.

---

###  First 5 Predictions 
[0.656, -0.035, 0.708, 0.280, 0.081]

These are continuous values predicted by the model. Since the actual target is binary (0 or 1), these numbers can be loosely interpreted as "likelihood" of purchasing. Predictions above 0.5 indicate likely purchasers; below 0.5 indicate unlikely. Negative values (like -0.035) are technically possible in regression, but not meaningful as probabilities.

---

**Mean Squared Error (MSE):** `0.1656`  
This measures the average squared difference between predicted values and true values. Lower is better; `0` means perfect prediction. An MSE of 0.1656 indicates moderate error for a basic regression model.

---

**R-squared (R²):** `0.3017`  
This tells us the percentage of variance in the target (`purchased`) that the model can explain using the input features.  
- `1.0` = perfect prediction  
- `0.0` = explains nothing  
- `< 0` = worse than just guessing the average  
Here, about **30.2%** of the variability in purchase decisions is explained — reasonable for only two features.

---

**Model Coefficients:**  
- `age`: +0.01187  
- `salary`: +0.00001009  

These weights tell us how each feature affects the prediction. A positive weight means increasing the feature increases the predicted value. Age has a small positive influence on likelihood to purchase, and salary has an even smaller positive effect.

---

**Intercept (Bias Term):** `-0.638`  
This is the model’s baseline output when both age and salary are zero. While not meaningful in practice, it’s necessary for adjusting the output range of the linear model.

---

### ✅ Final Conclusion:
Linear Regression is not ideal for binary classification problems, but it is useful for understanding the **linear relationships** between features and the target. This model gives interpretable results, with moderate accuracy. For more suitable binary prediction, use **Logistic Regression** instead.