
# Homework Assignment: Linear Regression vs SVM Regression

## Instructions
In this assignment, you will analyze a dataset containing information about medical insurance costs (`insurance.csv`). Your goal is to build models to predict medical costs using Linear Regression and Support Vector Machine (SVM) Regression, and compare their performances.

### Objectives:
1. Explore and preprocess the dataset.
2. Build and evaluate Linear Regression and SVM Regression models.
3. Compare the results and answer interpretive questions.



## Part 1: Import Libraries and Load Dataset


In [1]:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error

# Load the dataset
df = pd.read_csv("insurance.csv")

# Display the first few rows
print(df.head())


   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520




## Part 2: Data Exploration and Preprocessing


In [2]:

# Basic dataset information
print(df.info())

# Check for missing values
print(df.isnull().sum())

# Encode categorical variables (e.g., 'sex', 'smoker', 'region')
encoder = LabelEncoder()
df['sex'] = encoder.fit_transform(df['sex'])
df['smoker'] = encoder.fit_transform(df['smoker'])
df['region'] = encoder.fit_transform(df['region'])

# Define features (X) and target variable (y)
X = df.drop(columns='charges')
y = df['charges']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64




## Part 3: Linear Regression


In [5]:

# Build and train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_lr = linear_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
mape_lr = mean_absolute_percentage_error(y_test, y_pred_lr)

# Print evaluation metrics
print(f"Linear Regression - MSE: {mse_lr:.2f}, R²: {r2_lr:.2f}, MAPE: {mape_lr:.2%}")

# Print feature names and their coefficients from linear regression model
feature_names = df.drop(columns='charges').columns
coefficients = pd.DataFrame({'Feature': feature_names, 'Coefficient': linear_model.coef_})
print("\nLinear Regression Coefficients:")
print(coefficients.sort_values(by='Coefficient', ascending=False))

Linear Regression - MSE: 33635210.43, R²: 0.78, MAPE: 47.09%

Linear Regression Coefficients:
    Feature  Coefficient
4    smoker  9557.143383
0       age  3616.108652
2       bmi  2028.308579
3  children   516.662566
1       sex    -9.392954
5    region  -302.387980



## Part 4: SVM Regression



In [6]:

# Build and train the SVM Regression model with RBF kernel
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

for kernal in kernels: 
    svr_model = SVR(kernel=kernal, C=100, gamma=0.1, epsilon=0.1)
    svr_model.fit(X_train, y_train)

    # Predict and evaluate
    y_pred_svr = svr_model.predict(X_test)
    mse_svr = mean_squared_error(y_test, y_pred_svr)
    r2_svr = r2_score(y_test, y_pred_svr)
    mape_svr = mean_absolute_percentage_error(y_test, y_pred_svr)

    # Print evaluation metrics
    print(f"SVM Regression with {kernal} - MSE: {mse_svr:.2f}, R²: {r2_svr:.2f}, MAPE: {mape_svr:.2%}")



SVM Regression with linear - MSE: 54186302.08, R²: 0.65, MAPE: 18.72%
SVM Regression with poly - MSE: 133829406.27, R²: 0.14, MAPE: 92.82%
SVM Regression with rbf - MSE: 98347386.69, R²: 0.37, MAPE: 31.70%
SVM Regression with sigmoid - MSE: 79169155.21, R²: 0.49, MAPE: 20.71%



## Part 5: Analysis Questions
1. **Feature Importance**:
   - Based on the Linear Regression model, which features are most significant in predicting medical costs? (Hint: Look at the model coefficients.)
     - Being a smoker is far and away the biggesting positive indicator of health costs. Age and BMI also have positive coefficients but to a lesser extent indicating that though medical costs increase as you age or gain weight, there may be some mitigating factors.
2. **Model Comparison**:
   - Compare the performance of Linear Regression and SVM Regression models based on MSE, R², and MAPE. Which model performed better overall?
     - It's not 100% clear cut but overall Linear Regression seems to perform better since it has a lower MSE (33635210.43) and a higher R^2. However, SVM Regression has a lower MAPE (31.7%) indicating that it makes errors less frequently. 
3. **Experimentation**:
   - Try different kernels (e.g., `linear`, `poly`) for the SVM model. How do the results change?
     - Poly was a real stinker, 92% MAPE seems like it is wrong almost all the time. Linear did the best with the lowest MSE and MAPE and the highest R^2 indicating that the variables have a very strong linear relationship. Sigmoid did pretty well but it seems a linear regression is the best choice for this data.

## Submission
- Submit your completed Jupyter notebook with answers and observations.
- Include your answers to the analysis questions in Part 5 as comments in the notebook.
