Abstract:

In this study, I will explore the effectiveness of multiple machine learning models on a simple yet informative dataset obtained from Kaggle. The dataset, titled "Multiple Linear Regression Dataset," provides a valuable opportunity to evaluate various regression techniques in a real-world context. My objective is to assess and compare the performance of these models in predicting a target variable based on multiple input features. I aim to gain insights into the suitability of different regression algorithms for this dataset, shedding light on their strengths and limitations. The findings of this analysis will contribute to a deeper understanding of how I will study this task in a practical setting and will guide data scientists and practitioners in selecting the most appropriate regression technique for similar types of problems.

https://www.kaggle.com/datasets/hussainnasirkhan/multiple-linear-regression-dataset?resource=download

In [None]:
import zipfile

# Unzip the data file

zip_ref = zipfile.ZipFile("/content/archive (5).zip", "r")
zip_ref.extractall()
zip_ref.close()

In [None]:
import pandas as pd

df = pd.read_csv("/content/multiple_linear_regression_dataset.csv")
df = df.reset_index(drop=True)
df

# Read in the file

Unnamed: 0,age,experience,income
0,25,1,30450
1,30,3,35670
2,47,2,31580
3,32,5,40130
4,43,10,47830
5,51,7,41630
6,28,5,41340
7,33,4,37650
8,37,5,40250
9,39,8,45150


In [None]:
X = df.iloc[:,0:2].values
y = df.iloc[:, -1].values
X, y

# Seprate the variables

(array([[25,  1],
        [30,  3],
        [47,  2],
        [32,  5],
        [43, 10],
        [51,  7],
        [28,  5],
        [33,  4],
        [37,  5],
        [39,  8],
        [29,  1],
        [47,  9],
        [54,  5],
        [51,  4],
        [44, 12],
        [41,  6],
        [58, 17],
        [23,  1],
        [44,  9],
        [37, 10]]),
 array([30450, 35670, 31580, 40130, 47830, 41630, 41340, 37650, 40250,
        45150, 27840, 46110, 36720, 34800, 51300, 38900, 63600, 30870,
        44190, 48700]))

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

# Split data into train and test split

In [None]:
from sklearn.preprocessing import StandardScaler

# Feature scale the data

sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
y_train_scaled = sc.fit_transform(y_train.reshape(-1, 1))

X_test_scaled = sc.fit_transform(X_test)
y_test_scaled = sc.fit_transform(y_test.reshape(-1,1))

X_train_scaled, y_train_scaled

(array([[-1.17524   , -0.23942607],
        [ 0.48392235,  1.3567477 ],
        [ 0.04147906,  0.7182782 ],
        [-0.17974259, -0.23942607],
        [-0.73279671, -0.23942607],
        [ 0.59453318,  1.03751295],
        [-1.72829412, -1.51636508],
        [-1.50707247, -1.51636508],
        [-0.62218588, -0.55866082],
        [ 1.36880894,  0.39904344],
        [ 0.59453318,  1.99521721],
        [ 0.92636565, -1.19713033],
        [ 1.70064142, -0.23942607],
        [ 0.92636565,  1.03751295],
        [-0.95401836, -0.87789557],
        [ 0.26270071,  0.07980869]]),
 array([[ 0.22962234],
        [ 1.32994461],
        [ 0.87557424],
        [ 0.04482245],
        [ 0.02447751],
        [ 0.71281471],
        [-1.54547384],
        [-1.61668113],
        [-0.39598462],
        [ 0.27878928],
        [ 1.91825251],
        [-1.4250996 ],
        [-0.55365792],
        [ 1.03833378],
        [-0.73167616],
        [-0.18405815]]))

### Linear Model!

In [None]:
from sklearn.linear_model import LinearRegression

# Create a Linear Regressor
linear = LinearRegression()
linear.fit(X_train_scaled, y_train_scaled)

# Fit the model

In [None]:
linear_pred = linear.predict(X_test_scaled)
linear_pred

array([[ 0.4500082 ],
       [-0.80723353],
       [ 1.36256719],
       [-1.00534186]])

### SVR Model!

In [None]:
from sklearn.svm import SVR

# Create a SVR Regressor
svr_reg = SVR(kernel = 'rbf')
svr_reg.fit(X_train_scaled, y_train_scaled)

# Fit the model

  y = column_or_1d(y, warn=True)


In [None]:
svr_pred = svr_reg.predict(X_test_scaled)

### Decision Tree!

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Create a Decision Tree Regressor
tree_reg = DecisionTreeRegressor()

# Fit the model
tree_reg.fit(X_train_scaled, y_train_scaled)


In [None]:
# Make predictions using the Decision Tree regressor
tree_pred = tree_reg.predict(X_test_scaled)


### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor
random_forest_reg = RandomForestRegressor()

# Fit the model
random_forest_reg.fit(X_train_scaled, y_train_scaled)


  random_forest_reg.fit(X_train_scaled, y_train_scaled)


In [None]:
# Make predictions using the Random Forest regressor
random_forest_pred = random_forest_reg.predict(X_test_scaled)


In [None]:
from sklearn.metrics import r2_score

# Calculate R-squared (R2) scores for each model
r2_lin = r2_score(y_test_scaled, linear_pred)  # R2 score for Linear Regression
r2_svr = r2_score(y_test_scaled, svr_pred)      # R2 score for Support Vector Regressor
r2_tree = r2_score(y_test_scaled, tree_pred)    # R2 score for Decision Tree Regressor
r2_rf = r2_score(y_test_scaled, random_forest_pred)  # R2 score for Random Forest Regressor

# Print the R-squared (R2) scores
print("R-squared scores:")
print(f"Linear Regression: {r2_lin}")
print(f"Support Vector Regressor: {r2_svr}")
print(f"Decision Tree Regressor: {r2_tree}")
print(f"Random Forest Regressor: {r2_rf}")


R-squared scores:
Linear Regression: 0.9842210967061685
Support Vector Regressor: 0.9254540259013746
Decision Tree Regressor: 0.8965604906052872
Random Forest Regressor: 0.9632345820343007


In [None]:
from sklearn.metrics import mean_absolute_error

# Calculate Mean Absolute Error (MAE) scores for each model
mae_lin = mean_absolute_error(y_test_scaled, linear_pred)  # MAE for Linear Regression
mae_svr = mean_absolute_error(y_test_scaled, svr_pred)  # MAE for Support Vector Regressor
mae_tree = mean_absolute_error(y_test_scaled, tree_pred)  # MAE for Decision Tree Regressor
mae_rf = mean_absolute_error(y_test_scaled, random_forest_pred)  # MAE for Random Forest Regressor

# Print MAE scores for each model
print("Mean Absolute Error (MAE) scores:")
print(f"Linear Regression: {mae_lin}")
print(f"Support Vector Regressor: {mae_svr}")
print(f"Decision Tree Regressor: {mae_tree}")
print(f"Random Forest Regressor: {mae_rf}")

Mean Absolute Error (MAE) scores:
Linear Regression: 0.12097428145149605
Support Vector Regressor: 0.23430215293203832
Decision Tree Regressor: 0.2884807735350904
Random Forest Regressor: 0.18829005243479247


In conclusion, the evaluation of simple regression models on the salary dataset, which considers factors like experience and age, reveals promising results.

Linear Regression, in particular, demonstrates a strong fit to the data, with an R-squared score of 0.9842 and a low Mean Absolute Error (MAE) of 0.1210. This indicates that the model effectively captures the relationships between experience, age, and salary, making it a suitable choice for predicting salary based on these factors.

Random Forest Regressor also performs well, with an R-squared score of 0.9632 and a competitive MAE of 0.1883. While it may have a slightly larger MAE than Linear Regression, it remains an effective choice for predictive modeling.

Both models showcase their ability to capture the underlying patterns in the data, demonstrating that salary can be reasonably predicted based on experience and age. The choice between these models may depend on specific needs, such as interpretability, computational complexity, and the trade-off between model performance and simplicity. Further exploration and refinement may enhance the accuracy of these predictions for real-world applications.