This notebook predicts the student scores based on the number of hours studied. The source of the dataset is from kaggle : https://www.kaggle.com/datasets/samira1992/student-scores-simple-dataset/data

The implementation is from scratch and then the same is compared with sklearn library's prediction.

In [10]:
## Import required modules
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
## Load dataset
df = pd.read_csv('student_scores.csv')
X = np.array(df['Hours'])
y = np.array(df['Scores'])

**Linear Regression from scratch**

In [6]:
## Initalise parameters
m, c = 0, 0
lr = 0.0025
## Number of iterations can be decided based when our cost function value starts becoming constant.
epochs = 10000
n = len(X)

# Gradient descent
for i in range(epochs) :
  y_pred = m*X + c
  cost = (1/n)*sum([value**2 for value in (y-y_pred)])
  dm = (-2/n) * sum(X * (y-y_pred))
  dc = (-2/n) * sum(y - y_pred)
  m -= lr * dm
  c -= lr * dc
  print ("m {}, c {}, cost {}, iteration {}".format(m,c,cost, i))

print(f"From Scratch → m={m:.3f}, c={c:.3f}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
m 9.777001140866279, c 2.4762531341231826, cost 28.882741319488503, iteration 5001
m 9.776999997114993, c 2.476260219862458, cost 28.88274129885259, iteration 5002
m 9.776998854455895, c 2.476267298835444, cost 28.882741278256074, iteration 5003
m 9.776997712887942, c 2.476274371048602, cost 28.882741257698882, iteration 5004
m 9.77699657241009, c 2.4762814365083874, cost 28.882741237180937, iteration 5005
m 9.776995433021298, c 2.4762884952212487, cost 28.882741216702165, iteration 5006
m 9.776994294720529, c 2.4762955471936285, cost 28.882741196262447, iteration 5007
m 9.776993157506741, c 2.476302592431964, cost 28.882741175861806, iteration 5008
m 9.776992021378899, c 2.4763096309426853, cost 28.88274115550007, iteration 5009
m 9.776990886335962, c 2.476316662732217, cost 28.882741135177213, iteration 5010
m 9.776989752376897, c 2.4763236878069765, cost 28.882741114893147, iteration 5011
m 9.776988619500669, c 2.47633

In [8]:
## Calculate with sklearn
model = LinearRegression()
model.fit(df[['Hours']], df['Scores'])
print(f"From sklearn → m={model.coef_[0]:.3f}, c={model.intercept_:.3f}")
y_pred_sklearn = model.predict(df[['Hours']])

From sklearn → m=9.776, c=2.484


In [11]:
## Comparing with sklearn
print("MSE (scratch):", mean_squared_error(y, m*X + c))
print("MSE (sklearn):", mean_squared_error(y, y_pred_sklearn))

MSE (scratch): 28.88273051001376
MSE (sklearn): 28.882730509245466
