<a href="https://colab.research.google.com/github/valsson-group/UNT-ChemicalApplicationsOfMachineLearning-Spring2026/blob/main/Examples_scikit-learn/sklearn_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Linear regression using scikit-learn

This notebook shows how to perform linear regression using scikit-learn.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



First we generate a dataset $(x,y)$ of $N$ samples given by
$$
y = \beta_1 x + \beta_0 + \epsilon
$$
where $\epsilon$ is a random number taken from a Normal distribution with mean 0 and some standard deviation $\sigma$.

Thus, this is a model with linear relationship between $x$ and $y$, but with a random noise component, so the dataset will not have linear relationship. The magnitude of standard deviation $\sigma$ will determine how much the dataset will deviate from true linear relationship.

In [None]:
# Define a ranomd number generator.
# This is the recommended way to use randon number in numpy.

rng = np.random.default_rng()

x_range_min = 1.0
x_range_max = 60.0

beta_1=2.54
beta_0=10.05
random_sigma=25
NumberOfValues=50

# sample x values randomly uniformaly from the range given by [x_range_min,x_range_max]
# this will give a numpy array of NumberOfValues
x = rng.uniform(low=x_range_min, high=x_range_max, size=NumberOfValues)

# generate a numpy array of random values from a normal distrubution
# with mean of 0 and standard deviation given by random_sigma
noise = rng.normal(loc=0.0, scale=random_sigma, size=x.size)

# calculate the y values from the x values and the noise ter,
y = beta_1 *x + beta_0 + noise


print("x values")
print(x)
print("----------")

print("y values")
print(y)
print("----------")

plt.plot(x,y,'.')
plt.xlabel("x")
plt.ylabel("y")
plt.show()


# Note that due to the random noise random noise component
# you will get a different results each time.

# If you want to obtain reproducable results that are identical
# each time, you can initialize the random number generator
# with a given random seed (that should be a postive integer)
# by using rng = np.random.default_rng(seed=__SEED__), where
# __SEED__ is some number that you choose.

# If no random seed is given, then the random seed is selected
# randomly.


In [None]:
# Here we split the dataset into a training and test sets with a 70/30 split
# using random selection
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

# This is just to show the sizes of the sets, this is not normally needed
print("Size")
print("- Full: {:d}".format(x.size))
print("- Training: {:d}".format(x_train.size))
print("- Test: {:d}".format(x_test.size))

# plot the split
plt.plot(x_train,y_train,'.',label="Training set")
plt.plot(x_test,y_test,'.',label="Test set")
plt.legend()
plt.show()

# Again, each time you run, you will get different results.
# you can add the parameter random_state=__SEED__, e.g.,
# train_test_split(x,y, test_size=0.30, random_state=__SEED__)

In [None]:
# Here we perform the linear regression using
# the LinearRegression() function
from sklearn.linear_model import LinearRegression

# create a new LinearRegression() object
linear_reg = LinearRegression()

# Perform the fit to the training set.
# Here, you need to use the reshape(-1,1) function
# to put the data into the right shape for scikit-learn
linear_reg.fit(x_train.reshape(-1, 1),y_train.reshape(-1, 1))



In [None]:
# we can get the linear coefficent using linear_reg.coef_ variable
# that is a variable of the LinearRegression() object linear_reg.
# We can do the same for the intercept using linear_reg.intercept_.
# Note that both are arrays so we need to use [0] to get the first value
# (and only value in this case).

print("Fitted parameters")
print("- Linear Coefficent: {:.4f}".format(linear_reg.coef_[0][0]))
print("- Intercept: {:.4f}".format(linear_reg.intercept_[0]))

In [None]:
# We can get the R^2 or coefficient of determination value using the
# score() function. You can do this either for the training or the test dataset.
# The R^2 value is in the range [0,1] and a higher value means a better fit.

r2_coefficient_train = linear_reg.score(x_train.reshape(-1,1),y_train.reshape(-1,1))
r2_coefficient_test  = linear_reg.score(x_test.reshape(-1,1),y_test.reshape(-1,1))

print("Coefficient of determination / R^2")
print("- Training Dataset: {:.4f}".format(r2_coefficient_train))
print("- Test Dataset:     {:.4f}".format(r2_coefficient_test))

In [None]:
# Here we create a plot showing the fit as a line

# we first create a dense grind in range [x_range_min,x_range_max]
x_predict_grid = np.linspace(x_range_min,x_range_max,1000)

# then calculate the y values predicted by the linear regression
# model using the .predict function.
y_predict_grid = linear_reg.predict(x_predict_grid.reshape(-1,1))

# plot dataset as points
plt.plot(x_train,y_train,'.',label="Training set")
plt.plot(x_test,y_test,'.',label="Test set")

# plot linear fit as a line
plt.plot(x_predict_grid,y_predict_grid,'-',label="Linear Regression",linewidth=3.0)

# add a text to plot to show the R^2 values
# note that the x,y coordiantes are given in the coordinate
# system of the data.
# In the text string we use the "\n" character that is
# special character to indicate a new line.
plt.text(x=1.0, y=140.0,
         s="$R^2$\nTrain: {:.3f}\nTest:  {:.3f}".format(r2_coefficient_train,r2_coefficient_test),
         fontsize=10)

plt.xlabel("x")
plt.ylabel("y")

# this is to add x=0 line and a y=0 line
ax = plt.gca()
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.5)
ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)

plt.legend()
plt.show()

In [None]:
# Here we perform a linear regression using the RANSAC method
# that allows for handeling outliers. It automatically detects
# possible outliers and does the linear regression fit only using the inliers.

# References
# https://scikit-learn.org/stable/modules/linear_model.html#ransac-regression
# https://scikit-learn.org/stable/auto_examples/linear_model/plot_ransac.html#sphx-glr-auto-examples-linear-model-plot-ransac-py


from sklearn.linear_model import RANSACRegressor

ransac_reg = RANSACRegressor()

# Here we will do the fit on the full (x,y) dataset
ransac_reg.fit(x.reshape(-1, 1),y.reshape(-1, 1))

In [None]:
r2_coefficient_ransac = ransac_reg.score(x.reshape(-1,1),y.reshape(-1,1))

print("Coefficient of determination / R^2:     {:.4f}".format(r2_coefficient_ransac))
print("Fitted parameters")
print("- Linear Coefficent: {:.4f}".format(ransac_reg.estimator_.coef_[0][0]))
print("- Intercept: {:.4f}".format(ransac_reg.estimator_.intercept_[0]))

In [None]:
# we can get a "mask" for inliers by using .inlier_mask
# this is a array of the same size as x and y with boolean
# elements that are True if the corresponding element is an
# inlier.
inlier_mask = ransac_reg.inlier_mask_

# we can get the corresponding outlier mask by using
# np.logical_mask() function
outlier_mask = np.logical_not(inlier_mask)

print("inlier mask")
print(inlier_mask)
print("---------")

print("outlier mask")
print(outlier_mask)
print("---------")

print("Number of outliers: {:3d}".format(np.sum(outlier_mask)))

In [None]:
# calculate the predicted fit for plotting
x_ransac_predict_grid = np.linspace(x_range_min,x_range_max,1000)
y_ransac_predict_grid = ransac_reg.predict(x_ransac_predict_grid.reshape(-1,1))

# plot inliers and outliers
plt.plot(x[inlier_mask],y[inlier_mask],'.',label="Inliers")
plt.plot(x[outlier_mask],y[outlier_mask],'.',label="Outliers")

# plot the normal linear regression from before
plt.plot(x_predict_grid,y_predict_grid,'-',label="Linear Regression",linewidth=3.0)

# plot the RANSAC regression
plt.plot(x_ransac_predict_grid,y_ransac_predict_grid,'-',label="RANSAC Regression",linewidth=3.0)

plt.text(x=1.0, y=120,
         s="$R^2$\nTrain: {:.3f}\nTest: {:.3f}\nRANSAC: {:.3f}".format(r2_coefficient_train,r2_coefficient_test,r2_coefficient_ransac),
         fontsize=10)
plt.xlabel("x")
plt.ylabel("y")

ax = plt.gca()
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.5)
ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)

plt.legend()
plt.show()

You should try to repeat the linear regression by changing the parameters
used to generate the $(x,y)$ dataset, for example increasing the number of
samples, or increasing the standard deviation $\sigma$ in the noise term so there is more uncertainty.  