<a href="#Overview"></a>
# Overview
* <a href="#c748a6af-5cde-4671-9ac6-6481d4bc7f8b">Week 11 - Linear Regression</a>
  * <a href="#cb7143f2-a389-49e9-81c4-637d1c7fc381">Exercise 1: Import packages and classes</a>
  * <a href="#b3099a95-3f5a-46c1-98e4-391204a9becb">Excercise 2 -  Provide data</a>
  * <a href="#f8b0a6c5-b547-4a7f-b96d-2fdff59b2ed6">Exercise 2: Create a model and fit it</a>
  * <a href="#a22643bc-a1dd-4bd6-baee-c62391418af1">Exercise 3: Get results</a>
  * <a href="#b7e333b8-3e7f-423f-b9c1-2d2df3c48933">Exercise 4: Predict response</a>
  * <a href="#4dc0fede-2959-48fd-99ca-472893e8ceb8">Exercise 5: Plotting the data points</a>
* <a href="#62976845-d363-4255-b9c4-254c08c81cb4">Multiple Linear Regression With scikit-learn</a>
  * <a href="#51660e59-a871-4f02-b463-267147382f56">Exercise 6: get results</a>
  * <a href="#46217d92-d820-453c-95d9-38839a993791">Excersise 8 - generate prediction</a>
* <a href="#029cee58-3d17-418a-8bbe-9be1ee77a395">Polynomial Regression With scikit-learn</a>
* <a href="#15a57f7b-64a9-4fca-afd1-cb74f2ee57f2">Perform polynomial regression on EEG data</a>
  * <a href="#594bae76-7352-4287-b814-2b8496641941">Exercise 7: polynomial regression with real data</a>
* <a href="#036711d1-4012-4e7d-bab1-a0eb3a5da7d4">Polynomial Regression With statsmodels</a>

<a id="c748a6af-5cde-4671-9ac6-6481d4bc7f8b"></a>
# Week 11 - Linear Regression
<a href="#Overview">Return to overview</a>

**What Is Regression?**

In statistics, regression is a method used for modeling the relationship between a dependent variable and one or more independent variables. The goal of regression analysis is to understand how the independent variables impact the dependent variable and to predict the value of the dependent variable based on the values of the independent variables.

**When Do You Need Regression?**

Typically, regression is used to answer whether and how some phenomenon influences the other or how several variables are related, and the strength and direction of that relationship. . For example, you can use it to determine if and to what extent experience or gender impacts salaries.

Regression is also useful when you want to forecast a response using a new set of predictors. For example, you could use it to study the relationship between brain activity and memory performance.

**What is Linear Regression?**

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables.

Linear regression calculates the estimators of the regression coefficients or simply the predicted weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ. These estimators define the estimated regression function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output sufficiently well.

The estimated or predicted response, 𝑓(𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - 𝑓(𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called the residuals. Regression is about determining the best predicted weights—that is, the weights corresponding to the smallest residuals.

To get the best weights, you usually minimize the sum of squared residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary least squares.

**Regression Performance**

The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, occurs partly due to the dependence on the predictors 𝐱ᵢ. However, there’s also an additional inherent variance of the output.

The coefficient of determination, denoted as 𝑅², tells you which amount of variation in 𝑦 can be explained by the dependence on 𝐱, using the particular regression model. A larger 𝑅² indicates a better fit and means that the model can better explain the variation of the output with different inputs.

The value 𝑅² = 1 corresponds to SSR = 0. That’s the perfect fit, since the values of predicted and actual responses fit completely to each other.

The case of one explanatory variable is called simple linear regression.

**Simple Linear Regression**

Simple or single-variate linear regression is the simplest case of linear regression, as it has a single independent variable, 𝐱 = 𝑥.

The following figure illustrates simple linear regression:

<img src="https://files.realpython.com/media/fig-lin-reg.a506035b654a.png" alt="Description of image">



##Simple Linear Regression with scikit-learn

**Step 1: Import packages and classes**

The first step is to import the package numpy and the class LinearRegression from sklearn.linear_model:

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

**Step 2: Provide data**

The second step is defining data to work with. The inputs (regressors, 𝑥) and output (response, 𝑦) should be arrays or similar objects. This is the simplest way of providing data for regression:

In [None]:
x = np.array([5, 15, 25, 30, 35, 45]).reshape((-1, 1))
y = np.array([6, 28, 11, 39, 20, 52])

You should call .reshape() on x because this array must be two-dimensional, or more precisely, it must have one column and as many rows as necessary. That’s what the argument (-1, 1) of .reshape() specifies.

In [None]:
# Plotting the data points
import matplotlib.pyplot as plt
plt.scatter(x, y)

plt.xlabel('X')
plt.ylabel('y')
plt.title('Dummy data for regression')

**Step 3: Create a model and fit it**

The next step is to create a linear regression model and fit it using the existing data.

Create an instance of the class LinearRegression, which will represent the regression model:

In [None]:
model = LinearRegression()

This statement creates the variable model as an instance of LinearRegression. You can provide several optional parameters to LinearRegression:

*   fit_intercept is a Boolean that, if True, decides to calculate the intercept 𝑏₀ or, if False, considers it equal to zero. It defaults to True.
*   normalize is a Boolean that, if True, decides to normalize the input variables. It defaults to False, in which case it doesn’t normalize the input variables.
*   copy_X is a Boolean that decides whether to copy (True) or overwrite the input variables (False). It’s True by default.
*   n_jobs is either an integer or None. It represents the number of jobs used in parallel computation. It defaults to None, which usually means one job. -1 means to use all available processors.

Our model as defined above uses the default values of all parameters.

It’s time to start using the model. First, we need to call .fit() on model:

In [None]:
model.fit(x, y)

With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output, x and y, as the arguments. In other words, .fit() fits the model. It returns self, which is the variable model itself. We can combine the last two statements into this one:

In [None]:
model = LinearRegression().fit(x, y)

**Step 4: Get results**

Once you have your model fitted, you can get the results to check whether the model works satisfactorily and to interpret it.

You can obtain the coefficient of determination, 𝑅², with .score() called on model:

In [None]:
r_sq = model.score(x, y)
(f"coefficient of determination: {r_sq}")

When you’re applying .score(), the arguments are also the predictor x and response y, and the return value is 𝑅².

The attributes of the model are .intercept_, which represents the intercept, and .coef_, which represents the slope:

In [None]:
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

This illustrates that your model predicts the response 5.63 when 𝑥 is zero. The value slope means that the predicted response rises by 0.54 when 𝑥 is increased by one.

**Step 5: Predict response**

Once you have a satisfactory model, then you can use it for predictions with either existing or new data. To obtain the predicted response, use .predict():

In [None]:
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")

You can also use fitted models to calculate the outputs based on new inputs:



In [None]:
x_new = np.arange(5).reshape((-1, 1))
x_new

y_new = model.predict(x_new)
y_new

Here .predict() is applied to the new regressor x_new and yields the response y_new. This example conveniently uses arange() from numpy to generate an array with the elements from 0, inclusive, up to but excluding 5—that is, 0, 1, 2, 3, and 4.

**Plotting the data points**

You can also visualize linear regression on a plot!



In [None]:
# Plotting the data points
import matplotlib.pyplot as plt
plt.scatter(x, y)

# Plotting the linear regression line
plt.plot(x, model.predict(x), "r-")

plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression with Numpy and model.fit()')
plt.show()

Your turn! Let's try this now with a real dataset! We will be using a simple dataset to implement this algorithm. This dataset contains Head Size (cm^3) and Brain Weight (grams) where Head Size is an independent variable.

Let's begin going through the steps of doing a simple linear regression with the data.

<a id="cb7143f2-a389-49e9-81c4-637d1c7fc381"></a>
## Exercise 1: Import packages and classes
<a href="#Overview">Return to overview</a>

As we did above, the first step is to import the package numpy and the class.

In [None]:
# Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

<a id="b3099a95-3f5a-46c1-98e4-391204a9becb"></a>
## Excercise 2 -  Provide data
<a href="#Overview">Return to overview</a>

Next, you want to load the data from the file `head_brain_dataset.csv` into an array called `head_brain_data`. Remember that numpy is expecting an array. Additionally, assign the independent and dependent variables to `x` and `y`, whic respectively are the 2nd and 3rd columns of the data matrix.

In [None]:
import os
datafile = os.path.join('simple_linear_regression_sample_data','head_brain_dataset.csv')


In [None]:
# Answer

head_brain_data = np.loadtxt(datafile, delimiter=',', skiprows=1)
x=head_brain_data[:, 2].reshape((-1, 1))
y=head_brain_data[:, 3]


We can get a quick sense of the dataset’s dimensions:

In [None]:
head_brain_data.shape

Let's look at the first five rows:

In [None]:
print(head_brain_data[:5])

<a id="f8b0a6c5-b547-4a7f-b96d-2fdff59b2ed6"></a>
## Exercise 2: Create a model and fit it
<a href="#Overview">Return to overview</a>

Next, create a linear regression model and fit it using the loaded data.

In [None]:
# Answer
model = LinearRegression().fit(x, y)

<a id="a22643bc-a1dd-4bd6-baee-c62391418af1"></a>
## Exercise 3: Get results
<a href="#Overview">Return to overview</a>


Now that we have our model fitted, we can get the results to check whether the model works satisfactorily and to interpret it.

Next, calculate the coefficient of determination (𝑅²), the intercept, and the slope.

Hint: .score()

In [None]:
# Answer
r_sq = model.score(x, y)
(f"coefficient of determination: {r_sq}")

print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

<a id="b7e333b8-3e7f-423f-b9c1-2d2df3c48933"></a>
## Exercise 4: Predict response
<a href="#Overview">Return to overview</a>

Now that we have a satisfactory model, we can use it for predictions with our data. Calculate the predicted response.

Hint: `.predict()`

In [None]:
# Answer
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")

<a id="4dc0fede-2959-48fd-99ca-472893e8ceb8"></a>
## Exercise 5: Plotting the data points
<a href="#Overview">Return to overview</a>

Plot the data points so that we can visualize this linear regression.

In [None]:
# Answer

# Plotting the data points
import matplotlib.pyplot as plt
plt.scatter(x, y)

# Plotting the linear regression line
plt.plot(x, model.predict(x), "r-")

plt.xlabel('Head Size (cm^3)')
plt.ylabel('Brain Weight (grams)')
plt.title('Linear Regression with Numpy and model.fit()')
plt.show()

<a id="62976845-d363-4255-b9c4-254c08c81cb4"></a>
# Multiple Linear Regression With scikit-learn
<a href="#Overview">Return to overview</a>

You can implement multiple linear regression following the same steps as you would for simple regression. The main difference is that your x array will now have two or more columns.


*  Steps 1 and 2: Import packages and classes, and provide data

First, you import numpy and sklearn.linear_model.LinearRegression and provide known inputs and output:

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# define the data set
x = [
    [0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]
]
y = [4, 5, 20, 14, 32, 22, 38, 43]

# Convert the dataset into NumPy arrays
x, y = np.array(x), np.array(y)


Now let's print out x and y and see what they look like.

In [None]:
# Print out the values of x and y with labels
print("x =", x)
print("y =", y)

In multiple linear regression, x is a two-dimensional array with at least two columns, while y is usually a one-dimensional array. This is a simple example of multiple linear regression, and x has exactly two columns.

now for  

*   step 3: Create a model and fit it.

The next step is to create the regression model as an instance of LinearRegression and fit it with .fit():




In [None]:
model = LinearRegression().fit(x, y)

The result of this statement is the variable model referring to the object of type LinearRegression. It represents the regression model fitted with existing data.

*  Step 4: Get results.

<a id="51660e59-a871-4f02-b463-267147382f56"></a>
## Exercise 6: get results
<a href="#Overview">Return to overview</a>

Obtain the properties of the model the same way as in the case of simple linear regression:

In [None]:
# Answer
r_sq = model.score(x, y)

print(f"coefficient of determination: {r_sq}")

print(f"intercept: {model.intercept_}")

print(f"coefficients: {model.coef_}")

You obtain the value of 𝑅² using .score() and the values of the estimators of regression coefficients with .intercept_ and .coef_. Again, .intercept_ holds the bias 𝑏₀, while now .coef_ is an array containing 𝑏₁ and 𝑏₂.

In this example, the intercept is approximately 5.52, and this is the value of the predicted response when 𝑥₁ = 𝑥₂ = 0. An increase of 𝑥₁ by 1 yields a rise of the predicted response by 0.45. Similarly, when 𝑥₂ grows by 1, the response rises by 0.26.

*   Step 5: Predict response

Predictions also work the same way as in the case of simple linear regression:

<a id="46217d92-d820-453c-95d9-38839a993791"></a>
## Excersise 8 - generate prediction
<a href="#Overview">Return to overview</a>

Remember how did we make prediction in linear regression?
try using the function " model.predict()".

In [None]:
# Answer
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")

The predicted response is obtained with .predict(), but there is another way to obtain the prediction as the following example:

  Here, we predict the output values by multiplying each column of the input with the appropriate weight, summing the results, and adding the intercept to the sum.


In [None]:
y_pred = model.intercept_ + np.sum(model.coef_ * x, axis=1)
print(f"predicted response:\n{y_pred}")

Now let's apply this model to new data. Create an array of numbers from 0 to 9, incrementing by 1.  Then reshape the array into a 2D array with 2 columns. The -1 in the reshape function means that NumPy will automatically determine the number of rows based on the total number of elements in the array. Since there are 10 elements in the array and 2 columns specified, the resulting shape will be (5, 2).


In [None]:
x_new = np.arange(10).reshape((-1, 2))
print('new x:', x_new)

y_new = model.predict(x_new)
print (f"predicted y:\n{y_new}")

Plot the results

In [None]:
y_predicted = model.predict(x_)
plt.figure()
plt.plot(x, y, '.', label='data')
plt.plot(x, y_predicted, label='fit')
plt.legend();

<a id="029cee58-3d17-418a-8bbe-9be1ee77a395"></a>
# Polynomial Regression With scikit-learn
<a href="#Overview">Return to overview</a>


Implementing polynomial regression with scikit-learn is very similar to linear regression. There’s only one extra step: you need to transform the array of inputs to include nonlinear terms such as 𝑥².

Now let's go over the basic steps for implementing polynomial regression using scikit-learn.

**Step 1: Import packages and classes**

In addition to numpy and sklearn.linear_model.LinearRegression, you should also import the class PolynomialFeatures from sklearn.preprocessing:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

The import is now done, and you have everything you need to work with.

Step 2a: Provide data

This step defines the input and output and is the same as in the case of linear regression:

In [None]:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([15, 11, 2, 8, 25, 32])

In [None]:
plt.plot(x,y,'.')

Now you have the input and output in a suitable format. Keep in mind that you need the input to be a two-dimensional array. That’s why .reshape() is used.

Step 2b: Transform input data

This is the new step that you need to implement for polynomial regression!

As you learned earlier, you need to include 𝑥²—and perhaps other terms—as additional features when implementing polynomial regression. For that reason, you should transform the input array x to contain any additional columns with the values of 𝑥², and eventually more features.

It’s possible to transform the input array in several ways, like using insert() from numpy. But the class PolynomialFeatures is very convenient for this purpose. Go ahead and create an instance of this class:

In [None]:
transformer = PolynomialFeatures(degree=2, include_bias=False)

The variable transformer refers to an instance of PolynomialFeatures that you can use to transform the input x.

You can provide several optional parameters to PolynomialFeatures:

degree is an integer (2 by default) that represents the degree of the polynomial regression function.
interaction_only is a Boolean (False by default) that decides whether to include only interaction features (True) or all features (False).
include_bias is a Boolean (True by default) that decides whether to include the bias, or intercept, column of 1 values (True) or not (False).
This example uses the default values of all parameters except include_bias. You’ll sometimes want to experiment with the degree of the function, and it can be beneficial for readability to provide this argument anyway.

Before applying transformer, you need to fit it with .fit():

In [None]:
transformer.fit(x)

Once transformer is fitted, then it’s ready to create a new, modified input array. You apply .transform() to do that:

In [None]:
x_ = transformer.transform(x)

That’s the transformation of the input array with .transform(). It takes the input array as the argument and returns the modified array.

You can also use .fit_transform() to replace the three previous statements with only one:

In [None]:
x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)

With .fit_transform(), you’re fitting and transforming the input array in one statement. This method also takes the input array and effectively does the same thing as .fit() and .transform() called in that order. It also returns the modified array. This is how the new input array looks:

In [None]:
x_

The modified input array contains two columns: one with the original inputs and the other with their squares.

Step 3: Create a model and fit it

This step is also the same as in the case of linear regression. You create and fit the model:

In [None]:
model = LinearRegression().fit(x_, y)

The regression model is now created and fitted. It’s ready for application. You should keep in mind that the first argument of .fit() is the modified input array x_ and not the original x.

Step 4: Get results

You can obtain the properties of the model the same way as in the case of linear regression:

In [None]:
r_sq = model.score(x_, y)
print(f"coefficient of determination: {r_sq}")


print(f"intercept: {model.intercept_}")


print(f"coefficients: {model.coef_}")

Plot the results

In [None]:
y_predicted = model.predict(x_)
plt.figure()
plt.plot(x, y, '.', label='data')
plt.plot(x, y_predicted, label='fit')
plt.legend();

Again, .score() returns 𝑅². Its first argument is also the modified input x_, not x. The values of the weights are associated to .intercept_ and .coef_. Here, .intercept_ represents 𝑏₀, while .coef_ references the array that contains 𝑏₁ and 𝑏₂.

You can obtain a very similar result with different transformation and regression arguments:

In [None]:
x_ = PolynomialFeatures(degree=2, include_bias=True).fit_transform(x)

If you call PolynomialFeatures with the default parameter include_bias=True, or if you just omit it, then you’ll obtain the new input array x_ with the additional leftmost column containing only 1 values. This column corresponds to the intercept. This is how the modified input array looks in this case:



In [None]:
x_

The first column of x_ contains ones, the second has the values of x, while the third holds the squares of x.

The intercept is already included with the leftmost column of ones, and you don’t need to include it again when creating the instance of LinearRegression. Thus, you can provide fit_intercept=False. This is how the next statement looks:

In [None]:
model = LinearRegression(fit_intercept=False).fit(x_, y)

The variable model again corresponds to the new input array x_. Therefore, x_ should be passed as the first argument instead of x.

This approach yields the following results, which are similar to the previous case:

In [None]:
r_sq = model.score(x_, y)
print(f"coefficient of determination: {r_sq}")


print(f"intercept: {model.intercept_}")


print(f"coefficients: {model.coef_}")

Plot the results

In [None]:
y_predicted = model.predict(x_)
plt.figure()
plt.plot(x, y, '.', label='data')
plt.plot(x, y_predicted, label='fit')
plt.legend();

You see that now .intercept_ is zero, but .coef_ actually contains 𝑏₀ as its first element. Everything else is the same.

Step 5: Predict response

If you want to get the predicted response, just use .predict(), but remember that the argument should be the modified input x_ instead of the old x:

In [None]:
y_pred = model.predict(x_)
print(f"predicted response:\n{y_pred}")

As you can see, the prediction works almost the same way as in the case of linear regression. It just requires the modified input instead of the original.

You can apply an identical procedure if you have several input variables. You’ll have an input array with more than one column, but everything else will be the same. Here’s an example:

In [None]:
# Step 1: Import packages and classes
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Step 2a: Provide data
x = [
  [0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]
]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)

# Step 2b: Transform input data
x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)

# Step 3: Create a model and fit it
model = LinearRegression().fit(x_, y)

# Step 4: Get results
r_sq = model.score(x_, y)
intercept, coefficients = model.intercept_, model.coef_

# Step 5: Predict response
y_pred = model.predict(x_)

This regression example yields the following results and predictions:

In [None]:
print(f"coefficient of determination: {r_sq}")


print(f"intercept: {intercept}")


print(f"coefficients:\n{coefficients}")



print(f"predicted response:\n{y_pred}")

In this case, there are six regression coefficients, including the intercept, as shown in the estimated regression function 𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂ + 𝑏₃𝑥₁² + 𝑏₄𝑥₁𝑥₂ + 𝑏₅𝑥₂².

You can also notice that polynomial regression yielded a higher coefficient of determination than multiple linear regression for the same problem. At first, you could think that obtaining such a large 𝑅² is an excellent result. It might be.

However, in real-world situations, having a complex model and 𝑅² very close to one might also be a sign of overfitting. To check the performance of a model, you should test it with new data—that is, with observations not used to fit, or train, the model.

<a id="15a57f7b-64a9-4fca-afd1-cb74f2ee57f2"></a>
# Perform polynomial regression on EEG data
<a href="#Overview">Return to overview</a>


Here we come, a real life-ish example~
We will be applying polynomial regression to analyze a 1-second long EEG data extracted from the file eeg_data_new.csv.
EEG, or Electroencephalogram, records the electrical activity of the brain over time, typically capturing complex and dynamic patterns. But what can polynomial regression do to EEG data? Firstly, it enables us to recognize and extract complex patterns that may not be adequately captured by linear models. Secondly, polynomial regression can help in reconstructing noisy or missing data points within the EEG signal, providing a smoother representation of brain activity. Additionally, polynomial regression serves as a valuable tool for feature extraction, deriving higher-order features from the original EEG data, which may contain crucial information for further analysis.

<a id="594bae76-7352-4287-b814-2b8496641941"></a>
## Exercise 7: polynomial regression with real data
<a href="#Overview">Return to overview</a>

We are going to fit a polynomial regression model to the eeg data. let's recall the steps we will need to take:


*   Step 1: load the data.
*   Step 2: Prepare the data for polynomial regression.  
*   Step 3: Fit the polynomial regression model
*   Step 4: Evaluate the model.
*   Step 5: Visualize the results.



In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

Step 1: Load the EEG data

In [None]:
eeg_data = pd.read_csv("eeg_data_new.csv")

Step 2: Prepare the data for polynomial regression

In [None]:
# Answer
X = eeg_data['time'].values.reshape(-1, 1)  # Independent variable (time)
y = eeg_data['amplitude'].values            # Dependent variable (amplitude)

Step 3: Fit the polynomial regression model

In [None]:
degree = 50  # Degree of the polynomial, you can adjust it
# Answer
poly_features = PolynomialFeatures(degree=degree)
X_poly = poly_features.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, y)

Step 4: Evaluate the model

In [None]:
# Answer

from sklearn.metrics import r2_score

y_pred = model.predict(X_poly)  # Predicted values
r_squared = r2_score(y, y_pred)
r_sq = model.score(X_poly, y)
print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")

Step 5: Visualize the results

In [None]:
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, model.predict(X_poly), color='red', label='Polynomial Regression')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.title('Polynomial Regression on EEG Data')
plt.legend()
plt.show()



Is your polynomial regression present a underfit or a overfit?

<a id="036711d1-4012-4e7d-bab1-a0eb3a5da7d4"></a>
# Polynomial Regression With statsmodels
<a href="#Overview">Return to overview</a>


You can implement linear regression in Python by using the package statsmodels as well. Typically, this is desirable when you need more detailed results.

The procedure is similar to that of scikit-learn. We will keep working on the eeg data that we used in the last section.

**Step 1: Import packages**

First you need to do some imports. In addition to numpy, you need to import statsmodels.api:

In [None]:
import numpy as np
import statsmodels.api as sm

**Step 2: Provide data and transform inputs**

You can provide the inputs and outputs the same way as you did when you were using scikit-learn:

In [None]:
# Load the EEG data from the CSV file
eeg_data = pd.read_csv('eeg_data_new.csv')

# Prepare the data for polynomial regression
X = eeg_data['time']  # Independent variable (time)
y = eeg_data['amplitude']  # Dependent variable (amplitude)

The input and output arrays are created, but the job isn’t done yet.

You need to add the column of ones to the inputs if you want statsmodels to calculate the intercept 𝑏₀. It doesn’t take 𝑏₀ into account by default. This is just one function call:

In [None]:
# Add a constant to the independent variable for the intercept term
X = sm.add_constant(X)

Step 3: Create a model and fit it

The regression model based on ordinary least squares is an instance of the class statsmodels.regression.linear_model.OLS. This is how you can obtain one:

In [None]:
# Fit the polynomial regression model (e.g., polynomial of degree 3)
degree = 30
poly = sm.OLS(y, np.column_stack([X**i for i in range(degree+1)]))


You should be careful here! Notice that the first argument is the output, followed by the input. This is the opposite order of the corresponding scikit-learn functions.

Once your model is created, then you can apply .fit() on it:

In [None]:
result = poly.fit()

By calling .fit(), you obtain the variable results, which is an instance of the class statsmodels.regression.linear_model.RegressionResultsWrapper. This object holds a lot of information about the regression model.

**Step 4: Get results**

The variable results refers to the object that contains detailed information about the results of linear regression. Explaining these results is far beyond the scope of this tutorial, but you’ll learn here how to extract them.

You can call .summary() to get the table with the results of linear regression:

In [None]:
# Print the summary of the regression results
result.summary()

This table is very comprehensive. You can find many statistical values associated with linear regression, including 𝑅², 𝑏₀, 𝑏₁, and 𝑏₂.

In this particular case, you might obtain a warning saying kurtosistest only valid for n>=20. This is due to the small number of observations provided in the example.

You can extract any of the values from the table above. Here’s an example:

In [None]:
print(f"coefficient of determination: {result.rsquared}")


print(f"adjusted coefficient of determination: {result.rsquared_adj}")


print(f"regression coefficients: {result.params}")

That’s how you obtain some of the results of linear regression:

.rsquared holds 𝑅².
.rsquared_adj represents adjusted 𝑅²—that is, 𝑅² corrected according to the number of input features.
.params refers the array with 𝑏₀, 𝑏₁, and 𝑏₂.
You can also notice that these results are identical to those obtained with scikit-learn for the same problem.

**Step 5: Predict response**

You can obtain the predicted response on the input values used for creating the model using .fittedvalues or .predict() with the input array as the argument:

In [None]:
# Predict the values using the fitted model
y_pred = result.predict()
print(f"predicted response:\n{y_pred}")

**Step 6: Vidualize the data **

In [None]:
# Visualize the results
plt.scatter(X['time'], y, color='blue', label='Data')
plt.plot(X['time'], y_pred, color='red', label='Polynomial Regression')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.title('Polynomial Regression using Statsmodels')
plt.legend()
plt.show()