# Simple Linear Regression

In [32]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression


## Explore the dataframe


To import a CSV file into a DataFrame using the `pandas` library, you can specify the file path in two ways:

1. **Local Path**: The CSV file is stored locally on your computer.

2. **URL**: The CSV file is hosted online and accessible via a URL.

In the provided code, both methods are demonstrated:


In [33]:
# Importing the Advertising dataset

# advertising = pd.read_csv('data/Advertising.csv', usecols=[1,2,3,4])
advertising = pd.read_csv('https://raw.githubusercontent.com/vincenzorrei/EDU-Datasets/refs/heads/main/Advertising.csv', usecols=[1,2,3,4])


To explore the initial rows of the `advertising` DataFrame, you can use the `head()` method:


In [None]:
# Explore the data
advertising.head()


To generate a synthetic descriptive table of the `advertising` dataset, you can use the `describe()` method from the pandas library:

In [None]:
# Generate descriptive statistics of the advertising dataset
advertising.describe()


In [None]:
# Correlation matrix
correlation_matrix = advertising.corr()
correlation_matrix


In [None]:
# Correlation matrix heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix,
            annot=True,
            cmap='coolwarm',
            linewidths=0.5,
            vmin=-1,
            vmax=1)
plt.title('Correlation Matrix of Advertising Data')
plt.show()


The provided code utilizes Seaborn's `regplot` function to create a scatter plot with a fitted regression line, illustrating the relationship between radio advertising expenditures and sales:

In [None]:
# Create a scatter plot with a regression line
sns.regplot(
    x=advertising.radio,
    y=advertising.sales,
    order=1,  # Specifies a linear regression model: increase to overfit! default=1
    ci=None,  # Omits the confidence interval for the regression line
    scatter_kws={'color': 'r', 's': 9}  # Sets the color and size of the scatter plot points
)

# Display the plot
plt.show()


## Parameter Estimation



In this section, we will estimate parameters using two prominent Python libraries:

- **statsmodels**
- **scikit-learn**

**Comparison between scikit-learn and statsmodels:**

- **scikit-learn**: Primarily designed for predictive modeling, scikit-learn focuses on model performance metrics. It does not automatically provide statistical details such as t-values, p-values, or standard errors. To obtain these, one would need to use statsmodels or additional statistical methods.

- **statsmodels**: Tailored for statistical analysis, statsmodels offers comprehensive outputs, including detailed statistical tests and metrics, making it suitable for in-depth data exploration and inferential statistics.


### 1) Statsmodels library



In this section, we will perform an Ordinary Least Squares (OLS) regression analysis to examine the relationship between sales and radio advertising expenditures using the `statsmodels` library in Python.

**Explanation:**

- **Model Specification**: The formula `'sales ~ radio'` indicates that we are modeling `sales` as a function of `radio` advertising expenditures, where `sales` is the dependent variable and `radio` is the independent variable.

- **Fitting the Model**: The `fit()` method estimates the parameters of the OLS regression model using the provided `advertising` dataset.

- **Displaying the Summary Table**: The `summary()` method generates a comprehensive summary of the regression results. Accessing `tables[1]` specifically retrieves the table containing the estimated coefficients, their standard errors, t-values, and associated p-values.

This analysis helps us understand the influence of radio advertising on sales by quantifying the relationship between the two variables.

In [None]:
# Fit the OLS model
est = smf.ols('sales ~ radio', data=advertising).fit()

# Display the summary table of regression coefficients
est.summary().tables[1] # Display the summary table of regression coefficients


In [40]:
# Store the regression coefficients
statsmodels_slope = est.params['radio']
statsmodels_intercept = est.params['Intercept']


### 2) Scikit-learn




Scikit-learn, often abbreviated as `sklearn`, is a powerful open-source Python library designed for machine learning and data analysis. Built upon foundational libraries like NumPy and SciPy, it offers efficient implementations of a wide range of machine learning algorithms

**Explanation:**

- **Data Preparation**: The predictor variable `radio` is extracted from the `advertising` DataFrame and reshaped into a 2D array using `reshape(-1, 1)`, as scikit-learn's `LinearRegression` expects the input features in this format. The target variable `sales` is assigned to `y`.

- **Model Initialization and Fitting**: An instance of `LinearRegression` is created and fitted to the data using `model.fit(X, y)`, which computes the best-fit line for predicting `sales` based on `radio` advertising expenditures.

- **Retrieving Parameters**: The intercept and slope of the regression line are obtained from `model.intercept_` and `model.coef_[0]`, respectively.


This approach ensures that the predictor variable is correctly shaped for model fitting and that parameter comparisons account for potential floating-point precision issues.

In [None]:

# Reshape the predictor variable
X = advertising['radio'].values.reshape(-1, 1)
y = advertising['sales']

# Initialize and fit the linear regression model
model = LinearRegression()
model.fit(X, y)


In [42]:
# Accessing the slope (coefficient) and intercept
sklearn_slope = model.coef_[0]
sklearn_intercept = model.intercept_


In [None]:
print(f'Sklearn - Intercept: {sklearn_intercept:.2f}\nSklearn - Slope: {sklearn_slope:.2f}\n')
print(f'Statsmodels - Intercept: {statsmodels_intercept:.2f}\nStatsmodels - Slope: {statsmodels_slope:.2f}')


### Inspect Regression results

# Multiple Linear Regression


In [None]:
# Parameters estimation with statsmodels
est = smf.ols('sales ~ TV + radio + newspaper', advertising).fit()
est.summary()


In [None]:
# Parameters estimation with statsmodels
est = smf.ols('sales ~ TV + radio', advertising).fit()
est.summary()


In [None]:
# Parameters estimation with sklearn
regr = LinearRegression()

X = advertising[['radio', 'TV']]
y = advertising.sales

regr.fit(X,y)

# Create a coordinate grid
Radio = np.arange(0,50)
TV = np.arange(0,300)

B1, B2 = np.meshgrid(Radio, TV, indexing='xy')
Z = np.zeros((TV.size, Radio.size))

for (i,j),v in np.ndenumerate(Z):
        Z[i,j] =(regr.intercept_ + B1[i,j]*regr.coef_[0] + B2[i,j]*regr.coef_[1])

# Create plot
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
fig.suptitle('Regression: Sales ~ Radio + TV Advertising', fontsize=20)

ax.plot_surface(B1, B2, Z, rstride=1, cstride=1, alpha=0.4)
ax.scatter(advertising['radio'], advertising['TV'], advertising['sales'], c='r')

ax.set_xlabel('Radio')
ax.set_xlim(0, 50)
ax.set_ylabel('TV')
ax.set_ylim(0, 300)
ax.set_zlabel('Sales')

plt.show()
