# Linear Regression 

Use the iris dataset (from seaborn) to measure association between continuous variables

In [2]:
import seaborn as sns

iris = sns.load_dataset('iris')
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


## 1. Statsmodels

### 1.1. Formula API

In [15]:
import statsmodels.formula.api as smf

mod = smf.ols(formula='sepal_length ~ petal_length + sepal_width*petal_width', data=iris)

res = mod.fit()

print(res.summary())

                            OLS Regression Results                            
Dep. Variable:           sepal_length   R-squared:                       0.860
Model:                            OLS   Adj. R-squared:                  0.856
Method:                 Least Squares   F-statistic:                     222.0
Date:                Tue, 02 May 2023   Prob (F-statistic):           9.58e-61
Time:                        19:21:14   Log-Likelihood:                -36.785
No. Observations:                 150   AIC:                             83.57
Df Residuals:                     145   BIC:                             98.62
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                 

## Sklearn

For this example, we'll split the data into test and train sets in order to predict values on held out data.

In [10]:
# Create vectors for data
import numpy as np

y = iris['sepal_length'].to_numpy()
X = iris[['petal_length','sepal_width','petal_width']].to_numpy()
X = np.hstack((np.ones((iris.shape[0], 1)), X))

# Split the data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    random_state=13, 
    test_size=0.2, 
    shuffle=True)

# Fit a linear regression model
from sklearn.linear_model import LinearRegression

mdl = LinearRegression()
mdl.fit(X_train, y_train)
print(f"coeficients: {mdl.coef_}")

# Predict held out values and assess performance using R-squared
from sklearn.metrics import r2_score

y_pred = mdl.predict(X_test)
mdl_r2 = r2_score(y_test, y_pred)

print(f"Prediction for held out data: R\N{SUPERSCRIPT TWO} = {mdl_r2}")


coeficients: [ 0.          0.70669925  0.63118587 -0.56821882]
Prediction for held out data: R² = 0.7923918759481171
0.8583556494018153


In [5]:
print(mdl.score.__doc__)

Return the coefficient of determination of the prediction.

        The coefficient of determination :math:`R^2` is defined as
        :math:`(1 - \frac{u}{v})`, where :math:`u` is the residual
        sum of squares ``((y_true - y_pred)** 2).sum()`` and :math:`v`
        is the total sum of squares ``((y_true - y_true.mean()) ** 2).sum()``.
        The best possible score is 1.0 and it can be negative (because the
        model can be arbitrarily worse). A constant model that always predicts
        the expected value of `y`, disregarding the input features, would get
        a :math:`R^2` score of 0.0.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Test samples. For some estimators this may be a precomputed
            kernel matrix or a list of generic objects instead with shape
            ``(n_samples, n_samples_fitted)``, where ``n_samples_fitted``
            is the number of samples used in the fitting for the estimato

## Keras

In [2]:
train_dataset = iris.sample(frac=0.8, random_state=0)
test_dataset = iris.drop(train_dataset.index)

NameError: name 'iris' is not defined