# Working with Linear Models the Easy Way

## Linear regression fit

In [1]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

from sklearn.datasets import load_boston

from sklearn.preprocessing import scale

boston = load_boston()
print("\nShape:", boston.data.shape)

X = scale(boston.data)

y = boston.target



Shape: (506, 13)


## Preprocessing &ndash; "scale"

The `load_boston` dataset from sklearn contains information about housing in the Boston area. Specifically, it has data on 506 houses, each with 13 features, such as the crime rate, the average number of rooms per dwelling, and the distance to employment centers.

`scale(boston.data)` is a function call to the `scale` function from the `sklearn.preprocessing` module, which is used to standardize data. This function takes an input array and returns a new array with each feature centered at zero and scaled to have a standard deviation of one.

So, `scale(boston.data)` standardizes the features of the Boston housing dataset. This is often done as a preprocessing step for machine learning algorithms, as it can improve the performance and stability of the models. Standardization can help ensure that features with larger magnitudes or variances don't dominate the model, and it can also help with interpretability and comparison of feature importance across models.

## Deprecation

<mark>Function load_boston is deprecated;</mark> `load_boston` is deprecated in 1.0 and will be removed in 1.2.

The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.

The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source:

```py
import pandas as pd
import numpy as np


data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
```

Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows:

```py
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
```

for the California housing dataset and:

```py
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
```

for the Ames housing dataset.


In [2]:
from sklearn.linear_model import LinearRegression

regression = LinearRegression()
regression.fit(X, y)


LinearRegression()

## R-squared

In [3]:
import numpy as np

mean_y = np.mean(y)

squared_errors_mean = np.sum((y - mean_y) ** 2)

squared_errors_model = np.sum((y - regression.predict(X)) ** 2)

R2 = 1 - (squared_errors_model / squared_errors_mean)

print("\nR squared:", R2)



R squared: 0.7406426641094095


### [sklearn Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [4]:
print("\nScore:", regression.score(X, y))  # Oh look!  It's the same.



Score: 0.7406426641094095


In [5]:
print([a + ": " + str(round(b, 1)) for a, b in 
       zip(boston.feature_names, regression.coef_, )])


['CRIM: -0.9', 'ZN: 1.1', 'INDUS: 0.1', 'CHAS: 0.7', 'NOX: -2.1', 'RM: 2.7', 'AGE: 0.0', 'DIS: -3.1', 'RAD: 2.7', 'TAX: -2.1', 'PTRATIO: -2.1', 'B: 0.8', 'LSTAT: -3.7']


## What the hecc was that?

This code creates a list of strings that contain the **name of each feature** in the Boston housing dataset and its **corresponding coefficient value** from a regression model.

More specifically, the `zip` function takes two iterables, `boston.feature_names` and `regression.coef_`, and creates tuples from them, pairing the corresponding elements from each iterable.

Here, `boston.feature_names` is a list of the names of the 13 features in the Boston housing dataset, and `regression.coef_` is a list of the regression coefficients for each feature in a previously fitted regression model.

The resulting tuples are then used in a list comprehension to create a list of strings. 

Each string contains the name of a **feature**, followed by a **colon**, and then the **coefficient value** for that feature rounded to one decimal place.

The + operator concatenates the parts of each string together.

Finally, the resulting list of strings is printed to the console. This can be useful for interpreting the results of a regression model and understanding which features are most strongly associated with the target variable.

## What's a coefficient?

In the context of a regression model, a coefficient refers to the value that represents the relationship between a predictor variable and the target variable.

In linear regression, for example, a coefficient represents the slope of the line that best fits the data. It indicates the change in the value of the target variable for a one-unit change in the predictor variable, assuming all other variables are held constant.

Here, `regression.coef_` refers to the coefficients estimated by a previously fitted regression model on the Boston housing dataset. Each coefficient represents the effect of a specific feature on the predicted housing price. The code prints the name of each feature along with its corresponding coefficient value, providing insight into which features are most important in predicting housing prices in the model.

## Order the list of feature names and coefficient values by descending coefficient value

To order the list of feature names and coefficient values by descending coefficient value, you can use the `sorted` function with the `key` argument set to the absolute value of the coefficient value:

In [6]:
coef_list = [(a, b) for a, b in zip(boston.feature_names, regression.coef_)]
sorted_coef_list = sorted(coef_list, key=lambda x: abs(x[1]), reverse=True)

for feature, coef in sorted_coef_list:
    print(f"{feature}: {coef:.1f}")


LSTAT: -3.7
DIS: -3.1
RM: 2.7
RAD: 2.7
TAX: -2.1
PTRATIO: -2.1
NOX: -2.1
ZN: 1.1
CRIM: -0.9
B: 0.8
CHAS: 0.7
INDUS: 0.1
AGE: 0.0


## Here's what's happening:

1. First, a list of tuples is created containing the feature names and corresponding coefficient values, using a list comprehension: `[(a, b) for a, b in zip(boston.feature_names, regression.coef_)]`
2. Next, the `sorted` function is called on the list of tuples. The key argument is set to a `lambda` function that takes a tuple as input and returns the absolute value of the second element (the coefficient value). This ensures that the list is sorted by descending absolute value of the coefficient. The reverse argument is set to True to sort the list in descending order: `sorted_coef_list = sorted(coef_list, key=lambda x: abs(x[1]), reverse=True)`
3. Finally, a `for` loop is used to iterate through the sorted list and print out each feature name and coefficient value, using an f-string to format the output: `for feature, coef in sorted_coef_list: print(f"{feature}: {coef:.1f}")`

This will print out the feature names and corresponding coefficients in descending order of absolute value of the coefficients, making it easy to see which features have the strongest association with the target variable.

## What is a lambda function?

A lambda function is a small, anonymous function in Python that can be defined in a single line of code.

## One hot encoding

In [7]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

lbl = LabelEncoder()
enc = OneHotEncoder()

# qualitative data
arr = ['red', 'red', 'green', 'blue',
       'red', 'blue', 'blue', 'green']

labels = lbl.fit_transform(arr).reshape(8, 1)

print("\nEncoded:\n", enc.fit_transform(labels).toarray())



Encoded:
 [[0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]


# Logistic regression

In [8]:
a = np.array([0, 0, 0, 0, 1, 1, 1, 1])
c = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print("a shape:", a.shape)

b = c.reshape(8, 1)
print("b reshape:", b.shape)

from sklearn.linear_model import LinearRegression

regression = LinearRegression()
regression.fit(b, a)

b_pred = regression.predict(b) > 0.5  # ndarray, shape (8,)

print("\nPredict b > 0.5:\n", b_pred)


a shape: (8,)
b reshape: (8, 1)

Predict b > 0.5:
 [False False False False  True  True  True  True]


### Why?

In the case of `LinearRegression.fit(a, b)`, `a` is an array that contains the feature values for a set of examples, and `b` is an array that contains the target values for those examples.

The shape of an array tells us how many rows and columns it has. So `(8, 1)` means there are 8 rows and 1 column, while `(8,)` means there are 8 rows and no columns (it's a one-dimensional array).

When we call `LinearRegression.fit(a, b)`, the `a` array is passed in with shape `(8, 1)` and the `b` array is passed in with shape `(8,)`. This tells the `LinearRegression` object that there are 8 examples, each with 1 feature, and that the target values for those examples are stored in a one-dimensional array.

The result of calling `LinearRegression.fit(a, b)` is an array of shape `(8,)`. This is because the `LinearRegression` object uses the feature values in `a` and the target values in `b` to learn a mathematical relationship between the two. Once it has learned this relationship, it can use it to make predictions for new examples.

The resulting array of shape `(8,)` contains the predicted target values for each example in `a`. In other words, for each row in the `a` array, the corresponding element in the result array contains the predicted target value for that example.

So, in summary, we use arrays with different shapes when working with `LinearRegression` to store the feature values and target values for a set of examples, and to store the predicted target values for those examples after the model has been trained.

<br>

In [9]:
# from sklearn.cross_validation import train_test_split - cross_validation deprecated
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

binary_y = np.array(y >= 40).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, binary_y, test_size=0.33, random_state=5)

logistic = LogisticRegression()
logistic.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

print('\nIn-sample accuracy: %0.3f' % accuracy_score(y_train, logistic.predict(X_train)))

print('\nOut-of-sample accuracy: %0.3f' % accuracy_score(y_test, logistic.predict(X_test)))



In-sample accuracy: 0.979

Out-of-sample accuracy: 0.958


In [10]:
for var, coef in zip(boston.feature_names, logistic.coef_[0]):
    print("%7s : %7.3f" % (var, coef))


   CRIM :   0.086
     ZN :   0.230
  INDUS :   0.580
   CHAS :  -0.029
    NOX :  -0.304
     RM :   1.769
    AGE :  -0.127
    DIS :  -0.539
    RAD :   0.919
    TAX :  -0.165
PTRATIO :  -0.782
      B :   0.077
  LSTAT :  -1.628


In [11]:
print('\nclasses:', logistic.classes_)
print('\nProbs:\n', logistic.predict_proba(X_test)[:3, :])



classes: [0 1]

Probs:
 [[0.33234217 0.66765783]
 [0.97060356 0.02939644]
 [0.99594746 0.00405254]]


# Variable selection

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, test_size=0.33, random_state=42)
check = [2 ** i for i in range(8)]

for i in range(2 ** 7 + 1):
    X_train = np.column_stack((X_train, np.random.random(X_train.shape[0])))
    X_test = np.column_stack((X_test, np.random.random(X_test.shape[0])))
    regression.fit(X_train, y_train)
    if i in check:
        print("Random features: %i -> R2: %0.3f" %
              (i, r2_score(y_train, regression.predict(X_train))))


Random features: 1 -> R2: 0.748
Random features: 2 -> R2: 0.749
Random features: 4 -> R2: 0.749
Random features: 8 -> R2: 0.754
Random features: 16 -> R2: 0.758
Random features: 32 -> R2: 0.780
Random features: 64 -> R2: 0.812
Random features: 128 -> R2: 0.864


In [13]:
regression.fit(X_train, y_train)
print('R2 %0.3f' % r2_score(y_test, regression.predict(X_test)))
# Please notice that the R2 result may change from run to run 
# due to the random nature of the experiment


R2 0.443


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree=2)
poly_X = pf.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(poly_X,
                                                    y, test_size=0.33, random_state=42)

from sklearn.linear_model import Ridge

reg_regression = Ridge(alpha=0.1, normalize=True)
reg_regression.fit(X_train, y_train)

print('\nR2: %0.3f' % r2_score(y_test, reg_regression.predict(X_test)))



R2: 0.820


# Stochastic Gradient Descent

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, test_size=0.33, random_state=42)

SGD = SGDRegressor(penalty=None, learning_rate='invscaling',
                   eta0=0.01, power_t=0.25)


In [16]:
power = 17
check = [2 ** i for i in range(power + 1)]

for i in range(400):
    for j in range(X_train.shape[0]):
        SGD.partial_fit(X_train[j, :].reshape(1, 13),
                        y_train[j].reshape(1, ))

        count = (j + 1) + X_train.shape[0] * i

        if count in check:
            R2 = r2_score(y_test, SGD.predict(X_test))
            print('\nExample %6i R2 %0.3f coef: %s' %
                  (count, R2, ' '.join(map(lambda x: '%0.3f' % x, SGD.coef_))))



Example      1 R2 -6.255 coef: 0.112 -0.071 0.148 -0.040 0.075 -0.021 0.146 -0.113 0.243 0.224 0.118 0.037 0.110

Example      2 R2 -6.168 coef: 0.065 -0.139 0.087 -0.078 0.055 -0.114 0.254 -0.054 0.154 0.140 0.282 0.068 0.152

Example      4 R2 -6.060 coef: -0.074 -0.195 0.319 -0.171 0.064 -0.206 0.527 0.048 -0.041 0.266 0.075 0.219 0.353

Example      8 R2 -5.775 coef: -0.249 -0.504 0.605 -0.343 0.098 0.005 0.807 -0.304 -0.095 0.332 -0.067 0.399 0.024

Example     16 R2 -5.144 coef: -0.441 -0.430 0.298 -0.571 -0.002 0.004 0.519 -0.423 -0.279 0.292 -0.544 0.665 -0.065

Example     32 R2 -4.494 coef: -0.562 -0.308 0.441 1.224 0.051 0.315 0.387 -0.567 0.055 0.629 -0.367 0.726 -0.513

Example     64 R2 -2.947 coef: -0.986 0.419 0.107 1.648 -0.409 1.686 -0.427 -0.201 -0.029 0.448 -1.245 1.166 -1.913

Example    128 R2 -1.791 coef: -0.546 0.863 0.119 1.137 -0.584 1.823 -0.288 -0.179 -0.281 0.096 -1.982 1.165 -2.029

Example    256 R2 -0.608 coef: -0.804 0.619 -0.176 1.368 -0.770 3.135 -0.