# linear regression in python

In this notebook, you will generate linear data and 'fit' a linear model to the data 'by hand' using python.

Linear models underlie the vast majority of classical statistics, form the foundation for machine learning methods and are the basic building blocks for neural networks. So, if you really 'get' linear models, you have a solid foundation for doing most of contemporary AI.

Remember that linear models are just functions or equations of the form

```
y = m * x + b
```

where the variable "x" stores your predictors, and "y" is the value you'd like to predict. The 'parameters' "m" and "b" are free paramters of the linear model, giving the slope (m) and y-intercept (b), respectively.

The "y=mx+b" equation gives you an *infinite* number of lines, one for each pair of values for the parameters, m and b. The purpose of "fitting" the linear model to data (called "training" the model, in the machine-learning and artificial-intelligence fields) is to identify the 'best' specific values for "m" and "b", given a specific data set of x,y pairs.

## Why fit a linear model to training data?

In most cases, we have gathered x,y data from an experiment or from observations, and we want to understand the relationship between the predictors (x) and the response (y). In some cases, we may just be intellectually curious about the world, and we want to know how y is related to x. For example, we might want to know how polar bear habitat has been impacted by global climate change.

In many cases, we want to fit our model so that we can 'predict' the response that we would likely observe for a *new* x-value. For example, maybe we want to 'predict' whether a patient is at high risk for a certain type of cancer, based on their lab results. Or perhaps we want to build a model to 'predict' how much irrigation we need to apply to a field of strawberries to achieve optimum growth, given weather data over the past three weeks.

Whatever our reason for fitting a model to data, once we begin the process of fitting a *linear* model, we are making the *assumption* that the relationship between x and y follows the equation

```
y = m * x + b
```

and *not* some other equation or functional form. This assumption generates a *bias* associated with the model we have chosen to use. A linear model will *always* be a straight line, whether the relationship between x and y is linear or not. If the relation is (approximately) linear, then fitting a linear model is probably appropriate, and - once fit - our model will probably give us reasonably accurate predictions. However, if the *real* relationship between x and y is *not* linear, assuming a linear model is likely to give us predictions that are *not* valid and could be horribly wrong.

Choosing the appropriate model for the specific x,y data often requires balancing many factors, including our background knowledge of the experiment or observation, how much computational resources and time we have available, and how many x,y pairs we have to train or fit our model. The sub-field of data science or statistics that deals with choosing the model is called "model selection".

For now, we'll just assume our data are linear.


## let's get started

To fit a linear model, the firs thing we need is some pairs of x,y coordinates that are 'known'. We can then use this "training data" to find the 'best' values for the parameters m and b.

In the 'real world', our training data often comes from experiments or observations, but for now we just want to practice fitting linear models.

'Simulated' training data is used by data scientists to characterize and understand models, figure out when models work well and when they don't, and develop new methods and techniques that effectively overcome the limitations of previous approaches.

In this case, we want to better understand how linear models are fit to data. The first step in this process is to simulate data that *does* obey the assumptions of the linear model. That is, we want to simulate x,y pairs that *do* follow the relationship

```
y = m * x + b + ε
```

which is a linear equation with an extra "error" term, "ε". If we omit the error term ε, then our data will fall *exactly* on a perfect line. And we all know that there is no such thing as *perfect* data. All observations and measurements have error, and in general, the real world is 'messy' and imperfect; the error term reflects this.

Typically, we assume that the error, ε, is a random draw from a normal distribution with mean=0 and standard deviation equal to some value appropriate for the problem at hand. Mathematically, we can write

```
ε ~ N(0,σ)
```

where N(0,σ) indicates a "normal" distribution with given mean (0) and standard deviation (σ).

Procedurally, to simulate linear data with error, we first choose values for the linear-model's parameters, m and b, and then select a value for x (either deterministically or stochastically). Once we have specified values for m,b and x, we can calculate y=mx+b, and then stochastically generate a 'random draw' from the normal distribution N(0,σ) and add this value to y to generate the 'error'.

Fortunately for us, we don't need to write *all* the code to simulate our data! We can use an existing python library to simulate linear data for us.

## simulate linear data with scikit-learn

We will use the "scikit-learn" library to simulate x,y data according to a linear model.

You've actually seen this process in action previously, but we'll go through it in more detail here. For more information about scikit-learn, check out the library's [webpage](https://scikit-learn.org).

To use scikit-learn, we need to "import" the library.

In [1]:
import sklearn

Go ahead and run that code cell to import the scikit-learn library (which is called "sklearn" in python).

You should see google colab connect to a computer on the internet and run your code.

Unfortunately, the "import" statement doesn't produce any output. If there is a problem importing the library, you should see an error statement, but you won't see anything printed if the import works.

To check if the sklearn library was imported correctly, you can print its version to the screen

In [2]:
print(sklearn.__version__)

0.22.2.post1


The specific version will change over time, but you should see some numbers separated by periods, and maybe some other stuff. When I ran this code initially, the sklearn version was

```
0.22.2.post1
```

Okay, now we have the sklearn library imported, let's use it to simulate some linear data!

The function we will use to simulate linear data is called "sklearn.datasets.make_regression". Because it is in the "datasets" submodule of the "sklearn" library, we'll need to import the datasets submodule, first. To do this, run the following code cell.

In [3]:
import sklearn.datasets

Now we should have all the functions in the submodule sklearn.datasets available, including the "make_regression" function.

The documentation for the sklearn.datasets.make_regression function can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html).

The make_regression function is pretty general, with lots of options that you can set to change its behavior. For our simple data simulation, the important options are:

* n_samples - how many x,y pairs to simulate
* n_features - how many x dimensions (in our case, we'll just use 1)
* bias - the y-intercept (default is 0)
* noise - the standard deviation of the error term
* random_state - a random seed, so we can simulate the same data over and over

XX



In [None]:
import sklearn.datasets
import matplotlib.pyplot as plt

x,y = sklearn.datasets.make_regression(n_samples=100, n_features=1,
                                       n_informative=1, noise=10, 
                                       coef=False)

plt.scatter(x,y, marker='o')

After running the above code cell, you should see a 2-dimensional scatter plot with a bunch of blue dots that fall approximately along a diagonal line. Not too bad for essentially 2 lines of code!

If you are curious about what this code cell is doing...

The first 2 lines start with the word "import"; these lines are loading Python libraries, so you can use the code in those libraries. The "sklearn.datasets" library has a function called "make_regression" that creates random data for you.

The "matplotlib.pyplot" library has a "scatter" function that produces a scatterplot.

If you want to see line numbers in your code cells (hint: you probably do!), click on the gear-shaped icon in the upper-right corner of google colab, right next to your user icon. Select the "Editor" tab, click "Show line numbers", and then "Save". You should now see line numbers in the code cells!

Lines 4-6 in the above code cell are actually all part of a single function
call to "make_regression"; the formatting is for human readability, only. This function call generates the 'random' x,y points and stores them in the "x" and "y" variables, respectively. 

Line 8 is the function call that produces the scatterplot from these points.

Again, don't worry about digesting all the Python code yet; we'll get to that later on.

## editing code cells

So, now you can execute Python code cells from within your jupyter notebook and generate results in real-time, which is cool. But what's cooler is that you can also edit the code and see the results in real-time.

Try this with the code cell below, which is just a replica of the linear scatter plot. Try running it once, just to see that it works.

Notice that the points aren't exactly the same? That's because the "make_regression" function generates 'random' data, which is (potentially) different every time! In fact, you can keep running the code cell over and over, to produce different data.

### Let's change how the data are displayed. 

Edit line 8, below, changing the

    marker='o'

to

    marker='*'

and re-run the code cell. Now you have a different type of marker for your data plot.

In [None]:
import sklearn.datasets
import matplotlib.pyplot as plt

x,y = sklearn.datasets.make_regression(n_samples=100, n_features=1,
                                       n_informative=1, noise=10, 
                                       coef=False)

plt.scatter(x,y, marker='o')

## playing with random data generation

Remember how we were saying that the "make_regression" function generates 'random' data that is not the same each time? Let's confirm this.

The code cell below uses the same "make_regression" function call as before, but instead of plotting the resulting points, it just prints the x,y coordinates of the first generated point.

Try running this code multiple times, to see how the x,y location of the first data point changes each time.

Sometimes we want our random data generator to be able to generate the same data more than once. To do this, we can set the "random_state" of the "make_regression" function call to a specific numerical value.

As an example, on line 6 of the code cell below, change the
  
    coef=False)

line to

    coef=False, random_state=2021)

Adding a "random_state" value of "2021" to the function call.

Try running the code cell multiple times.

You should see the same x,y coordinates being printed every time!

Keep a note of these x,y values; you will likely need them to answer a quiz question :)

In [None]:
import sklearn.datasets
import matplotlib.pyplot as plt

x,y = sklearn.datasets.make_regression(n_samples=100, n_features=1,
                                       n_informative=1, noise=10, 
                                       coef=False, random_state=2021)

print('x,y = {:.4f},{:.4f}'.format(x[0][0], y[0]))

After you take the quiz for this notebook, make sure you terminate your notebook's runtime session by clicking on the downward-facing arrow in the upper-right corner of google colab, next to the RAM and Disk usage icon. Select "Manage sessions" and TERMINATE any sessions you have running.