# Project: Linear Regression

Reggie is a scientist who has been hired by the local fast food joint to build their newest ball pit in the play area. As such, he is working on researching the bounciness of different balls so as to optimize the pit. He is running an experiment to bounce different sizes of bouncy balls, and then fitting lines to the data points he records. My goal here is to implement a version of linear regression in Python.

_Linear Regression_ is when one has a group of points on a graph, and one finds a `line that approximately resembles that group of points`. A good Linear Regression algorithm minimizes the _error_, or the distance from each point to the line. A line with the least error is the line that fits the data the best. One calls this a line of _best fit_.

Modules to be used: loops, lists, and arithmetic to create a function that will find a line of best fit when given a set of data.


## Calculating Error


The formula that produces the `line` is:
```
y = m*x + b
```
where `m` is the slope of the line and `b` is the intercept, where the line crosses the y-axis.

Below I am creating a function `get_y()` that takes in `m`, `b`, and `x` values and returns what the `y` value would be for the given `x`:

In [50]:
def get_y(m, b, x):
  y = m * x + b
  return y

# Test the function
print(get_y(1, 0, 7))
print(get_y(5, 10, 3))

# Test is successful, see the output below:

7
25



Reggie wants to try a bunch of different `m` values and `b` values and see which line produces the least error. 

To calculate error between a point and a line I will create a function `calculate_error()`, which takes in `m`, `b`, `x` and `y` values (`x` and `y` under the name `point`) and gives back the distance between the line and the point.

The distance represents the `error` between the line produced by formula `y = m*x + b` and the `point` given.


In [9]:
def calculate_error(m, b, point):
    x_point, y_point = point
    difference = get_y(m, b, x_point) - y_point
    return abs(difference)

Test this function:

In [56]:
# This is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))

# The point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))

# The point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))

# The point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))

# Test is successful, see the output below:

0
1
1
5


Reggie's datasets will be sets of points. For example, he ran an experiment comparing the width of bouncy balls to how high they bounce:


In [42]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]

The first datapoint, `(1, 2)`, means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters.

As I try to fit a line to this data, I will need a function called `calculate_all_error`, which takes `m` and `b` that describe a line, and `points` (a set of data like the example datapoints above).

`calculate_all_error` should iterate through each `point` in `points` and calculate the error from the given point to the line (using `calculate_error`). Doing so it should also find a sum of occuring errors in `total_of_the_errors` and then return it after all points have been checked.


In [13]:
def calculate_all_error(m, b, points):
    total_of_errors = 0
    for point in points:
        total_of_errors += calculate_error(m, b, point)
    return total_of_errors
        

Test this function:

In [57]:
#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))

#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))

#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))


#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))

# Test is successful, see the output below:

0
4
4
18


Now I have a function that can take in a line and Reggie's data and return how much error that line produces when I try to fit it to the data.

The next step is to find the `m` and `b` that minimizes this error, and thus fits the data best.


## Finging Best Slope and Intercept


The way Reggie wants to find a line of best fit is by trial and error. He wants to try a bunch of different slopes (`m` values) and a bunch of different intercepts (`b` values) and see which one produces the smallest error value for his dataset.

To accomplish this I will create a list of possible `m` values to try. It will go from -10 to 10 inclusive, in increments of 0.1.


In [24]:
possible_ms = [m / 10 for m in range(-100, 101, 1)]

And also a list of `possible_bs` to check. That will be the values from -20 to 20 inclusive, in steps of 0.1.

In [36]:
possible_bs = [b / 10 for b in range(-200, 201, 1)]


Now I am going to find the smallest error. First, I will make every possible `y = m*x + b` line by pairing all of the possible `m`s with all of the possible `b`s. Then, I will see which `y = m*x + b` line produces the smallest total error with the set of data stored in `datapoint`.


In [47]:
def calculate_m_b(points):
    smallest_error = float('inf')
    best_m = 0
    best_b = 0

    for m in possible_ms:
        for b in possible_bs:
            errors_sum = calculate_all_error(m, b, points)
            if errors_sum < smallest_error:
                best_m = m
                best_b = b
                smallest_error = errors_sum
        
    return best_m, best_b, smallest_error

# Test
print(calculate_m_b(datapoints))

# Test is successful, see the output below:

(0.4, 1.6, 5.0)


What does this model predict?

For the given set of observations on the bouncy balls, the line that fits the data best has an `m` of 0.4 and a `b` of 1.6. That is not the only answer though because what's important in the latest calculation is the value of the total error. Considering it is a sum of several errors the values of `m` and `b` could be different, 0.3 and 1.7, for instance. 

```
y = 0.3x + 1.7
```

This line still produces a total error of 5.

Using these `m`s and this `b`s, though one can predict the bounce height (`y` value) of a ball with a width of 6cm (`x` value)
In other words, what is the output of `get_y()` (the main formula of Linear Regression at the beggining of this project) when we call it with:
* m = 0.3
* b = 1.7
* x = 6

In [58]:
m = 0.3
b = 1.7
x = 6

get_y(m, b, x)

3.5

## Conclusion

Our model predicts that the 6cm ball will bounce 3.5m. Now, Reggie can use this model to predict the bounce of all kinds of sizes of balls he may choose to include in the ball pit!