# Chapter 2 - Introduction to Regression Analysis

* **regression analysis**: the process of finding a mathematical model (an equation) that relates $y$ to a set of independent variables and best fits the data

## General Form of Probabilistic Model in Regression

\begin{equation}
y = E(y) + \epsilon
\end{equation}

* $y$ : Dependent variable
* $E(y)$ : Mean (or expected) value of y
* $\epsilon$ : Unexplainable, or random, error


* **dependent** (or response) **variable**: the variable to be predicted or modeled
* **independent variable**: the variables used to predict or model y

* this is a **probabilistic model** for $y$: when certain assumptions about the model are satisfied, we can make a probability statement about the magnitude of the deviation between $y$ and $E(y)$

## Overview of Regression Analysis

* branch of statistical methodology concerned with relating a response $y$ to a set of independent, or predictor, variables $x_1$, $x_2$, $\dots$, $x_k$
* resulting equation is a **regression model**
* **response surface**: convenient method for modeling a response $y$ that is a function of two quantitative variables $x_1$ and $x_2$; traces the mean value of the response variable, $E(y)$, for various combinations of $x_1$ and $x_2$.

## Regression Modeling: Six-Step Procedure

1. Hypothesize the form of the model for $E(y)$.
2. Collect the sample data.
3. Use the sample data to estimate unknown parameters in the model.
4. Specify the probability distribution of the random error term, and estimate any unknown parameters of this distribution.
5. Statistically check the usefulness of the model.
6. When satisfied that the model is useful, use it for prediction, estimation, and so on.

## Data Collection

* If the values of the independent variables ($x$’s) in regression are uncontrolled (i.e., not set in advance before the value of $y$ is observed) but are measured without error, the data are **observational**.
* If the values of the independent variables ($x$’s) in regression are controlled using a designed experiment (i.e., set in advance before the value of $y$ is observed), the data are **experimental**.

### Sample Size and Observational Data

* Regression involves estimating the mean response, so 3 factors must be taken into account:
	* the (estimated) population standard deviation
    * the confidence level
    * the desired half-width of the confidence interval used to estimate the mean
* $E(y)$ is modeled as a function of a set of independent variables, and the additional parameters in the model (the $\beta$'s) must also be estimated
* Sample size must be large enough so that the $\beta$'s are both estimable and testable
* This will not occur unless $n$ is at least as large as the number of $\beta$ parameters

<div class="alert alert-info">
<b>Rule of Thumb:</b> Select $n$ greater than or equal to 10 times the number of $\beta$ parameters in the model (excluding $\beta{}_0$)
</div>

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
import math

from pandas import Series,DataFrame

%matplotlib inline

In [10]:
# load the data
datapath = "./data/ch03/warehouse.dat"
df = pd.read_fwf(datapath, names=["vehicles", "time"], colspecs="infer")

# define the independent and dependent variables
xCol = 'vehicles'
yCol = 'time'

df.head()

Unnamed: 0,vehicles,time
0,1,0.0
1,2,0.0
2,3,0.02
3,4,0.01
4,5,0.01


In [11]:
df.head()

Unnamed: 0,vehicles,time
0,1,0.0
1,2,0.0
2,3,0.02
3,4,0.01
4,5,0.01


In [12]:
df[xCol].describe()

count    15.000000
mean      8.000000
std       4.472136
min       1.000000
25%       4.500000
50%       8.000000
75%      11.500000
max      15.000000
Name: vehicles, dtype: float64

In [13]:
df[yCol].describe()

count    15.000000
mean      0.024667
std       0.015976
min       0.000000
25%       0.010000
50%       0.030000
75%       0.040000
max       0.050000
Name: time, dtype: float64