<a href="https://colab.research.google.com/github/jirvingphd/fsds_100719_cohort_notes/blob/master/mod1_equations_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Review of Equations - Mod 1


In [0]:
# !pip install fsds_100719


Review Study Group for online-ds-ft-100719 & online-ds-pt-100719 cohorts.<br><br>


Questions?:<br>
Contact me:
James M. Irving, Ph.D.
- [Email](james.irivng@flatironschool.com)
- [Schedule One-on-Ones](https://go.oncehub.com/JamesIrvingOfficeHours)


### Equations Included From:
- Section 03
- Section 07

### Additional Topics (Regression)
- Section 07
- Section 08


### Additional Reference Material:
- [Math notation pdf from Learn lesson](https://drive.google.com/file/d/1iu5pj-1q0KX6YmCX1vneS-uTmbaaZXkD/view?usp=sharing)
- [Covariance — Different Ways to Explain or Visualize It - Found by Devin](https://stats.seandolinar.com/covariance-different-ways-to-explain/)



# Methods of Centrality/Dispersion

## Measures of Central Tendency
### Mean

The **Mean** or **Arithmetic Average** is the value obtained by dividing the sum of all the data by the total number of data points as shown in the formula below:

$$ 
\Large\bar X = \dfrac{\sum X}{N} $$

### Median 
- The median is the middle value in a sorted list of values/observations

```python
def median(list_of_observations):
    """Finds the median for input list of observations/values by sorting observations and choosing appropriate value depending on if len(list) is odd or even.
    Args:
        list_of_observations (list): The first parameter.

    Returns:
        median (single element of list): calculated median value.
    """
    
    
    # Calculate length of list / num of observations
    length = len(list_of_observations)         
    
    # Sort reviews by values to to locate median
    sorted_vals = sorted(list_of_observations) 
    
    
    ## Calc median using methods for odd/even lengths
    # If length is even calc mean of middle 2 nums:
    if ((length % 2) == 0):
        
        print('There is an even number of values. Calculating mean of middle 2 elements.')
        
        ## Get the location of middle nums
        ## REMINDER: Python list indices start at 0, not 1

        # Index of 1st num is halfway into list
        idx_0 = int(length/2)
        
        # Index of 2nd num is +1 in list
        idx_1 = idx_0 +1
        
        
        # Median is mean of mid nums
        median = (sorted_vals[idx_0] + sorted_vals[idx_1])/2
    
        
    # If length is odd, select middle num
    else:
        
        print('There is an odd number of values. Selecting middle elemenmt.')
        
        # Get the location of mid num
        idx_med = int(length//2)    
        
        # Median is mid num
        median = sorted_vals[idx_med]             
    
    return median

```

### Mode
- The mode is the value that occurs the most
    - highest frequency
    
  
```python
def get_mode(data):
    """Solution from Implementing stats with functions lab.
    Finds mode by making a dictionary of vals = {val:frequency} 
    Loops through data and updates a dictionary with +=1.
    
    Args:
        data (list)
        
    Returns:
        mode (element from data list)
    """

    # Create and populate frequency distribution
    frequency_dict = {}
    
    # If an element is not in the dictionary , add it with value 1
    # If an element is already in the dictionary , +1 the value
    for i in data:
        
        if i not in frequency_dict:
            frequency_dict[i] = 1
            
        else:
            frequency_dict[i] += 1
    
    # Create a list for mode values
    modes = []
    
    #from the dictionary, add element(s) to the modes list with max frequency
    highest_freq = max(frequency_dict.values())
    
    for key, val in frequency_dict.items():
        if val == highest_freq:
            modes.append(key)
            
    # Return the mode list 
    return modes
```


## Measures of Dispersion

### Absolute Deviation

**Absolute Deviation** is the simplest way of calculating the dispersion of a data set. It is calculated by taking a value from the dataset and subtracting the mean of the dataset. This helps to identify the "distance" between a given value and the mean. In other words, how much a value *deviates* from the mean.  

> $\left|x_i - \bar{x}\right|$

**Average Absolute Deviation** is calculated by taking the mean of all individual absolute deviations in a data set as shown in the formula below:

$$\large \dfrac{1}{n}\sum^n_{i=1}\left|(x_i-\bar x)\right| $$
### Variance

Earlier in the course, you learned about __variance__ (represented by $\sigma^2$) as a measure of dispersion for continuous random variables from its expected mean value. Let's quickly revisit this, as variance formula plays a key role while calculating covariance and correlation measures.

The formula for calculating variance as shown below:

$$\Large \sigma^2 = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i-\mu)^2$$

- $x$ represents an individual data points
- $\mu $ is the mean of the data points
- $n$ is the total number of data points 


## Standard Deviation

The **Standard Deviation** is another measure of the spread of values within a dataset. 
It is simply the square root of the variance. In the above formula, $\sigma^2$ is the variance so $\sigma$ is the standard deviation. 

$$ \large \sigma = \sqrt{\dfrac{1}{n}\displaystyle\sum^n_{i=1}(x_i-\mu)^2} $$

## Measures of Mututal Variation
### Calculating Covariance
If you have $X$ and $Y$, two random variables having $n$ elements each. You can calculate covariance ($\sigma_{xy}$) between these two variables by using the formula:

$$ \Large \sigma_{XY} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$

- $\sigma_{XY}$ = Covariance between $X$ and $Y$
- $x_i$ = ith element of variable $X$
- $y_i$ = ith element of variable $Y$
- $n$ = number of data points (__$n$ must be same for $X$ and $Y$__)
- $\mu_x$ = mean of the independent variable $X$
- $\mu_y$ = mean of the dependent variable $Y$

#### Interpreting covariance values 

Covariance values range from positive infinity to negative infinity. 

* A **positive covariance** indicates that two variables are **positively related**

* A **negative covariance** indicates that two variables are **inversely related**

* A **covariance equal or close to 0** indicates that there is **no linear relationship** between two variables





### Calculating Correlation Coefficient

Pearson Correlation ($r$) is calculated using following formula :

$$ \Large r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$

So just like in the case of covariance,  $X$ and $Y$ are two random variables having n elements each. 


- $x_i$ = ith element of variable $X$
- $y_i$ = ith element of variable $Y$
- $n$ = number of data points (__$n$ must be same for $X$ and $Y$__)
- $\mu_x$ = mean of the independent variable $X$
- $\mu_y$ = mean of the dependent variable $Y$
- $r$ = Calculated Pearson Correlation

## MSE/RMSE 
To get a summarized measure over all the instances in the test set and training set, a popular metric is the **(Root) Mean Squared Error**:

$$ \Large MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$$

$$ \Large RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2}$$





## R-Squared
> **The $R^2$ or Coefficient of determination is a statistical measure that is used to assess the goodness of fit of a regression model**

Here is how it works. 

R-Squared uses a so-called "baseline" model which is a very simple, naive model. This baseline model does not make use of any independent variables to predict the value of dependent variable Y. Instead, it uses the **mean** of the observed responses of the dependent variable $y$ and always predicts this mean as the value of $y$ for any value of $x$. In the image below, this model is given by the straight orange line.


<img src="https://raw.githubusercontent.com/learn-co-students/dsc-coefficient-of-determination-online-ds-ft-100719/master/images/linreg_rsq.png" width="400">

You can see that, in this plot, the baseline model always predicts the mean of $y$ **irrespective** of the value of the $x$. The red line, however, is our fitted regression line which makes use of $x$ values to predict the values of $y$. Looking at the plot above, R-Squared simply asks the question:

>** Is our fitted regression line better than our baseline (worst) model ?**

Any regression model that we fit is compared to this baseline model to understand its **goodness of fit**. Simply put, R-Squared just explains how good is your model when compared to the baseline model. That's about it. 

#### Calculating R-Squared ?

The mathematical formula to calculate R-Squared for a linear regression line is in terms of **squared errors** for the fitted model and the baseline model. It's calculated as :

$$ \large R^2 = 1- \dfrac{SS_{RES}}{SS_{TOT}} = 1 - \dfrac{\sum_i(y_i - \hat y_i)^2}{\sum_i(y_i - \overline y_i)^2} $$

* $SS_{RES}$ (also called RSS) is the **Residual** sum of squared errors of our regression model also known as **$SSE$** (Sum of Squared Errors). $SS_{RES}$ is the squared difference between $y$ and $\hat y$. For the one highlighted observation in our graph above, the $SS_{RES}$ is denoted by the red arrow. This part of the error is not explained by our model.


* $SS_{TOT}$ (also called TSS) is the **Total** sum of squared error. $SS_{TOT}$ is the squared difference between $y$ and $\overline y$. For the one highlighted observation in our graph above, the $SS_{TOT}$ is denoted by the orange arrow.


#### Explaining/Phrasing R-Squared values 

An obtained R-squared value of say 0.85 can be put into a statement as 

> ***85% of the variations in dependent variable $y$ are explained by the independent variable in our model.***

