
# Starting off

Below is a list of weights(kg) of 10 male subjects. How can you describe this data set to another person?


```[55, 56, 56, 58, 60, 61, 63, 64, 70, 78]```

# Introducing Statistics: Measures of Central Tendency , Disperson, and Correlation

## Aim:
- Be able to describe a large sample of data in a meaningful way that conveys information.

- Be able to describe how two sets of data are related to each other

- Write functions to calculate the Descriptive statistics of a data set. 


A **population** is the collection of **all** people, plants, animals, or objects of interest about which we wish to make statistical inferences (generalizations). 

A **population parameter** is a numerical characteristic of a population. In nearly all statistical problems we do not know the value of a parameter because we do not measure the entire population. We use sample data to make an inference about the value of a parameter.



A **sample** is the subset of the population that we actually measure or observe.

A **sample statistic is** a numerical characteristic of a sample. A sample statistic estimates the unknown value of a population parameter. Information collected from sample statistic is sometimes refered to as Descriptive Statistic.

 Here are the Notations that will be used:

$X_{ij}$ = Observation for variable *j* in subject *i* .

$p$ 
 = Number of variables

$n$
 = Number of subjects

In the example to come, we'll have data on 737 people (subjects) and 5 nutritional outcomes (variables). So, 

$p$
 = 5 variables

$n$
 = 737 subjects





In multivariate statistics we will always be working with vectors of observations. So in this case we are going to arrange the data for the p variables on each subject into a vector. In the expression below, 
$X_i$ is the vector of observations for the $i^{th}$ subject, $i$ = 1 to $n$(737). Therefore, the data for the $j^{th}$ variable will be located in the $j^{th}$ element of this subject's vector, $j$ = 1 to $p$(5).


$$\mathbf{X}_i = \left(\begin{array}{l}X_{i1}\\X_{i2}\\ \vdots \\ X_{ip}\end{array}\right)$$

## Measures of Central Tendency



### Mean
Mean or average is the value obtained by dividing the sum of all the data by the total number of data points.


<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/4e3313161244f8ab61d897fb6e5fbf6647e1d5f5' />

## Mathematically Speaking


Throughout this course, we’ll use the ordinary notations for the mean of a variable. 

That is, the symbol $\mu$ is used to represent a (theoretical) population mean and the symbol $\bar{x}$ is used to represent a sample mean computed from observed data. 

In the multivariate setting, we add subscripts to these symbols to indicate the specific variable for which the mean is being given. For instance, $\mu_1$ represents the population mean for variable 
$x_1$ and 
$\bar{x}$
 denotes a sample mean based on observed data for variable 
$\bar{x}_1$
.



The population mean is the measure of central tendency for the population. Here, the population mean for variable $j$ is:

$$\mu_j = E(X_{ij})$$

and the sample mean for variable $j$ is:

$$\bar{x}_j = \frac{1}{n}\sum_{i=1}^{n}X_{ij}$$

### Median

In a set with odd number of data points the median is the middlemost value while if the number of data points is even then it is the average of the two middle items.

In the previous set since the number of data is 10 (even) the 5th and 6th item correspond to the middle data items.


<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/da59c1e963f56160361fcce819a95f351748630a' />

### Mode

Mode refers the data item that occurs most frequently in a given data set.

### Questions:

-When would median be a better measure of central tendency than mean?

-When is mode the best measure of central tendency to use?

Questions:


1. We want to calculate the mean, median, and mode for the above list of numbers.Please write a function to calculate each of those statistics.



In [1]:
weights = [55, 56, 56, 58, 60, 61, 63, 64, 70, 78]

In [92]:
#write a function that returns the mean 

def calc_mean(dataset):
    return sum(dataset) / len(dataset)
#     total = 0
#     for datapoint in dataset:
#         total += datapoint
#     mean = total / len(dataset)
#     return mean


In [93]:
#call this function on the weights list

calc_mean(weights)

161.44035683283076

In [94]:
#Write a function that returns the median 

def calc_median(data):
    data.sort()
    n = len(data)
    if n % 2:
        return data[n/2]
    else:
        return (data[n//2] + data[(n//2) - 1]) / 2

        
#     median_index = dataset[len(dataset)/2]
#     median = dataset[int(median_index)]
#     midpoint = len(dataset) / 2
    
#     if midpoint % 2 == 1:
        
#     print(dataset[int(len(dataset)/2)])
#     return 

In [95]:
calc_median(weights)

161.21292769948298

## Measures of Dispersion
Measures of dispersion quantify the spread of the data. They try to measure how much variation is there among the various data points.



### Range
One simple such measure is range which is simply the difference between the largest and the smallest data item. For our previous dataset,

Range = 78–55 = 23.





### InterQuantile Range - IQR
The quartiles of a data set divides the data into four equal parts, with one-fourth of the data values in each part. The second quartile position is the median of the data set, which divides the data set in half as shown for a simple dataset below:

![IQR](iqr.png)

The interquartile range (IQR) is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread (i.e. the average or median) when reporting things like average retirement age and scores in a test etc.

### Variance
A more complex measure of dispersion is variance. The variancde of a population for variable $x_j$ is:

$$\sigma_j^2 = \frac{1}{n}\sum_{i=1}^{n}(x_{ij}-u_j)^2$$

The population variance $\sigma _{j}^{2}$ can be estimate by the sample variance: 

$$s_j^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)^2$$ 

Variance signifies how much the data items are deviating from mean.

1) Larger variance means the data items deviate more from the mean.

2) Smaller variance means the data items are closer to the mean.

Now let’s calculate the variance for the previous dataset,

*Variance* = 

~~~
[(55–62.1)² + (56–62.1)² + (56–62.1)² + (58–62.1)² + (60-62.1)² + (61–62.1)² + (63–62.1)² + (64–62.1)² + (70–62.1)² +(78–62.1)²]/9.

= 466.9/9

= 51.88

~~~

### Standard deviation
It is simply the square root of the variance. In the above formula, σ is the standard deviation and σ2 is the variance. Hence, in this example the standard deviation is

$\sigma = \sqrt{\sigma^2}$

$\sqrt{51.88} = 7.20$

### Application


- Write a function to calculate the variance of a sample.



In [96]:
def calc_var(data):
    sum_of_sqs = 0
    for i in data:
        sum_of_sqs += (i - calc_mean(data)) ** 2 
    return sum_of_sqs / (len(data) - 1)
calc_var(weights)

### List Comprehension


List comprehension is an elegant way to define and create lists based on existing lists.

**Syntax of List Comprehension**

`[expression for item in list]`


Let's take our list of data and create a new list where every data point is multiplied by 2.


In [108]:
# print(weights)
times_2 = [x*2 for x in weights]

Let's replace the for loop in  `calc_var()` with a list comprehension

In [98]:
def calc_var(data):
    sum_of_sqs = sum([(i - calc_mean(data)) ** 2 for i in data ])
    return sum_of_sqs / (len(data) - 1)

In [99]:
calc_var(weights)

1030.9518554353856

- Write a function to calculate the standard deviation of a sample using the variance function.

***Functions can call other functions***

In [100]:
def increase_mean(data, increase):
    new_mean =  calc_mean(data)+increase
    return new_mean

In [101]:
# increase_mean(data, 4)

In [102]:
#standard deviation function

In [111]:
import math

def calc_std(data):
    return sqrt(calc_var(data))

In [112]:
calc_std(weights)

32.108439006519546

## Measures of Association

Let’s say we have a dataset of height and weight of ten males. Normally we expect that the weight and height of a person are correlated, i.e. a taller person has more chances of having more weight than a short person. Measures of association quantify relationship between these kinds of data.

### Co-variance
One such measure is called co-variance, which measures how two variables vary with respect to each other. 

The population covariance $σ_{jk}$ between variables $j$ and $k$ can be calculated with the following formula. 

$$\sigma_{jk} = \frac{1}{N}\sum_{i=1}^{N}(X_{ij}-\bar{x}_j)(X_{ik}-\bar{x}_k)$$

Hwever we often use the sample covariance since we normally do not know the population parameters. This can be calculated using the formula below:


$$s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)(x_{ik}-\bar{x}_k)$$


### Positive & Negative Covariance.

1) Positive covariance signifies that the higher values of one variable correspond with the higher values of the other variable, and similarly for the lower ones.

2) Negative covariance, on the other hand, signifies that the higher values of one variable correspond to the lower values of the other.

Hence the sign of the covariance therefore shows us the kind of linear relationship between two variables.

#### Question:  
 What does a co-variance of 0 probably mean?

### Pearson Correlation Coefficient

Correlation measures the strength of the relationship between variables. Correlation is the scaled measure of covariance. It is dimensionless. In other words, the correlation coefficient is always a pure value and not measured in any units.

The values lie between +1 and -1.

· +1 signifying a perfect increasing linear relationship (correlation).

· -1 signifying a perfect decreasing linear relationship (anti-correlation).

The correlation coefficient is obtained by dividing the covariance by the product of the standard deviations of the two variables.  It is defined for the population as,

![correlation](correlation.jpeg)


For the sample it would be:

 $$r_{xy} =\frac {s_{jk}}{s_xs_y}$$


<img src='https://www.mathsisfun.com/data/images/correlation-examples.svg' />

### Applied 

1. Write a function to calculate the covariance of a dataset.
2. Write a function to calculate correlation using your functions for covariance and standard deviation.

In [116]:
#read in data to use

import csv
with open('weight-height.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    next(readCSV)
    weights = []
    heights = []
    for row in readCSV:
        weight = row[2]
        height = row[1]

        weights.append(float(weight))
        heights.append(float(height))

    print(weights)
    print(heights[:5])

[241.893563180437, 162.3104725213, 212.7408555565, 220.042470303077, 206.349800623871, 152.212155757083, 183.927888604031, 167.971110489509, 175.92944039571, 156.399676387112, 186.604925560358, 213.741169489411, 167.127461073476, 189.446181386738, 186.434168021239, 172.186930058117, 196.028506330482, 172.88347020878, 185.98395757313, 182.426648013226, 174.115929081393, 197.73142161472, 149.173566007975, 228.761780615196, 162.006651848287, 192.343976579187, 184.435174408406, 206.828189420354, 175.213922399227, 154.342638925955, 187.506843155807, 212.910225325521, 195.032243233835, 205.183621341371, 204.164125484101, 192.903515074649, 197.488242598925, 183.810973232751, 163.851824878622, 163.108017147583, 172.135597406825, 194.045404898059, 168.617746204292, 161.193432596622, 164.660277264007, 188.922303151274, 187.060552163801, 209.070863390252, 192.014335412005, 211.34249681964, 165.61162618225, 201.071918099568, 173.423960346601, 181.407679285937, 169.737707400252, 163.309528309674, 1

In [114]:
def calc_covar(data1, data2):
    if len(data1) != len(data2):
        return print('The datasets are not of equal size')
    else:
        data1_mean = calc_mean(data1)
        data2_mean = calc_mean(data2)
        sum_of_product = 0
        for n in range(len(data1)):
            sum_of_product += (data1[n] - data1_mean) * (data2[n] - data2_mean)
        covar = sum_of_product / (len(data1) - 1)
#         sum([(data1[n] - data1_mean) * (data2[n] - data2_mean) for n in range(len(data1)]) / len(data1 - 1)
        return  covar
    
calc_covar(weights, heights)

114.2426564464631

In [122]:
def calc_correlation(data1, data2):
    return calc_covar(data1, data2) / (calc_std(data1) * calc_std(data2))
    
calc_correlation(weights, heights)
calc_var(weights)

1030.951855435386

In [2]:
import numpy as np
# stdev(weights, heights)
np.cov(weights, heights)

NameError: name 'weights' is not defined

In [1]:
np.cov()

NameError: name 'np' is not defined

#### Question

When we find two variables are highly correlated, does that mean we can say one causes the other?

**Example:** 
*Children that watch a lot of TV are the most violent. Therefore, TV makes children more violent.*

http://www.tylervigen.com/spurious-correlations