
# Starting off

Below is a list of weights(kg) of 10 male subjects. How can you describe this data set to another person?


```[55, 56, 56, 58, 60, 61, 63, 64, 70, 78]```

# Introducing Statistics: Measures of Central Tendency , Disperson, and Correlation

## Aim:
- Be able to describe a large sample of data in a meaningful way that conveys information.

- Be able to describe how two sets of data are related to each other

- Write functions to calculate the Descriptive statistics of a data set. 


A **population** is the collection of **all** people, plants, animals, or objects of interest about which we wish to make statistical inferences (generalizations). 

A **population parameter** is a numerical characteristic of a population. In nearly all statistical problems we do not know the value of a parameter because we do not measure the entire population. We use sample data to make an inference about the value of a parameter.



A **sample** is the subset of the population that we actually measure or observe.

A **sample statistic is** a numerical characteristic of a sample. A sample statistic estimates the unknown value of a population parameter. Information collected from sample statistic is sometimes refered to as Descriptive Statistic.

 Here are the Notations that will be used:

$X_{ij}$ = Observation for variable *j* in subject *i* .

$p$ 
 = Number of variables

$n$
 = Number of subjects

In the example to come, we'll have data on 737 people (subjects) and 5 nutritional outcomes (variables). So, 

$p$
 = 5 variables

$n$
 = 737 subjects





In multivariate statistics we will always be working with vectors of observations. So in this case we are going to arrange the data for the p variables on each subject into a vector. In the expression below, 
$X_i$ is the vector of observations for the $i^{th}$ subject, $i$ = 1 to $n$(737). Therefore, the data for the $j^{th}$ variable will be located in the $j^{th}$ element of this subject's vector, $j$ = 1 to $p$(5).


$$\mathbf{X}_i = \left(\begin{array}{l}X_{i1}\\X_{i2}\\ \vdots \\ X_{ip}\end{array}\right)$$

## Measures of Central Tendency



### Mean
Mean or average is the value obtained by dividing the sum of all the data by the total number of data points.


<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/4e3313161244f8ab61d897fb6e5fbf6647e1d5f5' />

## Mathematically Speaking


Throughout this course, we’ll use the ordinary notations for the mean of a variable. 

That is, the symbol $\mu$ is used to represent a (theoretical) population mean and the symbol $\bar{x}$ is used to represent a sample mean computed from observed data. 

In the multivariate setting, we add subscripts to these symbols to indicate the specific variable for which the mean is being given. For instance, $\mu_1$ represents the population mean for variable 
$x_1$ and 
$\bar{x}$
 denotes a sample mean based on observed data for variable 
$\bar{x}_1$
.



The population mean is the measure of central tendency for the population. Here, the population mean for variable $j$ is:

$$\mu_j = E(X_{ij})$$

and the sample mean for variable $j$ is:

$$\bar{x}_j = \frac{1}{n}\sum_{i=1}^{n}X_{ij}$$

### Median

In a set with odd number of data points the median is the middlemost value while if the number of data points is even then it is the average of the two middle items.

In the previous set since the number of data is 10 (even) the 5th and 6th item correspond to the middle data items.


<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/da59c1e963f56160361fcce819a95f351748630a' />

### Mode

Mode refers the data item that occurs most frequently in a given data set.

### Questions:

-When would median be a better measure of central tendency than mean?

-When is mode the best measure of central tendency to use?

Questions:


1. We want to calculate the mean, median, and mode for the above list of numbers.Please write a function to calculate each of those statistics.



In [1]:
data = [55, 56, 56, 58, 60, 61, 63, 64, 70, 78]

In [55]:
def calc_mean(data):
    adds = 0
    for d in data:
        adds += float(d)
    return adds / len(data)

calc_mean(data)

62.1

In [46]:
def calc_median(data):
    if len(data) % 2 == 0:
        return [data[int(len(data) / 2)] + data[int(len(data) / 2 + 1)]]
    else:
        return data[int(round(len(data) / 2, 0))]
                    
calc_median(data)
calc_median(data[:9])

60

In [47]:
from collections import Counter
def calc_mode(data):
    count = Counter(data)
    return count.most_common(1)[0][0]

calc_mode(data)

56

## Measures of Dispersion
Measures of dispersion quantify the spread of the data. They try to measure how much variation is there among the various data points.



### Range
One simple such measure is range which is simply the difference between the largest and the smallest data item. For our previous dataset,

Range = 78–55 = 23.





### InterQuantile Range - IQR
The quartiles of a data set divides the data into four equal parts, with one-fourth of the data values in each part. The second quartile position is the median of the data set, which divides the data set in half as shown for a simple dataset below:

![IQR](iqr.png)

The interquartile range (IQR) is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread (i.e. the average or median) when reporting things like average retirement age and scores in a test etc.

### Variance
A more complex measure of dispersion is variance. The variancde of a population for variable $x_j$ is:

$\sigma_j^2 = E(x_j-u_j)^2$

The population variance $\sigma _{j}^{2}$ can be estimate by the sample variance: 

$s_j^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)^2=\frac{\sum_{i=1}^{n}X_{ij}^2-(\left(\sum_{i=1}^{n}X_{ij}\right)^2/n)}{n-1}$ 

Variance signifies how much the data items are deviating from mean.

1) Larger variance means the data items deviate more from the mean.

2) Smaller variance means the data items are closer to the mean.

Now let’s calculate the variance for the previous dataset,

*Variance* = 

~~~
[(55–62.1)² + (56–62.1)² + (56–62.1)² + (58–62.1)² + (60-62.1)² + (61–62.1)² + (63–62.1)² + (64–62.1)² + (70–62.1)² +(78–62.1)²]/9.

= 466.9/9

= 51.88

~~~

### Standard deviation
It is simply the square root of the variance. In the above formula, σ is the standard deviation and σ2 is the variance. Hence, in this example the standard deviation is

$\sigma = \sqrt{\sigma^2}$

$\sqrt{51.88} = 7.20$

### Application


- Write a function to calculate the variance of a dataset.
- Write a function to calculate the standard deviation of a dataset using the variance function.


In [48]:
def increase_mean(date, increase):
    new_mean = calc_mean(data) + increase
    print(new_mean)
    return new_mean

In [49]:
from math import sqrt

def calc_var(data):
    mean = calc_mean(data)
    return sum([(x - mean)**2 for x in data]) / (len(data) - 1)
#     return sum(list(map(lambda point: (point - mean)**2, data))) / (len(data) - 1)

calc_var(data)

51.87777777777777

In [50]:
def sd(data):
    return round((sqrt(calc_var(data))), 2)

sd(data)

7.2

### List Comprehension


List comprehension is an elegant way to define and create lists based on existing lists.

**Syntax of List Comprehension**

`[expression for item in list]`


Let's take our list of data and create a new list where every data point is multiplied by 2.

In [51]:
[x*2 for x in data]

[110, 112, 112, 116, 120, 122, 126, 128, 140, 156]

## Measures of Association

When we have two variables and we want to describe how the two are related to eachother.

### Correlation

Let’s say we have a dataset of height and weight of ten males. Normally we expect that the weight and height of a person are correlated, i.e. a taller person has more chances of having more weight than a short person. Correlation measures relationship between these kinds of data.

### Co-variance
One such measure is called co-variance, which measures how two variables vary with respect to each other. 

The population covariance $σ_{jk}$ between variables $j$ and $k$ can be estimated by the sample covariance. This can be calculated using the formula below:


$$s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)(X_{ik}-\bar{x}_k)=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{n-1}$$





### Positive & Negative Covariance.

1) Positive covariance signifies that the higher values of one variable correspond with the higher values of the other variable, and similarly for the lower ones.

2) Negative covariance, on the other hand, signifies that the higher values of one variable correspond to the lower values of the other.

Hence the sign of the covariance therefore shows us the kind of linear relationship between two variables.

#### Question:  
 What does a co-variance of 0 probably mean? - no relationship

### Correlation Coefficient
The correlation coefficient is obtained by dividing the covariance by the product of the standard deviations of the two variables.  It is defined as,

![correlation](correlation.jpeg)

Or in our fancy notation it is: 

$$r_{jk}=\frac{s_{jk}}{s_js_k}=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{\sqrt{\{\sum_{i=1}^{n}X^2_{ij}-(\sum_{i=1}^{n}X_{ij})^2/n\}\{\sum_{i=1}^{n}X^2_{ik}-(\sum_{i=1}^{n}X_{ik})^2/n\}}}$$


The values lie between +1 and -1.

· +1 signifying a perfect increasing linear relationship (correlation).

· -1 signifying a perfect decreasing linear relationship (anti-correlation).

### Applied 

1. Write a function to calculate the covariance of a dataset.
2. Write a function to calculate correlation using your functions for covariance and standard deviation.

In [52]:
import csv
with open('weight-height.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    
    weights = []
    heights = []
    for row in readCSV:
        weight = row[2]
        height = row[1]

        weights.append(weight)
        heights.append(height)

    print(weights[:5])
    print(heights[:5])

['Weight', '241.893563180437', '162.310472521300', '212.7408555565', '220.042470303077']
['Height', '73.847017017515', '68.7819040458903', '74.1101053917849', '71.7309784033377']


In [53]:
weights = weights[1:]
heights = heights[1:]

In [57]:
def calc_cov(data1, data2):
    mean1 = calc_mean(data1)
    mean2 = calc_mean(data2)
    covar_list = []
    for i in range(len(data1)):
        covar_list.append((float(data1[i]) - mean1) * (float(data2[i]) - mean2))
    return sum(covar_list) / (len(data1) - 1)

calc_cov(weights, heights)

114.2426564464631

#### Question

Does correlation mean causation?