# Statistics for Data Science

## Central Tendencies

### Mean

When we want a notion of where our data is centered, we use mean

In [2]:
scores = [23, 45, 56, 78, 90, 110, 24, 34, 0, 8, 0, 11, 17, 19, 21, 23, 25] #Let this list be the batting scores of Virat Kohli

def mean(x):
    return sum(x)/len(x)

mean(scores)


34.35294117647059

### Median

We are sometimes also interested in the middle-most value instead of the average. For odd number of points, it is the middle value while for even values, it is average of the 2 middle-most values

In [3]:
def median(x):
    x = sorted(x)
    if(len(x)%2==0):
        return (x[len(x)//2] + x[len(x)//2 + 1])/2
    else:
        return (x[len(x)//2])

median(scores)

23

One thing to note here is that the mean is very sensitive to outliers; so it can paint a false picture of the central tendency. But the median might not be effected by outliers.

### Quantile

A generalisation of the median is the quantile, which represents the value less than which a certain percentile of the data lies.

In [6]:
def quantile(x, p):
    """returns the pth percentile value in x"""
    p_index = int(p*len(x))
    return sorted(x)[p_index - 1]

quantile(scores, 0.8)

45

### Mode
Mode gives you the most-common values

In [8]:
from collections import Counter
def mode(x):
    """returns a list of the mode (there can be more than one mode)"""
    counts = Counter(x)
    modes = [x_i for x_i, val in counts.items() if val==max(counts.values())]
    return modes

mode(scores)

[23, 0]

## Dispersion
This measures how spread out the data is in general

### Range
This is simply the difference between the maximum and the minimum value

In [9]:
def range(x):
    return max(x) - min(x)

range(scores)

110

### Variance
Variance is a more complex measure of dispersion. Here, it is almost the average squared deviation from the mean, except that we’re dividing by n-1 instead of n. In fact, when we’re dealing with a sample from a larger population, x_bar is only an estimate of the actual mean, which means that on average (x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation from the mean, which is why we divide by n-1 instead of n. The main premise here is that we tend to overestimate the actual mean of the sample from the true population and thereby underestimating the sum of squared values of the deviations.

In [10]:
def variance(x):
    devs = [x_i - mean(x) for x_i in x]
    return sum([x_i**2 for x_i in devs])/(len(x)-1)

variance(scores)

1004.6176470588235

Here, the unit of variance is squared of the original units (scores squared), and because of this, we refer to standard deviation

### Standard Deviation

In [11]:
import math
def standard_deviation(x):
    return math.sqrt(variance(x))

standard_deviation(scores)

31.695703921175557

## Correlation

Let's say we want to examine the scores of Virat Kohli. We also have one more metric which is the number of protein bars that Virat Kohli ate before the match started. Let's find out if his performance is related to the number of bars he ate

In [12]:
bars = [1,0,2,4,0,3,4,5,0,1,2,0,1,2,5,7]

We will calculate the covariance first, this is the dot product of the deviations for both variables.

In [13]:
def dot(x,y):
    return sum([x_i*w_i for x_i,w_i in zip(x,y)])

def covariance(x,y):
    x_dev = [x_i-mean(x) for x_i in x]
    y_dev = [y_i-mean(y) for y_i in y]
    return dot(x_dev,y_dev)/(len(x)-1)

covariance(scores,bars)


4.957031249999998

Generally, a higher value of the positive covariance means that the 2 variables are postively related and vice-versa. But because of the units, it is very tough to understand/evaluate the meaning from it. Therefore, we divide the covariance by the standard deviation of both x & y to obtain the correlation.

In [14]:
def correlation(x,y):
    return covariance(x,y)/standard_deviation(x)/standard_deviation(y)

correlation(scores,bars)

0.07268919291750593

The correlation is unit-less and always lies between -1 and 1. Here, we observe a correlation of 0.072 which shows that there is very less correlation between the number of bars and scores of Virat Kohli :))

One very common saying is that - "Correlation is not Causation". This means that if two variables are highly correlated, it does not necessarily mean that one variable caused the other or vice-versa. We cannot say with conclusion that eating more bars caused Kohli to score more runs or vice-versa. Correlations must first be confirmed as real, and every possible causative relationship must then be systematically explored