# Statistics

### Describing single set of data

In [1]:
import random
num_friends = [random.randrange(1, random.randrange(50, 100)) for _ in range(200)]

In [2]:
from matplotlib import pyplot as plt
from collections import Counter
friend_count = Counter(num_friends)
xs = range(101)
ys = [friend_count[x] for x in xs]
plt.axis([0, 100, 0, 10])
plt.title("History of friend counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.bar(xs, ys)


<BarContainer object of 101 artists>

In [3]:
num_points = len(num_friends)
largest = max(num_friends)
lowest = min(num_friends)
print(num_points, largest, lowest)

200 89 1


### Central tendencies

#### Mean, Median, Mode gives notion of where our data is centered

In [4]:
def mean(x):
    return sum(x) / len(x)

mean(num_friends)

37.535

In [5]:
def median(v):
    """Finds middlemost value of v"""
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2
    
    if n % 2 == 1:
        # if odd, return middle value
        return sorted_v[midpoint]
    else:
        # if even, return average of middle values
        lo = midpoint + 1
        hi = midpoint
        return (sorted_v[lo] + sorted_v[hi]) / 2
median(num_friends)

37.5

#### Mean is sensitive to outliers of data while the median remains almost the same for new outliers added to data

In [6]:
y = num_friends.copy()
print(mean(y), median(y))
y += [1000]
print(mean(y), median(y))

37.535 37.5
42.32338308457712 37


*TODO : Efficient tricks to compute median without soriting*

#### Generalization of median is *quantile*, which represents the value less than which certain percentile of data lies.

* Median represents the value less than which 50% of data lies

In [7]:
def quantile(x, p):
    """Return the pth percentile value in x"""
    p_index = int(p * len(x))
    return sorted(x)[p_index]

In [8]:
quantile(num_friends, 0.1)

7

In [9]:
quantile(num_friends, 0.3)

23

In [10]:
quantile(num_friends, 0.6)

44

In [11]:
quantile(num_friends, 0.9)

70

#### Mode is the most commonly occuring value

In [12]:
def mode(x):
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.items() if count == max_count]

In [13]:
mode(num_friends)

[41]

### Dispersion

#### Measure of how spread our data is.
#### Values near 0 signify not spread and large values means very spread out.

In [14]:
def data_range(x):
    # Undispersed means all values are equal and hence max and min are eqaul
    return max(x) - min(x)

def de_mean(x):
    """Returns values deducted by the mean"""
    m = mean(x)
    return [x_i - m for x_i in x]

def variance(x):
    """Requires atleast 2 values in list"""
    n = len(x)
    deviations = de_mean(x)
    sum_of_squared_deviation = sum(x_i**2 for x_i in deviations)
    return sum_of_squared_deviation / (n - 1)

variance is $$ \frac{\sum (x_\mu - x)^2}{n - 1}$$

In [15]:
variance(num_friends)

505.75756281407

#### Standard deviation is square root of variance

In [16]:
import math
def standard_deviation(x):
    return math.sqrt(variance(x))

In [17]:
standard_deviation(num_friends)

22.48905428901069

### Correlation

In [23]:
friends = [100, 10, 20, 80, 50]
num_minutes = [40, 10, 13, 35, 25]

def dot(x, y):
    return sum(i * j for i, j in zip(x, y))

def covariance(x, y):
    n = len(x)
    return dot(de_mean(x), de_mean(y)) / (n - 1)

covariance(friends, num_minutes)

503.50000000000006

#### Covariance is the value using which we could specify the relationship among the two entities. Higher covariance means the values of two entities are directly propotional else it is inversly propotional

*But it is very hard to interpret covariance and hence we look at correlation*

#### Correlation divides standard deviation of both variables by the covariance

In [24]:
def correlation(x, y):
    stdev_x = standard_deviation(x)
    stdev_y = standard_deviation(y)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(x, y) / stdev_x / stdev_y
    else:
        return 0

correlation(friends, num_minutes)

0.997565741535536

#### Correlation always lies between -1(perfect anti-correlation) and 1(perfect correlation).
If number like 0.25 outputs then we say it is weak positive correlation.
Here it has very strong positive correlation of 0.99

#### <font color=red>Correlation can be very sensitive to outliers</font>

### Simpson's Paradox

#### Correlations can sometimes be misleading when cofounding variables are ignored

> *Simpson's paradox occurs when some groups of data show a certain relationship in each group, but when the data is combined, that relationship is reversed*

<img src=https://ds055uzetaobb.cloudfront.net/image_optimizer/2276415d10826ea2bab7429afa215c693fa53fdb.png>