# Descriptive Statistics
## Similarity & Correlation

In [27]:
import math
import copy
import statistics
import scipy.stats
import numpy as np
import pandas as pd
import sklearn.metrics

import matplotlib.pyplot as plt
import seaborn as sns

In [18]:
from stats import *

## Diabetes dataset
Dataset: https://www.kaggle.com/datasets/shantanudhakadd/diabetes-dataset-for-beginners

Source: National Institute of Diabetes and Digestive and Kidney Diseases

The datasets consists of several medical predictocov variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Download and extract csv to `/data` folder. Rename the extension from `.xls` to `.csv`.

In [3]:
df = pd.read_csv("../data/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Covariance
Covariance is a measure of the joint variability of two random variables. It tells us if the paired values tend to rise together, or if one tends to rise as the other falls.

Consider two sets of data `x` and `y`. 
For example, in the above dataset, let's say `x` is `Age`, and `y` is `Glucose`.
Let the length of the data be `n`.

Algorithm:
1. Find means of `x` and `y`.
2. Subtract mean of `x` from every x value, and call it `a`, and subtract mean of `y` from every y value and call it `b`.
3. Calculate $ab$.
4. Sum up $ab$.
5. Divide the sum of $ab$ by n-1. Where n is the total number of pairs.

Equations at each step:
1. $\bar{x} = \frac{\sum{x_i}}{n}$ & $\bar{y} = \frac{\sum{y_i}}{n}$.
2. $a_i = x_i - \bar{x}$ & $b_i = y_i - \bar{y}$.
3. Calculate $ab$ for every value.
4. Find $\sum_i^{n}{a_ib_i}$.
5. $ cov(X, Y) = \frac{1}{n-1}\sum_i^{n}{a_ib_i} $.

One equation for Covariance between two random variables X & Y:

$ cov(X, Y) = \frac{1}{n-1}\sum_i^{n}{(x_i - \bar{x})(y_i - \bar{y})} $

In [19]:
# Pure python implementation
def covariance(x, y):
    mean_x = calc_mean(x)
    mean_y = calc_mean(y)
    a = [xi - mean_x for xi in x]
    b = [yi - mean_y for yi in y]
    ab = [ai * bi for (ai, bi) in zip(a, b)]
    n = len(ab)
    cov = sum(ab) / (n-1)
    return cov

# Numpy deconstructed implementation
def numpy_covariance(x, y):
    a = (x - np.mean(x))
    b = (y - np.mean(y))
    n = len(a)
    cov = sum(a * b) / (n - 1)
    return cov

In [24]:
# Covariance
print("Covariance: ")
x = df["Age"]
y = df["Glucose"]

# Pure python
cov = covariance(x.to_list(), y.to_list())
print("\tUsing pure python: \t{}".format(cov))

# # Statistics library (new in python version 3.10+)
# cov = statistics.covariance(x.to_list(), y.to_list())
# print("\tUsing statistics: \t{}".format(cov))

# Numpy deconstructed
cov = numpy_covariance(x, y)
print("\tUsing numpy custom: \t{}".format(cov))

# Numpy library
cov = np.cov(x, y)
print("\tUsing numpy: \t\t{}".format(cov[0][1]))

Covariance: 
	Using pure python: 	99.08280536994792
	Using numpy custom: 	99.08280536994792
	Using numpy: 		99.08280536994786


### Pearson Correlation
Consider two sets of data `x` and `y`. 
For example, in the above dataset, let's say `x` is `Age`, and `y` is `Glucose`.
Let the length of the data be `n`.

Algorithm:
1. Find means of `x` and `y`.
2. Subtract mean of `x` from every x value, and call it `a`, and subtract mean of `y` from every y value and call it `b`.
3. Calculate: $ab$, $a^2$ and $b^2$ for every value.
4. Sum up $ab$, sum up $a^2$ and sum up $b^2$
5. Divide the sum of $ab$ by the square root of [(sum of $a^2$) × (sum of $b^2$)]

Equations at each step:
1. $\bar{x} = \frac{\sum{x_i}}{n}$ & $\bar{y} = \frac{\sum{y_i}}{n}$.
2. $a_i = x_i - \bar{x}$ & $b_i = y_i - \bar{y}$.
3. Calculate: $ab$, $a^2$ and $b^2$ for every value.
4. Find $\sum_i^{n}{a_ib_i}$, $\sum_i^{n}{a_i^2}$ and $\sum_i^{n}{b_i^2}$.
5. $ r_{xy} = \frac{\sum_i^{n}{a_ib_i}}{\sum_i^{n}{a_i^2} \sum_i^{n}{b_i^2}} $.

One equation for Pearson correlation r:
$ r_{xy} =  \frac{\sum_i^{n}{x_i - \bar{x}}\sum_i^{n}{y_i - \bar{y}}}{\sum_i^{n}{(x_i - \bar{x})^2}\sum_i^{n}{(y_i - \bar{y})^2}} $


In [4]:
# Pure python implementation
def pearson_correlation(x, y):
    mean_x = calc_mean(x)
    mean_y = calc_mean(y)
    a = [xi - mean_x for xi in x]
    b = [yi - mean_y for yi in y]
    ab = [ai * bi for (ai, bi) in zip(a, b)]
    a_sq = [ai**2 for ai in a]
    b_sq = [bi**2 for bi in b]
    r = sum(ab) / math.sqrt((sum(a_sq) * sum(b_sq)))
    return r

# Numpy deconstructed implementation
def numpy_pearson_r(x, y):
    a = (x - np.mean(x))
    b = (y - np.mean(y))
    r = sum(a * b) / np.sqrt(np.sum(np.power(a, 2)) * np.sum(np.power(b, 2)))
    return r

In [28]:
# Pearson Correlation r
print("Pearson Correlation: ")
x = df["Age"]
y = df["Glucose"]

# Pure python
r = pearson_correlation(x.to_list(), y.to_list())
print("\tUsing pure python: \t{}".format(r))

# # Scipy library
r = scipy.stats.pearsonr(x.to_list(), y.to_list())
print("\tUsing scipy: \t\t{}".format(r[0]))

# Numpy deconstructed
r = numpy_pearson_r(x, y)
print("\tUsing numpy custom: \t{}".format(r))
# Numpy library
r = np.corrcoef(x, y)
print("\tUsing numpy: \t\t{}".format(r[0][1]))

Pearson Correlation: 
	Using pure python: 	0.26351431982433376
	Using scipy: 		0.2635143198243335
	Using numpy custom: 	0.2635143198243337
	Using numpy: 		0.26351431982433354


In [11]:
display(df.corr())
print("Correlation b/w Age and Glucose:", df.corr()["Glucose"][7])

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


Correlation b/w Age and Glucose: 0.26351431982433327


### Cosine Similarity
Consider two sets of data `x` and `y`. 
For example, in the above dataset, let's say `x` is `Age`, and `y` is `Glucose`.
Let the length of the data be `n`.

Algorithm:
1. Find euclidean norms of `x` and `y`.
2. Multiply corresponding values of `x` and `y`.
3. Divide the sum of $xy$ by: the square root of norm of `x` multiplied by the square root of norm of `y`.

Equations at each step:
1. $ norm(x) = \sqrt{\sum_i^{n}{x_i^2}} $ and $ norm(y) = \sqrt{\sum_i^{n}{y_i^2}} $.
2. Find $ X\cdot Y = \sum_i^{n}{x_iy_i}$.
3. Cosine similarity is $ cos{\theta} = \frac{X\cdot Y}{norm(x) \hspace{0.2cm}\text{x}\hspace{0.2cm} norm(y)} $.

Equation for cosine similarity:
$$ cos(\theta) = \frac{X \cdot Y}{||X|| ||Y||} =  
\frac{\sum_i^{n}{x_i * y_i}}{\sqrt{\sum_i^{n}{x_i^2}}\sqrt{\sum_i^{n}{y_i^2}}} $$


In [25]:
# Pure python implementation
def cosine_similarity(x, y):
    xy = [xi * yi for (xi, yi) in zip(x, y)]
    x_sq = [xi**2 for xi in x]
    y_sq = [yi**2 for yi in y]
    cos_xy = sum(xy) / (math.sqrt((sum(x_sq))) * math.sqrt((sum(y_sq))))
    return cos_xy

# Numpy deconstructed implementation
def numpy_cosine_sim(x, y):
    cos_xy = np.sum(x * y) / (np.sqrt(np.sum(np.power(x, 2))) * np.sqrt(np.sum(np.power(y, 2))))
    return cos_xy

In [39]:
# Cosine Similarity
print("Cosine Similarity: ")
x = df["Age"]
y = df["Glucose"]

# Pure python
cos_xy = cosine_similarity(x.to_list(), y.to_list())
print("\tUsing pure python: \t{}".format(cos_xy))

# Numpy deconstructed
cos_xy = numpy_cosine_sim(x, y)
print("\tUsing numpy custom: \t{}".format(cos_xy))

# Sklearn library
# print(x.shape, y.shape)
# print(x.to_numpy())
npx = np.reshape(x.to_numpy(), (1, -1))
npy = np.reshape(y.to_numpy(), (1, -1))
cos_xy = sklearn.metrics.pairwise.cosine_similarity(npx, npy)
print("\tUsing numpy: \t\t{}".format(cos_xy[0][0]))

Cosine Similarity: 
	Using pure python: 	0.9339545437446973
	Using numpy custom: 	0.9339545437446973
	Using numpy: 		0.9339545437446972


## Bibliography
1. https://www.mathsisfun.com/data/covariance.html
2. https://mathworld.wolfram.com/topics/DescriptiveStatistics.html
3. https://www.statology.org/descriptive-inferential-statistics/
4. https://en.wikipedia.org/wiki/Cosine_similarity

## Links
1. Dataset: https://www.kaggle.com/datasets/shantanudhakadd/diabetes-dataset-for-beginners