# Karl Pearson’s correlation

##  variables must have a Gaussian distribution and a linear relationship

## Covariance

### cov(x,y) = sum((x-mean(x))*(y-mean(y)))/(n-1)

## A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.

## The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

### Pearson’s correlation coefficient = covariance(x,y)/(stdv(x)*stdv(y))

## The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution.

# Spearman’s Correlation

## Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables

## the two variables being considered may have a non-Gaussian distribution.

## This test of relationship can also be used if there is a linear relationship between the variables, but will have slightly less power (e.g. may result in lower coefficient scores).

## Instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated from the relative rank of values on each sample. This is a common approach used in non-parametric statistics, e.g. statistical methods where we do not assume a distribution of the data such as Gaussian.



In [6]:
%%time
def findNumber(arr_1, arr_2):
    History_Scores = list(arr_1)
    Physics_Scores = list(arr_2)
    avg_history_score = float(sum(History_Scores))/len(History_Scores)
    avg_physics_score = float(sum(Physics_Scores))/len(Physics_Scores)
    difference_history_score = map(lambda x: x-avg_history_score,History_Scores)
    difference_physics_score = map(lambda x: x-avg_physics_score,Physics_Scores)
    product = map(lambda x,y:x*y,(difference_history_score),(difference_physics_score))
    coeff = sum(product)
    stdv__history_score = (sum([i**2 for i in difference_history_score]))**0.5
    stdv__physics_score = (sum([i**2 for i in difference_physics_score]))**0.5
    score = coeff/(stdv__history_score*stdv__physics_score)
    print("Karl Pearson’s coefficient {score:.3f}".format(score=score))
    return score

CPU times: user 10 µs, sys: 2 µs, total: 12 µs
Wall time: 16.9 µs


## Pearson correlation coefficient and p-value for testing non-correlation.

    The Pearson correlation coefficient [1]_ measures the linear relationship
    between two datasets.  The calculation of the p-value relies on the
    assumption that each dataset is normally distributed.  (See Kowalski [3]_
    for a discussion of the effects of non-normality of the input on the
    distribution of the correlation coefficient.)  Like other correlation
    coefficients, this one varies between -1 and +1 with 0 implying no
    correlation. Correlations of -1 or +1 imply an exact linear relationship.
    Positive correlations imply that as x increases, so does y. Negative
    correlations imply that as x increases, y decreases.

    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Pearson correlation at least as extreme
    as the one computed from these datasets.
    
    
    r = \frac{\sum (x - m_x) (y - m_y)}
                 {\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}

In [3]:
%%time
import scipy.stats as s
History_Scores = [10,  25,  17,  11,  13,  17,  20,  13,  9,   15]
Physics_Scores = [15,  12,  8,   8,   7,   7,   7,   6,   5,   3]
print(s.pearsonr(History_Scores,Physics_Scores))

(0.14499815458068518, 0.6894014481166955)
CPU times: user 570 µs, sys: 0 ns, total: 570 µs
Wall time: 500 µs


In [4]:
History_Scores = list(input("Give History Score list here"))
Physics_Scores = list(input("Give Physics Score list here"))
avg_history_score = float(sum(History_Scores))/len(History_Scores)
avg_physics_score = float(sum(Physics_Scores))/len(Physics_Scores)
difference_history_score = map(lambda x: x-avg_history_score,History_Scores)
difference_physics_score = map(lambda x: x-avg_physics_score,Physics_Scores)
product = map(lambda x,y:x*y,(difference_history_score),(difference_physics_score))
coeff = sum(product)
stdv__history_score = (sum([i**2 for i in difference_history_score]))**0.5
stdv__physics_score = (sum([i**2 for i in difference_physics_score]))**0.5
score = coeff/(stdv__history_score*stdv__physics_score)
print("Karl Pearson’s coefficient {score:.3f}".format(score=score))


Give History Score list here[10,  25,  17,  11,  13,  17,  20,  13,  9,   15]
Give Physics Score list here[15,  12,  8,   8,   7,   7,   7,   6,   5,   3]
Karl Pearson’s coefficient 0.145


In [None]:
if __name__ == '__main__' :
    #fptr = open(os.environ['OUTPUT_PATH'], 'w')
    arr_count_1 = int(input().strip())
    arr_count_2 = int(input().strip())
    arr_1 = []
    arr_2 = []
    for _ in range(arr_count_1):
        arr_item = int(input().strip())
        arr_1.append(arr_item)
    for _ in range(arr_count_2):
        arr_item = int(input().strip())
        arr_2.append(arr_item)

    findNumber(arr_1, arr_2)

"10"
10
"10"
10
