# Pearson's r with Health Searches

The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is a statistic which measures the degree of linear correlation between two variables. In this notebook, we'll use the Pearson correlation coefficient (as abbreviated Pearson's r, as in the notebook title) to guage the degree of linearity in Google health search volume.

First though we'll have to understand what Pearson's correlation coefficient is. The following chart (from Wikipedia) demonstrates the correlation coefficients of various distributions:

![](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

The more tightly linear two variables X and Y are, the closer Pearson's correlation coefficient (henceforth PCC) will be to either -1, if the relationship is negative, or +1, if the relationship is positive. Perfectly linearly uncorrelated numbers will recieve a PCC of 0.

Mathematically, Pearson's correlation coefficient can be stated as:

$$\rho_{X, Y} = \frac{\text{Cov}(X, Y)}{\sigma_x \sigma_y}$$

Pearson's correlation coefficient is *normalized covariance*. But what is covariance?

To understand this formula, start by looking at variance. The variance ($\text{Var}(X)$ or $\sigma^2_x$) of a distribution measures the average distance between a point created by the distribution and the distribution mean. A high variance means that the distribution is very spread out, but it can be difficult to interpret variance because it is a squared term.

The standard deviation of a distribution ($\sigma(X)$ or $\sigma_x$) is the square root of the variance. Variance is a squared property, while the standard deviation is a linear property. The utility of the standard deviation is demonstrated very nicely by the normal distribution, whose values can be measured in their unusualness very effectively using that number:

![](https://upload.wikimedia.org/wikipedia/commons/8/8c/Standard_deviation_diagram.svg)

So for example a point as rare or rarer than one three standard deviations from the mean has only a 0.2% chance of occuring.

Covariance measures the joint variance of two distributions. Like unary variance, covariance measures the square of the average distance between the points in a distribution and the mean of the distribution; but since there are now two variables, that measurement is now a distance in two dimensions!

Pearson's correlation coefficient normalizes covariance by dividing it by the standard deviation of either of the variables being measured. Simple!

So far we've been talking about a distribution, but when we work with data we don't know the underlying distribution: we just have a bunch of values. So we need to come up with a Pearson's correlation coefficient statistic, using statistics which measure the covariance and standard deviation of the variables in question. Hence we write:

$$\rho_{X, Y} = \frac{\text{Cov}(X, Y)}{\sigma_x \sigma_y} \approx \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}}$$

Here's a hand implementation of this measurement:

In [None]:
import numpy as np

def pearson_r(x, y):
    x_bar, y_bar = np.mean(x), np.mean(y)
    cov_est = np.sum((x - x_bar) * (y - y_bar))
    std_x_est = np.sqrt(np.sum((x - x_bar)**2))
    std_y_est = np.sqrt(np.sum((y - y_bar)**2))
    return cov_est / (std_x_est * std_y_est)

Correlation is a very fundamental property of your variables. It's a general purpose measurement that can be taken before you do any modeling. The higher the correlation between your target variable and your predictor variables, as a rule, the better your model will perform.

## Application

We're now going to look at an application of the Pearson correlation coefficient to real data. We'll use the Google health search dataset, helpfully published on Kaggle by their News Labs, to do so.

The health search dataset includes an index of volumes of searches for various common medical topics throughout an assortment of areas in the United States. The data covers the period 2004 through 2017, with a different index value for every place and every year. This dataset is interesting because we naturally expect there to be a relationship between the search volume for a specific term in 2004 and 2017, but we expect this relationship to be *much stronger* the closer we get to 2017, maxing out in 2016.

We can verify that this is indeed the case by plotting the joint distributions of these variables using `seaborn`:

In [None]:
import pandas as pd
searches = pd.read_csv("../input/RegionalInterestByConditionOverTime.csv")
searches.head()

In [None]:
import seaborn as sns
sns.jointplot('2004+cancer', '2017+cancer', data=searches)

In [None]:
import seaborn as sns
sns.jointplot('2016+cancer', '2017+cancer', data=searches)

As you can see, the second of these distributions is significantly closer to linear than the first one is!

Let's see if our hypothesis holds for the `cancer` search topic for the rest of the years in the dataset.

In [None]:
p_corrs = [pearson_r(searches["{0}+cancer".format(year)], searches["2017+cancer"]) for year 
           in range(2004, 2018)]

In [None]:
p_corrs = [pearson_r(searches["{0}+cancer".format(year)], searches["2017+cancer"]) for year 
           in range(2004, 2018)]
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
pd.Series(p_corrs, index=range(2004, 2018)).plot.line()

We are correct: the Pearson correlation coefficient (within a certain amount of randomness) does go up over time!

The health searches dataset gives us access to nine time series like this one. It was very taxing to plot all nine scatter plots over time. By plotting the correlation coefficents over time, we have a way to understand how similar medical searches today are to ones in the past that can be fit into a single line chart.

In [None]:
df = pd.DataFrame()
for topic in ['cancer', 'cardiovascular', 'stroke', 'depression', 'rehab', 'vaccine', 
              'diarrhea', 'obesity', 'diabetes']:
    p_corrs = [pearson_r(searches["{0}+{1}".format(year, topic)], 
                         searches["2017+{0}".format(topic)]) for year in range(2004, 2018)]
    df[topic] = p_corrs
    
df.index = range(2004, 2018)

In [None]:
df.plot.line(figsize=(12, 6), cmap='viridis', 
             title='Google Medical Searches by Similarity to 2017')

All of the medical topics includes in this dataset are relatively samey with respect to the present search pattern within the last few years, but there is increasing amounts of variance going back to 2014.To interpret this chart, it's helpful to look at the distributions image from earlier:

![](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

A comparison between search today and search three to five years ago is much like looking at the blob in the second, 0.8 Pearson-scoring distribution in the top row of this picture. Pretty tight! Going back ten years, we get a coefficient of around 0.4, like that of the third scoring distribution in this chart; not nearly as good, but still with an identifiable trend.

Diarrhea seems to be the medical topic whose searches are least similar today to what they were in the past, while (more weakly) cancer seems to be the most consistent.

And note that the correlation coefficient of a variable to itself is always exactly 1.

### Closing remarks

There are a few more things that are worth stating about the Pearson correlation coefficient.

First of all, it's important to note that the Pearson correlation coefficient is both scale and location invariant. Linear changes in the data, like increasing the slope of the data or moving every data point up by some number of units,  will not change the Pearson correlation coefficient. This is a helpful property because it makes the coefficient easier to interpret and compare over datasets in practice, but because of this the coefficient is not a good model fit metric.

Second of all, the Pearson correlation coefficient can only measure linear relationships between data. Relationships between data points that are more complex, such as polynomial relationships, will elude this measurement and have a coefficient near 0. This is demonstrated quite nicely in the third row of the diagram above, which shows various shapes that have near-0 correlation. In practice, always scatter plot your data to confirm what you think you are seeing!

Finally, the Pearson is just one "kind" of correlation coefficient. A correlation coefficient is any measurement of data variable interdependence which has the property that it ranges between -1 and 1 for total dependence and 0 for total independence. Because of how popular the Pearson coefficient is, it is sometimes "just" called variable correlation, but it's not the only way to measure this. The second most popular correlation coefficient, the Spearmann correlation coefficient, is the subject of [another notebook](https://www.kaggle.com/residentmario/spearman-correlation-with-montreal-bikes/)!