# Correlation and Regression

In [None]:
from IPython.display import Markdown
base_path = (
    "https://raw.githubusercontent.com/rezahabibi96/GitBook/refs/heads/main/"
    "books/applied-statistics-with-python/.resources"
)

In [None]:
import math
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.utils import resample
from scipy.stats import norm, binom, chi2, chisquare, chi2_contingency, expon, shapiro, t, wilcoxon
from statsmodels.stats.power import TTestPower, TTestIndPower

import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image
from matplotlib.pyplot import figure

import requests
from io import BytesIO

In this chapter, we introduce methods for determining whether there is a linear correlation, or association, between two numerical variables. In the case it exists, we learn how to derive a linear model and how to use it for prediction, which is very important in applications of many different fields from Business and Finance to Biology and Medicine.
For example:
* Based on income, consumer debt, and several other predictors about a person, we might be interested in predicting a person's credit rating.
* We might be interested in predicting growth in sales of a company based on current economic conditions and its previous year's performance.
* In Medicine, it would be important to predict patients' blood pressure based on health measurements and the dosage of a new drug.

In fairness, the more interesting examples above have several predictors, which is called multiple regression and is taken up in detail in later chapters. In this chapter, we study correlation in general, but the linear models are limited to a single predictor.

## Correlation

As an example, let's return to the file `MHEALTH.csv` mentioned before in chapters 1 and 2. It contains many health measurements for 40 men. Let's investigate weight (WT) dependence on the waist size (WAIST).

Whenever you plan a correlation/regression study, you *must always start with a scatterplot* as shown in the figure below.

There is a fair amount of variation, but there seems to be an overall linear trend with a positive slope. Not surprisingly, a man with a larger waist size has a higher weight.

The numerical measure of the strength of this *linear* trend is given by the **Pearson correlation coefficient**:

$$ r = \frac{1}{n-1} \sum_{i=1}^n (z_x)_i \cdot (z_y)i = \frac{1}{n-1} \sum{i=1}^n \frac{x_i - \bar{x}}{s_x} \cdot \frac{y_i - \bar{y}}{s_y} $$

where $n$ is the number of data point pairs. The `scipy.stats` library has an effective function to perform this rather tedious computation.