# Illustration of locally differentially private linear regression

This notebook illustrates that naively using OLS on locally differentially private data leads to an inconsistent estimator. It is explained how the estimator should be adapted in order to obtain consistency. 

This approach is, however, specific to linear regression. The second notebook demonstrates how one should adapt a Generalized Method of Moments estimator in order to deal with locally differentially private data.

## 0. Imports

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from utils import noise_generators
from utils import simulate as sim
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'utils'

# 1. Simulate dataset

We simulate a dataset according to a univariate linear regression model $Y=\alpha + \beta X + \varepsilon$. The regressor $X\sim N(\mu_X, \sigma^2)$ and the innovation $\epsilon$ is drawn (independently from $X$) from a Logistic distribution with mean zero. The variance is set in such a manner that the $R^2$ in the population, $R^2= \beta^2 \operatorname{var}(X)/(\beta^2 \operatorname{var}(X) + \operatorname{var}(\varepsilon))$, is equal to the specified value.

In [None]:
n = 1000
variance_X = 3
mu_X = 4
alpha = 2
beta = .5
desired_R_squared = .7
data_df = sim.univariate_linear_regression(mu_X, variance_X, alpha, beta, desired_R_squared, n)

In [None]:
data_df.head(5)

In [None]:
data_df.plot.scatter(x="X", y="Y")

## 2. Fit OLS on 'true' dataset

Using statsmodels we estimate the parameters using OLS. Please compare the parameter estimates and the R-squared to the specification above.

In [None]:
ols_results = smf.ols("Y ~ 1 + X", data=data_df).fit()
print(ols_results.summary())

## 3. Using OLS on the locally private data does not work!

In this section we will investigate what happens if we use OLS on locally differentially private data. This basically means that instead of of $(Y_i,X_i)$ we will observe $(Y_i + \eta_i, X_i + u_i)$, where $\eta_i$ and $u_i$ are independent random variables (which are also independent over $i$).

Generate locally differentially private data:

In [None]:
sensitivity = 3
epsilon = 6
private_data_df = noise_generators.add_noise_laplace_mechanism(data_df, epsilon, 
                                sensitivity, seed=123)

Inspect first row of dataset and compare to "true data":

In [None]:
print("Locally differentially private data:")
display(private_data_df.head(3))
print("True data:")
display(data_df.head(3))

## 4. Using OLS on the locally private data does not work!

If we apply OLS to the LDP dataset, then we typically obtain estimates that are far off the true parameter values.

In [None]:
# Fit regression model 
ols_results = smf.ols("Y ~ 1 + X", data=private_data_df).fit()
print(ols_results.summary())

# 5. Adapt the OLS estimator in order to deal with local differential privacy

For linear regression we can use two moment conditions:  $0=\mathbb{E}[ Y -\alpha -\beta X]$ and $0=\mathbb{E}[ XY -\alpha X -\beta X^2]$. These moment conditions can be solved analytically. We need to receive data on $Y$, $X$, $XY$, and $X^2$.

We first add the additional columns $XY$ and $X^2$ to the true dataset:

In [None]:
data_df["X * Y"] = data_df["X"] * data_df["Y"]
data_df["X^2"] = data_df["X"] * data_df["X"]
data_df.head(5)

Now we assume that we can obtain a local differentially private version of each column. 

In [None]:
private_data_df = noise_generators.add_noise_laplace_mechanism(data_df, epsilon, sensitivity, seed=123)
private_data_df.head(5)

Next we apply our moment estimator. First on the true data (which yields the same outputs as statsmodels) and after that on the locally differentially private dataset.

In [None]:
def linear_regression_MM(Y, X, X_times_Y, X_squared):
    hat_beta = (X_times_Y.mean() - X.mean() * Y.mean()) / (X_squared.mean() - X.mean() ** 2)
    hat_alpha = Y.mean() - hat_beta * X.mean()
    return hat_alpha, hat_beta

In [None]:
 linear_regression_MM(data_df["Y"], data_df["X"], data_df["X * Y"], data_df["X^2"])

In [None]:
 linear_regression_MM(private_data_df["Y"], private_data_df["X"], private_data_df["X * Y"], private_data_df["X^2"])

We see that the results are quite close. It can indeed be proved that the above estimator, based upon locally differentially private data, is consistent. And, as you would expect, its variance exceeds the variance of OLS based upon the "true data".