## Proof of Concept for Differentially Private Exploratory Data Analysis 



**Very basic Implementation based on the concept of Global Differential Privacy**

The objective of this notebook is to implement the concept of differential privacy during exploratory data analysis. The global DP approach is followed here, where EDA will be performed on the sensitive dataset locally on the data owner’s computer. Later noise will be added to the results to make them differentially private. 

In this notebook, we use a prolific library for EDA named pandas profiling. It is a very well-known tool for performing exploratory data analysis. Given a dataset, the library can help us in creating a complete set of analytics in reproducible formats like .html, .json. Here in this example, we will be retrieving aggregates, especially 'mean' from the.json file produced by pandas profiling and add noise to it



#### Step 1: Import the required librares
- Incase of any missing packages, install them using 'pip'

In [3]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport # to perform EDA
import json

#### Step 2: Load the data 
- We are using the well known "Titanic Dataset"
- More details about the dataset can be found [here](https://www.kaggle.com/c/titanic/overview)

In [4]:
url = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial_3-Titanic_demo/titanic_clean.csv'
data = pd.read_csv(url,sep=",", index_col=0)
data.head()


Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title,Family_Size
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171,Mr,1
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599,Mrs,1
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282,Miss,0
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803,Mrs,1
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450,Mr,0


#### Step 3: Defining the dp_eda function 
- It is the heart of this example
- The function takes data frame consisting the loaded dataset, attribute whose aggregate (mean in this case) we want, and epsilon (privacy budget, default: 0.5) as input and provides the private and normal aggregates

In [8]:
def dp_eda(df, x, epsilon=0.5):
    profile = ProfileReport(df, title='DP EDA', explorative=False, progress_bar = False , minimal=True) # Performing EDA on our data using pandas profiling
    profile.to_file("data_report.json") # Saving the result as .json file
    json_file = open('data_report.json') # Retrieving the data from .json file 
    jsondata = json.load(json_file)
    dp_result = jsondata['variables'][x]['mean'] + np.random.laplace(0,1.0/epsilon) # Adding laplacian noise and making results differentially private
    return dp_result, jsondata['variables'][x]['mean']
    

#### Step 4: Results 
- We test the function defined above
- For any numerical column in the dataset, we can retrieve private and normal mean
- Check the results for various epsilon 

In [9]:
x = 'Age'
priv_mean, norm_mean = dp_eda(data, x, epsilon = 0.7)
print(f"Private Mean of {x}: {priv_mean}, Normal Mean of {x}: {norm_mean}")

Private Mean of Age: 27.77999073389835, Normal Mean of Age: 29.56239113827349


### Conclusion

We made the aggregates obtained from EDA by using pandas-profiling, differential private. Since, it is a POC, only mean is considered among lot of aggregated available in  over a simple dataset
