![](https://drive.google.com/uc?id=131WXvMdvlKDxIk0Pa1qHkU3hkey5ChGj)

# **<span style="color:#e76f51;">What is Differential Privacy ?</span>**

Differential privacy is a rigorous mathematical definition of privacy.  An algorithm is said to be differentially private if an adversary cannot use auxiliary information to reverse engineer the sensitive data .

# **<span style="color:#e76f51;">Why anonymization is not sufficient ?</span>**

Private data of patients in healthcare has huge potential to transform medical treatments . However exposing such data comes with substantial risk to privacy .Anonymization of features is no longer sufficient when exposing sensitive data .


In the 1990s , the Massachusetts Group Insurance Commission decided to release anonymized data on state employees that showed every single hospital visit . The goal was to help researchers, and the state spent time removing all obvious identifiers such as name, address, and Social Security number.In 1997 , Latanya Sweeney , a graduate student at MIT was able to identify medical record of Governor William Weld using zip code , birth date and voter registration records from Cambridge , Massachusetts .

![](https://drive.google.com/uc?id=1exheko3vbEd14tLo1bDts8p7l9hgZ4Sk)

Netflix released an anonymized viewing dataset for a competition to build a better recommendation engine in 2006 . Narayanan and Shmatikov were able to re-identify users by merging it with the IMDB dataset .

![](https://drive.google.com/uc?id=1TvLJyuPMejOCDOTiCFU2sa9qGs-LTH4K)

# **<span style="color:#e76f51;">How does Differential Privacy work ?</span>**


Differentially-private algorithms incorporate random noise to the original data , so that it becomes difficult for the adversaries to breach privacy

# **<span style="color:#e76f51;">Privacy Loss :</span>**

It is an additional risk to an individual by using auxiliary knowledge to re-identify them .

# **<span style="color:#e76f51;">Limitations of Differential Privacy :</span>**

The details about the original data can be estimated by making repeated queries about the data .

# **<span style="color:#e76f51;">Diffprivlib :</span>**

Diffprivlib is a general-purpose library from IBM for experimenting with, investigating and developing applications in, differential privacy.

Diffprivlib can be used for 

ðŸ“Œ Experiment with differential privacy.

ðŸ“Œ Explore the impact of differential privacy on machine learning accuracy using classification and clustering models.

ðŸ“Œ Build your own differential privacy applications, using our extensive collection of mechanisms

Diffprivlib is comprised of four major components:

**Mechanisms:** These are the building blocks of differential privacy, and are used in all models that implement differential privacy. Mechanisms have little or no default settings, and are intended for use by experts implementing their own models. They can, however, be used outside models for separate investigations, etc.

**Models:** This module includes machine learning models with differential privacy. Diffprivlib currently has models for clustering, classification, regression, dimensionality reduction and pre-processing.

**Tools:** Diffprivlib comes with a number of generic tools for differentially private data analysis. This includes differentially private histograms, following the same format as Numpy's histogram function.

**Accountant:** The BudgetAccountant class can be used to track privacy budget and calculate total privacy loss using advanced composition techniques.
 


In [None]:
!pip install diffprivlib

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from diffprivlib.models import GaussianNB

from sklearn.model_selection import train_test_split

import wandb


In [None]:

train = pd.read_csv("../input/tabular-playground-series-nov-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-nov-2021/test.csv")


<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

> I will be integrating W&B for visualizations and logging artifacts!
> 
> [TPS October Project on W&B Dashboard]
(https://wandb.ai/usharengaraju/TPSNovember)
> 
> - To get the API key, create an account in the [website](https://wandb.ai/site) .
> - Use secrets to use API Keys more securely 

In [None]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("api_key")
    wandb.login(key=secret_value_0)
    anony=None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')
    
CONFIG = dict(competition = 'TPSOctober',_wandb_kernel = 'tensorgirl')


In [None]:
train.drop(columns='id', inplace=True)
test.drop(columns='id', inplace=True)

In [None]:
#code copied from https://www.kaggle.com/sergiosaharovskiy/tps-nov-2021-a-complete-guide

# Downcasting the training dataset.
for col in train.columns:
    
    if train[col].dtype == "float64":
        train[col] = pd.to_numeric(train[col], downcast="float")
        
    if train[col].dtype == "int64":
        train[col] = pd.to_numeric(train[col], downcast="integer")
        
# Downcasting the test dataset.
for col in test.columns:
    
    if test[col].dtype == "float64":
        test[col] = pd.to_numeric(test[col], downcast="float")
        
    if test[col].dtype == "int64":
        test[col] = pd.to_numeric(test[col], downcast="integer")
        


# **<span style="color:#e76f51;">W & B Artifacts</span>**

An artifact as a versioned folder of data.Entire datasets can be directly stored as artifacts .

W&B Artifacts are used for dataset versioning, model versioning . They are also used for tracking dependencies and results across machine learning pipelines.Artifact references can be used to point to data in other systems like S3, GCP, or your own system.

You can learn more about W&B artifacts [here](https://docs.wandb.ai/guides/artifacts)

![](https://drive.google.com/uc?id=1JYSaIMXuEVBheP15xxuaex-32yzxgglV)

In [None]:
# Save train data to W&B Artifacts
run = wandb.init(project='TPSNovember', name='training_data', anonymous=anony,config=CONFIG) 
artifact = wandb.Artifact(name='training_data',type='dataset')
artifact.add_file("../input/tabular-playground-series-nov-2021/train.csv")

wandb.log_artifact(artifact)
wandb.finish()

# **<span style="color:#e76f51;">Basic Statistics of Features</span>**

In [None]:
train.loc[:, 'f1':'f99'].describe().style.background_gradient(cmap='Pastel1')


# **<span style="color:#e76f51;">Target Variable Distribution</span>**

In [None]:
plt.figure(figsize=(15, 7))
sns.kdeplot(train["target"] ,fill=True, color = "#2a9d8f")

# **<span style="color:#e76f51;">Target Class Balance</span>**

In [None]:
plt.figure(figsize=(15, 7))
plt.pie([508,492], labels = ["0" , "1"],autopct='%1.1f%%',colors = ["#2a9d8f", "#e9c46a"])


# **<span style="color:#e76f51;">Distribution of features</span>**

In [None]:
fig, axes = plt.subplots(10,10, figsize=(20, 12))
axes = axes.flatten()

for idx, ax in enumerate(axes):
    
    sns.kdeplot(
        data=train, ax=ax, hue='target', fill=True,
        x=f'f{idx}', palette=['#4DB6AC', 'red'], legend=idx==0
    )
 
    ax.set_xticks([]); ax.set_yticks([]); ax.set_xlabel('')
    ax.set_ylabel(''); ax.spines['left'].set_visible(False)
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Distribution of Features', ha='center', fontweight='bold')
fig.tight_layout()
plt.show()


# **<span style="color:#e76f51;">Logging to W & B environment</span>**

In [None]:
# Log Plots to W&B environment
title = "Distribution of Target Feature"
run = wandb.init(project='TPSNovember', name=title,anonymous=anony,config=CONFIG)
fig = sns.kdeplot(train["target"] , color = "#E4916C")
wandb.log({"Distribution of Target Feature": fig})
wandb.finish()

In [None]:
X = train.drop('target', axis=1)
y = train['target']

X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)

# **<span style="color:#e76f51;">Modelling using Diffprivlib</span>**

In [None]:
clf = GaussianNB()
clf.fit(X_train, y_train)

In [None]:
clf.predict(X_test)

In [None]:
print("Test accuracy: %f" % clf.score(X_test, y_test))


In [None]:
accuracy = list()

epsilons = np.logspace(-2, 2, 50)

for epsilon in epsilons:
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    
    accuracy.append(clf.score(X_test, y_test))

plt.semilogx(epsilons, accuracy)
plt.title("Differentially private Naive Bayes accuracy")
plt.xlabel("epsilon")
plt.ylabel("Accuracy")
plt.show()

Some example notebooks to experiment with is present in the official github repository

[Github](https://github.com/IBM/differential-privacy-library)

# **<span style="color:#e76f51;">Resources for Differential Privacy</span>**

https://privacytools.seas.harvard.edu/differential-privacy

The Algorithmic Foundations of Differential Privacy (Cynthia Dwork., Aaron Roth).

Deep Learning with Differential Privacy (M Abadi, A Chu, I Goodfellow, HB McMahan)

Concentrated Differential Privacy (Cynthia Dwork, Guy N. Rothblum)

References :

https://www.kaggle.com/sergiosaharovskiy/tps-nov-2021-a-complete-guide