# Protection of personal data of customers

You need to protect the data of customers of the insurance company "Though the Flood". Develop a data transformation method that makes it difficult to recover personal information from it. Justify the correctness of his work.

You need to protect the data so that the quality of the machine learning models does not deteriorate during the transformation. There is no need to select the best model.

## Loading data

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [2]:
path ='/Users/vzuga/Documents/jupyter/'

try:
    df = pd.read_csv('insurance.csv') 
except:
    df = pd.read_csv(path+'/datasets/insurance.csv')

In [3]:
def data_check(data):
    data.info()
    print()
    display(data.head())
    print()
    print('Duplicates:', data.duplicated().sum())
    print()
    print('Missing values')
    print(data.isna().mean())   
    print()
    print('Statistics')
    print(data.describe())

In [4]:
data_check(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB



Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0



Duplicates: 153

Missing values
Пол                  0.0
Возраст              0.0
Зарплата             0.0
Члены семьи          0.0
Страховые выплаты    0.0
dtype: float64

Statistics
               Пол      Возраст      Зарплата  Члены семьи  Страховые выплаты
count  5000.000000  5000.000000   5000.000000  5000.000000        5000.000000
mean      0.499000    30.952800  39916.360000     1.194200           0.148000
std       0.500049     8.440807   9900.083569     1.091387           0.463183
min       0.000000    18.000000   5300.000000     0.000000           0.000000
25%       0.000000    24.000000  33300.000000     0.000000           0.000000
50%       0.000000    30.000000  40200.000000     1.000000           0.000000
75%       1.000000    37.000000  46600.000000     2.000000           0.000000
max       1.000000    65.000000  79000.000000     6.000000           5.000000


### Conclusions:
* there are no gaps in the data
* features are numerical
* data types and values are fine
* for this project, the presence of a small number of duplicates is not a problem
* I won't rename columns either

## Matrix multiplication

Designations:

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ is the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Machine learning objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

To protect personal data, we multiply the feature matrix $X$ by a random matrix $P$.

Let's use the properties of matrices:
- $(AB)^T = B^TA^T$
- $(AB)C = A(BC)$
- $AA^{-1} = E$
- $(AB)^{-1} = B^{-1}A^{-1}$,

where the last two properties are valid only for invertible matrices. Let's replace $X$ with $X'$ in the linear regression learning formula $w$, where $X' = XP$. Then the new learning formula will look like:

$$
w' = (X'^T X')^{-1} X'^T y
$$
given $X' = XP$:

$$
w' = ((XP)^T XP)^{-1}(XP)^T y = (P^T X^T XP)^{-1}P^T X^T y,
$$

Substitute $w'$ into the predictions:
$$
a' = X'w' = XP [(P^T) (X^T X)P]^{-1}P^T X^T y = X [P P^{-1}](X^T X)^{ -1} [(P^T)^{-1} P^T] X^T y = Xw = a
$$

Thus multiplying the feature matrix by any reversible matrix, the predictions of the model will not change.

## Conversion algorithm

**Algorithm**

Multiply the NxM feature matrix by the MxM reversible square matrix generated with numpy.random.normal.

**Rationale**

Above, I proved that the predictions of linear regression when multiplying a feature matrix by an invertible matrix will not change.

I will write a function that will multiply the features by an invertible matrix.

In [5]:
def protection(features):
    
    M = features.shape[1] #matrix dimension P
    P = np.random.normal(size=(M,M)) #creating matrix
    
    while (np.linalg.det(P)==0):  #matrix determinant check
        P = np.random.normal(size=(M,M))
    
    return np.dot(features, P)

In [6]:
#function check
test_features = np.random.normal(size=(3,1), scale=100)
test_features

array([[ 13.61692576],
       [147.97119311],
       [ 13.96132585]])

In [7]:
test_features_protected = protection(test_features)
test_features_protected

array([[ -20.9200274 ],
       [-227.33188591],
       [ -21.44913796]])

## Algorithm check

I will extract the features and the target, and also transform the features:

In [8]:
features = df.drop(['Страховые выплаты'], axis=1)
target = df['Страховые выплаты']

features_protected = protection(features)

I will split the samples into training and test:

In [9]:
features_train, features_test, target_train, target_test, \
features_train_protected, features_test_protected = train_test_split(
    features, target, features_protected, test_size=.25, random_state=12345)

I will check the quality of model with and without protection:

In [10]:
#features without protection
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(features_train, target_train)

print('The R2 value for the original data:')
print(f'{r2_score(target_test, model.predict(features_test)):.6f}')

The R2 value for the original data:
0.435228


In [11]:
#features with protection
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(features_train_protected, target_train)

print('The R2 value for the transformed data:')
print(f'{r2_score(target_test, model.predict(features_test_protected)):.6f}')

The R2 value for the transformed data:
0.435228


R2 values ​​match!

### General conclusion:

* prediction of the model does not change when features are multiplied by an invertible matrix
* a data transformation algorithm is proposed, which consists in multiplying features by a randomly generated invertible square matrix
* as a result of applying the transformation algorithm, the quality of the linear regression has not changed