# Protection of personal data of clients

You need to protect the data of clients of the insurance company "Though the Flood". Develop a method for transforming data so that it is difficult to recover personal information from it. Justify the correctness of its operation.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during conversion. There is no need to select the best model.

## Loading data

Import the necessary libraries and look at the data

In [12]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

df = pd.read_csv('/datasets/insurance.csv')

print(df.info())
display(df.head())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB
None


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


               Пол      Возраст      Зарплата  Члены семьи  Страховые выплаты
count  5000.000000  5000.000000   5000.000000  5000.000000        5000.000000
mean      0.499000    30.952800  39916.360000     1.194200           0.148000
std       0.500049     8.440807   9900.083569     1.091387           0.463183
min       0.000000    18.000000   5300.000000     0.000000           0.000000
25%       0.000000    24.000000  33300.000000     0.000000           0.000000
50%       0.000000    30.000000  40200.000000     1.000000           0.000000
75%       1.000000    37.000000  46600.000000     2.000000           0.000000
max       1.000000    65.000000  79000.000000     6.000000           5.000000


There are no passes.

Let's break down the data for further work.

In [13]:
features = df.drop('Страховые выплаты', axis=1)
target = df['Страховые выплаты']

## Matrix multiplication

In this activity you can write formulas in *Jupyter Notebook.*

To write a formula inside text, surround it with dollar symbols \\$; if outside - double symbols \\$\\$. These formulas are written in the layout language *LaTeX.*

For example, we wrote down the linear regression formulas. You can copy and edit them to solve the problem.

Working in *LaTeX* is not necessary.

Designations:

- $X$ — matrix of features (the zero column consists of ones)

- $y$ — vector of the target feature

- $P$ is the matrix by which the features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning Objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** assume that the quality of the linear regression does not change, that is, the predictions $a$ do not change.

**Rationale:** It is known that the matrix X is a feature matrix, so let’s multiply the matrix $X$ by the invertible matrix $Z$, and substitute this product into $a$ and $w$.

$$a = XZw$$

$$w = ((XZ)^T XZ)^{-1}(XZ)^T y$$

Let's substitute $w$ в $a$:

$$a = XZ(((XZ)^T XZ)^{-1}(XZ)^T y)$$

Let us expand the transposition using the following property: $(XZ)^T = Z^T X^T$

$$a = XZ(((Z^T X^T XZ)^{-1}Z^T X^T y)$$

Using the property $(XZ)^{-1} = Z^{-1} X^{-1}$:

$$a = XZ(((X^T X Z)^{-1}(Z^T)^{-1} Z^T X^T y)$$

$$a = XZ Z^{-1}(((X^T X)^{-1} (Z^T)^{-1} Z^T X^T y)$$

We use the definition of an inverse matrix:

$$a = XE(X^TX)^{-1}EX^Ty$$

It is known that when multiplying, identity matrices cancel:

$$a = X(X^TX)^{-1}X^Ty$$

As you can see, we have returned to the original form $a$:

$$a = X(X^TX)^{-1}X^Ty = Xw$$

Thus, multiplying features by an invertible matrix should not lead to a change in the quality of linear regression.

## Conversion algorithm

First, let's find the r2 metric on our data

In [14]:
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
print('r2:',r2_score(target, predictions))

r2: 0.42494550286668


Let's create a random matrix

In [15]:
Z = np.random.normal(size = (4,4))

Let's multiply the features by our new matrix

In [16]:
X = features.values
new_features = np.dot(X, Z)

Let's find the metric r2 again

In [17]:
model = LinearRegression()
model.fit(new_features, target)
predictions = model.predict(new_features)
print('r2 after multiplication by a random matrix:',r2_score(target, predictions))

r2 после умножения на случайную матрицу: 0.4249455028666713


## Conclusion

In the current project, we have proven that multiplying features by a random matrix does not affect r2.