# An algorithm for protecting customers' personal data.

We need to protect the data of the insurance company "Hot Potop". We will develop a method of data transformation so that it is difficult to restore personal information from it. We will prove the correctness of its work.

## Contents: 
1. Reviewing and preprocessing of the data. 
2. Matrix multiplication. 
3. Transformation algorithm. 
4. Algorithm verification.

## Description of data:
Features: gender, age, and salary of the insured, number of members in their family.

Target feature: number of insurance payments to the customer over the last 5 years.

## Plan of the project:
* Download and study the data.
* Multiply the features by the inverse matrix.
* Propose an algorithm for data transformation to solve the task.
* Program this algorithm using matrix operations. Check that the quality of the linear regression from sklearn does not differ before and after the transformation. Apply the R2 metric.

## 1. Reviewing and preprocessing of the data.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [2]:
pd.set_option('display.max_columns', 50)
pd.options.display.float_format = '{:,.2f}'.format

In [3]:
url = 'https://code.s3.yandex.net/datasets/insurance.csv'
data = pd.read_csv(url)

Let's look at the data.

In [4]:
data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


Let's look at the missing values.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


Let's look at the unique values.

In [6]:
data['Пол'].value_counts()

0    2505
1    2495
Name: Пол, dtype: int64

In [7]:
data['Возраст'].value_counts()

19.00    223
25.00    214
31.00    212
26.00    211
22.00    209
27.00    209
32.00    206
28.00    204
29.00    203
30.00    202
23.00    202
21.00    200
20.00    195
36.00    193
33.00    191
24.00    182
35.00    179
34.00    177
37.00    147
39.00    141
38.00    139
41.00    129
18.00    117
40.00    114
42.00     93
43.00     77
44.00     74
45.00     73
46.00     60
47.00     47
49.00     37
50.00     27
48.00     26
52.00     22
51.00     21
53.00     11
55.00      9
54.00      7
56.00      5
59.00      3
57.00      2
58.00      2
60.00      2
61.00      1
65.00      1
62.00      1
Name: Возраст, dtype: int64

In [8]:
data['Члены семьи'].value_counts()

1    1814
0    1513
2    1071
3     439
4     124
5      32
6       7
Name: Члены семьи, dtype: int64

In [9]:
data['Страховые выплаты'].value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: Страховые выплаты, dtype: int64

In [10]:
data['Возраст'] = data['Возраст'].astype(int)

No errors or anomalies in the data are observed.

Conclusion:
* Dataset of 5000x5 is loaded;
* No missing data is detected;
* No errors or anomalies are detected;
* The data of the age column is converted from float to int;

The data is prepared for further work.

## 2. Matrix multiplication.

We use simplifications:

$$
(AB)^T=B^TA^T
$$

$$
((AB)^T)^{-1}=((AB)^{-1})^{T}
$$

$$
(AB)^{-1}=B^{-1}A^{-1}
$$

$$
A(A)^{-1}=(A)^{-1}A=E
$$

$$
EA=AE=A
$$

$$
\frac{d}{dX}X^Ta=a
$$

$$
\frac{d}{dX}X^TAX=(A+A^T)X
$$

where: 
- $E$ - A single matrix
- $A$ - A symmetric matrix

The dependence of the target indicator $y_i$ on the regressors of the i-th observation can be expressed through the equation of linear regression of the form:

$$
f(w,x_i)=w_0+w_1x_{1,i}+...+w_kx_{k,i}
$$

where: $x_i$ - i-th value of the regressor from 1 to n

k - the number of regressors

w - angular coefficients, which represent the amount by which the calculated target indicator will change on average when the regressor changes

The quality of the approximating function is estimated by the least squares method:

$$
Err=\sum\limits_{i=1}^n{(y_i-f(x_i))^2} \rightarrow min
$$

Let's write the equation in matrix form:

$$
Err={(\overrightarrow{y}-X\overrightarrow{w})^2} \rightarrow min
$$

The equation supplemented with a weight matrix is written in a similar way
$$
Err={(\overrightarrow{y}-XA\overrightarrow{w})^2} \rightarrow min
$$

Vector and matrix sizes:
$X:(n, k)$, $\overrightarrow{w}:(k, 1)$, $\overrightarrow{y}:(n, 1)$, $A:(k, k)$

At the same time, we take into account that the first coefficient $w_0$ in the linear regression equation is not multiplied by any regressor. To be able to count everything with matrices, each i-th value of the regressor $w_0$ is equated to one.

Let's start unfolding the brackets in the formula with an additional matrix:

$$
(\overrightarrow{y}-XA\overrightarrow{w})^2=
(\overrightarrow{y}-XA\overrightarrow{w})^T(\overrightarrow{y}-XA\overrightarrow{w})=
(\overrightarrow{y})^T\overrightarrow{y}-(\overrightarrow{y})^TXA\overrightarrow{w}-(XA\overrightarrow{w})^T\overrightarrow{y}+(XA\overrightarrow{w})^TXA\overrightarrow{w}
$$

Transposition is due to the fact that a vector is obtained in the brackets

Prepare the equation for differentiation:

$$
(\overrightarrow{y})^TXA\overrightarrow{w}=((X\overrightarrow{w})^T\overrightarrow{y})^T=(XA\overrightarrow{w})^T\overrightarrow{y}=\overrightarrow{w}^TA^TX^T\overrightarrow{y}=const
$$

$$
(XA\overrightarrow{w})^T\overrightarrow{y}=\overrightarrow{y}^TXA\overrightarrow{w}=const
$$

$$
(XA\overrightarrow{w})^TXA\overrightarrow{w}=\overrightarrow{w}^TA^TX^TXA\overrightarrow{w}
$$

since the result of the transformations is equal to const, T can be omitted and we get:

$$
Err=(\overrightarrow{y})^T\overrightarrow{y}-2\overrightarrow{w}^TA^TX^T\overrightarrow{y}+\overrightarrow{w}^TA^TX^TXA\overrightarrow{w}
$$

The solution to the minimization task is achieved by searching for the minimum error by the partial derivative of the weights:
$$
Err(X,\overrightarrow{y},\overrightarrow{w})\rightarrow min_{\overrightarrow{w}}\leftrightarrow \frac{dErr(X,\overrightarrow{y},\overrightarrow{w})}{d\overrightarrow{w}}=0
$$

$$
\frac{dErr(X,\overrightarrow{y},\overrightarrow{w})}{d\overrightarrow{w}}=\frac{d}{d\overrightarrow{w}}(\overrightarrow{y}^T\overrightarrow{y}-2\overrightarrow{w}^TA^TX^T\overrightarrow{y}+\overrightarrow{w}^TA^TX^TXA\overrightarrow{w})=\frac{\overrightarrow{y}^T\overrightarrow{y}}{d\overrightarrow{w}} - 
\frac{2\overrightarrow{w}^TA^TX^T\overrightarrow{y}}{d\overrightarrow{w}}+
\frac{\overrightarrow{w}^TA^TX^TXA\overrightarrow{w}}{d\overrightarrow{w}}
$$

Let's write down each derivative separately:

$$
\frac{\overrightarrow{y}^T\overrightarrow{y}}{d\overrightarrow{w}}=0
$$

$$
\frac{2\overrightarrow{w}^TA^TX^T\overrightarrow{y}}{d\overrightarrow{w}}=2A^TX^T\overrightarrow{y}
$$

$$
\frac{\overrightarrow{w}^TA^TX^TXA\overrightarrow{w}}{d\overrightarrow{w}}=(A^TX^TXA+(A^TX^TXA)^T)\overrightarrow{w}=2A^TX^TXA\overrightarrow{w}
$$

Next:
$$
\frac{dErr(X,\overrightarrow{y},\overrightarrow{w})}{d\overrightarrow{w}}=-2A^TX^T\overrightarrow{y}+2A^TX^TXA\overrightarrow{w}=0
$$


As a result, we get:

$$
A^TX^TXA\overrightarrow{w}=A^TX^T\overrightarrow{y}
$$

Using similar actions, the formula for linear regression without a transformation matrix can be derived:

$$
\sum\limits_{i=1}^n{(y_i-f(x_i))^2} = \sum\limits_{i=1}^n{(\overrightarrow{y}-X\overrightarrow{w})^2}=(\overrightarrow{y}-X\overrightarrow{w})^T(\overrightarrow{y}-X\overrightarrow{w})=
\overrightarrow{w}^TX^TX\overrightarrow{w}-2\overrightarrow{w}^TX^T\overrightarrow{y}+\overrightarrow{y}^T\overrightarrow{y}
$$

Differentiate:
$$
\frac{d(\overrightarrow{w}^TX^TX\overrightarrow{w}-2\overrightarrow{w}^TX^T\overrightarrow{y}+\overrightarrow{y}^T\overrightarrow{y})}{d\overrightarrow{w}}=2X^TX\overrightarrow{w}-2X^T\overrightarrow{y}=0
$$

$$
X^TX\overrightarrow{w}=X^T\overrightarrow{y}
$$

Multiply the expression without using the transformation matrix on the left by $A^T$ and equate the two matrices obtained

$$
A^TX^TXA\overrightarrow{w}_{до}=A^TX^TX\overrightarrow{w}_{после}
$$

Thus, the connection between the matrix has the following form:

$$
A\overrightarrow{w}_{до}=\overrightarrow{w}_{после}
$$

Return to the original data is possible using the inverse matrix

$$
\overrightarrow{w}_{до}=A^{-1}\overrightarrow{w}_{после}
$$

The quality of the model when multiplying the features by the invertible matrix will not change.

Let's check!

Let's separate the training and test samples.

In [11]:
train, test = train_test_split(data, 
                               shuffle=False, 
                               random_state=0)

In [12]:
class Linear_Regression:
    def fit(self, X_train, y_train):
        X = np.concatenate((np.ones((X_train.shape[0], 1)), X_train), axis=1)
        y = y_train
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]
    def predict(self, X_test):
        return X_test @ self.w + self.w0
    def score(self, X, y, set_name):
        print('r_2 on the {} sample: {:.2f}'.format(set_name,r2_score(y, self.predict(X))))

In [13]:
model_before = Linear_Regression()

X_train = train.drop('Страховые выплаты', axis=1)
y_train = train['Страховые выплаты']
X_test = test.drop('Страховые выплаты', axis=1)
y_test = test['Страховые выплаты']

model_before.fit(X_train,y_train)
model_before.score(X_train, y_train, 'train')
model_before.score(X_test, y_test, 'test')

r_2 on the train sample: 0.43
r_2 on the test sample: 0.42


## 3. Transformation algorithm.

It is necessary to multiply the feature matrix by the encryption matrix, as when multiplying, the number of columns of matrix A must be equal to the number of rows of matrix B. Therefore, the size of the encryption matrix that is multiplied is 4x4.

We will generate the encryption matrix using np.random.random().

We will multiply the features by the inverse matrix.

In [14]:
random_matrix = np.array([])
while True:
    
    random_matrix = np.random.random((X_train.shape[1], X_train.shape[1]))
    
    try:
        np.linalg.inv(random_matrix)
        break
    except:
        continue

## 4. Algorithm verification.

In [15]:
model_after = Linear_Regression()

X_train_changed = X_train @ random_matrix
X_test_changed = X_test @ random_matrix

model_after = Linear_Regression()
model_after.fit(X_train_changed, y_train)

model_after.score(X_train_changed, y_train, 'train')
model_after.score(X_test_changed, y_test, 'test')

r_2 on the train sample: 0.42
r_2 on the test sample: 0.42


Let's see if the accuracy of the model has changed after encryption.

 The metric of the model before encryption.

In [16]:
model_before.score(X_train, y_train, 'train')
model_before.score(X_test, y_test, 'test')

r_2 on the train sample: 0.43
r_2 on the test sample: 0.42


The metric of the model after encryption.

In [17]:
model_after.score(X_train_changed, y_train, 'train')
model_after.score(X_test_changed, y_test, 'test')

r_2 on the train sample: 0.42
r_2 on the test sample: 0.42


Conclusion: 

As we can see, the quality of the model did not change when multiplied by the inverse matrix.