# Protection of personal data of clients

It is necessary to protect the customer data of the insurance company "Though the flood". Develop a method of data transformation so that it is difficult to recover personal information from them. Justify the correctness of his work.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during the conversion. There is no need to select the best model.

## Data loading

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('datasets/insurance.csv')
df

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


In [5]:
df.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


Features: `Gender` (Пол), `Age` (Возраст) and `Salary` (Зарплата) of the insured, the `Number of Family Members` (Члены семьи).

Target: the `Number of Insurance Payments` (Страховые выплаты) to the client over the past 5 years.

### Conclusion

The data consists of 5 columns and 5 thousand rows.

There are no missing values in the data

## Matrix multiplication

Notation:

- $X$ — feature matrix (the zero column consists of units)

- $y$ — vector of the target feature

- $P$ - the matrix by which the signs are multiplied

- $w$ — vector of linear regression weights (the zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

The task of training:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning equation:

$$
w = (X^T X)^{-1} X^T y
$$

$$
a = Xw = XEw = XPP^{-1}w = (XP)P^{-1}w = (XP)w'
$$

$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$
$$
w' = (P^T (X^T X) P)^{-1} (XP)^T y
$$
$$
w' = (XP)^{-1}(P^T X^T)^{-1} P^T X^T y
$$
$$
w' = P^{-1} (X^{-1} X^T)^{-1} (P^T)^{-1} P^T X^T y
$$
$$
w' = P^{-1} X^{-1} (X^T)^{-1} X^T y
$$
$$
w' = P^{-1} (X^T X)^{-1} X^T y
$$
$$
w' = P^{-1}w
$$

$$
a = Xw = XEw = XPP^{-1}w = (XP)P^{-1}w = (XP)w'
$$

$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$
$$
w' = (P^T (X^T X) P)^{-1} (XP)^T y
$$
$$
w' = (P^T (X^T X) P)^{-1} P^T X^T y
$$
$$
w' = P^{-1} (X^T X)^{-1} (P^T)^{-1} P^T X^T y
$$
$$
w' = P^{-1} (X^T X)^{-1} E X^T y
$$
$$
w' = P^{-1}w
$$

### Model Training:

In [6]:
features = df.drop('Страховые выплаты',axis=1)
target = df['Страховые выплаты']

In [9]:
model = LinearRegression()
model.fit(features,target)
predictions = model.predict(features)
r2_initial = r2_score(target,predictions)
print(f'R2 metric based on initial data: {round(r2_initial,4)}')

R2 metric based on initial data: 0.4249


### Creating an inverse matrix:

In [12]:
array = np.random.randint(100, size=(features.shape[1],features.shape[1]))
print('Random matrix:\n',array)
try:
    inverse = np.linalg.inv(array)
except numpy.linalg.LinAlgError:
    print('The matrix is irreversible, restart the code!')

print()    
print('The random matrix is reversible:\n',inverse)

Random matrix:
 [[59 25 19 30]
 [98 92 34 49]
 [44 91 76 96]
 [34 75 91 28]]

The random matrix is reversible:
 [[ 2.13411316e-02  4.07944946e-05 -7.12028956e-03  1.47553291e-03]
 [-3.35938042e-02  2.03750678e-02  3.60747432e-04 -8.99855259e-04]
 [ 1.70873483e-02 -1.43564936e-02 -2.30172044e-03  1.47076035e-02]
 [ 8.53529084e-03 -7.96700636e-03  1.51603696e-02 -1.14668175e-02]]


### We multiply the features by an invertible matrix and train the model:

In [13]:
features_reverse = np.dot(df.drop('Страховые выплаты',axis=1).values,array)
features_reverse

array([[2186511., 4517472., 3771104., 4763667.],
       [1676542., 3462307., 2889655., 3650282.],
       [ 926842., 1913668., 1596986., 2017421.],
       ...,
       [1493628., 3086890., 2577262., 3255436.],
       [1441117., 2977974., 2486240., 3140392.],
       [1789237., 3697276., 3086662., 3899030.]])

In [14]:
model = LinearRegression()
model.fit(pd.DataFrame(features_reverse),target)
predictions = model.predict(pd.DataFrame(features_reverse))
r2_reverse = r2_score(target,predictions)

In [18]:
print(f'R2 metric based on initial data: {round(r2_initial,5)}')
print(f'The metric is R2 if we multiply the data by an invertible matrix: {round(r2_reverse,5)}')

R2 metric based on initial data: 0.42495
The metric is R2 if we multiply the data by an invertible matrix: 0.42495


### Justification and response

As can be seen, the multiplication of features by a reversible matrix had almost no effect on the accuracy of the model.

R2 in both cases are identical

## Conversion algorithm

**Algorithm**

In [19]:
def coding(df):
    # Creating an invertible matrix
    array = np.random.randint(100, size=(df.drop('Страховые выплаты',axis=1).shape[1],
                                         df.drop('Страховые выплаты',axis=1).shape[1]))
    try:
        inverse = np.linalg.inv(array)
    except numpy.linalg.LinAlgError:
        print('The matrix is irreversible, restart the code!')

    # We multiply the features by an invertible matrix and train the model
    features_reverse = np.dot(df.drop('Страховые выплаты',axis=1).values,array)

    model = LinearRegression()
    model.fit(pd.DataFrame(features_reverse),df['Страховые выплаты'])
    predictions = model.predict(pd.DataFrame(features_reverse))
    return r2_score(target,predictions)

**Justification**

The conversion algorithm will consist in the fact that all the features will be multiplied by a random reversible matrix, which will allow converting all customer data into non-obvious numbers

At the same time, as it was proved earlier, the quality of the model will not decrease

## Checking the algorithm

### Train the model before the transformation

In [20]:
features = df.drop('Страховые выплаты',axis=1)
target = df['Страховые выплаты']

model = LinearRegression()
model.fit(features,target)
predictions = model.predict(features)
r2_initial = r2_score(target,predictions)

In [23]:
print(f'R2 metric based on initial data: {round(r2_initial,4)}')
print(f'The metric is R2 if we multiply the data by an invertible matrix: {round(coding(df),4)}')

R2 metric based on initial data: 0.4249
The metric is R2 if we multiply the data by an invertible matrix: 0.4249


### Conclusion

The proposed conversion method allows you to hide the personal data of customers without losing the quality of the prediction of the model.

Example of features **before** conversion:

In [24]:
features

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,1,41.0,49600.0,1
1,0,46.0,38000.0,1
2,0,29.0,21000.0,0
3,0,21.0,41700.0,2
4,1,28.0,26100.0,0
...,...,...,...,...
4995,0,28.0,35700.0,2
4996,0,34.0,52400.0,1
4997,0,20.0,33900.0,2
4998,1,22.0,32700.0,3


Example of features **after** conversion:

In [25]:
pd.DataFrame(features_reverse)

Unnamed: 0,0,1,2,3
0,2186511.0,4517472.0,3771104.0,4763667.0
1,1676542.0,3462307.0,2889655.0,3650282.0
2,926842.0,1913668.0,1596986.0,2017421.0
3,1836926.0,3796782.0,3170096.0,4004285.0
4,1151203.0,2377701.0,1984571.0,2507002.0
...,...,...,...,...
4995,1573612.0,3251426.0,2714334.0,3428628.0
4996,2308966.0,4771603.0,3983647.0,5032094.0
4997,1493628.0,3086890.0,2577262.0,3255436.0
4998,1441117.0,2977974.0,2486240.0,3140392.0
