# Protection of personal data of clients

We need to protect the data of the insurance company's customers. We will develop a data conversion method that makes it difficult to recover personal information from it.

You need to protect the data so that the quality of the machine learning models does not deteriorate during the transformation. There is no need to select the best model.
  
The key steps of our project will be:

* Loading and preparing data
* Matrix multiplication
* Suggestion of data transformation algorithm for solving the problem
* Algorithm programming by applying matrix operations
* Linear regression quality check. Studying the R2 score
  
The project is made in **Jupyter Notebook**, Notebook server version: 6.1.4. Version **Python** 3.7.8.
Libraries used in the project
* **Pandas**
* **NumPy**
* **scikit-learn**
* **IPython**

## Loading and preparing data

In [1]:
# Import all required libraries and modules.
from IPython.display import display
import pandas as pd
import numpy as np
from numpy import random
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Read the data.
data = pd.read_csv('insurance.csv')
data.info()
display(data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


Here is the translation of dataset's russian column names:
* Пол - gender
* Возраст - Age
* Зарплата - Salary
* Члены семьи - number of family members
* Страховые выплаты - insurance payments

In [2]:
# This is optional, but change the data format
# columns "Age" and "Salary".
data = data.astype({'Возраст': 'int64', 'Зарплата': 'int64'})
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Пол                5000 non-null   int64
 1   Возраст            5000 non-null   int64
 2   Зарплата           5000 non-null   int64
 3   Члены семьи        5000 non-null   int64
 4   Страховые выплаты  5000 non-null   int64
dtypes: int64(5)
memory usage: 195.4 KB


### Conclusion

We read the data, reviewed it. Changed data format of columns *Age* and *Salary* bringing them to `int64` format.

## Matrix multiplication

In [3]:
# Define the target and other features.
features = data.drop(['Страховые выплаты'], axis=1)
target = data['Страховые выплаты']

# Prepare training, validation and test datasets
# in the classic ratio 0.6 / 0.2 / 0.2.
# Let's write a function. It will be useful in the future.
def splitting(features, target):
    # Split the dataset into samples
    features_train, features_valid, target_train, target_valid = train_test_split(
    features,
    target, 
    test_size=.2,
    random_state=12345)
    features_train, features_test, target_train, target_test = train_test_split(
    features_train,
    target_train, 
    test_size=.25,
    random_state=12345)
    # Estimate the size of the obtained samples.
    data_kit = [features_train,
            target_train,
            features_valid,
            target_valid,
            features_test,
            target_test]
    print('Note: the listed samples are divided in pairs into three' +
          ' groups in the following order: training, validation and test.')
    print('The first in the pair is the set of features, the second is the set of target features. ')
    for kit in data_kit:
        print('The table size is:', kit.shape)
    return (features_train, target_train, features_valid, 
            target_valid, features_test, target_test)

In [4]:
# Let's use the splitting function.
(features_train, target_train, features_valid, 
 target_valid, features_test, target_test) = splitting(features, target)

Note: the listed samples are divided in pairs into three groups in the following order: training, validation and test.
The first in the pair is the set of features, the second is the set of target features. 
The table size is: (3000, 4)
The table size is: (3000,)
The table size is: (1000, 4)
The table size is: (1000,)
The table size is: (1000, 4)
The table size is: (1000,)


In [5]:
# Train linear regression and measure its quality.
model = LinearRegression()
model.fit(features_train, target_train)
predictions_valid = pd.Series(
    data=model.predict(features_valid), 
    index=target_valid.index
)
# Now let's measure the quality of the model on the validation set,
# using R2 and MSE scores.
r2_valid = r2_score(target_valid, predictions_valid)
mse_valid = mean_squared_error(target_valid, predictions_valid)
print('R2 score on the validation set is', r2_valid)
print('MSE score on the validation set is', mse_valid)
print()
# Measure the quality on the test sample.
predictions_test = pd.Series(
    data=model.predict(features_test),
    index=target_test.index
)
r2_test = r2_score(target_test, predictions_test)
mse_test = mean_squared_error(target_test, predictions_test)
# Now let's measure the quality of the model on the test set,
# using R2 and MSE scores.
print('The R2 score on the test sample is', r2_test)
print('The MSE score on the test set is', mse_test)

R2 score on the validation set is 0.4119936287730738
MSE score on the validation set is 0.1100159920565579

The R2 score on the test sample is 0.41812098706230283
The MSE score on the test set is 0.12135145002716968


In [6]:
# Check if the quality of the model changes if we change
# features by multiplying them by an invertible matrix.
array_features = features.values
# Create an arbitrary invertible matrix with rows number,
# equal to the columns number of the array_features matrix.
invertable_array = np.random.randint(
    0, 
    10, 
    size=(
        array_features.shape[1], 
        array_features.shape[1])
)
print('Arbitrary supposedly invertable matrix invertable_array')
print(invertable_array)
print()
# Check if the matrix invertable_array is invertible.
try:
    print('Matrix inverse of matrix invertable_array')
    print(np.linalg.inv(invertable_array))
    print()
    print('The matrix invertable_array is invertible')
except:
    print('The matrix invertable_array is not invertible')

Arbitrary supposedly invertable matrix invertable_array
[[9 6 3 2]
 [4 3 5 6]
 [2 3 6 5]
 [8 3 5 5]]

Matrix inverse of matrix invertable_array
[[-0.01234568 -0.09876543 -0.08641975  0.20987654]
 [ 0.23868313  0.24279835  0.00411523 -0.3909465 ]
 [-0.07407407 -0.59259259  0.48148148  0.25925926]
 [-0.04938272  0.60493827 -0.34567901 -0.16049383]]

The matrix invertable_array is invertible


In [7]:
# Multiply the features by the invertable_array matrix.
new_features_array = array_features @ invertable_array
new_features = pd.DataFrame(
    data=new_features_array, 
    index=features.index, 
    columns=features.columns
)
display(new_features)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,99381,148932,297813,248253
1,76192,114141,228235,190281
2,42116,63087,126145,105174
3,83500,125169,250315,208636
4,52321,78390,156743,130670
...,...,...,...,...
4995,71528,107190,214350,178678
4996,104944,157305,314575,262209
4997,67896,101766,203510,169630
4998,65521,98181,196328,163649


In [8]:
# Let's split the new set of features with the splitting function.
(new_features_train, new_target_train, 
 new_features_valid, new_target_valid, 
 new_features_test, new_target_test) = splitting(new_features, target)

Note: the listed samples are divided in pairs into three groups in the following order: training, validation and test.
The first in the pair is the set of features, the second is the set of target features. 
The table size is: (3000, 4)
The table size is: (3000,)
The table size is: (1000, 4)
The table size is: (1000,)
The table size is: (1000, 4)
The table size is: (1000,)


In [9]:
# Train linear regression and measure its quality.
model = LinearRegression()
model.fit(new_features_train, new_target_train)
predictions_valid_new = pd.Series(
    data=model.predict(new_features_valid), 
    index=new_target_valid.index
)
# Now let's measure the quality of the model on the validation set,
# using R2 and MSE scores.
r2_valid_new = r2_score(new_target_valid, predictions_valid_new)
mse_valid_new = mean_squared_error(new_target_valid, predictions_valid_new)
print('R2 score on the validation modified sample is', 
      r2_valid_new)
print('MSE score on the validation modified sample is', 
      mse_valid_new)

# Test the quality on the test sample.
predictions_test_new = pd.Series(
    data=model.predict(new_features_test),
    index=new_target_test.index
)
# Now let's measure the quality of the model on the test set,
# using R2 and MSE scores.
r2_test_new = r2_score(new_target_test, predictions_test_new)
mse_test_new = mean_squared_error(new_target_test, predictions_test_new)
print()
print('R2 score on the test modified sample is', 
      r2_test_new)
print('MSE score on the test modified sample is', 
      mse_test_new)

R2 score on the validation modified sample is 0.41199362877305856
MSE score on the validation modified sample is 0.11001599205656075

R2 score on the test modified sample is 0.41812098706240963
MSE score on the test modified sample is 0.1213514500271474


In [10]:
results = pd.DataFrame(
    {
        'R2': [r2_test, r2_test_new],
        'MSE': [mse_test, mse_test_new]
    },index=(
        [
            'Initial feature set', 
            'Feature set multiplied by an invertible matrix'
        ]
    )
)
display(results)

Unnamed: 0,R2,MSE
Initial feature set,0.418121,0.121351
Feature set multiplied by an invertible matrix,0.418121,0.121351


### Conclusion

We have identified the features, the target feature. We prepared a function that splits the data into samples and estimates their sizes. We then created a linear regression model and trained it on the original data. The quality of this model is measured by R2 and MSE scores.

We have multiplied the feature dataset by an arbitrary invertible matrix. Again, a linear regression model was created and its quality was assessed, having previously trained on the modified initial data. The quality has not changed compared to the first model.
Below is the rationale.

## Suggestion of data transformation algorithm for solving the problem

First, let's make the notation

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ is the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)
  
Let's revisit the basic formulas.
    
Predictions:

$$
a = Xw
$$

Learning objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

Expanded prediction formula:

$$
a = X (X^T X)^{-1} X^T y
$$

We made sure that when multiplying the feature matrix by an arbitrary reversible matrix, the quality of the model does not change. Now let's try to solve a theoretical problem and prove why this is so.
In the linear regression prediction formula, we multiply the feature matrix X by an arbitrary invertible matrix P.

$$
a = (X\cdot P) \cdot ((X\cdotp P)^T \cdot (X\cdotp P))^{-1} \cdot (X\cdotp P)^T \cdot y
$$

Now let's expand the brackets

$$
a = X\cdot P \cdot (P^T \cdot X^T \cdot (X\cdotp P))^{-1} \cdot P^T \cdotp X^T \cdot y
$$
  
The property of associativity of matrices during multiplication allows us to change the location of the brackets inside the product raised to -1 power.
  
$$
a = X\cdot P \cdot (P^T \cdot (X^T \cdot X) \cdotp P)^{-1} \cdot P^T \cdotp X^T \cdot y
$$

The product of a matrix and its transposed version generates a square matrix. This allows us to conclude that in this part of the formula we have three square matrices.
$$
(P^T \cdot (X^T \cdot X) \cdotp P)^{-1}
$$

Let's continue opening the brackets.
$$
a = X\cdot P \cdot P^{-1} \cdot (X^T \cdot X)^{-1} \cdot (P^T)^{-1} \cdot P^T \cdotp X^ T \cdot y
$$
The product of a matrix and its inverse gives us the identity matrix.

$$
a = X\cdot E \cdot (X^T \cdot X)^{-1} \cdot E \cdotp X^T \cdot y
$$

And the multiplication of any matrix, for example, M, by the identity matrix gives us the same matrix M.

$$
a = X\cdot (X^T \cdot X)^{-1} \cdot X^T \cdot y
$$

We have received the initial formula for calculating predictions.

Taking into account the above operations with matrices, we proved that the result of the predictions does not change when the features are multiplied by an arbitrary invertible matrix. This fact can be used to encrypt data.

## Algorithm programming by applying matrix operations

**Algorithm**

An algorithm for converting (encrypting) customer data is proposed, based on the fact identified in the previous part of the project. Personal data is a feature of our model. When multiplying a matrix with features by an arbitrary invertible matrix, we get a new matrix, which is an encrypted array of personal data. The matrix inverse to an arbitrary invertible matrix will become the encryption key. At the same time, the quality of the mathematical model of linear regression will not change after data encryption.
  
It is worth noting that the key to successful data encryption is a combination of substitution and permutation (substitution of new characters / numbers for the original ones and their permutation within the resulting data set). However, in our case, only substitution (multiplication by an arbitrary invertible matrix) is possible. With the subsequent permutation of the obtained elements of the array, the quality of the model will certainly change.

In [11]:
# Let's write a feature encryption function.
# The function will have two parameters: features in DataFrame format
# and the seed parameter of the random number generator.
def cypher (features, seed):
    # Let's create an array of data from the DataFrame feature object.
    array_features = features.values
    # Create a random generator. As a parameter, we specify
    # arbitrary positive number like 12345 to
    # get a reproducible result.
    rng = np.random.default_rng(seed)
    # Get an arbitrary supposedly reversible
    # a cipher matrix of suitable sizes.
    invertable_array = rng.random((
        array_features.shape[1], 
        array_features.shape[1]
    ))
    # Make sure it's reversible.
    try:
        cypher = np.linalg.inv(invertable_array)
    except:
        print('The cipher matrix is irreversible.' + 
              ' Change the parameters of the random number generator')
    # Multiply the features introduced into the function by the cipher matrix.
    new_features_array = array_features @ invertable_array
    # Create a DataFrame object, put the result into it.
    encrypted_features = pd.DataFrame(
    data=new_features_array, 
    index=features.index, 
    columns=features.columns)
    print('Random number generator seed used', 
          seed)
    print('Encryption key')
    print(cypher)
    return encrypted_features, cypher

### Conclusion

**Rationale**

At the previous step of the project, we proved that when multiplying the feature matrix by an arbitrary reversible matrix, the quality of the model does not change. This means that the inverse of an arbitrary invertible matrix can become an encryption key, and customers' personal data can be encrypted by matrix multiplication. To decrypt the data, you will need to multiply the changed features by the encryption key. A detailed theoretical substantiation of this method is given in the previous step.
  
It is worth noting that the seed parameter of the random number generator, as well as the encryption key, must be protected from unauthorized access.

## Checking the quality of linear regression. Studying the R2 score

In [12]:
# So we have the initial data.
# We broke it down into target and other features.
# Encrypt signs.
encrypted_features, key = cypher(features, 12345)
# Let's see what the encrypted data looks like.
display(encrypted_features)
# Let's see what the encryption key looks like.
display(key)

Random number generator seed used 12345
Encryption key
[[-1.97240014  1.76004024 -0.08309671  1.22285233]
 [ 0.14111106  0.32873452  1.02824721 -1.27752175]
 [ 0.8908452   0.90302415 -0.59501472 -0.23290483]
 [ 1.02530945 -1.81039816  0.24787878  0.46192295]]


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,33385.629848,46727.480145,12338.757310,47073.723967
1,25583.387949,35803.914219,9461.301198,36066.960022
2,14139.219101,19787.511775,5230.510961,19931.919480
3,28063.474811,39280.360370,10365.294463,39574.038409
4,17570.111152,24590.690332,6496.763162,24771.702875
...,...,...,...,...
4995,24029.676314,33631.872876,8880.008337,33882.058637
4996,35266.381669,49361.881712,13028.859784,49728.607798
4997,22815.586558,31933.965207,8428.379580,32172.578691
4998,22009.956098,30804.880053,8132.920545,31035.857511


array([[-1.97240014,  1.76004024, -0.08309671,  1.22285233],
       [ 0.14111106,  0.32873452,  1.02824721, -1.27752175],
       [ 0.8908452 ,  0.90302415, -0.59501472, -0.23290483],
       [ 1.02530945, -1.81039816,  0.24787878,  0.46192295]])

In [13]:
# And now let's compare the quality of the models 
# before and after the transformation.
# Let's split the samples into parts.
(features_train_enc, target_train_enc, 
 features_valid_enc, target_valid_enc, 
 features_test_enc, target_test_enc) = splitting(
    encrypted_features, 
    target
)

Note: the listed samples are divided in pairs into three groups in the following order: training, validation and test.
The first in the pair is the set of features, the second is the set of target features. 
The table size is: (3000, 4)
The table size is: (3000,)
The table size is: (1000, 4)
The table size is: (1000,)
The table size is: (1000, 4)
The table size is: (1000,)


In [14]:
# Train linear regression and measure its quality.
model = LinearRegression()
model.fit(features_train_enc, target_train_enc)
predictions_valid_enc = pd.Series(
    data=model.predict(features_valid_enc), 
    index=target_valid_enc.index
)
# Now let's measure the quality of the model on the validation set,
# using R2 and MSE scores.
r2_valid_enc = r2_score(target_valid_enc, predictions_valid_enc)
mse_valid_enc = mean_squared_error(
    target_valid_enc, 
    predictions_valid_enc
)
print('R2 score on the validation encrypted sample is', 
      r2_valid_enc
     )
print('MSE score on the validation encrypted sample is', 
      mse_valid_enc
     )
print()
# Measure the quality on the test sample.
predictions_test_enc = pd.Series(
    data=model.predict(features_test_enc),
    index=target_test_enc.index
)
r2_test_enc = r2_score(target_test_enc, predictions_test_enc)
mse_test_enc = mean_squared_error(target_test_enc, predictions_test_enc)
# Now let's measure the quality of the model on the test set,
# using R2 and MSE metrics.
print('R2 score on the test encrypted sample is', 
      r2_test_enc
     )
print('MSE score on the test encrypted sample is', 
      mse_test_enc
     )

R2 score on the validation encrypted sample is 0.41199362877251355
MSE score on the validation encrypted sample is 0.11001599205666272

R2 score on the test encrypted sample is 0.418120987062058
MSE score on the test encrypted sample is 0.12135145002722073


In [15]:
# The results of the comparison will be presented in a table.
results = pd.DataFrame(
    {
        'R2': [r2_test, r2_test_enc],
        'MSE': [mse_test, mse_test_enc]
    },index=(
        ['Initial feature set', 'Encrypted feature set']
    )
)
display(results)

Unnamed: 0,R2,MSE
Initial feature set,0.418121,0.121351
Encrypted feature set,0.418121,0.121351


### Conclusion

We tested the operation of our algorithm, and made sure that it works. We also compared the quality of linear regression models before and after feature encryption. The quality has remained unchanged. Mission accomplished.