# Personal Data Security
(linear algebra)

### Content

1. [Introduction](#intro)
2. [General information](#general)
3. [Multiplying Matrices](#xmatrix)
4. [Algorithm](#algorithm)
5. [Checking Algorithm](#checking)
8. [Conclusion](#conclusion)


## Introduction <a href='intro'></a>
Insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. This is called data masking, or data obfuscation. You are also expected to prove that the algorithm works correctly. Additionally, the data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model. Follow these steps to develop a new algorithm:
construct a theoretical proof using properties of models, and the given task;
formulate an algorithm for this proof;
check that the algorithm is working correctly when applied to real data.

- Features: insured person's gender, age, salary, and number of family members.
- Target: number of insurance benefits received by the insured person over the last five years.

*Libraries*

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression


from sklearn.preprocessing import StandardScaler

from sklearn.metrics import r2_score

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

## General Information <a href='general'></a>

In [2]:
try:
    insurance = pd.read_csv('insurance.csv')
except:
    insurance = pd.read_csv('/datasets/insurance.csv')

In [3]:
insurance.sample(15)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
4350,0,41.0,29200.0,2,0
2415,1,30.0,41600.0,1,0
43,0,20.0,33100.0,1,0
352,0,31.0,32900.0,1,0
1192,1,25.0,41600.0,0,0
4506,0,30.0,34700.0,1,0
4776,0,45.0,35200.0,1,1
1208,0,26.0,32100.0,1,0
2240,0,60.0,27900.0,0,4
452,0,34.0,32800.0,0,0


In [4]:
insurance.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Пол                  5000 non-null int64
Возраст              5000 non-null float64
Зарплата             5000 non-null float64
Члены семьи          5000 non-null int64
Страховые выплаты    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
insurance.isna().sum()

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

In [6]:
insurance.duplicated().sum()

153

#### Intermediate conclusion

We have data with 5 columns and 5000 entries plus column that we can define as id(?).

1. Convert datatype in column "Возраст" (float -> int)
2. No missing data
3. 153 duplicates. No need to drop it in this project

*Note:*

Need to check what does first column in dataset mean.

In [7]:
insurance['Возраст'] = insurance['Возраст'].astype('int64')

### Conclusion

We check dataset:
- converted datatype
- No missing data
- Have duplicates but not important to this task no need to drop
Seems that everything is ok with dataset, and we can go to the next step


## Multiplying Matrices <a href='xmatrix'></a>

If we multiply features by inverse matrix, Linear Regression performance will not change, as inverse matrix in this case can be called coefficient which is modified original features.


References:

$X$ — Feature matrix

$y$ — Vector of target

$P$ — Matrix by which features multiply

$w$ — Weight vector of Lianer Regression (zero element is equals to indentation(сдвигу))

Predictions:

$$
a = Xw
$$

Traning goal:

$$
w = \arg\min_w MSE(Xw, y)
$$

Formula of Training:

$$
w = (X^T X)^{-1} X^T y
$$

If we multiply matrix of features by inverse matrix and write it as a formula of training:

$$
w = ((XP)^T XP)^{-1} (XP)^T y
$$

To get prediction we need to multiply weight matrix by inverse matrix of featuares

$$
a = XPw
$$$$
a = XP((XP)^T XP)^{-1} (XP)^T y = X P(P^T X^T XP)^{-1} P^T X^T y =  
$$$$
XP P^{-1} (X^T X)^{-1} (P^T)^{-1} P^T X^T y = XP P^{-1}w
$$$$
X (X^T X)^{-1} X^T y
$$

Weight matrix of Linear Regression for transfered data will be:$$
w_{transformed} = P^{-1}w
$$

Multiplying each feature vector by the same inverse matrix is a transfer of vector into new basic. In the same time there is a changing of starting point but not the place of vectors peaks in a space

## Algorithm<a href='algorithm'></a>

1. Features and target
2. Split dataset on train and test sets
3. Numeric features for scalar
4. Transfer features of train set into NumPy massif
5. Generating quadratic matrix which is equals to the number of features
6. Checking matrix from point 5
7. If index is not equals to 0:
    7.1. Multiply matrix of train and test sets by transfer matrix
    7.2. Scalar matrix of train and test sets
    7.3. Fit model on train set, predict - on test set
8. If index is equals to 0 - repeating from point 5

## Checking Algorithm<a href='checking'></a>


*1. Features and target*


In [8]:
features = insurance.drop('Страховые выплаты', axis = 1)
target = insurance['Страховые выплаты']

*2. Split dataset on train and test sets*

In [26]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.3, random_state=12345)


*3. Numeric features for scalar*


In [10]:
numeric = features.drop('Пол', axis=1).columns
numeric

Index(['Возраст', 'Зарплата', 'Члены семьи'], dtype='object')

In [11]:
def linreg(X_train , X_test, y_train, y_test):

    scalar = StandardScaler()
    scalar.fit(X_train.loc[:,numeric])
    X_train.loc[:,numeric] = scalar.transform(X_train.loc[:,numeric])
    X_test.loc[:,numeric] = scalar.transform(X_test.loc[:,numeric])

    model = LinearRegression()
    model.fit(X_train, y_train)
    predict = model.predict(X_test)
    return r2_score(y_test, predict)

In [12]:
r2_score_before = linreg(features_train, features_test, target_train, target_test)
r2_score_before

0.4305278542485148

*4. Transfer features of train set into NumPy massif*


In [13]:
X_train, X_test = features_train.values, features_test.values
print(X_train)
print(X_test)



[[ 0.         -0.23862317 -1.98861705 -1.08239912]
 [ 0.          0.70548434 -0.26239699  0.75707499]
 [ 0.          0.35144402 -0.7671397   0.75707499]
 ...
 [ 1.          1.17753809  0.48462222 -0.16266207]
 [ 0.         -1.06471724  1.02974435  2.5965491 ]
 [ 0.         -1.41875756  0.09092291 -1.08239912]]
[[ 0.00000000e+00  2.33430582e-01 -9.07844658e-02  2.59654910e+00]
 [ 0.00000000e+00  2.23965904e+00  3.23104556e-01  7.57074988e-01]
 [ 1.00000000e+00  9.41511213e-01  2.22156014e-01 -1.08239912e+00]
 ...
 [ 1.00000000e+00 -5.92663489e-01  2.62535431e-01 -1.08239912e+00]
 [ 0.00000000e+00 -2.59629565e-03  2.82725139e-01  2.59654910e+00]
 [ 1.00000000e+00  9.41511213e-01 -1.41258737e-01 -1.62662068e-01]]


In [14]:

def random_matrix(X_train, X_test, count = 100):

    n = X_train.shape[1]
    A = np.random.RandomState().randint(100, size=(n, n))
    flag = 0
    while flag == 0:
        if np.linalg.det(A) != 0:
            A_train = X_train @ A
            A_test = X_test @ A
            A_train = pd.DataFrame(A_train, index = features_train.index, columns = features_train.columns)
            A_test = pd.DataFrame(A_test, index = features_test.index, columns = features_test.columns)
            r2 = linreg(A_train, A_test, target_train, target_test)
            flag = 1
            print('Transfer Matrix:')
            print(A)
            print('r2_score:', r2)

    return r2, A

In [15]:
r2_score_after = random_matrix(X_train, X_test)[0]

Transfer Matrix:
[[ 4 79 79 83]
 [81 55 90 75]
 [86 92 23 12]
 [67  2 53 72]]
r2_score: 0.4305278542485145


In [16]:
r2_score_before - r2_score_after

3.3306690738754696e-16

## Conclusion<a href='conclusion'></a>

There is no big difference in r2_score in origin and transformed data. Which means that model is working as the same way with secured data and origin.


