# Algorithm for protection personal data of clients

Task is to create an algorithm that can change customer tabular data without loss of quality for model training.

In [1]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv('/Users/alexey_zalesov/Desktop/ya_prakrikum/ds/datasets/insurance.csv')
display(data.head())



Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


* 		Features: gender, age and salary of the insured, the number of members of his family.
* Target feature: the number of insurance payments to the client over the past 5 years.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
data.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


### Pre-analysis

Given a table: 5 columns of 5000 lines, 4 feature columns and 1 target column.
You need to replace the column names, check for gaps, duplicates, you may need to change the "Gender" column from categorical to quantitative, you also need to change the data type in the "Age" and "Salary" columns.

In [5]:
data.set_axis(['gender', 'age', 'salary', 'family_members_count', 'insurance_case'], axis='columns', inplace=True)
                

display(data.head())





Unnamed: 0,gender,age,salary,family_members_count,insurance_case
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [6]:
data['gender'].isna().sum()

0

In [7]:
data['age'].isna().sum()

0

In [8]:
data['salary'].isna().sum()

0

In [9]:
data['family_members_count'].isna().sum()

0

In [10]:
data['insurance_case'].isna().sum()

0

In [11]:
data.duplicated().sum()


153

There are 153 duplicates in total, which is about 3% of the total number of lines. I propose to remove them. 

In [12]:
data_cleared = data.drop_duplicates().reset_index(drop=True)

In [13]:
data['age'] = data['age'].astype('int')
data['salary'] = data['salary'].astype('int')

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   gender                5000 non-null   int64
 1   age                   5000 non-null   int64
 2   salary                5000 non-null   int64
 3   family_members_count  5000 non-null   int64
 4   insurance_case        5000 non-null   int64
dtypes: int64(5)
memory usage: 195.4 KB


### Summary

The data is loaded, examined, duplicates are removed, the required data types are changed, and the column names are also changed.

## Matrix multiplication

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ is the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

(Formula 1)
$$
a = Xw
$$

Learning objective:
(formula 2)
$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:
(formula 3)
$$
w = (X^T X)^{-1} X^T y
$$

**Answer:**

b. Will not change.



**Rationale:**

We will consider the formulas in the cell above as already proven

then:

$$
w' = (((XP)^T)(XP)^{-1}(XP)^Ty =
= (((P^T X^T)(XP))^{-1}(XP)^Ty =
= ((P^T(X^T X))P)^{-1}(P^T X^T)y $$
[because matrix multiplication is associative]

$$
= P^{-1} ((P^T(X^T X))^{-1}(P^T X^T)y =
$$

$$
= P^{-1} (X^T X)^{-1} (P^T)^{-1} (P^T X^T)y=
$$

$$
= P^{-1} (X^T X)^{-1} E X^T y=
$$

$$
= P^{-1} (X^T X)^{-1} X^T y =
$$

[by formula 3:]

$$
= P^{-1} w
$$







$$
=>w'=P^{-1}w
$$  



$$
a'=XP*w'(формула1)
=>a'=XP P^{-1}w(по-доказанному) = XEw = Xw = a (формула1)
$$  




Q.E.D.

## Conversion algorithm

**Algorithm**

##### Let's write a function that:
1. Create the corresponding Random_state
2. Create a "random" matrix with numpy.random.normal()
3. Check it for reversibility - namely, does it exist inversely to it
4. In the case (very unlikely) that the inverse matrix does not exist - return to step 2)
5. Takes the original feature matrix, multiplies it by the created reversible
6. Will return a new, "modified" matrix

## Check

#### Firstly,let's train the model on the unchanged feature matrix

In [15]:
# разбиваем данные для обучения
# так как нам не требуется выбирать лучшею модель или гиперпараметры, а нужно лишь проверить, что качество 
# предсказания модели не ухудшится, то разобьем данные: 0.75 на обучения, 0.25 для тестовой.

#train, test = train_test_split 
features = data_cleared.drop('insurance_case', axis=1)
target = data_cleared['insurance_case']

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25,
                                                                           random_state=12345)


In [16]:
#обучаем модель, считаем значение метрики r2

model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)

score = r2_score(target_test, predictions)

print('Метрика R2 модели на неизмененных данных составляет:', score)

Метрика R2 модели на неизмененных данных составляет: 0.42307727492147573


In [17]:
#Реализуем сам алгоритм:


def change_data(data):
    length = data.shape[1]
    np.random.seed(12345)
    multiplicator = np.random.normal(size=(length,length))
    print(multiplicator)
    try:
        np.linalg.inv(multiplicator)
    except LinAlgError:
        multiplicator = np.random.normal(size=(length,length))
    
    
    return pd.DataFrame(np.dot(data, multiplicator), index = data.index, columns=data.columns)

    
    
    
    
    

Let's change the original feature matrix:

In [18]:
features_changed = change_data(features)

features_train_changed, features_test_changed, target_train, target_test = train_test_split(features_changed
                                                                                            , target,
                                                                                           test_size=0.25,
                                                                                           random_state=12345)


[[-0.20470766  0.47894334 -0.51943872 -0.5557303 ]
 [ 1.96578057  1.39340583  0.09290788  0.28174615]
 [ 0.76902257  1.24643474  1.00718936 -1.29622111]
 [ 0.27499163  0.22891288  1.35291684  0.88642934]]


In [19]:
display(features_changed.head())


#display(features_train_new.head())

Unnamed: 0,gender,age,salary,family_members_count
0,38224.186641,61881.00042,49961.234837,-64280.684721
1,29313.558467,47428.845564,38278.822267,-49242.555394
2,16206.481556,26215.538233,21153.670838,-27212.472653
3,32110.072445,52006.047856,42004.45311,-54044.730722
4,20126.326163,32571.440926,26289.724215,-33824.037786


In [20]:
model_new = LinearRegression()
model_new.fit(features_train_changed, target_train)


LinearRegression()

In [21]:
predictions_new = model_new.predict(features_test_changed)

In [22]:
score_new = r2_score(target_test , predictions_new)
print(score_new)

0.4230772749212812


In [23]:
print('Значение метрики R2 изменилось на:', (score/score_new - 1)*100, 'процентов')

Значение метрики R2 изменилось на: 4.5985437679973984e-11 процентов


### Conclusion

The value of the R2 metric on the modified sample, compared to the original one, changed slightly (4.5^(10^-11)) percent.
I consider the algorithm working, and the statement proven.

