## Python ile Veri Anonimleştirme ve Takma Ad Verme

www.tarikbahar.com/blog/python-ile-veri-anonimlestirme-ve-takma-ad-verme/

Örnek kişisel data üretmek için https://www.generatedata.com/ sitesini kullanabilirsiniz !

Not: Aşağıdaki yöntemleri sırayla denerken df'in yapısı bozulacağından hata almanız muhtemeldir. Bu sebeple her yeni yöntemi denemeden önce alttaki hücreyi çalıştırarak datamızı yeniden yüklemenizi tavsiye ederim.


### Kimliksizleştirme ve Anonimleştirme (Deidentification vs Anonymization)

In [1]:
import pandas as pd

In [35]:
data = {'name':['Darryl Michael', 'Lillith Guerra', 'Blaine Pate', 'Sarah Haley','Hanna Peterson'],
        'email':['dignissim.magna@gmail.com', 'lilli.guerra@lorem.org', 'pate.pate3@lorem.org', 'hanna.peter@lorem.org','Fusce.dolora@lorem.org'],
        'age':[31, 18, 23, 55, 47],
        'salary':[2800, 7700, 3500, 10400, 5050],
        'country':['United States', 'Germany', 'Somalia', 'Turkey', 'China'],
        'creditcard':['5189421046624896','5581114287768274','5267488300623067','5208235158315490','5279990694405411']} 

df = pd.DataFrame(data)

df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,dignissim.magna@gmail.com,31,2800,United States,5189421046624896
1,Lillith Guerra,lilli.guerra@lorem.org,18,7700,Germany,5581114287768274
2,Blaine Pate,pate.pate3@lorem.org,23,3500,Somalia,5267488300623067
3,Sarah Haley,hanna.peter@lorem.org,55,10400,Turkey,5208235158315490
4,Hanna Peterson,Fusce.dolora@lorem.org,47,5050,China,5279990694405411


#### Çıkarma (Removing)

In [3]:
df.drop(columns=['name', 'email'])

Unnamed: 0,age,salary,country,creditcard
0,31,2800,United States,5189421046624896
1,18,7700,Germany,5581114287768274
2,23,3500,Somalia,5267488300623067
3,55,10400,Turkey,5208235158315490
4,47,5050,China,5279990694405411


#### Maskeleme veya gizleme (Masking or suppression)


In [4]:
def creditcard_maskeleme(creditCardNumber):
    creditCardNumber =  "************" + creditCardNumber[-4:]
    return creditCardNumber

def email_maskeleme(emailAddress):
    emailAddress= "**********@" + emailAddress.split('@')[1]
    return emailAddress


df.creditcard = df.creditcard.map(creditcard_maskeleme)
df.email = df.email.map(email_maskeleme)
df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,**********@gmail.com,31,2800,United States,************4896
1,Lillith Guerra,**********@lorem.org,18,7700,Germany,************8274
2,Blaine Pate,**********@lorem.org,23,3500,Somalia,************3067
3,Sarah Haley,**********@lorem.org,55,10400,Turkey,************5490
4,Hanna Peterson,**********@lorem.org,47,5050,China,************5411


#### Genelleştirme (Generalization)


In [5]:
avg = df.salary.mean()

def salary_generalization(salary):
    if salary < avg :
        return 'Below Average'
    else:
        return 'Above Average'
    
df.salary = df.salary.map(salary_generalization)
df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,**********@gmail.com,31,Below Average,United States,************4896
1,Lillith Guerra,**********@lorem.org,18,Above Average,Germany,************8274
2,Blaine Pate,**********@lorem.org,23,Below Average,Somalia,************3067
3,Sarah Haley,**********@lorem.org,55,Above Average,Turkey,************5490
4,Hanna Peterson,**********@lorem.org,47,Below Average,China,************5411


#### Karıştırma (Scrambling) 

In [24]:
import random

def scrambling(creditCardNumber): 
    return ''.join(random.sample(creditCardNumber,len(creditCardNumber)))

df.creditcard = df.creditcard.map(scrambling)
df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,dignissim.magna@gmail.com,31,2800,United States,9061451926468248
1,Lillith Guerra,lilli.guerra@lorem.org,18,7700,Germany,4672588417157812
2,Blaine Pate,pate.pate3@lorem.org,23,3500,Somalia,3734656876000282
3,Sarah Haley,hanna.peter@lorem.org,55,10400,Turkey,3520908245135158
4,Hanna Peterson,Fusce.dolora@lorem.org,47,5050,China,9509194210475496


### Takma İsim Verme (Pseudonymization)

In [25]:
from faker import Faker

def email_pseudonymization(email):
    if email not in key:
        pseudonym = fake.email()
        while (pseudonym in key.values()) or (pseudonym in key):
            pseudonym = fake.email()
        key[email] = pseudonym
        return pseudonym
    else:
        return key[email]

key = {}
fake = Faker()
df.email = df.email.map(email_pseudonymization)
df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,kguzman@marshall.com,31,2800,United States,9061451926468248
1,Lillith Guerra,burnsdavid@wallace-rice.com,18,7700,Germany,4672588417157812
2,Blaine Pate,popelindsey@chavez-diaz.com,23,3500,Somalia,3734656876000282
3,Sarah Haley,sarajohnson@mckenzie.info,55,10400,Turkey,3520908245135158
4,Hanna Peterson,brycewalsh@gmail.com,47,5050,China,9509194210475496


In [26]:
key

{'dignissim.magna@gmail.com': 'kguzman@marshall.com',
 'lilli.guerra@lorem.org': 'burnsdavid@wallace-rice.com',
 'pate.pate3@lorem.org': 'popelindsey@chavez-diaz.com',
 'hanna.peter@lorem.org': 'sarajohnson@mckenzie.info',
 'Fusce.dolora@lorem.org': 'brycewalsh@gmail.com'}

#### Tokenizasyon (Tokenization)

In [30]:
import uuid

def email_tokenization(email):
    if email not in key:
        token = uuid.uuid4()
        while token in key.values():
            token = uuid.uuid4()
        key[email] = token
        return token
    else:
        return key[email]

key = {}
df.email = df.email.map(email_tokenization)
df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,7a739974-2656-495c-b282-508c5b715ea6,31,2800,United States,5189421046624896
1,Lillith Guerra,281bb8da-c693-45fa-8441-20d5ed0168ae,18,7700,Germany,5581114287768274
2,Blaine Pate,2df11300-21bb-4b3d-9a12-f736b83bc41b,23,3500,Somalia,5267488300623067
3,Sarah Haley,8ecddaa3-c8da-4f7b-8263-6bbb6e07de2a,55,10400,Turkey,5208235158315490
4,Hanna Peterson,aa74db77-f228-432d-b978-018fb6a9442b,47,5050,China,5279990694405411


In [31]:
key

{'dignissim.magna@gmail.com': UUID('7a739974-2656-495c-b282-508c5b715ea6'),
 'lilli.guerra@lorem.org': UUID('281bb8da-c693-45fa-8441-20d5ed0168ae'),
 'pate.pate3@lorem.org': UUID('2df11300-21bb-4b3d-9a12-f736b83bc41b'),
 'hanna.peter@lorem.org': UUID('8ecddaa3-c8da-4f7b-8263-6bbb6e07de2a'),
 'Fusce.dolora@lorem.org': UUID('aa74db77-f228-432d-b978-018fb6a9442b')}

#### Hashleme (Hashing)

In [36]:
import hashlib

def email_hasing(email):
    if email not in key:
        sha3 = hashlib.sha3_512()
        data = keyword + email
        sha3.update(data.encode('utf-8'))
        hexdigest = sha3.hexdigest()
        key[email] = hexdigest
        return hexdigest
    else:
        return key[email]

keyword = 'parola123'
key = {}
df.email = df.email.map(email_hasing)
df

Unnamed: 0,name,email,age,salary,country,creditcard
0,Darryl Michael,d951c332528294146ceabd7ad47be92ebbe5efe7be9e7e...,31,2800,United States,5189421046624896
1,Lillith Guerra,1fde1395836a03d5edbd20733fb25c1d2e6b4dd70509c7...,18,7700,Germany,5581114287768274
2,Blaine Pate,f110e286a31f9ef6e2e477f9239ef224cf38b46c2cb123...,23,3500,Somalia,5267488300623067
3,Sarah Haley,71dc21bc949958d0660183542573232aac5e91ffacfbbf...,55,10400,Turkey,5208235158315490
4,Hanna Peterson,fcbd8a72d7cedf243d9f2567bc043f96afa365161ec37e...,47,5050,China,5279990694405411


In [34]:
key

{'dignissim.magna@gmail.com': '4e8a2b42c57a0e89eb6be4440dbff4fc58e33d847c1606ea78e894f73d01cd6db1e4601f4a42b3bcda4674754a59c3a50d6bbd702c5d0dfecb2d790cc5197842',
 'lilli.guerra@lorem.org': '64f31e025d3cf79d7ef18a4f0085f9b16154837160691093ad18e24f108bb7f0c519fc46dac91aa9aa253dd99fb671cb9595cc3df9eead31bb6ee971dd7e23d2',
 'pate.pate3@lorem.org': '85953434fa5d896f77af333f6b7651539d6a390298c6e98ccbe13836cbb03ec8fbf4c08a400fc6e6f5b7f725fbb5e3e79a216a680fba601a316826e6cbd1c7c2',
 'hanna.peter@lorem.org': 'd6ba5bf9564fc6166d6b6444573d4622ab2eaaf0764b3b82e6a81b671c263b9e6665c5dde3aedab5a5cdf06ccb49f8ad927a3ca63ed106178bf06137e583af04',
 'Fusce.dolora@lorem.org': '52b1feea7e2731ff098e3e64b83b1afa0ba233afb254b292ab5f6701f8ad61560a780730e96cfd53c74f5e53768d77ab35cadde3b8d2d7f69f84de6bf2e23bf8'}