First, let's load in the data, and look at what type of data we have so we can choose values for our quasi-identifier:

In [None]:
import numpy as np # linear algebra
import pandas as pd 
import sys

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

credit_file_name = "../input/credit-card-customers/BankChurners.csv"
credit_data = pd.read_csv(credit_file_name) # Read the csv file

print(credit_data.head()) # Print the first 5 rows of credit_data

In [None]:
quasi_identifier = ["Customer_Age", "Gender", "Dependent_count", "Education_Level", 
                    "Marital_Status", "Income_Category"]

Now that we've found our quasi-identifier, we just need to loop through each attribute, and store how many times each value exists for that attribute. The lowest amount of times a value exists is the k level for k-anonymity.

In [None]:
"""
The following dictionary will store each attribute, and a corresponding dictionary of
{value : amount of times this value exists} for that specific attribute. For example, it 
might look like this {'Education_Level' : {High School : 100, Graduate : 200, etc.}, etc.}
"""
attribute_values = {}

for attribute in quasi_identifier: # Loop through all the attributes in the quasi_identifier
    
    current_column = credit_data[attribute] # Get all values associated with an attribute
    found_values = {} # Dictionary to store {value : amount found for this value} pairs
    
    attribute_values[attribute] = found_values # Store the found_values dictionary at the current attribute
    
    for value in current_column: # Loop through the column
        
        if value in found_values: # If we've seen this value before
            found_values[value] = found_values[value] + 1 # Add one to amount found
            continue
            
        else: # Otherwise
            found_values[value] = 1 # Add the value to the dictionary and set amount found to 1

print(attribute_values)

So now we've found the amount of times that a value exists for each value, however it's not very easy to read and we haven't found the minimum value yet. To find the minimum value, we just need to loop over every value and keep track of the minimum value we find. That value will be the k-value for our quasi-identifier. The following code does this and prints it in a more readable format, however this part of the code isn't too important to understand, as it's mostly formatting.

Note - since dictionaries in python are unordered, the output will be unordered, which can be fixed, but I have not done this yet.

In [None]:
min_amount_for_each_attribute = [] # Name should be self explanatory

for attribute in attribute_values:
    found_values = attribute_values[attribute] # Pull out the found_values dictionary
    
    print(attribute)
    
    min_amount = sys.maxsize # Make sure no number is smaller than this number
    
    for value in found_values:
        print(value, "->", found_values[value])
        if found_values[value] < min_amount: # If the 'amount found' is less than min_val
            min_amount = found_values[value] # Update min_val to this 'amount found'
    
    min_values = [] # This is used if multiple values have the same 'amount found'
    
    for value in found_values:
        if found_values[value] == min_amount: # If 'amount found' equals the minimum 'amount found'
            min_values.append(value) # Add it to the min_values list
    
    print("The minimum value from", attribute, "is", min_amount) 
    print("The value(s) at this minimum was(were):", end = " ")
    for value in min_values:
        print(value, end = " ")
    print("")
    print("")
    
    min_amount_for_each_attribute.append(min_amount)

k_value = sys.maxsize # Now we're going to loop through each attribute to find the minimum amount
for amount in min_amount_for_each_attribute:
    if amount < k_value:
        k_value = amount

print("The k-anonymity value is", k_value)
    

Well, now we have code to find the k-anonymity level! However, it seems that if the attacker knew that the person was in this dataset and was 70 or 73 years old, then the person would be able to be identified, since the k-level is 1. Our next step is obvious: what can we do to increase the anonymity level of this data?

Suppose we notice that as age increases in this dataset, the amount of people that correspond to that age decreases. Therefore, people at higher ages are less anonymous in this dataset, according to k-anonymity. One approach to solving this issue is to group everyone at a higher age. However, since table columns need to be of the same data type, we cannot have a string that says "60+" or "60 - 80" etc. We need an actual integer value. Therefore, we will take the average of all ages 60 and above, and replace each age with the average (rounded to the nearest integer).

In [None]:
ages = credit_data["Customer_Age"]
age_sum = 0 # Average = sum / number of values
total_ages = 0
for age in ages:
    if age >= 60:
        age_sum = age_sum + age
        total_ages = total_ages + 1

average_age = int(age_sum / total_ages)

for index in range(len(ages)):
    if ages[index] >= 60:
        credit_data["Customer_Age"][index] = average_age
        
for index in range(len(ages)):
    if ages[index] >= 60:
        print(credit_data["Customer_Age"][index]) # Make sure all ages over 60 are the average age


Now that we've done this (there's not really an easy way to check that it worked, oh well), let's run our script again that finds the k-anonymity level for our data, and see if it improved. 

In [None]:
"""
The following dictionary will store each attribute, and a corresponding dictionary of
{value : amount of times this value exists} for that specific attribute. For example, it 
might look like this {'Education_Level' : {High School : 100, Graduate : 200, etc.}, etc.}
"""
attribute_values = {}

for attribute in quasi_identifier: # Loop through all the attributes in the quasi_identifier
    
    current_column = credit_data[attribute] # Get all values associated with an attribute
    found_values = {} # Dictionary to store {value : amount found for this value} pairs
    
    attribute_values[attribute] = found_values # Store the found_values dictionary at the current attribute
    
    for value in current_column: # Loop through the column
        
        if value in found_values: # If we've seen this value before
            found_values[value] = found_values[value] + 1 # Add one to amount found
            continue
            
        else: # Otherwise
            found_values[value] = 1 # Add the value to the dictionary and set amount found to 1

min_amount_for_each_attribute = [] # Name should be self explanatory

for attribute in attribute_values:
    found_values = attribute_values[attribute] # Pull out the found_values dictionary
    
    print(attribute)
    
    min_amount = sys.maxsize # Make sure no number is smaller than this number
    
    for value in found_values:
        print(value, "->", found_values[value])
        if found_values[value] < min_amount: # If the 'amount found' is less than min_val
            min_amount = found_values[value] # Update min_val to this 'amount found'
    
    min_values = [] # This is used if multiple values have the same 'amount found'
    
    for value in found_values:
        if found_values[value] == min_amount: # If 'amount found' equals the minimum 'amount found'
            min_values.append(value) # Add it to the min_values list
    
    print("The minimum value from", attribute, "is", min_amount) 
    print("The value(s) at this minimum was(were):", end = " ")
    for value in min_values:
        print(value, end = " ")
    print("")
    print("")
    
    min_amount_for_each_attribute.append(min_amount)

k_value = sys.maxsize # Now we're going to loop through each attribute to find the minimum amount
for amount in min_amount_for_each_attribute:
    if amount < k_value:
        k_value = amount

print("The k-anonymity value is", k_value)
    

Well, that helped a lot! And we didn't have to ruin our data that much. There are definitely more improvements to make, but that's all for now :)