# Re-identification and De-identification

In [290]:
import pandas as pd
import numpy as np

In [473]:
"""
Useful display function for dataframe
"""
def display_df(df, nrows=10, ncols=None):
    with pd.option_context('display.max_rows', nrows, 'display.max_columns', ncols):
        display (df)
        
def print_row(df, row):
    for ctr,i in enumerate(df.iloc[row]):
        print (str(df.columns[ctr])+": "+str(i))

## Import data

In [2]:
#whole unaltered dataset
df_raw = pd.read_csv("../mid_sample_set.csv")

  interactivity=interactivity, compiler=compiler, result=result)


## Drop Unnecessary Fields and Clean NaNs

In [65]:
"""
Reads configuration file, a list of strings seperated by new lines, and returns a list
"""
def read_config(file):
    with open(file) as f:
        config_list = [(l) for l in f.read().split()]
    f.close()
    return config_list

In [66]:
qis = read_config('config.txt')

In [67]:
qis

['cc_by_ip',
 'countryLabel',
 'continent',
 'city',
 'region',
 'subdivision',
 'postalCode',
 'LoE',
 'YoB',
 'gender',
 'nforum_posts',
 'nforum_votes',
 'nforum_endorsed',
 'nforum_threads',
 'nforum_comments',
 'nforum_pinned',
 'nforum_events']

We only need to keep the `user_id` as a key, the quasi-identifiers, the `completed` field to find the completion rate, `explored` to find the exploration rate, and `grade` for the discussion of L-diversity in question 6. Everything else can be dropped. Then we can clean the dataset.

In [614]:
df_clean = df_raw[['user_id'] + qis + ['completed', 'explored', 'grade']]

Many of the fields contain NaNs when they actually should contain 0. We will replace those values.

In [615]:
"""
Takes list of fields with NaNs and fills NaN values with fill_val. Does this inplace.
"""
def replace_NaNs(df, labels, fill_val):
    for label in labels:
        df[label].fillna(fill_val, inplace=True)
"""
Gets ratio of NaNs for each column
"""
def stats_NaN(df):
    df_stats = pd.DataFrame(index=[df.columns], columns=["NaN Ratio"])
    for col in df.columns:
        df_stats["NaN Ratio"][col] = df[col].isna().sum()/len(df) #NaN ratio
    return df_stats.sort_values(by=['NaN Ratio'])

In [616]:
stats_NaN(df_clean)

Unnamed: 0,NaN Ratio
user_id,0.0
completed,0.0
continent,0.110371
countryLabel,0.111971
cc_by_ip,0.112171
gender,0.131326
LoE,0.139956
YoB,0.150226
nforum_events,0.184851
city,0.225491


In [617]:
NaN_to_0_fields = ['YoB', 'postalCode', 'nforum_posts', 'nforum_votes', 'nforum_endorsed', 
                   'nforum_threads', 'nforum_comments', 'nforum_pinned', 'nforum_events',
                  'explored']
replace_NaNs(df_clean, NaN_to_0_fields, 0)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [618]:
#cast to numeric type
df_clean['explored'] = pd.to_numeric(df_clean['explored'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


## Add Useful Statistical Fields

In [619]:
df_clean.sort_values('user_id', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In k-anonymizing a dataset, records will be compressed together. Therefire, we must preserve valuable statistics like the completion rate, which means we must create new columns `nStarted` and `nCompleted` which keeps track of the amount of classes started and completed respectively by a given user.

In [620]:
df_clean = df_clean.join(pd.DataFrame(df_clean.groupby('user_id').size(), 
                                      columns=['nStarted']),
                         on='user_id')

In [621]:
df_clean = df_clean.join(pd.DataFrame(df_clean[df_clean['completed']==True].groupby(['user_id']).size(), 
                                      columns=['nCompleted']),
                   on='user_id')

In [622]:
df_clean = df_clean.join(pd.DataFrame(df_clean[df_clean['explored']==True].groupby(['user_id']).size(), 
                                      columns=['nExplored']),
                   on='user_id')

In [625]:
#Fix NaNs
replace_NaNs(df_clean, ['nCompleted','nExplored','grade'], 0)

We can drop `completed` and `explored` as it is no longer necessary in analysis.

In [627]:
df_clean.drop(columns=['completed','explored'], inplace=True)

## Getting Completion Rate

We will write a generalizable function that finds the completion rate of a dataset. It will use the `nStarted` and `nCompleted` columns to tabulate this. It will be general enough to use on the clean dataset without double counting and also able to handle the k-anonymized datasets where we have already handled duplicate values.

In [375]:
"""
Returns completion rate and exploration rate of a dataframe. If user_id is present,
function counts per unique_id to avoid double counting. Otherwise assumes that
duplicates have been handled if user_id is dropped. Returns list. First element is 
completion rate, second element is exploration rate.
"""
def getStats(df):
    if 'user_id' in df.columns:
        df = df[['user_id', 'nStarted', 'nExplored', 'nCompleted']]
        df.drop_duplicates(subset='user_id', inplace=True)
    start_sum = df['nStarted'].sum()
    exp_sum = df['nExplored'].sum()
    comp_sum = df['nCompleted'].sum()
    return [float(comp_sum)/start_sum,float(exp_sum)/start_sum]

We need a function to check k-anonymity for unit testing.

In [509]:
def getKAnon(df, qis):
    return min(df.groupby(qis).size())
"""
Checks whether k-anonymity of dataset meets threshold. We use this to test whether a 
database is the required k-anonymity. If dataset is empty, it returns True.
"""
def checkKAnon(df, qis, k):
    return (len(df)==0) or (getKAnon(df,qis)>=k)

## Suppression

We write a general function that takes a dataframe, a list of quasi-identifiers, and a value `k`. The dataframe must be prepped with the `nStarted`, `nCompleted`, and `nExplored` fields to maintain analysis. 

In [666]:
"""
Returns a df where under less than k same quasi-identier samples are suppressed, making
the df k-anonymous.
"""
def suppressKAnon(df, qis, k=5):
    df_drop = df.drop_duplicates(subset=['user_id']+qis)
    #sum completion statistics and count unique_ids (k) for set of qis
    df_kanon = df_drop.groupby(qis).agg(
        {'nStarted':'sum','nCompleted':'sum','nExplored':'sum',
         'grade':(lambda x:len(set(x))) ,'user_id':'nunique'}).reset_index()
    df_kanon = df_kanon.rename(columns={'user_id' : 'k', 'grade':'l_grades'})
    df_kanon = df_kanon[df_kanon['k'] >= k] #suppresses less than k samples
    
    #print statistics
    stats = getStats(df_kanon)
    cr_anon = stats[0]
    er_anon = stats[1]
    print(str(k)+"-anon dataset completion rate: %.3f%%"%(cr_anon*100))
    print(str(k)+"-anon dataset exploration rate: %.3f%%"%(er_anon*100))
    
    #now must re-add records based on k
    df_kanon = pd.DataFrame(np.repeat(df_kanon.values,df_kanon['k'].values,
                                     axis=0), columns=df_kanon.columns)
    
    #must drop k, nStarted, nCompleted, and nExplored fields as these are artifacts of
    #completion analysis
    df_kanon = df_kanon.drop(columns=['k','nStarted','nCompleted','nExplored'])
    
    #print number of records suppresses
    records_suppressed = len(df)-len(df_kanon)
    print(str(records_suppressed)+" records suppressed for k="+str(k))
    return df_kanon.sort_values(by=qis)

### Completion Statistics

In [373]:
clean_stats = getStats(df_clean);

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [377]:
print("Clean dataset completion rate: %.3f%%"%(clean_stats[0]*100))
print("Clean dataset exploration rate: %.3f%%"%(clean_stats[1]*100))

Clean dataset completion rate: 2.777%
Clean dataset exploration rate: 13.430%


In [647]:
df_3supp = suppressKAnon(df_clean, qis, 3)

3-anon dataset completion rate: 1.279%
3-anon dataset exploration rate: 10.638%
183150 records suppressed for k=3


In [631]:
assert(getKAnon(df_3supp,qis)>=3)

In [630]:
df_4supp = suppressKAnon(df_clean, qis, 4)

4-anon dataset completion rate: 1.210%
4-anon dataset exploration rate: 10.388%
187797 records suppressed for k=4


In [448]:
assert(getKAnon(df_4supp,qis)>=4)

In [382]:
df_5supp = suppressKAnon(df_clean, qis, 5)

5-anon dataset completion rate: 1.214%
5-anon dataset exploration rate: 10.275%
190473 records suppressed for k=5


In [449]:
assert(getKAnon(df_5supp,qis)>=5)

## Synthetic Records

In [510]:
def syntheticKAnon(df, qis, k=5):
    #drops duplicates to avoid double counting for the completion rate
    df_drop = df.drop_duplicates(subset=['user_id']+qis)
    #sum completion statistics and count unique_ids (k) for set of qis
    df_kanon = df_drop.groupby(qis).agg({'nStarted':'sum','nCompleted':'sum', 
                                    'nExplored':'sum','user_id':'nunique'}).reset_index()
    df_kanon = df_kanon.rename(columns={'user_id' : 'k'})
    #display_df(df_kanon, nrows=1000)
    df_add_synth = df_kanon[df_kanon['k'] < k] #df to add records to
    df_kanon = df_kanon[df_kanon['k']>=k] #df which doesnt need synthetic records
    
    
    #display_df(df_kanon[df_kanon['k']<k], nrows=1000)
    #print(len(df_kanon[df_kanon['k']<k]))
    #print_row(df_kanon[df_kanon['k']<k],1)
    
    #print("kanon: "+str(getKAnon(df_kanon, qis)))
    #assert(getKAnon(df_kanon,qis)>=k)
    #assert(getKAnon(df_kanon[df_kanon['k']<k],qis)>=k)
    #assert(checkKAnon(df_kanon[df_kanon['k']<k], qis, k))
    assert(checkKAnon(df_kanon, qis, k))
    print('passed')
    
    #add synthetic records based on k (doesnt include original record so k+1)
    df_add_synth = pd.DataFrame(np.repeat(df_add_synth.values,
                                          k+1 - df_add_synth['k'].values, 
                                     axis=0), columns=df_add_synth.columns)
    #display_df(df_add_synth, nrows=1000)
    print("kanon synth: "+str(getKAnon(df_kanon, qis)))
    display_df(df_add_synth[df_add_synth['k']<k], nrows=1000)
    #display_df(pd.DataFrame(df_add_synth.groupby(qis).size()), nrows=1000)
    assert(getKAnon(df_add_synth,qis)>=k)
    
    df_kanon = df_kanon.append(df_add_synth, ignore_index=True) #combine datasets
    
    #print statistics
    stats = getStats(df_kanon)
    cr_anon = stats[0]
    er_anon = stats[1]
    print(str(k)+"-anon dataset completion rate: %.3f%%"%(cr_anon*100))
    print(str(k)+"-anon dataset exploration rate: %.3f%%"%(er_anon*100))
    
    #must drop k, nStarted, nCompleted, and nExplored fields as these are artifacts
    #of completion analysis
    df_kanon = df_kanon.drop(columns=['k','nStarted','nCompleted','nExplored'])
    
    #print number of records added
    records_added = len(df_kanon)-len(df)
    print(str(records_added)+" records added for k="+str(k))
    
    return df_kanon.sort_values(by=qis)

### Completion Statistics

In [511]:
df_3synth = syntheticKAnon(df_clean, qis, 3)

AssertionError: 

In [None]:
pd.DataFrame(df_3synth.groupby(qis).size())

In [410]:
getKAnon(df_3synth, qis)

1

In [411]:
assert(getKAnon(df_3synth,qis)==3)

AssertionError: 

In [440]:
df_4synth = syntheticKAnon(df_clean, qis, 4)

4-anon dataset completion rate: 3.909%
4-anon dataset exploration rate: 15.741%
199575 records added for k=4


In [444]:
pd.DataFrame(df_4synth.groupby(qis).size())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0,0
cc_by_ip,countryLabel,continent,city,region,subdivision,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,Unnamed: 17_level_1
AD,Andorra,Europe,Andorra La Vella,07,Andorra la Vella,0,m,1972.0,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AD,Andorra,Europe,Engordany,08,Escaldes-Engordany,0,a,1973.0,m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AD,Andorra,Europe,Engordany,08,Escaldes-Engordany,0,m,1984.0,m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,a,1988.0,m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,a,1992.0,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,b,1954.0,m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,b,1966.0,m,27.0,3.0,0.0,24.0,3.0,0.0,0.0,4
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,b,1967.0,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,b,1969.0,m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
AE,United Arab Emirates,Asia,Abu Dhabi,AZ,Abu Dhabi,0,b,1969.0,m,1.0,0.0,0.0,1.0,0.0,0.0,4.0,4


In [404]:
assert(getKAnon(df_4synth,qis)==4)

AssertionError: 

In [390]:
df_5synth = syntheticKAnon(df_clean, qis, 5)

5-anon dataset completion rate: 3.888%
5-anon dataset exploration rate: 15.705%
301748 records added for k=5


In [None]:
assert(checkKAnon(df_5synth, qis, 5))

## Generalization, Blurring, and Suppression

### Determining Bins for Continous Variables

Pandas qcut has very useful binning functionality. We play around with qs parameter to sufficiently generalize. 

In [518]:
pd.qcut(df_clean['YoB'],10, duplicates='drop').unique()

[(-0.001, 1960.0], (1973.0, 1980.0], (1984.0, 1987.0], (1980.0, 1984.0], (1990.0, 1992.0], (1960.0, 1973.0], (1992.0, 1995.0], (1987.0, 1990.0], (1995.0, 2018.0]]
Categories (9, interval[float64]): [(-0.001, 1960.0] < (1960.0, 1973.0] < (1973.0, 1980.0] < (1980.0, 1984.0] ... (1987.0, 1990.0] < (1990.0, 1992.0] < (1992.0, 1995.0] < (1995.0, 2018.0]]

In [529]:
pd.qcut(df_clean['nforum_posts'],150, duplicates='drop').unique()

[(-0.001, 1.0], (2.0, 4.0], (4.0, 10.0], (1.0, 2.0], (10.0, 465.0]]
Categories (5, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 4.0] < (4.0, 10.0] < (10.0, 465.0]]

In [554]:
pd.qcut(df_clean['nforum_votes'],200, duplicates='drop').unique()

[(-0.001, 1.0], (1.0, 2.0], (2.0, 636.0]]
Categories (3, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 636.0]]

In [550]:
pd.qcut(df_clean['nforum_endorsed'],5000, duplicates='drop').unique()

[(-0.001, 1.0], (1.0, 46.0]]
Categories (2, interval[float64]): [(-0.001, 1.0] < (1.0, 46.0]]

In [553]:
pd.qcut(df_clean['nforum_threads'],150, duplicates='drop').unique()

[(-0.001, 1.0], (1.0, 3.0], (3.0, 142.0]]
Categories (3, interval[float64]): [(-0.001, 1.0] < (1.0, 3.0] < (3.0, 142.0]]

In [528]:
pd.qcut(df_clean['nforum_comments'],100, duplicates='drop').unique()

[(-0.001, 1.0], (3.0, 444.0], (1.0, 3.0]]
Categories (3, interval[float64]): [(-0.001, 1.0] < (1.0, 3.0] < (3.0, 444.0]]

In [547]:
pd.qcut(df_clean['nforum_pinned'],100, duplicates='drop').unique()

[(-0.001, 18.0]]
Categories (1, interval[float64]): [(-0.001, 18.0]]

In [538]:
pd.qcut(df_clean['nforum_events'],50, duplicates='drop').unique()

[(-0.001, 2.0], (22.0, 9192.0], (6.0, 22.0], (2.0, 6.0]]
Categories (4, interval[float64]): [(-0.001, 2.0] < (2.0, 6.0] < (6.0, 22.0] < (22.0, 9192.0]]

We like the bucketing with the numbers above. It doesnt provide too much granularity such that the bucket is a unique identifier and the buckets themselves seem useful in analysis for educational researchers. Often it comes down to, did this person participate in a forum in a particular way, and did they do it a lot or a little. 

In [632]:
"""
Generalizes continuous variables using pandas qcut function. Takes list of q values
to be passed in applying to the corresponding column
"""
def generalizeContinous(df, cols, qs):
    df_g = df.copy()
    for i in range(0,len(cols)):
        df_g[cols[i]] = pd.qcut(df_g[cols[i]], qs[i], duplicates='drop')
    return df_g

In [633]:
gen_cols = ['YoB','nforum_posts','nforum_votes','nforum_endorsed','nforum_threads',
'nforum_comments','nforum_pinned','nforum_events']
qs = [10,150,200,5000,150,100,100,50]

In [640]:
df_gen = generalizeContinous(df_clean, gen_cols, qs)

### Generalizing Location Fields by Dropping

Some fields are just too specific. Hence we will drop them.

In [607]:
df_gen2 = df_gen.drop(columns=['postalCode'])

In [608]:
qis_gbs = list(set(qis)-set(['postalCode']))

### Suppression

In [667]:
df_3gbs = suppressKAnon(df_gen, qis_gbs, 3)

3-anon dataset completion rate: 1.568%
3-anon dataset exploration rate: 11.882%
149914 records suppressed for k=3


In [668]:
df_4gbs = suppressKAnon(df_gen, qis_gbs, 4)

4-anon dataset completion rate: 1.512%
4-anon dataset exploration rate: 11.834%
157720 records suppressed for k=4


In [669]:
df_5gbs = suppressKAnon(df_gen, qis_gbs, 5)

5-anon dataset completion rate: 1.503%
5-anon dataset exploration rate: 11.678%
162892 records suppressed for k=5


## L-Diversity

Grade is the sensitive attribute we want to analyze with respect to L-diversity. We will will look at the counts of records that have each grade to do so. We added a new field to the suppression algorithm `l_grades` which counts the amount of unique grades for a given set of quasi-identifiers. 

In [679]:
print("L-diversity for 3-anonymous dataset: "+str(min(df_3gbs['l_grades'])))

L-diversity for 3-anonymous dataset: 1.0


In [680]:
print("L-diversity for 4-anonymous dataset: "+str(min(df_4gbs['l_grades'])))

L-diversity for 3-anonymous dataset: 1.0


In [681]:
print("L-diversity for 5-anonymous dataset: "+str(min(df_5gbs['l_grades'])))

L-diversity for 5-anonymous dataset: 1.0


All of the 3, 4, and 5 had 1-diversity. There simply is not enough unique grade values for each set of quasi-identifiers. Using synthetic records would be a good idea to increase L-diversity as we could add more unique grades for a given set of quasi-identifiers.