# Ethics for NLP: Spring 2022
# Homework 4 Privacy



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Data Overview and Baseline

A major problem with utilizing web data as a source for NLP applications is the increasing concern for privacy, e.g., such as microtargeting. This homework is aimed at developing a method to obfuscate demographic features, in this case (binary) gender and to investigate the trade-off between obfuscating an users identity and preserving useful information.

The given dataset consists of Reddit posts (`post_text`) which are annotated with the gender (`op_gender`) of the user and the corresponding subreddit (`subreddit`) category.

*  `subreddit_classifier.pickle` pretrained subreddit classifier
*  `gender_classifier.pickle` pretrained gender classifier
*  `test.csv` your primary test data
*  `male.txt` a list of words commonly used by men
*  `female.txt` a list of words commonly used by women
*  `background.csv` additional Reddit posts that you may optionally use for training an obfuscation model

In [2]:
from sklearn.metrics import accuracy_score
from pandas.core.frame import DataFrame
from typing import List, Tuple
import pandas
import pickle
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
import os

In [4]:
def get_preds(cache_name: str, test: List[str]) -> List[str]:
    loaded_model, dictionary, transpose, train_bow = pickle.load(open(cache_name, 'rb'))
    X_test = transpose(test, train_bow, dictionary)
    preds = loaded_model.predict(X_test)
    return preds

In [5]:
def run_classifier(test_file: str, googleDrive=False) -> Tuple[float]:
  if googleDrive:
    GoogleDrivePathBase='/content/drive/MyDrive/Ethic_in_NLP/HW04/data'
  else:
    GoogleDrivePathBase=''

  test_file=os.path.join(GoogleDrivePathBase,test_file)
  test_data = pandas.read_csv(test_file)

  cache_name = os.path.join(GoogleDrivePathBase,'gender_classifier.pickle')
  test_preds = get_preds(cache_name, list(test_data["post_text"]))
  gold_test = list(test_data["op_gender"])
  gender_acc = accuracy_score(list(test_preds), gold_test)
  print("Gender classification accuracy", gender_acc)

  cache_name = os.path.join(GoogleDrivePathBase,'subreddit_classifier.pickle')
  test_preds = get_preds(cache_name, list(test_data["post_text"]))
  gold_test = list(test_data["subreddit"])
  subreddit_acc = accuracy_score(list(test_preds), gold_test)
  print("Subreddit classification accuracy", subreddit_acc)
  return gender_acc, subreddit_acc

In [6]:
gender_acc, subreddit_acc = run_classifier("test.csv",True)

assert gender_acc == 0.646
assert subreddit_acc == 0.832

Gender classification accuracy 0.646
Subreddit classification accuracy 0.832


**Default accuracy:**
*   `Gender    classification accuracy: 0.646`
*   `Subreddit classification accuracy: 0.832`

## 2. Obfuscation of the Test Dataset
### 2.1 Random Obfuscated Dataset  (4P)
First, run a random experiment, by randomly swapping gender-specific words that appear in posts with a word from the respective list of words of the opposite gender.

*  Write a function to read the female.txt and male.txt files
*  Tokenize the posts („post_text“) using NLTK (0.5p)
*  For each post, if written by a man („M“) and containing a token from the male.txt, replace that token with a random one from the female.txt (1p)
*  For each post, if written by a woman („W“) and containing a token from the female.txt, replace that token with a random one from the male.txt (1p)
*  Save the obfuscated version of the test.csv in a separate csv file (using pandas and makes sure to name them accordingly) (0.5p)
*  Run the given classifier again, report the accuracy and provide a brief commentary on the results compared to the baseline (1p)

In [7]:
def read_data(file_name: str) -> List[str]:
    """
    
    add your code here

    """
    content=[]
    with open(file_name, 'r') as f:
      for line in f:
        content.append(line.strip())

    return content

In [49]:
BASE='/content/drive/MyDrive/Ethic_in_NLP/HW04/data'

MALE_PATH=os.path.join(BASE, 'male.txt')
FEMELE_PATH=os.path.join(BASE, 'female.txt')

male_words = read_data(MALE_PATH)
female_words = read_data(FEMELE_PATH)

assert len(male_words) == 3000
assert len(male_words) == 3000

In [9]:
from nltk.corpus import stopwords
import random

SEED=123
random.seed(SEED)

In [10]:
def customized_tokenize(sentence:str):
  stopWords = set(stopwords.words('english'))
  if  not sentence:
    return ' '
  else:
    return  [word.lower() for word in nltk.word_tokenize(sentence) if word not in stopWords and word.isalpha()]

def replace(gender_words: List[str], opposite_gender_words: List[str],raw_sequence:List[str]):
  new_sequence=[]
  for word in raw_sequence:
    if word in gender_words:
      while True:
        # avoid being replaced with punctuations
        idx=random.randint(0,len(opposite_gender_words)-1)
        newword=opposite_gender_words[idx]
        if newword.isalpha():
          break
      new_sequence.append(newword)
    else:
      new_sequence.append(word)
  
  return ' '.join(new_sequence)


def randomly_replace(male_words: List[str], female_words: List[str], item:pandas.Series):
  if item.op_gender=='M':
    item['post_text']=replace(male_words, female_words, item['post_text'])

  elif item.op_gender=='W':
    item['post_text']=replace(female_words, male_words, item['post_text'])
  else: 
    raise ValueError('The gender of the user is not explicitly mentioned !!')

  return item

def obfuscate_gender_randomly(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  """
  
  add your code here
  
  """
  df=pandas.read_csv(dataset_file_name, sep=',', encoding='utf-8',header=0)
  df['post_text']=df['post_text'].apply(lambda x: customized_tokenize(x))

  df=df.apply(lambda x : randomly_replace(male_words, female_words, x),axis=1)
  
  # replace blank value with NaN
  df['post_text'] = df['post_text'].apply(lambda x: x.strip()).replace('', pandas.NA)

  # replace nan with new value
  df['post_text']=df['post_text'].replace(to_replace = pandas.NA, value =-99999)

  if df.post_text.isna().any():
    raise ValueError('There are some NaN in the column "post_text" ')
  
  print(type(df.loc[df['op_id']=='ninepointsix'].post_text))

  

  return df


In [11]:
file_name = "randomReplacement_tokenized_testset.csv"

In [12]:
TEST_PATH=os.path.join(BASE, 'test.csv')
SAVE_PATH=os.path.join(BASE, file_name)
random_replaced_test = obfuscate_gender_randomly(male_words=male_words, female_words=female_words, dataset_file_name=TEST_PATH)
random_replaced_test.to_csv(SAVE_PATH)

<class 'pandas.core.series.Series'>


In [13]:
random_replaced_test = pandas.read_csv(SAVE_PATH)
assert len(random_replaced_test) == 500
assert random_replaced_test["subreddit"][0] == "funny"
assert random_replaced_test["subreddit"][-1:].item() == "relationships"

In [14]:
gender_acc, subreddit_acc = run_classifier(file_name, True)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.7

Gender classification accuracy 0.48
Subreddit classification accuracy 0.788


**Report accuracy:**
*   `Gender    classification accuracy: ` 0.48
*   `Subreddit classification accuracy: ` 0.788
*   `Your commentary: ` In comparison with the baseline, the random word replcement method can obfuscate demographic features, the gender classification accuracy decreased from 64.6% to 48%, however this reduction is achieved at the expense of the change of the original paraphrase, as the subreddit classification accuracy also decrease from 0.83.2 to 0.788 

### 2.2 Similarity Obfuscated Dataset (4P)
In a second approach, refine the swap method. Instead of randomly selecting a word, use a similarity metric.


*  Instead of the first method replace the tokens by semantically similar tokens from the other genders token list. For that you may choose any metric for identifying semantically similar words, but you have to justify your choice. (Recommend: using cosine distance between pre-trained word embeddings) (2p)
*  Save the obfuscated version of the test.csv in a separate CSV file (using pandas and makes sure to name them accordingly) (0.5p)
*  Run the given classifier again, report the accuracy and provide a brief commentary on the results (compared to the baseline and your other results) (1p)
*  The classifiers accuracy for predicting the gender should be below random guessing (50%) and for the subreddit prediction it should be above 80% (0.5p)

In [15]:
import gensim.downloader
import gensim
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [16]:
glove_vectors = gensim.downloader.load('glove-twitter-50')
# from gensim.models import KeyedVectors 
# # glove_vectors=gensim.models.Word2Vec.load('/content/drive/MyDrive/Ethic_in_NLP/HW04/model/word2vec-google-news-300/word2vec-google-news-300.model')
# glove_vectors=KeyedVectors.load('/content/drive/MyDrive/Ethic_in_NLP/HW04/model/word2vec-google-news-300/word2vec-google-news-300.model')


In [17]:
(glove_vectors.most_similar('effort',topn=30))

[('advantage', 0.7885393500328064),
 ('efforts', 0.7689546346664429),
 ('positive', 0.7641745209693909),
 ('difficult', 0.7637898325920105),
 ('success', 0.7618691921234131),
 ('confidence', 0.7613573670387268),
 ('difference', 0.7592354416847229),
 ('expect', 0.7585071325302124),
 ('important', 0.7489951848983765),
 ('enough', 0.748828113079071),
 ('motivation', 0.7486553192138672),
 ('progress', 0.7485166788101196),
 ('patience', 0.7442478537559509),
 ('decision', 0.74277663230896),
 ('willing', 0.7402777671813965),
 ('example', 0.7390508651733398),
 ('commitment', 0.7374849319458008),
 ('push', 0.7373062372207642),
 ('appreciate', 0.7368417978286743),
 ('however', 0.7357004284858704),
 ('timing', 0.7339947819709778),
 ('tough', 0.73281329870224),
 ('although', 0.7287693619728088),
 ('sense', 0.7284727692604065),
 ('matter', 0.7276229858398438),
 ('lack', 0.72706139087677),
 ('opportunity', 0.726543664932251),
 ('focus', 0.7243965268135071),
 ('ability', 0.7236769199371338),
 ('consi

In [18]:
def similarity_replace(gender_words: List[str], opposite_gender_words: List[str],raw_sequence:List[str]):
  new_sequence=[]
  model=glove_vectors
  if not model:
    raise ValueError('There are no pre-trained model!')

  for word in raw_sequence:
    if word in gender_words:
      candidates=[]

      # If the word does not exist in the vocabulary of the model
      # then, use the word itself as the candidate
      try:
        candidates=model.most_similar(word,topn=20)
      except:
        candidates.append(tuple(word))

      replaced_tag=False
      for candidate in candidates:
        if candidate[0] in opposite_gender_words and candidate[0].isalpha(): # avoid being replaced with punctuations
          new_sequence.append(candidate[0])
          replaced_tag=True
          break    
      
      # if there are no such word from opposite_gender_words which is 
      # similar to the word in raw_sequence(original word),
      # the word will be replaced randomly. 
      if not replaced_tag:
        while True:
          # avoid being replaced with punctuations
          idx=random.randint(0,len(opposite_gender_words)-1)
          newword=opposite_gender_words[idx]
          if newword.isalpha():
            break
        new_sequence.append(newword)
        # new_sequence.append(word)


    else:
      new_sequence.append(word)
  
  return ' '.join(new_sequence)


def similarity_replacement(male_words: List[str], female_words: List[str], item:pandas.Series):

  if item.op_gender=='M':
    item['post_text']=similarity_replace(male_words, female_words, item['post_text'])

  elif item.op_gender=='W':
    item['post_text']=similarity_replace(female_words, male_words, item['post_text'])
  else: 
    raise ValueError('The gender of the user is not explicitly mentioned !!')

  return item



def obfuscate_gender_by_similarity(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  """
  
  add your code here
  
  """
  df=pandas.read_csv(dataset_file_name, sep=',', encoding='utf-8',header=0)
  df['post_text']=df['post_text'].apply(lambda x: customized_tokenize(x))

  df=df.apply(lambda x : similarity_replacement(male_words, female_words, x),axis=1)
  
  # replace blank value with NaN
  df['post_text'] = df['post_text'].apply(lambda x: x.strip()).replace('', pandas.NA)

  # replace nan with new value
  df['post_text']=df['post_text'].replace(to_replace = pandas.NA, value =-99999)

  if df.post_text.isna().any():
    raise ValueError('There are some NaN in the column "post_text" ')



  return df

In [19]:
"""
 you may use gensim models for example word2vec-google-news-300
"""

'\n you may use gensim models for example word2vec-google-news-300\n'

In [20]:
file_name = "similarity_tokenized_testset.csv"

* The below code block would take almost **14 minutes** on the colab w.r.t. no 
restriction on the len(word)
* The below code block would take almost **6 minutes** on the colab w.r.t. len(word) has to be larger than 3

In [21]:
SAVE_Similarity_PATH=os.path.join(BASE,file_name)
similarity_replaced_test = obfuscate_gender_by_similarity(male_words=male_words, female_words=female_words, dataset_file_name=TEST_PATH)
similarity_replaced_test.to_csv(SAVE_Similarity_PATH)

In [22]:
similarity_replaced_test.loc[0].post_text

'patience nevermind kidding cuz pretty sure boys help russian studies'

In [23]:
similarity_replaced_test = pandas.read_csv(SAVE_Similarity_PATH)
assert len(similarity_replaced_test) == 500
assert similarity_replaced_test["subreddit"][0] == "funny"
assert similarity_replaced_test["subreddit"][-1:].item() == "relationships"

In [24]:
gender_acc, subreddit_acc = run_classifier(file_name,True)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.8


Gender classification accuracy 0.478
Subreddit classification accuracy 0.808


**Report accuracy:**
*   `Gender    classification accuracy: ` 0.478
*   `Subreddit classification accuracy: ` 0.808
*   `Your commentary: ` The obfuscation capability of the lexical similarity replacement method is closed to the random method above, as their gender classification accuracies are closed. Besides, the similarity method outperforms the random method in the aspect of persevering the paraphrase of the original sentence. 

By using goolge-news-300 model, the gender_acc, subreddit_acc reaches 0.562, 0.808 respectively

By using glove-twitter-50 model, the gender_acc, subreddit_acc reaches 0.476, 0.806 respectively

### 2.3 Your Own Obfuscated Dataset (4P)
With this last approach, you can experiment by yourself how to obfuscate the posts.

*  Some examples: What if you randomly decide whether or not to replace words instead of replacing every lexicon word? What if you only replace words that have semantically similar enough counterparts? What if you use different word embeddings? (2p)
*  Save the obfuscated version of the test.csv in a separate csv file (using pandas and makes sure to name them accordingly) (0.5p)
*  Describe your modifications and report the accuracy and provide a brief commentary on the results compared to the baseline and your other results (1.5p)

In [25]:
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [26]:
glove_vectors_twitter_100 = gensim.downloader.load('glove-twitter-100')

In [71]:
glove_vectors_twitter_100['word']

array([ 0.57479 ,  0.27959 , -0.17003 ,  1.0926  , -0.5678  ,  0.13946 ,
       -0.22845 ,  0.27979 ,  0.1436  ,  0.25408 ,  0.14175 ,  0.47737 ,
       -4.1063  , -0.45932 , -0.78775 , -0.061295,  0.28098 ,  0.55691 ,
        0.040097, -0.33675 ,  0.10952 ,  0.32482 , -0.60996 ,  0.77837 ,
        1.0855  ,  0.092512, -0.34347 , -0.52561 , -0.32974 , -0.45062 ,
       -0.33763 ,  0.26943 , -0.7608  , -0.013459, -0.097348, -0.40263 ,
        0.22523 ,  0.40602 ,  0.34765 , -1.2264  , -0.81516 , -0.57451 ,
        0.084248,  0.36518 ,  0.24649 , -0.26708 ,  0.074   ,  0.73033 ,
       -0.34619 ,  0.29964 ,  0.49903 ,  0.46251 , -0.68305 , -0.92597 ,
        0.075895, -0.51661 , -0.67615 , -0.017943, -1.1911  , -0.12817 ,
        0.27478 , -0.77928 , -0.35465 ,  0.39712 ,  0.22347 ,  0.38169 ,
       -0.067566, -0.24608 ,  0.34249 , -0.26701 , -0.78815 , -0.79426 ,
       -0.57019 ,  0.14404 ,  0.23621 , -0.067121,  0.31948 ,  0.06233 ,
       -0.3619  , -0.012909,  0.91253 ,  0.21408 , 

In [27]:
glove_vectors_word2vec_google_news_300 = gensim.downloader.load('word2vec-google-news-300')

In [62]:
def customized_replace(gender_words: List[str], opposite_gender_words: List[str],raw_sequence:List[str]):
  new_sequence=[]
  model=glove_vectors_twitter_100
  if not model:
    raise ValueError('There are no pre-trained model!')

  for word in raw_sequence:
    candidates=[]
    trigger==random.choice([0,1])
    if trigger:
      try:
        new_sequence.append(model.most_similar(word, topn=1)[0][0])

      except:
        new_sequence.append(word)
  
  return ' '.join(new_sequence)


def customiezed_replacement(male_words: List[str], female_words: List[str], item:pandas.Series):
  
  
  if item.op_gender=='M':
    item['post_text']=customized_replace(male_words, female_words, item['post_text'])

  elif item.op_gender=='W':
    item['post_text']=customized_replace(female_words, male_words, item['post_text'])
  else: 
    raise ValueError('The gender of the user is not explicitly mentioned !!')

  return item




def obfuscate_gender(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  """

    add your own implemntation, you may add more functions and arguments
    
  """

  df=pandas.read_csv(dataset_file_name, sep=',', encoding='utf-8',header=0)
  df['post_text']=df['post_text'].apply(lambda x: customized_tokenize(x))

  df=df.apply(lambda x : similarity_replacement(male_words, female_words, x),axis=1)
  
  # replace blank value with NaN
  df['post_text'] = df['post_text'].apply(lambda x: x.strip()).replace('', pandas.NA)

  # replace nan with new value
  df['post_text']=df['post_text'].replace(to_replace = pandas.NA, value =-99999)

  if df.post_text.isna().any():
    raise ValueError('There are some NaN in the column "post_text" ')



  return df

In [63]:
file_name = "costumized_tokenized_testset.csv"

In [64]:
SAVE_Custumized_PATH=os.path.join(BASE,file_name)
your_test = obfuscate_gender(male_words=male_words, female_words=female_words, dataset_file_name=TEST_PATH)
your_test.to_csv(SAVE_Custumized_PATH)

In [66]:
your_test = pandas.read_csv(SAVE_Custumized_PATH)
assert len(your_test) == 500
assert your_test["subreddit"][0] == "funny"
assert your_test["subreddit"][-1:].item() == "relationships"

In [67]:
gender_acc, subreddit_acc = run_classifier(file_name,True)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.6

Gender classification accuracy 0.466
Subreddit classification accuracy 0.812


**Report accuracy:**
*   `Gender    classification accuracy: ` 0.466
*   `Subreddit classification accuracy: ` 0.812
*   `Your commentary: ` 

I tried to use different set of parameters, such as word embedding, only replace words that have semantically similar enough counterparts, i.e. only the most similar word is replaced. The table below shows the performances of different methods. The best one is using 'glove-twitter-50', 7 similar words and randomly decide whether replace word or not. 



| Parameters                                   | Gender classification accuracy | Subreddit classification accuracy |
|----------------------------------------------|--------------------------------|-----------------------------------|
| (glove-twitter-50, 10 similiar word, random) | 0.474                          | 0.808                             |
| (glove-twitter-50, 7 similiar word, random)  | 0.466                          | 0.812                              |
| (glove-twitter-50, 5 similiar word, random)  | 0.472                          | 0.804                             |
| (glove-twitter-50, 7 similiar word, remain)  | 0.47                           | 0.808                             |
| (glove-twitter-100, 7 similiar word, random) | 0.474                          | 0.806                             |


### 3 Advanced Obfuscated Model (5P)
Develop your own obfuscation model using the provided background.csv for training. Your ultimate goal should be to obfuscate text so that the classifier is unable to determine the gender of an user (no better than random guessing) without compromising the accuracy of the subreddit classification task. To train a model that is good at predicting subreddit classification, but bad at predicting gender. The key idea in this approach is to design a model that does not encode information about protected attributes (in this case, gender). In your report, include a description of your model and results.

*  Develop your own classifier (3p)
*  Use only posts from the subreddits „CasualConversation“ and „funny“ (min. 1000 posts for each gender per subreddit) (0.5p)
*  Use sklearn models (MLPClassifier, LogisticRegression, etc.)
*  Use 90% for training and 10% for testing (0.5p)
*  In your report, include a description of your model and report the accuracy on the unmodified train data (your baseline here) as well as the modified train data and provide a brief commentary on the results (1p)

In [72]:
"""

add your code here

The below code are still bugy

"""

# load data and preprocessing
background_data=pandas.read_csv(os.path.join(BASE, 'background.csv'))
background_data['post_text']=background_data['post_text'].apply(lambda x: customized_tokenize(x))



def sentenc_embedding(item,model):
  item['op_gender'] = 1 if item['op_gender']=='M' else 0

  sent= item['post_text']
  sent_vec =[]
  numw = 0
  for w in sent:
      try:
          if numw == 0:
              sent_vec = model[w]
          else:
              sent_vec = np.add(sent_vec, model[w])
          numw+=1
      except:
          pass

  item['post_text']=np.asarray(sent_vec) / numw
    
  return item





In [75]:
background_data=background_data.apply(lambda x: sentenc_embedding(x, glove_vectors_twitter_100))

KeyError: ignored

In [74]:
background_data

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,op_id,op_gender,post_id,post_text,subreddit,op_gender_visible
0,596598,596598,Delts28,M,596598,"[wow, still, going, well, im, game]",CasualConversation,False
1,596599,596599,robertmeowneyjr,M,596599,"[you, guys, nice, cars, heres, mine, link, jus...",CasualConversation,False
2,596600,596600,Delts28,M,596600,"[two, samsungs, an, netbook, seems, indestruct...",CasualConversation,False
3,596601,596601,Delts28,M,596601,"[i, used, marine, engineer, wasnt, life, now, ...",CasualConversation,False
4,596602,596602,Delts28,M,596602,"[well, ive, countries, people, dont, know, exi...",CasualConversation,False
...,...,...,...,...,...,...,...,...
54744,749530,761059,Morc35,M,761059,"[yellow, no]",funny,False
54745,749531,761060,O-shi,M,761060,"[sign, maker, seems, like, fun]",funny,False
54746,749532,761061,Shenko-wolf,M,761061,"[those, trophy, zucchinis]",funny,False
54747,749533,761062,i_forget_my_userids,M,761062,"[lol, youre, wrong, good, resource, community,...",funny,False


**Report accuracy:**
* Baseline:
  * `Gender    classification accuracy: `
  * `Subreddit classification accuracy: `
* Your Model: 
  * `Gender    classification accuracy: `
  * `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

### 4 Ethical Implications (3P)
Discuss the ethical implications of obfuscation and privacy based on the concepts covered in the lecture. Provide answers to the following points:

1.   What are demographic features (name at least three) and explain shortly some of the privacy violation risks? (1p)
2.   Explain the cultural and social implications and their effects? In this context discuss the information privacy paradox. You may refer to a recent example like the COVID-19 pandemic.  (1.5p)
3.   Name a at least three privacy preserving countermeasures  (0.5p)

Your Answer: ...

**Question 1**

Demographic features are statisitical factors that influence population growth or decline, including age structure, fecundity, sex ratio etc. The privacy violation might lead to the increase of bureaucracy and the abuse of power

**Question 2**

Citizens' personal privacy and security are in a paradoxical relationship. On the one hand, extensive public monitoring and censorship of content on social media is necessary to deter crime and prevent crime. However, centralizing these responsibilities and efforts in one agency can lead to increased bureaucracy and abuse of power. The trade-off between privacy and security are different among different cultures and countries. 

In the COVID19 pandemic, the different policies regarding different countries are an illustrative instance of the above theory. In some countries, the ordinary people believe the government has no right to track them even if their data are promised to be only used for monitoring plague, which results in the unavoidable and significant increase of people injected by the COIVD virus. In some countries, the amount of injected people are under control by large-scale tracking citizen and collective quarantine. However in those countries, there is sometimes many of news that some officials used the tracking technologies to threaten citizens who were going to reveal corruption. 

> Phone tapping and surveillance of suspects is the best way to find terrorist before they perform an attack ---- Sir Humphrey Appleby, *Yes, Minister*


**Question 3**

1. People use PGP, a protocol/framework aiming to protect data privacy by
crptographic signature, to send and receive individual Email. 
2. The ePrivacy Diective, a privacy legislatino that requires sites to get conset from visitors befor placing cookies ontheir devices, prevents the privider  of a website from abusing cookies to track visitors. 
3. Virtual private network(VPN) and Onion Router(Tor) enable anonymous communication. 


