# Ethics for NLP: Spring 2022
# Homework 4 Privacy



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Data Overview and Baseline

A major problem with utilizing web data as a source for NLP applications is the increasing concern for privacy, e.g., such as microtargeting. This homework is aimed at developing a method to obfuscate demographic features, in this case (binary) gender and to investigate the trade-off between obfuscating an users identity and preserving useful information.

The given dataset consists of Reddit posts (`post_text`) which are annotated with the gender (`op_gender`) of the user and the corresponding subreddit (`subreddit`) category.

*  `subreddit_classifier.pickle` pretrained subreddit classifier
*  `gender_classifier.pickle` pretrained gender classifier
*  `test.csv` your primary test data
*  `male.txt` a list of words commonly used by men
*  `female.txt` a list of words commonly used by women
*  `background.csv` additional Reddit posts that you may optionally use for training an obfuscation model

In [2]:
from sklearn.metrics import accuracy_score
from pandas.core.frame import DataFrame
from typing import List, Tuple
import pandas
import pickle
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
import os

In [4]:
def get_preds(cache_name: str, test: List[str]) -> List[str]:
    loaded_model, dictionary, transpose, train_bow = pickle.load(open(cache_name, 'rb'))
    X_test = transpose(test, train_bow, dictionary)
    preds = loaded_model.predict(X_test)
    return preds

In [26]:
def run_classifier(test_file: str, googleDrive=False) -> Tuple[float]:
  if googleDrive:
    GoogleDrivePathBase='/content/drive/MyDrive/Ethic_in_NLP/HW04/data'
  else:
    GoogleDrivePathBase=''

  test_file=os.path.join(GoogleDrivePathBase,test_file)
  test_data = pandas.read_csv(test_file)

  cache_name = os.path.join(GoogleDrivePathBase,'gender_classifier.pickle')
  test_preds = get_preds(cache_name, list(test_data["post_text"]))
  gold_test = list(test_data["op_gender"])
  gender_acc = accuracy_score(list(test_preds), gold_test)
  print("Gender classification accuracy", gender_acc)

  cache_name = os.path.join(GoogleDrivePathBase,'subreddit_classifier.pickle')
  test_preds = get_preds(cache_name, list(test_data["post_text"]))
  gold_test = list(test_data["subreddit"])
  subreddit_acc = accuracy_score(list(test_preds), gold_test)
  print("Subreddit classification accuracy", subreddit_acc)
  return gender_acc, subreddit_acc

In [6]:
gender_acc, subreddit_acc = run_classifier("test.csv",True)

assert gender_acc == 0.646
assert subreddit_acc == 0.832

Gender classification accuracy 0.646
Subreddit classification accuracy 0.832


**Default accuracy:**
*   `Gender    classification accuracy: 0.646`
*   `Subreddit classification accuracy: 0.832`

## 2. Obfuscation of the Test Dataset
### 2.1 Random Obfuscated Dataset  (4P)
First, run a random experiment, by randomly swapping gender-specific words that appear in posts with a word from the respective list of words of the opposite gender.

*  Write a function to read the female.txt and male.txt files
*  Tokenize the posts („post_text“) using NLTK (0.5p)
*  For each post, if written by a man („M“) and containing a token from the male.txt, replace that token with a random one from the female.txt (1p)
*  For each post, if written by a woman („W“) and containing a token from the female.txt, replace that token with a random one from the male.txt (1p)
*  Save the obfuscated version of the test.csv in a separate csv file (using pandas and makes sure to name them accordingly) (0.5p)
*  Run the given classifier again, report the accuracy and provide a brief commentary on the results compared to the baseline (1p)

In [7]:
def read_data(file_name: str) -> List[str]:
    """
    
    add your code here

    """
    content=[]
    with open(file_name, 'r') as f:
      for line in f:
        content.append(line.strip())

    return content

In [8]:
BASE='/content/drive/MyDrive/Ethic_in_NLP/HW04/data'

MALE_PATH=os.path.join(BASE, 'male.txt')
FEMELE_PATH=os.path.join(BASE, 'female.txt')

male_words = read_data(MALE_PATH)
female_words = read_data(FEMELE_PATH)

assert len(male_words) == 3000
assert len(male_words) == 3000

In [56]:
from nltk.corpus import stopwords
import random

SEED=123
random.seed(SEED)

In [10]:
def customized_tokenize(sentence:str):
  stopWords = set(stopwords.words('english'))
  if  not sentence:
    return ' '
  else:
    return  [word.lower() for word in nltk.word_tokenize(sentence) if word not in stopWords and word.isalpha()]

def replace(gender_words: List[str], opposite_gender_words: List[str],raw_sequence:List[str]):
  new_sequence=[]
  for word in raw_sequence:
    if word in gender_words:
      while True:
        # avoid being replaced with punctuations
        idx=random.randint(0,len(opposite_gender_words)-1)
        newword=opposite_gender_words[idx]
        if newword.isalpha():
          break
      new_sequence.append(newword)
    else:
      new_sequence.append(word)
  
  return ' '.join(new_sequence)


def randomly_replace(male_words: List[str], female_words: List[str], item:pandas.Series):
  if item.op_gender=='M':
    item['post_text']=replace(male_words, female_words, item['post_text'])

  elif item.op_gender=='W':
    item['post_text']=replace(female_words, male_words, item['post_text'])
  else: 
    raise ValueError('The gender of the user is not explicitly mentioned !!')

  return item

def obfuscate_gender_randomly(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  """
  
  add your code here
  
  """
  df=pandas.read_csv(dataset_file_name, sep=',', encoding='utf-8',header=0)
  df['post_text']=df['post_text'].apply(lambda x: customized_tokenize(x))

  df=df.apply(lambda x : randomly_replace(male_words, female_words, x),axis=1)
  
  # replace blank value with NaN
  df['post_text'] = df['post_text'].apply(lambda x: x.strip()).replace('', pandas.NA)

  # replace nan with new value
  df['post_text']=df['post_text'].replace(to_replace = pandas.NA, value =-99999)

  if df.post_text.isna().any():
    raise ValueError('There are some NaN in the column "post_text" ')
  
  print(type(df.loc[df['op_id']=='ninepointsix'].post_text))

  

  return df


In [11]:
file_name = "randomReplacement_tokenized_testset.csv"

In [12]:
TEST_PATH=os.path.join(BASE, 'test.csv')
SAVE_PATH=os.path.join(BASE, file_name)
random_replaced_test = obfuscate_gender_randomly(male_words=male_words, female_words=female_words, dataset_file_name=TEST_PATH)
random_replaced_test.to_csv(SAVE_PATH)

<class 'pandas.core.series.Series'>


In [13]:
random_replaced_test = pandas.read_csv(SAVE_PATH)
assert len(random_replaced_test) == 500
assert random_replaced_test["subreddit"][0] == "funny"
assert random_replaced_test["subreddit"][-1:].item() == "relationships"

In [14]:
gender_acc, subreddit_acc = run_classifier(file_name, True)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.7

Gender classification accuracy 0.484
Subreddit classification accuracy 0.796


**Report accuracy:**
*   `Gender    classification accuracy: `
*   `Subreddit classification accuracy: `
*   `Your commentary: ` ...

### 2.2 Similarity Obfuscated Dataset (4P)
In a second approach, refine the swap method. Instead of randomly selecting a word, use a similarity metric.


*  Instead of the first method replace the tokens by semantically similar tokens from the other genders token list. For that you may choose any metric for identifying semantically similar words, but you have to justify your choice. (Recommend: using cosine distance between pre-trained word embeddings) (2p)
*  Save the obfuscated version of the test.csv in a separate CSV file (using pandas and makes sure to name them accordingly) (0.5p)
*  Run the given classifier again, report the accuracy and provide a brief commentary on the results (compared to the baseline and your other results) (1p)
*  The classifiers accuracy for predicting the gender should be below random guessing (50%) and for the subreddit prediction it should be above 80% (0.5p)

In [15]:
import gensim.downloader
import gensim
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [62]:
glove_vectors = gensim.downloader.load('glove-twitter-50')
# from gensim.models import KeyedVectors 
# # glove_vectors=gensim.models.Word2Vec.load('/content/drive/MyDrive/Ethic_in_NLP/HW04/model/word2vec-google-news-300/word2vec-google-news-300.model')
# glove_vectors=KeyedVectors.load('/content/drive/MyDrive/Ethic_in_NLP/HW04/model/word2vec-google-news-300/word2vec-google-news-300.model')




In [63]:
(glove_vectors.most_similar('effort',topn=30))

[('advantage', 0.7885393500328064),
 ('efforts', 0.7689546346664429),
 ('positive', 0.7641745209693909),
 ('difficult', 0.7637898325920105),
 ('success', 0.7618691921234131),
 ('confidence', 0.7613573670387268),
 ('difference', 0.7592354416847229),
 ('expect', 0.7585071325302124),
 ('important', 0.7489951848983765),
 ('enough', 0.748828113079071),
 ('motivation', 0.7486553192138672),
 ('progress', 0.7485166788101196),
 ('patience', 0.7442478537559509),
 ('decision', 0.74277663230896),
 ('willing', 0.7402777671813965),
 ('example', 0.7390508651733398),
 ('commitment', 0.7374849319458008),
 ('push', 0.7373062372207642),
 ('appreciate', 0.7368417978286743),
 ('however', 0.7357004284858704),
 ('timing', 0.7339947819709778),
 ('tough', 0.73281329870224),
 ('although', 0.7287693619728088),
 ('sense', 0.7284727692604065),
 ('matter', 0.7276229858398438),
 ('lack', 0.72706139087677),
 ('opportunity', 0.726543664932251),
 ('focus', 0.7243965268135071),
 ('ability', 0.7236769199371338),
 ('consi

In [69]:
def similarity_replace(gender_words: List[str], opposite_gender_words: List[str],raw_sequence:List[str]):
  new_sequence=[]
  model=glove_vectors
  if not model:
    raise ValueError('There are no pre-trained model!')

  for word in raw_sequence:
    if word in gender_words:
      candidates=[]

      # If the word does not exist in the vocabulary of the model
      # then, use the word itself as the candidate
      try:
        candidates=model.most_similar(word,topn=20)
      except:
        candidates.append(tuple(word))

      replaced_tag=False
      for candidate in candidates:
        if candidate[0] in opposite_gender_words and candidate[0].isalpha(): # avoid being replaced with punctuations
          new_sequence.append(candidate[0])
          replaced_tag=True
          break    
      
      # if there are no such word from opposite_gender_words which is 
      # similar to the word in raw_sequence(original word),
      # the word will be replaced randomly. 
      if not replaced_tag:
        while True:
          # avoid being replaced with punctuations
          idx=random.randint(0,len(opposite_gender_words)-1)
          newword=opposite_gender_words[idx]
          if newword.isalpha():
            break
        new_sequence.append(newword)
        # new_sequence.append(word)


    else:
      new_sequence.append(word)
  
  return ' '.join(new_sequence)


def similarity_replacement(male_words: List[str], female_words: List[str], item:pandas.Series):

  if item.op_gender=='M':
    item['post_text']=similarity_replace(male_words, female_words, item['post_text'])

  elif item.op_gender=='W':
    item['post_text']=similarity_replace(female_words, male_words, item['post_text'])
  else: 
    raise ValueError('The gender of the user is not explicitly mentioned !!')

  return item



def obfuscate_gender_by_similarity(male_words: List[str], female_words: List[str], dataset_file_name: str) -> DataFrame:
  """
  
  add your code here
  
  """
  df=pandas.read_csv(dataset_file_name, sep=',', encoding='utf-8',header=0)
  df['post_text']=df['post_text'].apply(lambda x: customized_tokenize(x))

  df=df.apply(lambda x : similarity_replacement(male_words, female_words, x),axis=1)
  
  # replace blank value with NaN
  df['post_text'] = df['post_text'].apply(lambda x: x.strip()).replace('', pandas.NA)

  # replace nan with new value
  df['post_text']=df['post_text'].replace(to_replace = pandas.NA, value =-99999)

  if df.post_text.isna().any():
    raise ValueError('There are some NaN in the column "post_text" ')



  return df

In [19]:
"""
 you may use gensim models for example word2vec-google-news-300
"""

'\n you may use gensim models for example word2vec-google-news-300\n'

In [65]:
file_name = "similarity_tokenized_testset.csv"

* The below code block would take almost **14 minutes** on the colab w.r.t. no 
restriction on the len(word)
* The below code block would take almost **6 minutes** on the colab w.r.t. len(word) has to be larger than 3

In [70]:
SAVE_Similarity_PATH=os.path.join(BASE,file_name)
similarity_replaced_test = obfuscate_gender_by_similarity(male_words=male_words, female_words=female_words, dataset_file_name=TEST_PATH)
similarity_replaced_test.to_csv(SAVE_Similarity_PATH)

In [59]:
similarity_replaced_test.loc[0].post_text

'bathroom nevermind kidding im pretty sure guys help russian studies'

In [71]:
similarity_replaced_test = pandas.read_csv(SAVE_Similarity_PATH)
assert len(similarity_replaced_test) == 500
assert similarity_replaced_test["subreddit"][0] == "funny"
assert similarity_replaced_test["subreddit"][-1:].item() == "relationships"

In [72]:
gender_acc, subreddit_acc = run_classifier(file_name,True)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.8


Gender classification accuracy 0.476
Subreddit classification accuracy 0.806


**Report accuracy:**
*   `Gender    classification accuracy: `
*   `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

By using goolge-news-300 model, the gender_acc, subreddit_acc reaches 0.562, 0.808 respectively

By using glove-twitter-50 model, the gender_acc, subreddit_acc reaches 0.476, 0.806 respectively

### 2.3 Your Own Obfuscated Dataset (4P)
With this last approach, you can experiment by yourself how to obfuscate the posts.

*  Some examples: What if you randomly decide whether or not to replace words instead of replacing every lexicon word? What if you only replace words that have semantically similar enough counterparts? What if you use different word embeddings? (2p)
*  Save the obfuscated version of the test.csv in a separate csv file (using pandas and makes sure to name them accordingly) (0.5p)
*  Describe your modifications and report the accuracy and provide a brief commentary on the results compared to the baseline and your other results (1.5p)

In [None]:
def obfuscate_gender(dataset_file_name: str) -> DataFrame:
  """

    add your own implemntation, you may add more functions and arguments
    
  """
  return DataFrame()

In [None]:
file_name = "add file name"

In [None]:
your_test = obfuscate_gender(dataset_file_name="./test.csv")
your_test.to_csv(file_name)

In [None]:
your_test = pandas.read_csv(file_name)
assert len(your_test) == 500
assert your_test["subreddit"][0] == "funny"
assert your_test["subreddit"][-1:].item() == "relationships"

In [None]:
gender_acc, subreddit_acc = run_classifier(file_name)

assert gender_acc <= 0.5
assert subreddit_acc >= 0.6

**Report accuracy:**
*   `Gender    classification accuracy: `
*   `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

### 3 Advanced Obfuscated Model (5P)
Develop your own obfuscation model using the provided background.csv for training. Your ultimate goal should be to obfuscate text so that the classifier is unable to determine the gender of an user (no better than random guessing) without compromising the accuracy of the subreddit classification task. To train a model that is good at predicting subreddit classification, but bad at predicting gender. The key idea in this approach is to design a model that does not encode information about protected attributes (in this case, gender). In your report, include a description of your model and results.

*  Develop your own classifier (3p)
*  Use only posts from the subreddits „CasualConversation“ and „funny“ (min. 1000 posts for each gender per subreddit) (0.5p)
*  Use sklearn models (MLPClassifier, LogisticRegression, etc.)
*  Use 90% for training and 10% for testing (0.5p)
*  In your report, include a description of your model and report the accuracy on the unmodified train data (your baseline here) as well as the modified train data and provide a brief commentary on the results (1p)

In [None]:
"""

add your code here

"""

**Report accuracy:**
* Baseline:
  * `Gender    classification accuracy: `
  * `Subreddit classification accuracy: `
* Your Model: 
  * `Gender    classification accuracy: `
  * `Subreddit classification accuracy: ` 
*   `Your commentary: ` ...

### 4 Ethical Implications (3P)
Discuss the ethical implications of obfuscation and privacy based on the concepts covered in the lecture. Provide answers to the following points:

1.   What are demographic features (name at least three) and explain shortly some of the privacy violation risks? (1p)
2.   Explain the cultural and social implications and their effects? In this context discuss the information privacy paradox. You may refer to a recent example like the COVID-19 pandemic.  (1.5p)
3.   Name a at least three privacy preserving countermeasures  (0.5p)

Your Answer: ...

1. ...
2. ...
3. ...
