# Discovering and Categorising Languages Biases in Reddit

## 1. Abstract 
This paper utilizes word embeddings to automatically discover and categorise biases in different Reddit communities. The authors developed a method to automatically discover and categorize protected attributes in different subreddits. 

## 2. Basic Approach
Given two sets of concepts (c1 = {he, son, his, him, father, male}, and c2 = {she, daughter, her, mother, female}) and a word embedding model, the approach is as follows:

1.Train an embedding model on corpus and select two sets of target words

2.Select n-most biased words with respect to each concept

3.Cluster words into k-conceptual biases

4.Categorize the discovered biases


In [23]:
 # Import necessary packages
import pandas as pd
import gensim 
from gensim.models import Word2Vec
import numpy as np
import nltk
import time
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from scipy import spatial
from sklearn.cluster import KMeans
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### 2.1 Dataset 
For the demo purppose, we will use a toy dataset. The dataset contains around 1,000,000 data. 

In [3]:
# upload the dataset
from google.colab import files
uploaded = files.upload()

Saving toy_1000_trp.csv to toy_1000_trp (1).csv


In [8]:
df= pd.read_csv("toy_1000_trp.csv")
df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,idint,idstr,created,author,body
0,26204410000.0,t1_c1dfjia,1295488000.0,highpowered,"First? - Anyway, no that is not normal, and it..."
1,26204420000.0,t1_c1dfo0d,1295490000.0,[deleted],> He constantly speaks about his first love an...
2,26204420000.0,t1_c1dfo3c,1295490000.0,bluewasabi,I don't know how long you and your boyfriend h...
3,26204420000.0,t1_c1dfpem,1295491000.0,IOnlyReadPostTitles,> BF lives in the past...I would like to move ...
4,26204420000.0,t1_c1dfpld,1295491000.0,[deleted],"It's normal to retain feelings for past loves,..."


In [9]:
len(df)

999999

### 2.2 Training the embedding model on toy_1000_trp.csv using word2vec

In [10]:
def TrainModel(csv_document, csv_comment_column='body', outputname='outputModel', window = 4, minf=10, epochs=100, ndim=200, lemmatiseFirst = False, verbose = True):
	'''
	Load the documents from csv_document and column csv_comment_column, trains a skipgram embedding model with given parameters and saves it in outputname.
	csv_document <str> : path to reddit csv dataset
	csv_comment_column <str> : column where comments are stored
	outputname <str> : output model name
	window = 4, minf=10, epochs=100, ndim=200, lemmatiseFirst = False, tolower= True : Training and preprocessing parameters
	'''

	def loadCSVAndPreprocess(path, column = 'body', nrowss=None, verbose = True):
		'''
		input:
		path <str> : path to csv file
		column <str> : column with text
		nrowss <int> : number of rows to process, leave None if all
		verbose <True/False> : verbose output
		tolower <True/False> : transform all text to lowercase
		returns:
		list of preprocessed sentences
		'''
		trpCom = pd.read_csv(path, lineterminator='\n', nrows=nrowss)
		documents = []
		for i, row in enumerate(trpCom[column]):
			

			if i%500000 == 0 and verbose == True:
				print('\t...processing line {}'.format(i))
			try:
				pp = gensim.utils.simple_preprocess (row)
				if(lemmatiseFirst == True):
					pp = [wordnet_lemmatizer.lemmatize(w, pos="n") for w in pp]
				documents.append(pp)
			except:
				if(verbose):
					print('\terror with row {}'.format(row))
		print('Done reading all documents')
		return documents

	def trainWEModel(documents, outputfile, ndim, window, minfreq, epochss):
		'''
		documents list<str> : List of texts preprocessed
		outputfile <str> : final file will be saved in this path
		ndim <int> : embedding dimensions
		window <int> : window when training the model
		minfreq <int> : minimum frequency, words with less freq will be discarded
		epochss <int> : training epochs
		'''
		starttime = time.time()
		print('->->Starting training model {} with dimensions:{}, minf:{}, epochs:{}'.format(outputfile,ndim, minfreq, epochss))
		model = gensim.models.Word2Vec (documents, size=ndim, window=window, min_count=minfreq, workers=5)
		model.train(documents,total_examples=len(documents),epochs=epochss)
		model.save(outputfile)
		print('->-> Model saved in {}'.format(outputfile))     

     
	print('->Starting with {} [{}], output {}, window {}, minf {}, epochs {}, ndim {}'.format(csv_document,csv_comment_column,outputname, window, minf, epochs, ndim))
	docs = loadCSVAndPreprocess(csv_document, csv_comment_column, nrowss=None, verbose=verbose)
	starttime = time.time()
	print('-> Output will be saved in {}'.format(outputname))
	trainWEModel(docs, outputname, ndim, window, minf, epochs)
	print('-> Model creation ended in {} seconds'.format(time.time()-starttime))

###2.3 Get the top most biased words

Bias($w$,$c_1$,$c_2$) = cos($\vec{w},\vec{c_1}$)-cos($\vec{w},\vec{c_2}$)

$c_1$,$c_2$ are the same as defined previously. To calculate the bias, we compute the cosine distance between the word and centroid of each target group. Word bias is treated on whether the similarity is higher for the first cos term or second (e.g. positive or negative values).

In [11]:
sid = SentimentIntensityAnalyzer()
def GetTopMostBiasedWords(modelpath, topk, c1, c2, pos = ['JJ','JJR','JJS'], verbose = True):
	'''
	modelpath <str> : path to skipgram w2v model
	topk <int> : topk words
	c1 list<str> : list of words for target set 1
	c2 list<str> : list of words for target set 2
	pos list<str> : List of parts of speech we are interested in analysing
	verbose <bool> : True/False
	'''

	def calculateCentroid(model, words):
		embeddings = [np.array(model[w]) for w in words if w in model]
		centroid = np.zeros(len(embeddings[0]))
		for e in embeddings:
			centroid += e
		return centroid/len(embeddings)

	def getCosineDistance(embedding1, embedding2):       
		return spatial.distance.cosine(embedding1, embedding2)


	#select the interesting subset of words based on pos
	model = Word2Vec.load(modelpath)
	words_sorted = sorted( [(k,v.index, v.count) for (k,v) in model.wv.vocab.items()] ,  key=lambda x: x[1], reverse=False)
	words = [w for w in words_sorted if nltk.pos_tag([w[0]])[0][1] in pos]

	if len(c1) < 1 or len(c2) < 1 or len(words) < 1:
		print('[!] Not enough word concepts to perform the experiment')
		return None

	centroid1, centroid2 = calculateCentroid(model, c1),calculateCentroid(model, c2)
	winfo = []
	for i, w in enumerate(words):
		word = w[0]
		freq = w[2]
		rank = w[1]
		pos = nltk.pos_tag([word])[0][1]
		wv = model[word]
		sent = sid.polarity_scores(word)['compound']
		#estimate cosinedistance diff
		d1 = getCosineDistance(centroid1, wv)
		d2 = getCosineDistance(centroid2, wv)
		bias = d2-d1

		winfo.append({'word':word, 'bias':bias, 'freq':freq, 'pos':pos, 'wv':wv, 'rank':rank, 'sent':sent} )

		if(i%100 == 0 and verbose == True):
			print('...'+str(i), end="")

	#Get max and min topk biased words...
	biasc1 = sorted( winfo, key=lambda x:x['bias'], reverse=True )[:min(len(winfo), topk)]
	biasc2 = sorted( winfo, key=lambda x:x['bias'], reverse=False )[:min(len(winfo), topk)]
    #move the ts2 bias to the positive space
	for w2 in biasc2:
		w2['bias'] = w2['bias']*-1
    
	return [biasc1, biasc2]

### 2.4 Clustering the bias words into two concepts

In [12]:

def Cluster(biasc1, biasc2, r, repeatk, verbose = True):
	'''
	biasc1 list<words> : List of words biased towards target concept1 as returned by GetTopMostBiasedWords
	biasc2 list<words> : List of words biased towards target concept2 as returned by GetTopMostBiasedWords
	r <int> : reduction factor used to determine k for the kmeans; k = r * len(voc) 
	repeatk <int> : Number of Clustering to perform only to keep the partition with best intrasim
	'''
	def getCosineDistance(embedding1, embedding2): 
		return spatial.distance.cosine(embedding1, embedding2)
	def getIntraSim(partition):
		iS = 0
		for cluster in partition:
			iS += getIntraSimCluster(cluster)
		return iS/len(partition)
	def getIntraSimCluster(cluster):
		if(len(cluster)==1):
			return 0
		sim = 0; c = 0
		for i in range(len(cluster)):
			w1 = cluster[i]['wv']
			for j in range(i+1, len(cluster)):
				w2 = cluster[j]['wv']
				sim+= 1-getCosineDistance(w1,w2)
				c+=1
		return sim/c
	def createPartition(embeddings, biasw, k):
		preds = KMeans (n_clusters=k).fit_predict(embeddings)
		#first create the proper clusters, then estiamte avg intra sim
		all_clusters = []
		for i in range(0, k):
			clust = []
			indexes = np.where(preds == i)[0]
			for idx in indexes:
				clust.append(biasw[idx])
			all_clusters.append(clust)
		score = getIntraSim(all_clusters)
		return [score, all_clusters]


	k = int(r * (len(biasc1)+len(biasc2))/2)
	emb1, emb2  = [w['wv'] for w in biasc1], [w['wv'] for w in biasc2]
	mis1, mis2 = [0,[]], [0,[]]	#here we will save partitions with max sim for both target sets
	for run in range(repeatk):
		p1 = createPartition(emb1, biasc1, k)
		if(p1[0] > mis1[0]):
			mis1 = p1
		p2 = createPartition(emb2, biasc2, k)
		if(p2[0] > mis2[0]):
			mis2 = p2
		if(verbose == True):
			print('New partition for ts1, intrasim: ', p1[0])
			print('New partition for ts2, intrasim: ', p2[0])

	print('[*] Intrasim of best partition found for ts1, ', mis1[0])
	print('[*] Intrasim of best partition found for ts2, ', mis2[0])
	return [mis1[1], mis2[1]]
		

## 3. Demo

In [14]:
'''
Train an embeddings model using word2vec with different parameters.
'''
setup = {'csvfile': "toy_1000_trp.csv", 'outputFile': 'Models', 'w':4, 'minf': 10, 'epochs':10 ,'ndim':200}
    
TrainModel(setup['csvfile'], 
           'body',
           outputname = setup['outputFile'],
           window = setup['w'],
           minf = setup['minf'],
           epochs = setup['epochs'],
           ndim = setup['ndim'],
           verbose = False
           )

->Starting with toy_1000_trp (1).csv [body], output Models, window 4, minf 10, epochs 10, ndim 200




Done reading all documents
-> Output will be saved in Models
->->Starting training model Models with dimensions:200, minf:10, epochs:10
->-> Model saved in Models
-> Model creation ended in 224.60853600502014 seconds


In [15]:
'''
List of target sets used in this work, replace them in GetTopMostBiasedWords to obtain different sets of biases
or create your own target sets to represent a concept!
'''

women=["sister" , "female" , "woman" , "girl" , "daughter" , "she" , "hers" , "her"]
men=["brother" , "male" , "man" , "boy" , "son" , "he" , "his" , "him"]  

islam = ["allah", "ramadan", "turban", "emir", "salaam", "sunni", "koran", "imam", "sultan", "prophet", "veil", "ayatollah", "shiite", "mosque", "islam", "sheik", "muslim", "muhammad"]
christian = ["baptism", "messiah", "catholicism", "resurrection", "christianity", "salvation", "protestant", "gospel", "trinity", "jesus", "christ", "christian", "cross", "catholic", "church"]

white_names = ["harris", "nelson", "robinson", "thompson", "moore", "wright", "anderson", "clark", "jackson", "taylor", "scott", "davis", "allen", "adams", "lewis", "williams", "jones", "wilson", "martin", "johnson"]
hispanic_names= ["ruiz", "alvarez", "vargas", "castillo", "gomez", "soto", "gonzalez", "sanchez", "rivera", "mendoza", "martinez", "torres", "rodriguez", "perez", "lopez", "medina", "diaz", "garcia", "castro", "cruz"]


In [16]:
'''
Call GetTopMostBiasedWords to obtain a list of the topk words with POS = ['JJ','JJR','JJS'] 
most biased towards women and men target sets in the model.

The function returns two word lists, b1 and b2, which contain all words from the embedding model most biased towards
women (b1) and men (b2). 
'''

modelpath = 'Models'  #add your model here!
[b1,b2] = GetTopMostBiasedWords(
        modelpath,
        300,
        women,
        men,
        ['JJ','JJR','JJS'],
        True)

  del sys.path[0]
  del sys.path[0]


...0...100...200...300...400...500...600...700...800...900

In [17]:
'''
List all topk biased words
'''
print('biased towards ', women)
print( [b['word'] for b in b1[:30]] )
print('biased towards ', men)
print( [b['word'] for b in b2[:30]] )

biased towards  ['sister', 'female', 'woman', 'girl', 'daughter', 'she', 'hers', 'her']
['mutual', 'available', 'inexpensive', 'second', 'unplanned', 'formal', 'common', 'laughable', 'single', 'polish', 'innocuous', 'continued', 'local', 'chic', 'ethnic', 'neutral', 'genetic', 'viable', 'compatible', 'informal', 'okcupid', 'small', 'tangible', 'suitable', 'highest', 'enjoyable', 'third', 'specific', 'variable', 'probable']
biased towards  ['brother', 'male', 'man', 'boy', 'son', 'he', 'his', 'him']
['homosexual', 'ouch', 'unfriended', 'glorious', 'unapologetic', 'lest', 'respectable', 'underwear', 'mechanical', 'psychotic', 'overall', 'lustful', 'miserable', 'total', 'neurotic', 'metaphorical', 'enable', 'typical', 'ecstatic', 'obnoxious', 'dependable', 'inexperienced', 'sophisticated', 'stupid', 'ludicrous', 'delighted', 'pathetic', 'hippy', 'nuclear', 'honorable']


In [18]:

'''
Every word returned by GetTopMostBiasedWords contains the next attributes:
word : Word 
bias : Bias strength towards target set 1 (in this example) when compared to target set 2
freq : Frequency of word in the vocabulary of the model
pos  : Part of speech as determined by NLTK
wv   : Embedding of the word, used for clustering later
rank : Frequency ranking of the word in model's vocabulary
sent : Sentiment of word [-1,1], as determined by nltk.sentiment.vader

Here we show the firt word biased towards women in the toy dataset
'''
b1[0]

{'bias': 0.16688398267389626,
 'freq': 1678,
 'pos': 'JJ',
 'rank': 715,
 'sent': 0.0,
 'word': 'mutual',
 'wv': array([-0.35153508,  2.4677126 ,  1.0783716 ,  1.9673408 ,  1.7866459 ,
        -0.5691257 ,  2.5474048 ,  2.5896962 ,  0.00767454,  2.3252208 ,
        -1.1459427 ,  1.0371476 ,  0.20932662, -1.2491125 , -1.6734018 ,
        -2.2402112 ,  0.63953424,  1.8222603 ,  0.76552784,  0.5116544 ,
        -1.7834792 ,  0.993913  , -1.4597697 ,  1.3452007 ,  0.2182632 ,
         0.20111796, -0.8610432 , -0.81632304,  0.27136183,  0.96343505,
        -1.3894559 ,  0.05518913,  1.5285295 ,  0.9776253 , -0.38843372,
        -2.425641  , -0.8800297 , -0.39002603, -1.0597271 ,  0.337895  ,
        -1.1285799 ,  0.89519095, -1.7965282 , -0.8947466 , -1.5264554 ,
        -0.34988675, -1.8221968 , -0.9596129 , -1.2337615 ,  0.6578687 ,
        -0.7387683 ,  0.10865816,  0.9348661 ,  0.1473172 , -0.2946774 ,
         0.9619846 ,  3.7789378 ,  0.3296489 ,  0.9853502 , -0.72702414,
         1.5

In [19]:
'''
Here we show the firt word biased towards men in the toy dataset
'''
b2[5]

{'bias': 0.17531354396275067,
 'freq': 43,
 'pos': 'JJS',
 'rank': 6389,
 'sent': 0.0,
 'word': 'lest',
 'wv': array([ 0.41634107, -0.44387686, -0.00985347, -0.21305421,  0.5504241 ,
        -0.18487473,  0.2682296 ,  0.49028108,  0.1429584 , -0.52414817,
         0.07344105,  0.34229323,  0.12309747, -0.01019612, -0.3162536 ,
        -0.26175988, -0.01533773,  0.07527173, -0.2821663 , -0.470527  ,
         0.06452382,  0.18299761,  0.1139118 ,  0.72259825, -0.10858902,
         0.26592755,  0.38285235,  0.09760041, -0.28722137, -0.11048467,
        -0.06294207,  0.09396516, -0.16292803,  0.40241387,  0.0678149 ,
        -0.02434599, -0.06206103, -0.4851445 , -0.58348846, -0.24393375,
         0.06535346, -0.20651348, -0.13795072,  0.17772835, -0.22885354,
        -0.12020966, -0.15249254,  0.03381014,  0.15814622, -0.45963234,
        -0.43462208, -0.07656372,  0.16956717, -0.0106379 , -0.6811642 ,
         0.36787388,  0.0735056 ,  0.34158555,  0.5385481 ,  0.3713525 ,
         0.405

In [20]:
'''
Cluster words into concepts by leveragin their embedding distributions, where
b1 : list of biased words towards target set 1
b2 : list of biased words towards target set 2
r  : r parameter for k-means clustering, where k = r*len(b)
100: partition repetitoins for k-means, keeping the partition with best intrasim

The function returns:
List of clusters in a partition and words clustered in each cluster, for both target sets (cl1, cl2).
'''
[cl1,cl2] = Cluster(b1,b2, 0.15, 100)

New partition for ts1, intrasim:  0.18748823020857017
New partition for ts2, intrasim:  0.1089224949232526
New partition for ts1, intrasim:  0.1652583090662702
New partition for ts2, intrasim:  0.13907001723510964
New partition for ts1, intrasim:  0.15556529131303576
New partition for ts2, intrasim:  0.13631719497399303
New partition for ts1, intrasim:  0.1646673558580657
New partition for ts2, intrasim:  0.1211439098790006
New partition for ts1, intrasim:  0.17705168786452222
New partition for ts2, intrasim:  0.13905212268457431
New partition for ts1, intrasim:  0.15846734465658824
New partition for ts2, intrasim:  0.10756914874806522
New partition for ts1, intrasim:  0.15938932687253232
New partition for ts2, intrasim:  0.13175428691533567
New partition for ts1, intrasim:  0.15481209208020122
New partition for ts2, intrasim:  0.10627994930023228
New partition for ts1, intrasim:  0.1715843015137376
New partition for ts2, intrasim:  0.08829359400141044
New partition for ts1, intrasim: 

In [21]:
'''
Exploring the conceptual biases from partition biased towards ts1, only printing the words in each cluster.
'''
#conceptual biases for target set 1
print(len(cl1))
for cluster in cl1:
    print( [k['word'] for k in cluster] )

45
['local', 'wide', 'italian', 'uni', 'urban', 'slower', 'musical', 'foreign', 'vegetarian', 'new', 'green', 'fresh']
['ready']
['easier']
['common']
['neutral', 'viable', 'suitable', 'probable', 'intentional', 'plausible', 'broad', 'significant', 'familiar', 'preferable', 'impersonal', 'unreliable', 'invasive', 'unusual', 'applicable', 'beneficial', 'definitive', 'productive', 'false', 'modest', 'disappointed', 'unnatural', 'knowledgeable', 'unable', 'unapproachable', 'impressive', 'risky', 'illegal', 'unimportant', 'satisfied', 'rigid', 'unwilling', 'accepted', 'humorous', 'vocal', 'unclear', 'questionable', 'offensive', 'trivial']
['current', 'dynamic', 'cultural', 'political', 'potential', 'various', 'moral', 'religious', 'asian']
['difficult', 'hard']
['exclusive']
['comfortable', 'uncomfortable']
['similar']
['interested']
['few', 'several']
['enjoyable', 'specific', 'enthusiastic', 'romantic', 'creative', 'approachable', 'adventurous', 'memorable', 'private', 'flirtatious', 'se

In [22]:
'''
Exploring the conceptual biases from partition biased towards ts2, only printing the words in each cluster.
'''
print(len(cl2))
for cluster in cl2:
  print( [k['word'] for k in cluster] )

45
['homosexual', 'ouch', 'unfriended', 'glorious', 'unapologetic', 'lest', 'underwear', 'mechanical', 'psychotic', 'lustful', 'metaphorical', 'enable', 'ecstatic', 'dependable', 'sophisticated', 'ludicrous', 'delighted', 'hippy', 'nuclear', 'bearable', 'tic', 'literary', 'vicious', 'unlovable', 'uncalled', 'argumentative', 'preppy', 'sympathetic', 'notable', 'unkempt', 'enigmatic', 'unfaithful', 'antisocial', 'idiotic', 'electric', 'unexperienced', 'vietnamese', 'furious', 'pompous', 'asynchronous', 'induced', 'devious', 'envious', 'unemotional', 'overdrive', 'pedantic', 'horrendous', 'infamous', 'injured', 'wisest', 'freudian', 'facetious', 'courageous', 'swedish', 'improbable', 'interchangeable', 'psychic', 'invisible', 'optional', 'largest', 'plural', 'mathematical', 'extraordinary', 'unproductive', 'likable', 'unmotivated', 'urbandictionary', 'irritable', 'irresistible', 'unworthy', 'tactical', 'observational', 'offish', 'unfavorable', 'steady', 'inquisitive', 'corporate', 'unsati

# Reference

Reuse some codes from https://github.com/xfold/LanguageBiasesInReddit
