<a href="https://colab.research.google.com/github/spatank/Curiosity/blob/master/wiki_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Make data available to Colab by mounting your Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/My Drive/Curiosity_IGT/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!ls # run !ls to verify location 

enwiki_20180420_300d.pkl  wiki_1.ipynb	WikiData_Shubhankar.csv


# Import Packages

In [3]:
import pandas as pd
!pip install wikipedia2vec
from wikipedia2vec import Wikipedia2Vec
import numpy as np
from scipy.spatial.distance import cosine
from scipy.io import savemat
import networkx as nx



# Data Wrangling

This section pre-processes the KNOT data to get it into a format suited to network analysis. The data contains the names of Wikipedia pages visited by a participant in two columns: SourceName and TargetName. 


In [4]:
wiki_df = pd.read_csv('WikiData_Shubhankar.csv')
wiki_df.head(3)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8


In [5]:
def clean_entity_name(name):
  name = name.replace('/wiki/', '')
  name = name.replace('_', ' ')
  return name

First, we create unique identifiers (UIDs) for each page so that they can be used as nodes in a network representation. Then we clean the strings associated with each page by stripping redundant information such as `wiki/` and `_`. The UIDs and clean names are appended to the data frame as new columns.

In [6]:
# create UID for each page
source_nodes = set(wiki_df['SourceName'].tolist())
target_nodes = set(wiki_df['TargetName'].tolist())
source_nodes.update(target_nodes)
node_set = {entity: name for name, entity in enumerate(source_nodes)}
wiki_df['SourceUID'] = wiki_df['SourceName'].apply(lambda x: node_set[x])
wiki_df['SrcNameClean'] = wiki_df['SourceName'].apply(lambda x: clean_entity_name(x))
wiki_df['TargetUID'] = wiki_df['TargetName'].apply(lambda x: node_set[x])
wiki_df['TgtNameClean'] = wiki_df['TargetName'].apply(lambda x: clean_entity_name(x))
wiki_df.head(3)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight,SourceUID,SrcNameClean,TargetUID,TgtNameClean
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0,8289,Jeff Bezos,16904,Cloud infrastructure
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8,16904,Cloud infrastructure,375,Cloud computing security
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8,375,Cloud computing security,16904,Cloud infrastructure


# Wikipedia2Vec

Each row in the data frame represents a transition made by a participant from one Wikipedia page to another. The associated edge weight can be obtained as the semantic distance between the contents of the two pages. In order to quantify the semantic distance between Wikipedia entities, we use a pre-trained model that represents each page as an n-dimensional vector. The distance between two pages `SemanticDist` is then computed as the cosine (dis)similarity between their vector representations.

In [7]:
model_file = 'enwiki_20180420_300d.pkl'
wiki2vec = Wikipedia2Vec.load(model_file)
# v1 = wiki2vec.get_entity_vector('Jeff Bezos')
# v2 = wiki2vec.get_entity_vector('Bill Gates')

In [8]:
def check_entity_vector(entity):
  try:
    vec = wiki2vec.get_entity_vector(entity)
    return 0
  except KeyError:
    return 1

In [9]:
no_vec_entities = []
for k, v in node_set.items():
  entity = clean_entity_name(k)
  no_vec_entities.append(check_entity_vector(entity))

In [10]:
len(no_vec_entities)

18378

In [11]:
sum(no_vec_entities)

2207

12% of the pages visited by participants in the KNOT data do not have corresponding vector embeddings. We represent these pages by a random vector.

In [12]:
def semantic_dist(entity_1, entity_2):
  # get entity 1 vector
  try:
    v1 = wiki2vec.get_entity_vector(entity_1)
  except KeyError:
    v1 = np.random.random(300)
  # get entity 2 vector
  try:
    v2 = wiki2vec.get_entity_vector(entity_2)
  except KeyError:
    v2 = np.random.random(300)

  return cosine(v1, v2)

In [13]:
wiki_df['SemanticDist'] = wiki_df.apply(lambda x: semantic_dist(x['SrcNameClean'], x['TgtNameClean']), axis = 1)
wiki_df.head(5)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight,SourceUID,SrcNameClean,TargetUID,TgtNameClean,SemanticDist
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0,8289,Jeff Bezos,16904,Cloud infrastructure,0.603374
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8,16904,Cloud infrastructure,375,Cloud computing security,0.337638
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8,375,Cloud computing security,16904,Cloud infrastructure,0.337638
3,101,/wiki/Cloud_infrastructure,/wiki/Information_technology,1,4,yes,0.8,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,4,0.2,16904,Cloud infrastructure,13764,Information technology,0.625812
4,101,/wiki/Information_technology,/wiki/Computer_language,1,5,no,0.6,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,5,0.4,13764,Information technology,16030,Computer language,0.708145


# Create Individual Networks

Next, we split the data set by individual, and use the `SourceUID`, `TargetUID`, and `SemanticDist` columns to generate network representations of participants' Wikipedia exploration.

In [14]:
# split the data by individual
ID_groups = wiki_df.groupby('ID')
for ID, group in ID_groups:
  print(ID)
  # enforce time ordering
  group.sort_values(by = ['TimeOrder'], inplace = True)
  network_df = group[['TimeOrder', 'SourceUID', 'SrcNameClean', 'TargetUID', 'TgtNameClean', 'SemanticDist']].reset_index(drop = True)
  # create an empty network
  G = nx.Graph()
  all_adj = []
  edge_info = []
  # incrementally add nodes and edges to the network
  for index, row in network_df.iterrows():
    from_node = row.get('SrcNameClean')
    to_node = row.get('TgtNameClean')
    edge_weight = row.get('SemanticDist')
    edge_info_dict = {'from': from_node, 'to': to_node, 'weight': edge_weight}
    edge_info.append(edge_info_dict)
    # add edge to the network
    G.add_edge(from_node, to_node, weight = edge_weight)
    adj_G = nx.linalg.graphmatrix.adjacency_matrix(G, weight = 'weight')
    all_adj.append(adj_G)
  # save subject data to .mat file
  filename = 'subj_' + str(ID) + '.mat'
  mdic = {'subj': ID, 'all_adj': all_adj, 'edge_info': edge_info}
  savemat(filename, mdic)

101
104
105
106
107
108
109
112
114
115
117
119
120
121
122
126
127
128
130
131
132
135
138
139
140
141
146
150
153
154
155
156
157
158
159
162
164
165
167
169
171
173
174
176
177
179
183
185
188
189
190
191
192
194
196
197
198
199
201
204
206
207
208
209
210
211
212
214
216
217
219
220
221
223
224
225
226
228
229
231
232
234
235
236
238
239
240
242
243
246
247
248
249
251
253
255
256
258
259
261
263
264
266
267
268
269
271
273
278
280
286
287
288
290
291
293
296
297
304
308
309
310
311
312
313
316
318
319
321
322
323
324
325
327
328
329
335
338
339
340
342
349
351
353
355
356
359
363
340340


In [15]:
len(ID_groups)

149