<a href="https://colab.research.google.com/github/spatank/Curiosity/blob/master/wiki_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Make data available to Colab by mounting your Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/My Drive/Curiosity_IGT/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!ls # run !ls to verify location 

enwiki_20180420_300d.pkl	   wiki_1.ipynb
enwiki_20180420_300d.pkl.bz2	   WikiData_Shubhankar.csv
enwiki_20180420_nolg_300d.txt.bz2


# Import Packages

In [3]:
import pandas as pd
!pip install wikipedia2vec
from wikipedia2vec import Wikipedia2Vec
import numpy as np
from scipy.spatial.distance import cosine
import networkx as nx



# Data Wrangling

This section pre-processes the KNOT data to get it into a format suited to network analysis. The data contains the names of Wikipedia pages visited by a participant in two columns: SourceName and TargetName. 


In [4]:
wiki_df = pd.read_csv('WikiData_Shubhankar.csv')
wiki_df.head(3)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8


In [5]:
def clean_entity_name(name):
  name = name.replace('/wiki/', '')
  name = name.replace('_', ' ')
  return name

First, we create unique identifiers (UIDs) for each page so that they can be used as nodes in a network representation. Then we clean the strings associated with each page by stripping redundant information such as `wiki/` and `_`. The UIDs and clean names are appended to the data frame as new columns.

In [6]:
# create UID for each page
source_nodes = set(wiki_df['SourceName'].tolist())
target_nodes = set(wiki_df['TargetName'].tolist())
source_nodes.update(target_nodes)
node_set = {entity: name for name, entity in enumerate(source_nodes)}
wiki_df['SourceUID'] = wiki_df['SourceName'].apply(lambda x: node_set[x])
wiki_df['SrcNameClean'] = wiki_df['SourceName'].apply(lambda x: clean_entity_name(x))
wiki_df['TargetUID'] = wiki_df['TargetName'].apply(lambda x: node_set[x])
wiki_df['TgtNameClean'] = wiki_df['TargetName'].apply(lambda x: clean_entity_name(x))
wiki_df.head(3)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight,SourceUID,SrcNameClean,TargetUID,TgtNameClean
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0,2552,Jeff Bezos,16296,Cloud infrastructure
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8,16296,Cloud infrastructure,1737,Cloud computing security
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8,1737,Cloud computing security,16296,Cloud infrastructure


# Wikipedia2Vec

Each row in the data frame represents a transition made by a participant from one Wikipedia page to another. The associated edge weight can be obtained as the semantic distance between the contents of the two pages. In order to quantify the semantic distance between Wikipedia entities, we use a pre-trained model that represents each page as an n-dimensional vector. The distance between two pages `SemanticDist` is then computed as the cosine (dis)similarity between their vector representations.

In [7]:
model_file = 'enwiki_20180420_300d.pkl'
wiki2vec = Wikipedia2Vec.load(model_file)
# v1 = wiki2vec.get_entity_vector('Jeff Bezos')
# v2 = wiki2vec.get_entity_vector('Bill Gates')

In [8]:
def check_entity_vector(entity):
  try:
    vec = wiki2vec.get_entity_vector(entity)
    return 0
  except KeyError:
    return 1

In [9]:
no_vec_entities = []
for k, v in node_set.items():
  entity = clean_entity_name(k)
  no_vec_entities.append(check_entity_vector(entity))

In [10]:
len(no_vec_entities)

18378

In [11]:
sum(no_vec_entities)

2207

12% of the pages visited by participants in the KNOT data do not have corresponding vector embeddings. We represent these pages by a random vector.

In [12]:
def semantic_dist(entity_1, entity_2):
  # get entity 1 vector
  try:
    v1 = wiki2vec.get_entity_vector(entity_1)
  except KeyError:
    v1 = np.random.random(300)
  # get entity 2 vector
  try:
    v2 = wiki2vec.get_entity_vector(entity_2)
  except KeyError:
    v2 = np.random.random(300)

  return cosine(v1, v2)

In [13]:
wiki_df['SemanticDist'] = wiki_df.apply(lambda x: semantic_dist(x['SrcNameClean'], x['TgtNameClean']), axis = 1)
wiki_df.head(5)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight,SourceUID,SrcNameClean,TargetUID,TgtNameClean,SemanticDist
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0,2552,Jeff Bezos,16296,Cloud infrastructure,0.603374
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8,16296,Cloud infrastructure,1737,Cloud computing security,0.337638
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8,1737,Cloud computing security,16296,Cloud infrastructure,0.337638
3,101,/wiki/Cloud_infrastructure,/wiki/Information_technology,1,4,yes,0.8,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,4,0.2,16296,Cloud infrastructure,15758,Information technology,0.625812
4,101,/wiki/Information_technology,/wiki/Computer_language,1,5,no,0.6,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,5,0.4,15758,Information technology,14457,Computer language,0.708145


# Create Individual Networks

Next, we split the data set by individual, and use the `SourceUID`, `TargetUID`, and `SemanticDist` columns to generate network representations of participants' Wikipedia exploration.

In [14]:
# split the data by individual
ID_groups = wiki_df.groupby('ID')
for ID, group in ID_groups:
  # enforce time ordering
  group.sort_values(by = ['TimeOrder'], inplace = True)
  # network_df = group[['SourceUID', 'SrcNameClean', 'TargetUID', 'TgtNameClean', 'SemanticDist']]
  
network_df = group[['SourceUID', 'SrcNameClean', 'TargetUID', 'TgtNameClean', 'SemanticDist']]
network_df.head(5)

Unnamed: 0,SourceUID,SrcNameClean,TargetUID,TgtNameClean,SemanticDist
27760,6151,Financial market,5429,Asset pricing,0.465183
27761,5429,Asset pricing,6151,Financial market,0.465183
27762,6151,Financial market,5429,Asset pricing,0.465183
27763,5429,Asset pricing,12700,BlackRock,0.558683
27764,12700,BlackRock,18050,The Blackstone Group,0.554321


In [15]:
# G = nx.from_pandas_edgelist(network_df, 'SourceUID', 'SourceUID', 'SemanticDist')
G = nx.from_pandas_edgelist(network_df, 'SrcNameClean', 'TgtNameClean', edge_attr = 'SemanticDist')

In [16]:
G.nodes

NodeView(('Financial market', 'Asset pricing', 'BlackRock', 'The Blackstone Group', 'Investment management', 'Chartered Financial Analyst', 'Asset management', 'List of asset management firms', 'The Vanguard Group', 'Physics', 'Natural science', 'Quantum mechanics', 'List of quantum-mechanical systems with analytical solutions', 'Discrete mathematics', 'Planck constant', 'Max Planck', 'Theoretical physics', 'Maxwell%27s equations', 'Maxwell relations', 'James Clerk Maxwell', 'Magnetism', 'Paramagnetism', 'Magnetic moment', 'Cross product', 'Executive education', 'Financial Times', 'Founders Brewing Company', 'India pale ale', 'Hops', 'Whey protein', 'Casein', 'Cheese', 'Mozzarella', 'Pizza', 'Pizza Margherita', 'Basil', 'Pizza marinara', 'Neapolitan pizza', 'New York-style pizza', 'Totonno%27s', 'Lombardi%27s Pizza', 'Old ale', 'Pho', 'Goat cheese', 'Fromage', 'Kanye West', 'Yeezus', 'The Yeezus Tour', 'Eric the Actor', 'The Howard Stern Show', 'Private equity', 'Investment fund', 'Inv

In [17]:
G.edges(data = True)

EdgeDataView([('Financial market', 'Asset pricing', {'SemanticDist': 0.4651826024055481}), ('Financial market', 'The Vanguard Group', {'SemanticDist': 0.6656797826290131}), ('Asset pricing', 'BlackRock', {'SemanticDist': 0.5586826503276825}), ('BlackRock', 'The Blackstone Group', {'SemanticDist': 0.5543205440044403}), ('BlackRock', 'Investment management', {'SemanticDist': 0.5791221261024475}), ('Investment management', 'Chartered Financial Analyst', {'SemanticDist': 0.5173569917678833}), ('Chartered Financial Analyst', 'Asset management', {'SemanticDist': 0.566513180732727}), ('Asset management', 'List of asset management firms', {'SemanticDist': 0.43053102493286133}), ('List of asset management firms', 'The Vanguard Group', {'SemanticDist': 0.4881243109703064}), ('List of asset management firms', 'Physics', {'SemanticDist': 0.886945016682148}), ('Physics', 'Natural science', {'SemanticDist': 0.6061553359031677}), ('Natural science', 'Quantum mechanics', {'SemanticDist': 0.79332669079

In [18]:
adj_G = nx.linalg.graphmatrix.adjacency_matrix(G, weight = 'SemanticDist')
adj_G.todense()

matrix([[0.        , 0.4651826 , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.4651826 , 0.        , 0.55868265, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.55868265, 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.82106788],
        [0.        , 0.        , 0.        , ..., 0.        , 0.82106788,
         0.        ]])