<a href="https://colab.research.google.com/github/spatank/Curiosity/blob/master/wiki_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Make data available to Colab by mounting your Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/My Drive/Curiosity_IGT/')

Mounted at /content/drive


In [2]:
!ls # run !ls to verify location

enwiki_20180420_300d.pkl.bz2  wiki_1.ipynb  WikiData_Shubhankar.csv


# Import Packages

In [3]:
import pandas as pd
import networkx as nx

!pip install wikipedia2vec
from wikipedia2vec import Wikipedia2Vec

# Data Wrangling

This section pre-processes the KNOT data to get it into a format suited to network analysis. The data contains the names of Wikipedia pages visited by a participant in two columns: SourceName and TargetName. 


In [6]:
wiki_df = pd.read_csv('WikiData_Shubhankar.csv')
wiki_df.head(3)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8


In [5]:
def clean_entity_name(name):
  name = name.replace('/wiki/', '')
  name = name.replace('_', ' ')
  return name

Cloud infrastructure


First, we create unique identifiers (UIDs) for each page so that they can be used as nodes in a network representation. Then we clean the strings associated with each page by stripping redundant information such as `wiki/` and `_`. The UIDs and clean names are appended to the data frame as new columns.

In [7]:
# create UID for each page
source_nodes = set(wiki_df['SourceName'].tolist())
target_nodes = set(wiki_df['TargetName'].tolist())
source_nodes.update(target_nodes)
node_set = {entity: name for name, entity in enumerate(source_nodes)}
wiki_df['SourceUID'] = wiki_df['SourceName'].apply(lambda x: node_set[x])
wiki_df['SrcNameClean'] = wiki_df['SourceName'].apply(lambda x: clean_entity_name(x))
wiki_df['TargetUID'] = wiki_df['TargetName'].apply(lambda x: node_set[x])
wiki_df['TgtNameClean'] = wiki_df['TargetName'].apply(lambda x: clean_entity_name(x))
wiki_df.head(3)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight,SourceUID,SrcNameClean,TargetUID,TgtNameClean
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0,2188,Jeff Bezos,12577,Cloud infrastructure
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8,12577,Cloud infrastructure,2794,Cloud computing security
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8,2794,Cloud computing security,12577,Cloud infrastructure


# Wikipedia2Vec

Each row in the data frame represents a transition made by a participant from one Wikipedia page to another. The associated edge weight can be obtained as the semantic distance between the contents of the two pages. In order to quantify the semantic distance between Wikipedia entities, we use a pre-trained model that represents each page as an n-dimensional vector. The distance between two pages `SemanticDist` is then computed as the cosine (dis)similarity between their vector representations.

In [None]:
def semantic_dist(x, y):
  return 0

In [None]:
wiki_df['SemanticDist'] = wiki_df.apply(lambda \
                                        x: semantic_dist(x.SrcNameClean, \
                                                         x.TgtNameClean))

# Create Individual Networks

Next, we split the data set by individual, and use the `SourceUID`, `TargetUID`, and `SemanticDist` columns to generate network representations of participants' Wikipedia exploration.

In [None]:
# split the data by individual
ID_groups = wiki_df.groupby('ID')
for ID, group in ID_groups:
  # enforce time ordering
  group.sort_values(by = ['TimeOrder'], inplace = True)

In [None]:
group.head(10)

Unnamed: 0,ID,SourceName,TargetName,Day,TimeOrder,Hyperlink,DistanceWeights,AgeYears,SexOrient,Race,GenderFactor,EducDeg,Income,JE_5D,DS_5D,ST_5D,SC_5D,TS_5D,Count,Weight
0,101,/wiki/Jeff_Bezos,/wiki/Cloud_infrastructure,1,1,no,1.0,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,1,0.0
1,101,/wiki/Cloud_infrastructure,/wiki/Cloud_computing_security,1,2,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,2,0.8
2,101,/wiki/Cloud_computing_security,/wiki/Cloud_infrastructure,1,3,no,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,3,0.8
3,101,/wiki/Cloud_infrastructure,/wiki/Information_technology,1,4,yes,0.8,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,4,0.2
4,101,/wiki/Information_technology,/wiki/Computer_language,1,5,no,0.6,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,5,0.4
5,101,/wiki/Computer_language,/wiki/Programming_language,1,6,yes,0.6,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,6,0.4
6,101,/wiki/Programming_language,/wiki/Java_(programming_language),1,7,yes,0.8,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,7,0.2
7,101,/wiki/Java_(programming_language),/wiki/Java_compiler,1,8,yes,0.2,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,8,0.8
8,101,/wiki/Java_compiler,/wiki/Java_class_file,1,9,yes,0.5,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,9,0.5
9,101,/wiki/Java_class_file,/wiki/Class_(programming),1,10,yes,0.8,23.27945,Heterosexual,AsiaAm,0,BachDegree,20to49k,4.4,4.25,1.6,2.8,2.0,10,0.2
