# Network Data Processing  

For the network dat visualization we need two datasets (which we will load into [Gephi](https://gephi.org/) to process and export into GEXF format):  
* **a Node file** - containing a list of all available nodes, along with possible node information: the url class  
* **an Edge file** - containing a list of all edges between the nodes. Each entry will contain an edgeId, the nodes it connects, a weight (the number of cross-site cookies connecting the nodes) and a tag (the names of the baseDomains of the cookies that connect the nodes).

In [1]:
import numpy as np
import pandas as pd
import sqlite3
from itertools import combinations

In [2]:
# Connect to SQLite database
con = sqlite3.connect("../data/top_100_shallow_1.sqlite")
cur = con.cursor()

# Get URL classes from websites_classification.csv
websites = pd.read_csv('../data/websites-classification.csv')

In [3]:
websites.head()

Unnamed: 0,url,class
0,http://google.com,Search Engines and Portals
1,http://youtube.com,Streaming Media
2,http://facebook.com,Social Networking
3,http://msn.com,Search Engines and Portals
4,http://yelp.com,Reference


This is actually the **nodes_network_data** file we are looking for. Let's just copy it over to processed:

In [4]:
!cp ../data/websites-classification.csv ../data/processed/nodes_network_data.csv

Now we need to obtain the edges_network_data file:

In [5]:
_result = pd.read_sql_query('SELECT DISTINCT page_url, baseDomain FROM profile_cookies', con)

In [6]:
pre_edges = _result.drop_duplicates()
pre_edges.head()

Unnamed: 0,page_url,baseDomain
0,http://google.com,google.com
1,http://youtube.com,youtube.com
2,http://youtube.com,doubleclick.net
3,http://youtube.com,pointroll.com
4,http://youtube.com,mookie1.com


In [7]:
# Group by baseDomain
by_domain = pre_edges.groupby('baseDomain')

In [11]:
# initialize edge list
edge_dict = {}

# For each baseDomain
for bd, df in by_domain['page_url']:
    edges = combinations(df.values, 2) # iterable of all possible 2 element combinations
    for e in edges:
        label = tuple(e)
        try:
            old_weight, old_doms = edge_dict[label]
            edge_dict[label] = [old_weight + 1, old_doms + ' | ' + bd]
        except KeyError:
            edge_dict[label] = [1, bd]

print("Created a total of: ", len(edge_dict), "edges")

Created a total of:  2143 edges


In [12]:
# Dump the dictionary into a dataframe (-> csv)
pre_df = [ [k[0], k[1], v[0], v[1]] for k,v in edge_dict.items() ]
edge_data = pd.DataFrame(pre_df, columns=['node1', 'node2', 'weight', 'tag'])

In [13]:
edge_data['edgeID'] = edge_data.index
edge_data = edge_data[[ 'edgeID', 'node1', 'node2', 'weight', 'tag' ]]
edge_data.head()

Unnamed: 0,edgeID,node1,node2,weight,tag
0,0,http://aol.com,http://imdb.com,2,doubleclick.net | twitter.com
1,1,http://dose.com,http://guff.com,1,scorecardresearch.com
2,2,http://urbandictionary.com,http://topix.com,16,adform.net | adgrx.com | bidswitch.net | bluek...
3,3,http://aol.com,http://sbnation.com,17,adadvisor.net | adsrvr.org | adtechus.com | ad...
4,4,http://pinterest.com,http://eonline.com,2,facebook.com | pinterest.com


Great! This is exactly what we need for the **edge_network_data** file. Let's save it into a .csv file.

In [14]:
edge_data.to_csv('../data/processed/edge_network_data.csv', index_label=False)