# Network Analysis 

Network analysis is an application of graph theory that focuses on the relationship between entities (e.g., individuals, organizations, webpages, etc.).  Analyzed entities within a network are referred to as nodes (or sometimes vertices) while the relationships between the enitities are called edges or links.   I will likely use these terms interchangeably throughout this notebook. 

Recently I had a client tell me that they wanted insight on how consumers are using their website.  This set me thinking - while there are many tools to examine individual page performance or user behavior, few offer the ability to pull back and wholistically examine how site pages interrelate to one another. When you represent a website as a network is becomes easy to see which pages are driving traffic to a target page (e.g., a cart or checkout page), which pages people are leaving the site from, how people are navigating the site, and even provide clues as to which products people shop for together.  Furthermore, when visualized as a graph, it is possible to do this quickly and on the fly with clients.   

This notebook provides code and instructions for how to use Google Analytics data to generate the files for network visualizations in Gephi, a network visualization tool.   

#### Authenticating and Connecting to the Google Analytics API Endpoint 

We must set up our connection to the Google Analytics API endpoint.  First, you must set up a new project and service account prior in order to generate a private key. [The documentation describing how to do so is here](https://developers.google.com/analytics/devguides/reporting/core/v4/quickstart/service-py).  After you create a service account and project you will download a json file with your API key.  Save this to your working directory as "client_secret.json".  

Second, there will be an email account associated with the service account.   You will need to have someone with admin access provide read access to this email address at the *account* (not view) level in GA. 

Finally, ... install librarys... you can now run the cell below to get started. 

In [None]:
import csv
import datetime
import httplib2
import json 
import requests 
import pandas as pd
import pprint

from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient.discovery import build 

In [None]:
#Create service credentials
#Rename your JSON key to client_secrets.json and save it to your working folder
credentials = ServiceAccountCredentials.from_json_keyfile_name('client_secrets.json', ['https://www.googleapis.com/auth/analytics.readonly'])
  
#Create a service object
http = credentials.authorize(httplib2.Http())
service = build('analytics', 'v4', http=http, discoveryServiceUrl=('https://analyticsreporting.googleapis.com/$discovery/rest'))


## Configuration

#### Defining Parameters 

The parameters defined below are used for making the Google Analytics API calls.  You can find [documentation for configuring these here](https://developers.google.com/analytics/devguides/reporting/core/v4/reference#ids).  You may find the [Google Analytics query explorer](https://ga-dev-tools.web.app/query-explorer/) helpful when tailoring parameters. 

Note that the edge dimensions should always be pagePath and preveiousPagePath and the edge metrics should be either be uniquePageviews or Pageviews.  This is neccessary to create the structure of the network - the pages dimensions represent the two entities being associated (i.e., two webpages) and the pageview metric provides the weight, or the strength of that association.  In fact, the simplest representation of a networks is an edge list which can be defined with three columns: 

    Source -> Target, Weight. 

The node paramaters are optional but are what enable us to really visually explore the network.  These are the attributes of a webpage in which we might be interested.  If you create a node table, you must always include a pagepath dimensions so that we can associate node attributes with the pages in the edge list.  Additionally, let me issue a word of warning - this script will clean up URL parameters and combine values from the same pages.  As a result, you should not include rates or percentages as you cannot average or sum percentages accurately.  If desired, we should calculate those from the raw data after we have combined different listings of the same page. For example, you can ask for bounces but not the bounce rate.  If you want the bounce rate, you should combine pages after the URL has been standardized and divide bounces by the number of sessions that hit that page.  


In [None]:
# UNIVERSAL PARAMETERS 
WEBSITE_NAME = "bushnell.com"
VIEWID = 'ga:219121486'
START_DATE = '7daysAgo'
END_DATE = 'today'
PAGE_SIZE = 10000 
PAGE_TOKEN = "0" 


#EDGE PARAMETERS 
EDGE_METRICS = [
               {'expression': 'ga:uniquePageviews'}
               ]

EDGE_DIMENSIONS = [
                  {"name": "ga:pagePath"},
                  {"name": "ga:previousPagePath"}
                  ]

EDGE_FILTEREXPRESSIONS = "ga:hostname==www.bushnell.com;ga:previousPagePath!=(entrance)"

EDGE_ORDERBYS= [
               {"fieldName": "ga:uniquePageviews", "sortOrder": "DESCENDING"}
               ] 


# NODE PARAMETERS 
NODE_METRICS = [
                {'expression': 'ga:uniquePageviews'},
                {'expression': 'ga:bounces'},
                {'expression': 'ga:pageValue'},
                {'expression': 'ga:entrances'},
                {'expression': 'ga:exits'},
               ]

NODE_DIMENSIONS = [
                 {"name": "ga:pagePath"}
                 ]

NODE_FILTEREXPRESSIONS = "ga:hostname==www.bushnell.com"

NODE_ORDERBYS= [
               {"fieldName": "ga:uniquePageviews", "sortOrder": "DESCENDING"}
               ] 

## Define Functions 

Here we have defined two functions.  The first conducts the API call while the second acts as a helper function parsing the API response.  These are general functions and should not need to be altered. 

In [None]:
def call_api(viewID, metrics, dimensions, filters=None, 
             startDate='30daysAgo', endDate='today', pageSize=10000,
            pageToken="0", orderBys=None): 
    ''' Constructs and posts Google Analytics API call. 
    
    Calls the google analytics api - mainly used to ease looping over it and 
    to enable default values other than those specified by the API.  
    Please see the following link for information regarding how to format
    arguments:   
    https://developers.google.com/analytics/devguides/reporting/core/v3/reference#ids
  
    Args:
        viewID (str): GA view ID.  
        
        metrics (lst of dict): List of key, value pairs specifying  metrics to 
                               return (e.g., [{'expression': 'ga:uniquePageviews'}]). 
        
        dimensions(lst of dict): List of key, value pairs specifying  metrics 
                                to return (e.g., [{"name": "ga:pagePath"}]). 
        
        filters (str): String specifying filters to applied. Strings should 
                       use operator characters NOT the url encoded form of 
                       operators. 
        
        startDate (str): Start date in YYYY-MM-DD string format.   
        
        endDate (str): Start date in YYYY-MM-DD string format. 
        
        pageSize (Int): Number of records to return per page.  0 to 10,000.  
        
        pageToken (str): Next record on the following page if number of 
                         records exceed page size. 
        
        orderBys (str): Specify the metric(s)/dimesion(s) by which values 
                        should be orderd and if ascending or descending. 
        
    Returns: 
        response (dict): API response object containing the requested data. 
    '''
    d = []
    
    while pageToken: 
        response = service.reports().batchGet(
            body={
                'reportRequests': [
                    {
                        'viewId': viewID, #Add View ID from GA
                        'dateRanges': [
                                      {'startDate': startDate,
                                       'endDate': endDate}
                                      ],
                        'metrics': metrics, 
                        'dimensions': dimensions, 
                        "filtersExpression": filters, 
                        'orderBys': orderBys, 
                        'pageSize': pageSize,
                        'pageToken': pageToken 
                    }]
            }
        ).execute()

        if 'nextPageToken' in response['reports'][0]:
            
            pageToken = response['reports'][0]['nextPageToken']
            
            if len(d)>0: 
                d = d+parse_data(response)[1:] 
            else: 
                d = parse_data(response)
        
        elif len(d) > 0: 
            d = d+parse_data(response)[1:] 
            break
        
        else: 
            d = parse_data(response)
            break
    
    print("{} rows of data returned.".format(len(d)))
    
    return d, response

In [None]:
def parse_data(ga_response):
    ''' Parses Google Analytics API response to extract metrics & dimensions.
    
    Takes a google analytics API response dictionary and returns a list of 
    list of dimensions and metrics amenable to conversion to dataframe. 
    
    Args: 
        ga_response(dict): Response object from the google api. 
    
    Returns: 
        data (lst of lst): List with each child list representing a single row 
                           of metrics and dimensions.  
    '''
    
    data = []
    
    for report in ga_response.get('reports', []):
        columnHeader = report.get('columnHeader', {})
        dimensionHeaders = [
            i.split(":")[1]
            for i in columnHeader.get('dimensions', [])
            ]
        metricHeaders = [
            i['name'].split(":")[1]
            for i in columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])
            ] 
        data.append(dimensionHeaders+metricHeaders)

        rows = report.get('data', {}).get('rows', [])
        for row in rows:

            dimensions = row.get('dimensions', [])
            m = row.get('metrics')
            metrics = m[0]['values']
            data.append(dimensions + metrics)
    
    return data


## Create Edges List 

Next, we're making the API call for the edge list.  Our defined functions will return a list of lists with each list representing a row of data.   We then convert this to a pandas dataframe and clean up the data by dropping the url parameters and combining duplicate URLs. Finally, we format this for import into Gephi with the required column names of source, target, and weight.   

In [None]:
# Get edges  
d, _ = call_api(VIEWID, EDGE_METRICS, EDGE_DIMENSIONS, 
                filters=EDGE_FILTEREXPRESSIONS, startDate=START_DATE,
                endDate=END_DATE, pageSize=PAGE_SIZE, pageToken=PAGE_TOKEN,
                orderBys=EDGE_ORDERBYS)

In [None]:
# Create the pandas DataFrame
df_edges = pd.DataFrame(d[1:], columns=d[0])

#Clean up urls by dropping url parameters
df_edges['pagePath'] = df_edges ['pagePath'].apply(lambda x: x.split('?')[0])
df_edges ['previousPagePath'] = df_edges ['previousPagePath'].apply(lambda x: x.split('?')[0])

# After droppping url parameters combining pages w/ same pagePath and
# previousPagePath and summing uniquePageViews
df_edges = df_edges.astype({'uniquePageviews': "int32"})
df_edges = df_edges.groupby(['pagePath', 'previousPagePath']).agg({'uniquePageviews': 'sum'}).reset_index()

#Removing self-references 
df_edges = df_edges[df_edges ['pagePath'] != df_edges['previousPagePath']]

# Reordering and renaming columns to conform to Gephi 
new_cols = [
            'previousPagePath', 
            'pagePath', 
            'uniquePageviews'
            ]

df_edges = df_edges [new_cols]

df_edges.rename(columns={'uniquePageviews': 'Weight',
                         'previousPagePath': 'Source',
                         'pagePath': 'Target'}, inplace=True)

# Dump to csv
fn = "{}_edge_list_{}-{}.csv".format(WEBSITE_NAME, START_DATE, END_DATE)
df_edges.to_csv(fn, index=False)

## Create Nodes Table

Finally, we create our nodes table. If you've modified any parameters used for the API call above, you may need to  modify the logic below used to clean up the data.  Additionally, if you want to define any metrics as rates (e.g., conversion, exit, or bounce rate) you should calculate them at this point.    

In [None]:
d, _ = call_api(VIEWID, NODE_METRICS, NODE_DIMENSIONS, 
                filters=NODE_FILTEREXPRESSIONS, startDate=START_DATE,
                endDate=END_DATE, pageSize=PAGE_SIZE, pageToken=PAGE_TOKEN,
                orderBys=NODE_ORDERBYS)

In [None]:
# Create the pandas DataFrame
df_nodes = pd.DataFrame(d[1:], columns=d[0])


#Clean up urls by dropping URL parameters
df_nodes['pagePath'] = df_nodes['pagePath'].apply(lambda x: x.split('?')[0])


# Combining Pages w/ same pagePath and previousPagePath and summing metrics
df_nodes = df_nodes.astype({'uniquePageviews': "int32", 
                            'bounces': "int32",
                            'pageValue': "float64",
                            'entrances' : "int32",
                            'exits': "int32"})

df_nodes = df_nodes.groupby(['pagePath']).agg({'uniquePageviews': 'sum',
                                               'bounces': "sum",
                                               'entrances' : "sum",
                                               'exits': "sum"}).reset_index()

# Sorting and relabeling nodes to conform to Gephi

df_nodes = df_nodes.sort_values('uniquePageviews', ascending=False)

df_nodes.rename(columns={'pagePath': 'Id'}, 
                inplace=True
                )

# Dump to csv
fn = "{}_nodes_table_{}-{}.csv".format(WEBSITE_NAME, START_DATE, END_DATE)
df_nodes.to_csv(fn, index=False)

## Scatchpad  - Working Notes and Code go Below Here