# Introduction to simulation

* Collaboration network data
* Simulation and real world data
* Exercises

Social scientists have been using simulation to understand social phenomenon for a long time. Although it's hard to imagine today, simulation research was around long before the advent (or at least the wide availablity) of computers. In this "prehistoric" era, researchers used coin flips, dice, and massive tomes of random numbers ([seriously](https://www.amazon.com/Million-Random-Digits-Normal-Deviates/dp/0833030477/ref=sr_1_1?dchild=1&keywords=A+Million+Random+Digits&qid=1587477787&s=books&sr=1-1)) to develop simple but often powerful and compelling simulations of phenomenon ranging from language to residental segregation. How was it possible to do simulation before the computer era? At least part of the reason is that a whole lot of simulation modeling relies on a few relatively simple pieces of technology—random number generators (for making choices or decisions), loops (for progressing through the simulation), and lists (for keeping track of things). As we've seen previously, it's really easy to work with all three of these tools in Python, which makes it a great programming language for social simulation.

Simulation is a general purpose too that's used for many different purposes in research and beyond. In this session, we'll briefly consider how to use simulation for better understanding real world data. In particular, we'll consider how simple simulations can give us insight into various processes involved in generating our empirical data.

# Collaboration network data

To keep things simple, we'll use some data we encountered previously, on patent collaboration networks among researchers at Big 10 Academic Alliance universities. Let's go ahead and load the data, which I've copied over to this notebook for ease of access.

In [1]:
# load some packages
import networkx as nx
import pandas as pd
import glob
import pathlib 
import numpy as np

In [2]:
# load the data
university_networks_df = pd.DataFrame([(pathlib.Path(g).stem, nx.read_graphml(g)) for g in glob.glob("data/*.graphml")], 
                                      columns =["university", "graph"])

In [3]:
# check out the data
university_networks_df.head()

Unnamed: 0,university,graph
0,OHIO_STATE,"(7722842-1, 7425229-2, 8038907-3, 4172124-2, 4..."
1,INDIANA,"(5905258-1, 4451396-1, 7294830-2, 4462685-2, 6..."
2,NEBRASKA,"(7126303-5, 7042184-2, 7042184-3, 7842479-2, 7..."
3,WISCONSIN,"(6440953-4, 4918178-2, 7704362-2, 7705425-2, 7..."
4,PENN_STATE,"(8071132-5, 6165875-3, PP6218-1, 7321854-2, 81..."


# Simulation and real world data

In research on social networks, we typically assume that observed social network structures are the result of some meaningful social process. But is that assumption reasonable? One way we can gain some intuition on this question is to use simulation. Specifically, we can consider whether our observed networks deviate in significant ways from comparable random networks. Note that what we man by "comparable" is important, but not necessarily obvious. To keep things simple, we'll compare our obseved collaboration networks to networks with identical numbers of nodes and ties, generated using a random graph model. 

Before we move forward, we need to think about how we would decide whether an observed network was "significantly" different from a comparable random network. There are different ways you might go about doing this, but one of the most common is, for each observed network, generate many comparable random networks. Then, we can measure various network properties on the observed and random networks, and use a z-score to see if they differ in meaningful ways. You can think of this general approach as a type of Monte Carlo simulation. Let's give it a shot.

In order to not have to store hundreds of random graphs in memory, we'll generate each random graph, compute our measures of interest and then discard the random graph. So we'll need to decided on a few measures of interest. Again in the interest of keeping things simple, we'll just compute the transitivity. You can always add more measures later as an exercise.

In [4]:
def compare_to_random_network(graph, measure, runs):
  """Compute a given measure on a network and compare 
     to the value for a similar random network. Return
     a z-score as the result."""
  
  # create a list to hold the measure values from the random networks
  random_values = []
  
  # loop over the number of desired random graphs
  for n in range(0, runs):
    
    # generate a comparable random network to the observed network
    graph_random = nx.gnm_random_graph(n=graph.number_of_nodes(), 
                                       m=graph.number_of_edges())
    
    # compute transitivity
    random_values.append(measure(graph_random))

  # compute measure on observed network
  observed_value = measure(graph)
    
  # compute mean of random_values
  random_values_mean = np.mean(random_values)

  # compute sd of random_values
  random_values_sd = np.std(random_values)

  # return the zscore
  return (observed_value-random_values_mean)/random_values_sd

In [5]:
# transitivity
university_networks_df["transitivity"] = university_networks_df["graph"].apply(nx.transitivity)
university_networks_df["transitivity_zscore"] = university_networks_df["graph"].apply(compare_to_random_network, args=(nx.transitivity, 100))

In [6]:
# clique_number
university_networks_df["degree_assortativity_coefficient"] = university_networks_df["graph"].apply(nx.degree_assortativity_coefficient)
university_networks_df["degree_assortativity_coefficient_zscore"] = university_networks_df["graph"].apply(compare_to_random_network, args=(nx.degree_assortativity_coefficient, 100))

In [7]:
# check out the results
university_networks_df

Unnamed: 0,university,graph,transitivity,transitivity_zscore,degree_assortativity_coefficient,degree_assortativity_coefficient_zscore
0,OHIO_STATE,"(7722842-1, 7425229-2, 8038907-3, 4172124-2, 4...",0.683871,143.894012,0.331366,7.04227
1,INDIANA,"(5905258-1, 4451396-1, 7294830-2, 4462685-2, 6...",0.955102,51.539669,0.92936,9.318455
2,NEBRASKA,"(7126303-5, 7042184-2, 7042184-3, 7842479-2, 7...",0.994532,209.90389,0.996106,30.350318
3,WISCONSIN,"(6440953-4, 4918178-2, 7704362-2, 7705425-2, 7...",0.657697,589.501274,0.256103,11.450885
4,PENN_STATE,"(8071132-5, 6165875-3, PP6218-1, 7321854-2, 81...",0.83783,262.507145,0.637726,19.873248
5,UIUC,"(8207216-2, 5658947-1, 8334976-4, 5655956-1, 7...",0.751806,418.36618,0.61265,26.542681
6,IOWA,"(7603013-2, 5721209-3, 7951781-5, 7474775-3, 4...",0.60989,124.425734,0.045951,1.295841
7,PURDUE,"(5292987-3, 5286996-2, 4464188-1, 6355510-1, 7...",0.772647,330.226885,0.457781,13.045009
8,MICHIGAN_STATE,"(7354505-1, 6862924-4, 7381625-3, 7666144-2, 8...",0.667093,145.967179,0.062185,1.962559
9,MINNESOTA,"(6650116-3, 4888555-1, 4914392-1, 6835394-1, 6...",0.798369,255.871759,0.703713,18.542403


So at least on these two dimensions, our university networks are (generally) quite different from random graphs.

# Exercises

  * Adapt the code above to compare the observed and random networks on several different measures.
  * Rather than using a simple random graph, try a few alternative comparison models (e.g., Watts-Strogatz)