## qué bolá, mi gente 😼

In this lesson, we will learn how to condition a network on a variable.

In plain English, this means we will generate the belief network of all people who hold a certain belief (or lack thereof).

The archetypical example from our discussions is to generate the belief network of self-proclaimed liberals and compare it to the belief network of self-proclaimed conservatives. 

This amounts to conditioning the network on the variables 'POLVIEWS' with liberals being those <0 and conservatives being those >0.

First, we import and clean the data.

In [1]:
import os
import sys
project_root = os.path.dirname(os.path.dirname(os.path.abspath("..")))
if project_root not in sys.path:
    sys.path.append(project_root)

from CLEAN.datasets.import_gss import import_dataset
from CLEAN.datasets.clean_raw_data import clean_datasets
from CLEAN.source_code.generators.corr_make_network import calculate_correlation_matrix, CorrelationMethod, EdgeSuppressionMethod
from CLEAN.source_code.generators.corr_make_conditioned_network import calculate_conditioned_correlation_matrix
from CLEAN.source_code.visualizers.network_visualizer import generate_html_visualization
from CLEAN.source_code.analyzers.graph_similarity import graph_similarity

In [2]:
df, _ = import_dataset()
cleaned_df = clean_datasets()

Loading from cache...
Done! ✨
Loading from cache...
Done! ✨


## Vale, lo siguiente es... selecting a variable to condition on 🐈

Now we can choose a variable to condition on. 

To do this, we simply refer to the corr_make_conditioned_network.py file and make use of the function calculate_conditioned_correlation_matrix().

This function accecpts the usual paramets (as in calculate_correlation_matrix()) plus some extra ones: 

1. variable_to_condition  (string telling us which variable to condition on)
2. condition  (string telling us what condition to apply to the variable - either 'equal_to', 'less_than_zero', 'greater_than_zero')
3. value (the value of the variable to condition on - only used if condition is 'equal_to')
4. return_df (boolean) - whether to return (filtered dataframe, conditioned correlation matrix) or just the conditioned correlation matrix

To understand what less_than_zero and greater_than_zero actually correspond to, simply control-F the variable name in clean_raw_data.py and search up the variable in the [GSS data explorer](https://gssdataexplorer.norc.org/variables/vfilter).

In [3]:
var_to_condition = 'POLVIEWS'
years_of_interest = [2018, 2020, 2022]


conditioned_corr_matrix = calculate_conditioned_correlation_matrix(
    cleaned_df, 
    years_of_interest=years_of_interest,
    method=CorrelationMethod.PEARSON,
    partial=True,
    edge_suppression=EdgeSuppressionMethod.REGULARIZATION,
    suppression_params={'regularization': 0.18},

    variable_to_condition=var_to_condition, 
    condition='less_than_zero'
)

We can now plot this network and see what it looks like. 

You are looking at the network of liberal-leaning individuals: the ideas that they associate with each other.


In [4]:
generate_html_visualization(
    conditioned_corr_matrix,
    output_path='delete_this_file.html'
)

Network visualization has been saved to c:\Users\timbo\Github\BeliefNetworkEvo\CLEAN\notebooks\tutorials\delete_this_file.html


## Further analysis (graph edit distance) 😹

We can do some basic analysis now.

A natural question is: how different are the networks of liberals and conservatives?

To answer this question, we can calculate some sort of graph distance metric the network of liberals and the network of conservatives.

In this case, we will use the so-called Graph Edit Distance -- the number of edits (deletions, insertions) needed to transform one network into the other.

First, we'll calculate the two networks.

In [5]:
var_to_condition = 'POLVIEWS'
years_of_interest = [2016, 2018, 2020]


liberal_corr_matrix = calculate_conditioned_correlation_matrix(
    cleaned_df, 
    years_of_interest=years_of_interest,
    method=CorrelationMethod.PEARSON,   
    partial=True,
    edge_suppression=EdgeSuppressionMethod.REGULARIZATION,
    suppression_params={'regularization': 0.18},

    variable_to_condition=var_to_condition, 
    condition='less_than_zero',
)

conservative_corr_matrix = calculate_conditioned_correlation_matrix(
    cleaned_df, 
    years_of_interest=years_of_interest,
    method=CorrelationMethod.PEARSON,
    partial=True,
    edge_suppression=EdgeSuppressionMethod.REGULARIZATION,
    suppression_params={'regularization': 0.18},

    variable_to_condition=var_to_condition, 
    condition='greater_than_zero',
)

Now we can call the graph_similarity function to calculate the graph edit distance between the two networks.

The threshold parameter simply determines the correlation strength that is sufficient to be considered an edge (0 means all edges are considered and 1 means no edges are considered).

If the similarity score is 0, the networks are identical.

If the similarity score is, say, 250, then 250 edge insertions/deletions are needed to transform one network into the other.


In [6]:
graph_edit_distance = graph_similarity(liberal_corr_matrix, conservative_corr_matrix, 
                                       similarity_method="graph_edit_distance",
                                       edge_threshold=0.1
                                       )
print(graph_edit_distance)

SimilarityResult(
  Score: 67.0000,
  Method: 'graph_edit_distance',
  Metadata: {parameters: {'edge_threshold': 0.1}}
)


And of course we will want actually see the two graphs. 

In [7]:
generate_html_visualization(
    liberal_corr_matrix,
    highlight_nodes='HOMOSEX',
    output_path='delete_this_file_Liberal.html'
)
generate_html_visualization(
    conservative_corr_matrix,
    highlight_nodes='HOMOSEX',
    output_path='delete_this_file_Conservative.html'
)

Network visualization has been saved to c:\Users\timbo\Github\BeliefNetworkEvo\CLEAN\notebooks\tutorials\delete_this_file_Liberal.html
Network visualization has been saved to c:\Users\timbo\Github\BeliefNetworkEvo\CLEAN\notebooks\tutorials\delete_this_file_Conservative.html


For the values I used (threshold=0.1), the similarity score is 70, meaning 70 edge insertions/deletions are needed to transform one network into the other.

A simple thing we can plot to understand how the threshold affects the similarity score is to plot the similarity score as a function of the threshold.

In [8]:
import numpy as np
import plotly.express as px

# Define thresholds and calculate graph edit distances
thresholds = np.linspace(0, 1, 666)
graph_edit_distances = [graph_similarity(liberal_corr_matrix, conservative_corr_matrix, 
                                         similarity_method="graph_edit_distance",
                                         edge_threshold=threshold).score 
                        for threshold in thresholds]

# Plotting with plotly
fig = px.line(x=thresholds, y=graph_edit_distances, title='Graph Edit Distance as a Function of Threshold Strength',
              labels={'x': 'Threshold Strength', 'y': 'Graph Edit Distance'})
fig.update_layout(xaxis_range=[0, 1], yaxis_range=[min(graph_edit_distances), max(graph_edit_distances)])
fig.add_hline(y=0, line_dash="dash", line_color="red")
fig.show()

## Further analysis (comparing different conditioning types)

It would also be good to put this into context of the whole network. 

Let's see the graph edge distances when we condition on other variables...