#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science – Summer '22

# Notebook 6: Online Text and Social Networks

In Notebooks 3 and 5, we used [TF-IDF weighting](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to represent documents as vectors that could be compared and plotted. In doing so, we were focusing on the similarity of the language used in the documents. Other research has used different approaches to creating this kind of representation of entire communities on Reddit (e.g., [Martin, 2017](https://aclanthology.org/W17-2904.pdf)), but we may not only want to consider the similarity of *content* when comparing online communities such as subreddits. Depending on the type of data we have, there may be other kinds of information worth considering.

In one of the readings for this week, [Hessel et al. (2015)](https://arxiv.org/abs/1511.03371) leverage two very different types of similarity: text similarity and *user similarity*, or the similarity of two subreddits based on their overlapping users. In their introduction, they write, "By explicitly comparing the differences between the user-based and language-based metrics, we can discover relationships that might not be captured by using only a single similarity metric. For example, we can ask 'what do vegans do when they aren’t talking about veganism?'" This kind of approach can give us leverage on apparent relationships between seemingly unrelated factors like political identity and taste. As [DellaPosta, Shi, and Macy (2015)](https://www.journals.uchicago.edu/doi/full/10.1086/681254) write in their classic paper on 'lifestyle politics':

> In short, the puzzle of lifestyle politics compounds the curious formation of cultural enclaves among seemingly unrelated preferences. Why should liberals and conservatives differ systematically on lifestyle dimensions that have no apparent substantive relevance to political ideology? What are the social mechanisms that could produce a world of “latte liberals” and “bird-hunting conservatives”? (p. 1475)

DellaPosta, Shi, and Macy argue that political identity and taste are linked due to the paired forces of [homophily](https://en.wikipedia.org/wiki/Homophily) and social influence (see also [McPherson, Smith-Lovin, and Cook, 2001](https://www.jstor.org/stable/pdf/2678628.pdf)). In brief, people associate with similar others, and people influence and are influenced by the people with whom they associate. If people seek one another out because they have one trait or set of traits in common and then acquire other traits from one another, the various traits may become correlated within the larger population. Additionally, different ideas or behaviors can become incredibly important for signaling a shared identity or membership in a common group.

In this notebook, we will take a look at a different corpus of Reddit data, and we will see how considering text similarity, user similarity, and the combination of user similarity with text *dissimilarity* can shed light on how identity and taste are linked. For this notebook, we will only need to import `pandas` to manage the data and `matplotlib.pyplot` and `seaborn` to help with visualizations. 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

The dataset is available via Canvas, but was [provided by](https://jmhessel.com/projects/latent_interest/latent_interest.html) Hessel et al. to replicate their work. (Dr. Hessel personally approved sharing the corpus via Canvas for this class.) I have uploaded the full dataset, which includes all of the text as well, but if you just want to follow along with the core of this notebook, you can save space and time by downloading the two files we read in below.

The two files we will import as dataframes and then merge are based on the different similarity metrics used in the original paper, namely text similarity and user similarity. Each file has two columns for the names of subreddits and a column indicating the similarity of the pair of subreddits represented by each row. We will use the `read_csv` method from `pandas` with an important twist: we will set the `sep` argument to `"\t"` (tabs) instead of the default (commas).

Although we have focused on TF-IDF weighting and [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) so far, Hessel et al. measure text similarity in a different way. Specifically, they train a topic model using [latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) to identify latent themes within the content of the posts to various subreddits. They then compare subreddits based on the probability distributions of the resulting topics using [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence). We will cover topic modeling in Week 7.

In [None]:
df = pd.read_csv("data/RedditDataRelease/textSims/500-out-all.graph", sep="\t", names=["sub1", "sub2", "text_sim"])

What makes the work of Hessel et al. so interesting is their use of different types of similarity. To measure user similarity, they use the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index). You can think of this as representing each subreddit as a vector of all *users* (or 'redditors') in the dataset, where each element is binary: 1 if the user participates in the subreddit, and 0 if they do not. The Jaccard index quantifies the amount of overlap between two subreddits.

We read in the file with text similarity as `df` because we read it in first and will merge the user similarity file with it. We'll read the file with user similarity in as `userSims` because the authors use that name for the folder and it's clear enough, but we won't do anything with that file on its own.

In [None]:
userSims = pd.read_csv("data/RedditDataRelease/userSims/jaccardSims-sparse.txt", sep="\t", names=["sub1", "sub2", "user_sim"])

Below, we combine the two dataframes using the `merge` method. You can read more about it [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html). As you can see when we look at the head of the dataframe, each row has a pair of subreddits, the text similarity, and the user similarity.

In [None]:
df = df.merge(userSims, on=["sub1", "sub2"], how="left")
df = df.fillna(0)

In [None]:
df.head()

Taking a look at the shape of the dataframe, we can see there are about 5.2 million rows. Where does this number come from? This is the number of undirected ties in a network of all subreddits in the dataset. We can get the number of subreddits by using built-in set operations, specifically the union of the two columns with names of subreddits. As Hessel et al. write, the corpus represents about 3200 subreddits. We can compute the number of all possible undirected ties in a network using (N(N-1))/2, where N is the number of nodes (or vertices) in the network, which in this case means unique subreddits. We can see that the number of rows in the dataframe is the same.

In [None]:
df.shape

In [None]:
NUM_SUBS = len(set(df.sub1).union(df.sub2))
print(NUM_SUBS)

In [None]:
(NUM_SUBS * (NUM_SUBS - 1))/2

In [None]:
df[["text_sim", "user_sim"]].describe()

In [None]:
df.hist(column="text_sim", bins=300)
plt.show()

The function below implements the rescaling the authors describe for text similarities so that below-average similarities count against the subreddit-subreddit dyad in later calculations. We will calculate the mean (`mu`) for the text similarity column overall. We then set below-average text similarity scores to 0.0 and rescale positive similarity scores using the formula in the paper. The way the function is written is slightly redundant, but it is written this way to show the connection to equation 3 in the paper.

In [None]:
def scale_text_similarity(sim: float, mu: float) -> float:
    """
    Rescale text-based similarity as in equation 3 in
    https://arxiv.org/abs/1511.03371
    """
    if sim < mu:
        return 0.0
    sim = (sim - mu)/(1 - mu)
    return max(0.0, sim)

In [None]:
mu = df["text_sim"].mean()
print(mu)

df["text_sim_rescaled"] = df["text_sim"].apply(lambda x: scale_text_similarity(x, mu))

In [None]:
sns.kdeplot("text_sim", data = df.sample(1000))
sns.kdeplot("text_sim_rescaled", data = df.sample(1000))
plt.legend(["Original", "Rescaled"])
plt.show()

In [None]:
df[["text_sim", "text_sim_rescaled"]].describe()

In [None]:
df.head()

In [None]:
df.sort_values("text_sim", ascending = False).head(10)

In [None]:
df.sort_values("user_sim", ascending = False).head(10)

Next, we will calculate the latent interest measure put forth in the paper in equation 5. We will do this by multiplying the user similarity score for each subreddit-subreddit dyad by 1 minus the rescaled text similarity. In effect, we are assigning more weight to subreddits that are dissimilar, based on the text.

In [None]:
df["latent_interest"] = df.user_sim * (1 - df.text_sim_rescaled) # equation 5

In [None]:
df[["latent_interest"]].describe()

Let's take a look at the top results for the r/vegan subreddit based on these different measures. Here, we are selecting the subset of rows where either sub1 or sub2 is equal to "vegan" and then sorting the results according to the different measures.

In [None]:
df[(df.sub1=="vegan")|(df.sub2=="vegan")].sort_values("text_sim", ascending=False).head(10)

In [None]:
df[(df.sub1=="vegan")|(df.sub2=="vegan")].sort_values("user_sim", ascending=False).head(10)

In [None]:
df[(df.sub1=="vegan")|(df.sub2=="vegan")].sort_values("latent_interest", ascending=False).head(10)

Although patterns of consumption (such as diet) may be driven by factors such as moral beliefs or identity, a lot of research has focused on the centrality of political identity (e.g., [Boutyline and Vaisey (2017)](https://www.journals.uchicago.edu/doi/full/10.1086/691274)). We can use this latent interest detection approach to see what different political groups are up to on Reddit (or *were* up to during the period covered by the data).

In [None]:
df[(df.sub1=="Conservative")|(df.sub2=="Conservative")].sort_values("latent_interest", ascending=False).head(10)

In [None]:
df[(df.sub1=="Liberal")|(df.sub2=="Liberal")].sort_values("latent_interest", ascending=False).head(10)

## Extensions

If you want to take a look at the text for a specific subreddit, you can use code like the following two cells. If you want to read in all of the text and convert it to a dataframe, the subsequent code will do that for you. It may take a minute or two to run.

In [None]:
vegan = []

with open("data/RedditDataRelease/redditTextBalanced.txt", "r") as reader:
    for line in reader:
        line = line.strip().split("\t")
        if line[1] == "vegan":
            vegan.append(line)

In [None]:
pd.DataFrame(vegan, columns=["label", "subreddit", "text"])

In [None]:
%time textBalanced = pd.read_csv("data/RedditDataRelease/redditTextBalanced.txt", sep="\t", names=["label", "subreddit", "text"] )

In [None]:
textBalanced.head()

In [None]:
textBalanced.shape