# INFO 3350/6350

## Lecture 26: Working with Social Media Data

## What is social media data?

 * Collected from social networks
 * Different kinds:
     * Example: posts, comments, likes, followers, clicks, shares (reposts and retweets), comments.
     * Numerical or textual format
     * We'll focus on textual data

![](images/socialmediadata.png)

## Why is working with social media data important?
 
 * Literature can tell us about the past
     * Look backward in time
     * We don't have a lot of digitized textual data from the past (letters? birth certificates?)
 * Social media data can tell us about the present moment
     * Look at the present or forward in time
     * We have lots of data from social networks!
 * It can tell us about
     * human behaviour
     * language
     * current events

![](images/interestingpapers.png)

## Differences between literary texts and texts from social media
 
 * Literary texts
     * Historical
     * Long format
     * Formally edited and published
 * Text from social media data
     * Contemporary (more or less)
     * Shorter format
     * Not formally edited and published

# Do students at Cornell talk about student life differently in 2020 vs 2022?

For the scope of this exercise, we will only focus on Reddit posts and comments published in March and April of 2020, and in March and April of 2022.
To investigate this question we will:
- Scrape posts and comments;
- Gather information about the corpus of post and comments;
- Deduplicate and clean the corpus;
- Perform topic modeling;
- Evaluate topic modeling;
- Perform classification.

## Scraping

### Set up scraping

Install some new packages for this lecture. We have to use `pip`, since none of these are available via `conda`.

**Note that `tomotopy` does not work natively on Apple Silicon Macs.** If you're running python via Rosetta, you'll be fine. If you're running M1-native python, you're out of luck.

In [1]:
import sys
!{sys.executable} -m pip install psaw little_mallet_wrapper Levenshtein

Collecting little_mallet_wrapper
  Using cached little_mallet_wrapper-0.5.0-py3-none-any.whl (19 kB)
Collecting Levenshtein
  Using cached Levenshtein-0.18.1-cp310-cp310-macosx_11_0_arm64.whl (230 kB)
Collecting rapidfuzz<3.0.0,>=2.0.1
  Using cached rapidfuzz-2.0.11-cp310-cp310-macosx_11_0_arm64.whl (1.2 MB)
Collecting jarowinkler<1.1.0,>=1.0.2
  Using cached jarowinkler-1.0.2-cp310-cp310-macosx_11_0_arm64.whl (56 kB)
Installing collected packages: little_mallet_wrapper, jarowinkler, rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.18.1 jarowinkler-1.0.2 little_mallet_wrapper-0.5.0 rapidfuzz-2.0.11


In [None]:
!{sys.executable} -m pip install tomotopy # does not work on M1

In [2]:
from datetime import datetime
import os
import glob
import pandas as pd
from psaw import PushshiftAPI

base_path = os.path.join('reddit_data')  # creating a directory for the data
if not os.path.exists(base_path):  # if it does not exist
    os.makedirs(base_path)         # create it

### Scraping functions

Here are the two functions for scraping posts and comments respectively from the subreddit of choice.

In [3]:
""" Maria Antoniak's code with minor modifications """
def scrape_posts_from_subreddit(subreddit, api, year, month, end_date):
    '''
    Takes the name of a subreddit, the PushshiftApi, a year and month to scrape from
    '''
    start_epoch = int(datetime(year, month, 1).timestamp())  # convert date into unicode timestamp
    end_epoch = int(datetime(year, month, end_date).timestamp())

    gen = api.search_submissions(after=start_epoch,
                                 before=end_epoch,
                                 subreddit=subreddit,
                                 filter=['url', 'author', 'created_utc',  # info we want about the post
                                         'title', 'subreddit', 'selftext',
                                         'num_comments', 'score', 'link_flair_text', 'id'])

    max_response_cache = 100000
    scraped_posts = []
    for _post in gen:
        scraped_posts.append(_post)
        if len(scraped_posts) >= max_response_cache:  # avoid requesting more posts than allowed
            break

    scraped_posts_df = pd.DataFrame([p.d_ for p in scraped_posts])

    return scraped_posts_df

In [4]:
""" Maria Antoniak's code with minor modifications """
def scrape_comments_from_subreddit(subreddit, api, year, month, end_date):
    '''
    Takes the name of a subreddit, the PushshiftApi, a year and month to scrape from
    '''
    start_epoch = int(datetime(year, month, 1).timestamp())  # convert date into unicode timestamp
    end_epoch = int(datetime(year, month, end_date).timestamp())

    gen = api.search_comments(after=start_epoch,
                              before=end_epoch,
                              subreddit=subreddit,
                              filter=['author', 'body', 'created_utc', # info we want about the comment
                                      'id', 'link_id', 'parent_id',
                                      'reply_delay', 'score', 'subreddit'])

    max_response_cache = 100000
    scraped_comments = []
    for _comment in gen:
        scraped_comments.append(_comment)
        if len(scraped_comments) >= max_response_cache:  # avoid requesting more posts than allowed
            break
    scraped_comments_df = pd.DataFrame([p.d_ for p in scraped_comments])

    return scraped_comments_df

### Scrape!

Here we will decide:
- which subreddit to scrape,
- which content type to scrape from that subreddit,
- and which dates we want to scrape.
And we will set off the previous scraping functions accordingly.

We will save files to **pickle format**, why?
- To avoid confusion when reading and writing them! Texts contain commas, and it is possible that pandas might read them as separators when reading CSV files.

NOTE ON DIRECTORIES:
- Our jupyter notebook is in a folder on our machine
  - inside that folder we previously we created a folder `reddit_data`
    - inside `reddit_data` we will create a folder named after the subreddit we will scrape `Cornell`
      - inside `Cornell` we will create one folder for each of the two content types `posts` and `comments`
        - inside `posts` we will store all the data about the posts of the Cornell subreddit
        - inside `comments` we will store all the data about the comments of the Cornell subreddit

In [5]:
""" Maria Antoniak's code with minor modifications """
def scrape_subreddit(_target_subreddits, _target_types, _years):
    '''
    Takes a list of subreddits, a list of types of content to scrape, and a list of years to scrape from
    '''
    
    api = PushshiftAPI()

    print('Number of PushshiftApi shards that are not working:', api.metadata_.get('shards'))  # check if any Pushshift shards are down!
    
    for _subreddit in _target_subreddits:
        for _target_type in _target_types:
            for _year in _years:
                if _year < 2022:
                    months = [3, 4]
                    end_dates = [31, 30]
                elif _year == 2022:
                    months = [3, 4]  # months to scrape
                    end_dates = [31, 30]  # last day of the month

                for _month, _end_date in zip(months, end_dates):
                    _output_directory_path = os.path.join(base_path, _subreddit, _target_type)  # directory to store scraped data
                                                                                                # by subreddit and type of content
                    if not os.path.exists(_output_directory_path):  # if it does not exist
                        os.makedirs(_output_directory_path)         # create it!

                    _file_name = _subreddit + '-' + str(_year) + '-' + str(_month) + '.pkl'  # filename of the csv with scraped data

                    # scrape only if output file does not already exist
                    if _file_name not in os.listdir(_output_directory_path):

                        print(str(datetime.now()) + ' ' + ': Scraping r/' + _subreddit + ' ' + str(_year) + '-' + str(_month) + '...')

                        if _target_type == 'posts':
                            _posts_df = scrape_posts_from_subreddit(_subreddit, api, _year, _month, _end_date)
                            if not _posts_df.empty:
                                _posts_df.to_pickle(os.path.join(_output_directory_path, _file_name), protocol=4)

                        if _target_type == 'comments':
                            _comments_df = scrape_comments_from_subreddit(_subreddit, api, _year, _month, _end_date)
                            if not _comments_df.empty:
                                _comments_df.to_pickle(os.path.join(_output_directory_path, _file_name), protocol=4)

    print(str(datetime.now()) + ' ' + ': Done scraping!')

In [6]:
target_subreddits = ['cornell']  # subreddits to scrape
target_types = ['posts', 'comments']  # type of content to scrape
years = [2020, 2022]  # years to scrape
scrape_subreddit(target_subreddits, target_types, years)

Number of PushshiftApi shards that are not working: None
2022-05-02 21:03:03.221418 : Done scraping!


### Combine posts and comments for one subreddit

Here we will combine the pickle files with all the posts from the subreddit and the pickle files with all the comments from the same subreddit into one file.

In [7]:
def combine_one_subreddit(_subreddit):  # creating csv with all of a subreddit's posts and comments

    df_d = {'author': [], 'id': [], 'type': [], 'text': [],   # create a dictionary
            'url': [], 'link_id': [], 'parent_id': [],
            'subreddit': [], 'created_utc': []}
    
    subreddit_pkl_path = os.path.join('reddit_data', _subreddit, f'{_subreddit}.pkl') # file with all the data
    if not os.path.exists(subreddit_pkl_path):  # if file does not exist
        
        for target_type in ['posts', 'comments']:
            files_directory_path = os.path.join('reddit_data', _subreddit, target_type)  # directory where scraped data is depending on subreddit and type of content
            all_target_type_files = glob.glob(os.path.join(files_directory_path, "*.pkl"))  # select all appropriate pickle files
            for f in all_target_type_files:  # we read each pickle file and include the info we want in the dictionary
                df = pd.read_pickle(f)


                if target_type == 'posts':
                    for index, row in df.iterrows():
                        df_d['author'].append(row['author'])
                        df_d['id'].append(f"{row['subreddit']}_{row['id']}_post")  # id of the post, 'Endo_xyz123_post'
                        df_d['type'].append('post')
                        df_d['text'].append(row['selftext'])  # textual content of the post
                        df_d['url'].append(row['url'])  # url of the post
                        df_d['link_id'].append('N/A')
                        df_d['parent_id'].append('N/A')
                        df_d['subreddit'].append(row['subreddit'])
                        df_d['created_utc'].append(row['created_utc'])  # utc time stamp of the post


                elif target_type == 'comments':
                    for index, row in df.iterrows():
                        df_d['author'].append(row['author'])
                        df_d['id'].append(f"{row['subreddit']}_{row['id']}_comment")
                        df_d['type'].append('comment')
                        df_d['text'].append(row['body'])  # textual content of the comment
                        df_d['url'].append(f"http://www.reddit.com/r/Endo/comments/{row['link_id'].split('_')[1]}/")  # url of the post
                        df_d['link_id'].append(row['link_id'])
                        df_d['parent_id'].append(row['parent_id'])
                        df_d['subreddit'].append(row['subreddit'])
                        df_d['created_utc'].append(row['created_utc'])  # utc time stamp of the post


        subreddit_df = pd.DataFrame.from_dict(df_d)  # create pandas dataframe from dictionary
        subreddit_df.sort_values('created_utc', inplace=True, ignore_index=True)  # order dataframe by date of post
        subreddit_df['time'] = pd.to_datetime(subreddit_df['created_utc'], unit='s').apply(lambda x: x.to_datetime64())  # convert timestamp to date
        subreddit_df['date'] = subreddit_df['time'].apply(lambda x: str(x).split(' ')[0])
        subreddit_df['year'] = subreddit_df['time'].apply(lambda x: str(x).split('-')[0])
        subreddit_df.drop(columns=['time'])
        
        subreddit_df.to_pickle(subreddit_pkl_path, protocol=4)  # saving it to pickle format

In [8]:
for subreddit in target_subreddits:
    combine_one_subreddit(subreddit)

## Some info on the corpus

Before performing any analysis it's important to get to know our texts. Characteristics about our social media texts affect how we will carry out our analysis. Let's check:
- how long the texts are,
- how many words are in the vocabulary of the corpus,
- what the most commons words are in the corpus etc.

This information will inform how we will clean the texts and perform topic modeling on them in the next section.

In [9]:
import re
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
df = pd.read_pickle(os.path.join('reddit_data', 'cornell', 'cornell.pkl'))
df = df.dropna()
print(len(df))

60023


In [11]:
def print_info(df, _type):
    if _type != 'corpus':
        vectorizer = CountVectorizer(        # Token counts with stopwords
            input = 'content',               # input is a string of texts
            encoding = 'utf-8',
            strip_accents = 'unicode',
            lowercase = True
        )

        texts = df['text'].astype('string').tolist()
        X = vectorizer.fit_transform(texts)
        print(f"Total vectorized words in the corpus of {_type}:", X.sum())
        print(f"Average vectorized {_type} length:", int(X.sum()/X.shape[0]), "tokens")
    
    else:
        vectorizer = CountVectorizer(
            input = 'content',
            encoding = 'utf-8',
            strip_accents = 'unicode',
            lowercase = True,
            stop_words = 'english'          # remove stopwords
        )
        
        texts = df['text'].astype('string').tolist()
        X = vectorizer.fit_transform(texts)
        sum_words = X.sum(axis=0)
        words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
        words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
        print('Top words in the combined corpus of posts and comments after removing stopwords:')
        for word, freq in words_freq[:30]:
            print(word, '\t', freq)

In [12]:
df_posts = df.loc[df['type'] == 'post'].copy()
df_comments = df.loc[df['type'] == 'comment'].copy()
df_2020 = df.loc[df['year'] == '2020'].copy()
df_2022 = df.loc[df['year'] == '2022'].copy()
print(f'Number of posts in r/Cornell:', len(df_posts))
print(f'Number of comments in r/Cornell:', len(df_comments))
print(f'Number of posts and comments from 2020 in r/Cornell:', len(df_2020))
print(f'Number of posts and comments from 2022 in r/Cornell:', len(df_2022))
print_info(df_posts, 'posts')
print_info(df_comments, 'comments')

Number of posts in r/Cornell: 7310
Number of comments in r/Cornell: 52713
Number of posts and comments from 2020 in r/Cornell: 22160
Number of posts and comments from 2022 in r/Cornell: 37863
Total vectorized words in the corpus of posts: 283923
Average vectorized posts length: 38 tokens
Total vectorized words in the corpus of comments: 1362887
Average vectorized comments length: 25 tokens


In [13]:
print_info(df, 'corpus')

Top words in the combined corpus of posts and comments after removing stopwords:
just 	 8224
cornell 	 8148
like 	 7191
people 	 6822
don 	 6465
think 	 4714
know 	 4409
time 	 4061
really 	 3925
class 	 3718
students 	 3680
good 	 3184
classes 	 3178
want 	 3172
ve 	 3113
https 	 2970
school 	 2892
make 	 2570
year 	 2502
going 	 2469
semester 	 2463
ll 	 2433
work 	 2428
need 	 2356
housing 	 2335
campus 	 2318
got 	 2310
lot 	 2291
way 	 2121
sure 	 2047


## Pre-process the corpus


When scraping Reddit or other platforms, it is important to consider how the platform is used by users, to have an idea of the kind of texts we might find.

A few things to keep in mind:
- the content on these platforms is **barely curated**. Moderators and bots designed for content moderation often just remove the most offensive and inflammatory content.
  - Unless you are dealing with a special subreddit/community that enforces very strict norms, you will find funky looking, uninformative, and bot-generated texts.
- In most social platforms, social interaction can revolve around **images**. Unless alt-text is provided (sadly, basically never), we cannot access that information using our NLP tools.
  - Therefore some texts will look funky for that reason. Such documents are generally short.
- On Reddit, content shows up depending on the up- and down-votes it receives. If a user's post gets ignored by their community, they sometime repost it to receive an answer.
  - Thus, in your corpus, you might find 5, 10, 20 **duplicates** of an individual post.

HOWEVER, how much and whether you need to clean your corpus highly depends on **a few factors**:
- The goal of your analysis, your question
- The community you are analyzing
  - **Be respectful!** This content might look weird to you, but can mean a lot to the members of the community
  - Keep in mind that you are analyzing someone's behavior and interaction online. Put yourself in their shoes :)
- The techniques you are going to use

In [14]:
import json
import little_mallet_wrapper as lmw
import Levenshtein

### Deduplicating function

This is far from an optimal function for getting rid of duplicates. For sake of time, we will make sure that content posted by the same user is not duplicated, and that the previous post - chrnologically - is not identical.

We will use the Levenshtein distance. It measures how different two strings are. It is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. It is useful because it does not require tokenization. So we can get rid of most of the duplicates before cleaning the data, saving us some time.

In [15]:
def find_duplicates(_df):  # function to find duplicated posts in the data

    prev_doc = ''
    map_dict = {}  # dict of authors' posts
    duplicate_indexes = []  # list of duplicates' indexes for removal from dataframe
    for index, row in _df.iterrows():  # iterate over posts
        author = row['author']
        doc = row['text']

        # if author info is available we compare each post with previous ones by the same author
        # we compare/calculate the similarity between the posts using the Levenshtein distance
        if author != '[deleted]':
            if author in map_dict.keys():
                flag = 0
                idx = 0
                while idx < len(map_dict[author]) and flag == 0:
                    lev = Levenshtein.ratio(doc, map_dict[author][idx])
                    if lev > 0.99:
                        duplicate_indexes.append(index)
                        flag = 1
                    idx += 1
                if flag == 0:
                    map_dict[author].append(doc)
            else:
                map_dict[author] = [doc]

        # if author info is not available we compare each post with the preceding one chronologically
        else:
            lev = Levenshtein.ratio(row['text'], prev_doc)
            if lev > 0.90:
                duplicate_indexes.append(index)

        prev_doc = doc

    return duplicate_indexes

In [16]:
dupes = find_duplicates(df)  # find duplicates
df.drop(dupes, inplace=True)  # removing duplicates
print(f'Number of duplicates: {len(dupes)}')

Number of duplicates: 1453


### Cleaning function

Before we perform topic modeling it's important we remove messages generated by bots or that are not diverse. 

In [17]:
def cleaning_docs(raw_df, _subreddit):
    '''
    Takes the full corpus, a file path. It cleans all the documents (removes punctuation and stopwords). It saves the clean corpus in a json file
    '''
    clean_docs_file = os.path.join('reddit_data', _subreddit, f'clean_{_subreddit}.pkl')
    if not os.path.exists(clean_docs_file): 
        
        clean_d = {'id':[], 'clean':[], 'og':[], 'year':[], 'date':[]}

        for index, row in raw_df.iterrows():                               # iterating over posts and comments
            if 'bot' not in row['author'] and 'Bot' not in row['author']:  # if author is not a bot
                clean_doc_st = lmw.process_string(row['text'])             # cleaning documents
                clean_doc_l = [t for t in clean_doc_st.split(' ')]
                if len(set(clean_doc_l))>5 and 'bot' not in clean_doc_l:  # exclude posts that have less than 5 different words
                                                                          # or that contain word 'bot'
                    clean_d['clean'].append(clean_doc_l)
                    clean_d['id'].append(row['id'])
                    clean_d['og'].append(row['text'])
                    clean_d['year'].append(row['year'])
                    clean_d['date'].append(row['date'])

        with open(clean_docs_file, 'w') as jsonfile:  # creating a file with the dict of documents to topic model
            json.dump(clean_d, jsonfile)

In [18]:
%%time
for subreddit in target_subreddits:
        cleaning_docs(df, subreddit)

CPU times: user 189 µs, sys: 101 µs, total: 290 µs
Wall time: 266 µs


## Topic modeling

What topics appear in Cornell's subreddit?

To perform LDA, we will be using `tomotopy` a new, fast and easy-to-use package for topic modeling. 

In [None]:
import tomotopy as tp

### Topic modeling functions

In [None]:
"""Mixture of Matthew Wilkens' and Melanie Walsh's code"""
def perform_topic_modeling(_doc_ids, _clean_docs, _num_topics, _rm_top, _topwords_file):
    '''
    Takes a list of document ids, a list of clean docs to perform LDA on, a number of topics, a number of top words to remove,
    a file path for the top words file. It performs topic modeling on the documents, then creates the top words file and a doc-term matrix.
    '''
                                          # setting and loading the LDA model
    lda_model = tp.LDAModel(k=_num_topics,      # number of topics in the model
                            min_df=3,           # remove words that occur in less than n documents
                            rm_top=_rm_top)     # remove n most frequent words
    for doc in _clean_docs:
        lda_model.add_doc(doc)  # adding document to the model

    iterations = 10
    for i in range(0, 100, iterations):  # train model 10 times with 10 iterations at each training = 100 iterations
        lda_model.train(iterations)
        print(f'Iteration: {i}\tLog-likelihood: {lda_model.ll_per_word}')

    # Writing the document with the TOP WORDS per TOPIC
    num_top_words = 25                                      # number of top words to print for each topic
    with open(_topwords_file, "w", encoding="utf-8") as file:
        file.write(f"\nTopics in LDA model: {_num_topics} topics {_rm_top} removed top words\n\n")
                                                            # write settings of the model in file
        topic_individual_words = []
        
        for topic_number in range(0, _num_topics):                  # for each topic number in the total number of topics
            topic_words = ' '.join(                                 # string of top words in the topic
                word for word, prob in lda_model.get_topic_words(topic_id=topic_number, top_n=num_top_words))
                                                # get_topic_words is a tomotopy function that returns a dict of words and their probabilities
            
            topic_individual_words.append(topic_words.split(' '))   # append list of the topic's top words for later
            file.write(f"Topic {topic_number}\n{topic_words}\n\n")  # write topic number and top words in file

            
    # TOPIC DISTRIBUTIONS
    topic_distributions = [list(doc.get_topic_dist()) for doc in lda_model.docs]  # list of lists of topic distributions for each document
    topic_results = []
    for topic_distribution in topic_distributions:
        topic_results.append({'topic_distribution': topic_distribution}) # adding dicts of topic distributions to list
    
    df = pd.DataFrame(topic_results, index=_doc_ids) 
                                                    # df where each row is the list of topic distributions of a document, s_ids are the ids of the sentences
    column_names = [f"Topic {number} {topic[0]}" for number, topic in enumerate(topic_individual_words)]  # create list of column names from topic numbers and top words
    
    df[column_names] = pd.DataFrame(df['topic_distribution'].tolist(), index=df.index)
                                    # df where topic distributions are not in a list and match the list of column names
    df = df.drop('topic_distribution', axis='columns')  # drop old topic distributions' column
    
    dominant_topic = np.argmax(df.values, axis=1)       # get dominant topic for each document
    df['dominant_topic'] = dominant_topic

    return df

In [None]:
def run_topic_modeling(_subreddit):
    tomo_folder = os.path.join('output', 'topic_modeling')  # results' folder
    if not os.path.exists(tomo_folder):  # create folder if it doesn't exist
        os.makedirs(tomo_folder)
    
    clean_docs_file = os.path.join('reddit_data', _subreddit, f'clean_{_subreddit}.pkl')
    with open(clean_docs_file) as json_file:
        clean_docs_dict = json.load(json_file)
    doc_ids = clean_docs_dict['id']       # list of ids of clean documents
    clean_docs = clean_docs_dict['clean']  # list of clean documents to perform topic modeling on                       

    
    print("Performing Topic Modeling...")     # for loop to run multiple models with different settings with one execution
    for num_topics in [10, 20]:            # for number of topics
        for rm_top in [5]:                 # for number of most frequent words to remove

            topwords_file = os.path.join(tomo_folder, f'{subreddit}-{num_topics}_{rm_top}.txt')  # path for top words file
            docterm_file = os.path.join(tomo_folder,f'{subreddit}-{num_topics}_{rm_top}.pkl')  # path for doc-topic matrix file
            if not os.path.exists(topwords_file) or not os.path.exists(docterm_file):         # if result files don't exist, performs topic model
                
                start = datetime.now()
                lda_dtm = perform_topic_modeling(doc_ids, clean_docs, num_topics, rm_top, topwords_file)
                lda_dtm['og_doc'] = clean_docs_dict['og']    # list of original documents for evaluation
                lda_dtm['year'] = clean_docs_dict['year']
                lda_dtm['date'] = clean_docs_dict['date']
                lda_dtm.to_pickle(docterm_file, protocol=4)  # convert doc-topic df in csv file
                print(f'{str(datetime.now())}____Topic modeling {num_topics}, {rm_top} time:____{str(datetime.now() - start)}\n')  # print timing of topic modeling

### Run Topic Modeling!

In [None]:
%%time
for subreddit in target_subreddits:
    run_topic_modeling('cornell')

### Evaluate the models

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 500)

def print_top_docs_per_topic(_df, _txtfile):
    
    with open(_txtfile, 'r') as file:
        lines = file.readlines()
        idx = 3
        while idx < len(lines):
            topic_line = lines[idx]
            words_line = lines[idx+1]
            n = topic_line.split()[1]
            word_1 = words_line.split()[0]
            print(f'{topic_line}{words_line}')
            for doc in _df.sort_values(f'Topic {n} {word_1}', ascending=False).og_doc.tolist()[5:10]:
                print(doc)
                print("_________")
            print('\n\n')
            idx += 3

In [None]:
n_removed = 5
n_topics = 20
tomo_folder = os.path.join('output', 'topic_modeling')
tomo_pklfile = os.path.join(tomo_folder, f'cornell-{n_topics}_{n_removed}.pkl')
tomo_txtfile = os.path.join(tomo_folder, f'cornell-{n_topics}_{n_removed}.txt')
tomo_df = pd.read_pickle(tomo_pklfile)
print(f'Number of documents in topic model: {len(tomo_df)}')
print_top_docs_per_topic(tomo_df, tomo_txtfile)

## Classification using topic distributions

Let's check if we can predict whether a post/comment is from 2020 or 2022 using its topic distributions?

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

In [None]:
# Examine the performance of our simple classifiers
# Freebie function to summarize and display classifier scores
def compare_scores(scores_dict):
    '''
    Takes a dictionary of cross_validate scores.
    Returns a color-coded Pandas dataframe that summarizes those scores.
    '''
    import pandas as pd
    df = pd.DataFrame(scores_dict).T.applymap(np.mean).style.background_gradient(cmap='RdYlGn')
    return df

### 20 Topics

In [None]:
tomo_shuffled = tomo_df.sample(frac=1)

tomo_shuffled['y_year'] = tomo_shuffled['year'].apply(lambda x: 0 if x == '2020' else 1)
y_year = tomo_shuffled['y_year'].tolist()
x_docterm = tomo_shuffled[tomo_df.columns[:n_topics].tolist()]
X_docterm = StandardScaler().fit_transform(x_docterm)

In [None]:
classifiers = {
    'Logit':LogisticRegression(),
    'Random forest':RandomForestClassifier(),
    'SVM':SVC()
}

scores1 = {} # Store cross-validation results in a dictionary
for classifier in classifiers: 
    scores1[classifier] = cross_validate( # perform cross-validation
        classifiers[classifier], # classifier object
        X_docterm, # feature matrix
        y_year, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1'] # scoring methods
    )
    
compare_scores(scores1)

In [None]:
method = f_classif #f is much faster than mutal_info, but not as robust
selector = SelectKBest(method, k=5)
X_best = selector.fit_transform(X_docterm, y_year)

In [None]:
scores2 = {} # Store cross-validation results in a dictionary
for classifier in classifiers: 
    scores2[classifier] = cross_validate( # perform cross-validation
        classifiers[classifier], # classifier object
        X_best, # feature matrix
        y_year, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1'] # scoring methods
    )

compare_scores(scores2)

In [None]:
all_features = tomo_df.columns[:n_topics].tolist()
top_features = sorted(zip(all_features, selector.scores_), key=lambda x: x[1], reverse=True)
for top_feature in top_features[:5]:
    print(f'{top_feature[0]} \t\tscore: {top_feature[1]}')

## Permutation test

In order to find out whether differences between topic distributions are statistically significant.

In [None]:
from scipy import stats

In [None]:
def permute(input_array):
    # shuffle is inplace, so copy to preserve input
    permuted = input_array.copy().values  # convert to numpy array, avoiding warning
    np.random.shuffle(permuted)
    return pd.Series(permuted)  # convert back to pandas

In [None]:
def permutation_test(ddf, raw_column):
    
    # Difference between the mean of the values in the first half and the mean of the values in the second half of the corpus
    column = f'{raw_column}_z'
    ddf[column] = stats.zscore(ddf[raw_column])
    real_mean_before = ddf.loc[ddf['year'] == '2020'][column].mean()
    real_mean_after = ddf.loc[ddf['year'] == '2022'][column].mean()
    diff_real = real_mean_before - real_mean_after 
    
    # Performing 1,000 permutations
    n_permutations = 1000
    flag = 0
    for i in range(n_permutations):
        copy = ddf.copy()  # we copy the original dataframe with the observed data
        copy['year'] = permute(copy['year'])  # we shuffle the 'year' column
        mean_before = copy.loc[copy['year'] == '2020'][column].mean()
        mean_after = copy.loc[copy['year'] == '2022'][column].mean()
        diff_perm = mean_before - mean_after  # we calculate the difference between the means of the two halves of the corpus
        if diff_real > 0:  # if real difference is a positive number
            if diff_real > diff_perm:  # we test if the observed difference is greater
                flag += 1
        if diff_real < 0:  # if real difference is a positive number
            if diff_real < diff_perm:  # we test if the observed difference is lesser
                flag += 1  # we keep count of the number of times the observed difference is larger
    p = (n_permutations-flag)/n_permutations
    
    return diff_real, flag, p

In [None]:
# Permutation test on the difference between the daily relative occurence of the symptoms label
# in the first and second halves of the corpus
for _column in tomo_df.columns[:n_topics]:
    diff, flag_value, p_value = permutation_test(tomo_df, _column)
    print(f'{_column} in 2020 vs 2022')
    print(f'Observed difference: {diff}')
    print(f'Number of times observed difference is larger than permutated: {flag_value}')
    print(f'P-value: {p_value}\n')