# Comprehensive Analysis of RePEc User and Mastodon Activity

This notebook conducts an analysis of RePEc user activity, their interactions, and their mentions in Mastodon-related tweets. We focus on generating weekly user engagement datasets, mapping relationships between RePEc users, and creating a detailed timeline of interactions based on Mastodon tweet data.

## Objectives

1. **Reading and Processing RePEc User Information**:
    - Load `cleaned_RePEc_userinfo.csv` and `RePEc_userfollowing.csv` to map Twitter and RePEc IDs.
    - Create a new `repec_following.csv` file that adds RePEc IDs for authors and followers, omitting rows without valid IDs.
    
2. **Generating Mastodon User Weekly Activity**:
    - Analyze the `mastodon_tweets.csv` dataset to create a `week` column representing numbered weeks since January 1, 2022.
    - Identify unique RePEc users tweeting in each week and save the results to `mastodon_user_week.csv`.

3. **Creating Mentions Data**:
    - Use `mastodon_user_week.csv` and `userinfo_df` to track weekly mentions of RePEc users.
    - Create a dataset (`mentions.csv`) that logs whether a user was mentioned each week, marked with 0 or 1 for each week column.

4. **Creating Mastodon User Week Interacted Data**:
    - Use `mastodon_user_week.csv` and `repec_following.csv` to extend user interactions by including followers of mentioned users.
    - Save the extended dataset to `mastodon_user_week_interacted.csv`.

5. **Creating Interacted Data**:
    - Use `mastodon_user_week_interacted.csv` to update user interaction records.
    - Track mentions and interactions in `interacted.csv`, considering both direct user activity and interactions through followers.

**Author: Eric Uehling**  
*Date: 7.13.24*

In [16]:
import pandas as pd
from datetime import datetime

## Step 1: Reading and Processing RePEc User Information and Following Data

### Overview
In this step, we will:
1. Read in two CSV files:
   - `cleaned_RePEc_userinfo.csv`: Contains user information, including Twitter and RePEc IDs.
   - `RePEc_userfollowing.csv`: Contains information about which users are following whom.
2. Create a new CSV file named `repec_following.csv`, which will be a modified version of `RePEc_userfollowing.csv`. This new file will include additional columns `repec_id` and `follower_repec_id`, corresponding to the RePEc IDs of the `author_id` and `follower_id`.
3. Map the corresponding RePEc IDs from the `cleaned_RePEc_userinfo.csv` to each row in the following data.
4. If either `author_id` or `follower_id` does not have a corresponding RePEc ID, we will skip that row.
5. Finally, we will save the output to a new CSV file called `repec_following.csv`.


In [17]:
# Step 1: Load the datasets
userinfo_df = pd.read_csv('../data/csv/cleaned_RePEc_userinfo.csv')
userfollowing_df = pd.read_csv('../data/csv/RePEc_userfollowing.csv')

# Create a dictionary for quick lookup of RePEc IDs
repec_id_dict = userinfo_df.set_index('id')['RePEc_id'].to_dict()

# Step 2: Map the corresponding RePEc IDs and filter the dataframe in one step
userfollowing_df['repec_id'] = userfollowing_df['author_id'].map(repec_id_dict)
userfollowing_df['follower_repec_id'] = userfollowing_df['follower_id'].map(repec_id_dict)

# Remove rows with missing RePEc IDs in one step
userfollowing_df.dropna(subset=['repec_id', 'follower_repec_id'], inplace=True)

# Step 3: Output the file
output_path = '../data/csv/repec_following.csv'
userfollowing_df.to_csv(output_path, index=False)

print(f"Successfully saved the file to {output_path}")


Successfully saved the file to ../data/csv/repec_following.csv


## Step 2: Generating the Mastodon User Weekly Activity Data

### Overview
In this step, we will:
1. Use the `mastodon_tweets` dataset to create a new column named `week`. This column will represent the numbered week starting from January 1, 2022. For example, if a tweet was created on '2022-11-18', it will be mapped to its corresponding week number in the year.
2. Determine the maximum week number to define the range of weeks we will be working with. This will help us create the appropriate number of rows (e.g., 1 to 65 weeks).
3. For each week, identify the unique `RePEc_id`s that correspond to the tweets created in that week. These IDs will be aggregated into lists, and we will populate the `repec_users` column with these lists.
4. Finally, we will save the resulting dataset to a new CSV file named `mastodon_user_week.csv`.

In [18]:
# Load the Mastodon tweets dataset
mastodon_tweets_df = pd.read_csv('../data/csv/mastodon_tweets.csv')

# Step 1: Create the 'week' column
# Convert 'created_at' to datetime
mastodon_tweets_df['created_at'] = pd.to_datetime(mastodon_tweets_df['created_at'])

# Define the start date
start_date = datetime(2022, 1, 1)

# Calculate the week number
mastodon_tweets_df['week'] = mastodon_tweets_df['created_at'].apply(
    lambda x: (x - start_date).days // 7 + 1
)

# Step 2: Determine the max week number
max_week = mastodon_tweets_df['week'].max()

# Initialize a list to store the data for the new CSV file
week_data = []

# Step 3: Generate the list of unique RePEc IDs for each week
for week_num in range(1, max_week + 1):
    # Filter tweets for the current week
    weekly_tweets = mastodon_tweets_df[mastodon_tweets_df['week'] == week_num]
    
    # Extract unique RePEc IDs
    unique_repec_ids = weekly_tweets['RePEc_id'].unique().tolist()
    
    # Append the result to the list
    week_data.append({
        'week': week_num,
        'repec_users': unique_repec_ids
    })

# Convert the list to a DataFrame
week_df = pd.DataFrame(week_data)

# Step 4: Output the file
output_path = '../data/csv/mastodon_user_week.csv'
week_df.to_csv(output_path, index=False)

print(f"Successfully saved the file to {output_path}")

Successfully saved the file to ../data/csv/mastodon_user_week.csv


## Step 3: Creating the Mentions Data File

### Overview
In this step, we will:
1. Identify the maximum week number from the `mastodon_user_week.csv` file to determine how many week columns are needed in the `mentions.csv` file.
2. For each user in the `userinfo_df` dataset, create a row in `mentions.csv` with their `author_id`, `repec_id`, `number_of_followers`, and `following_count`, followed by week columns (`week1`, `week2`, etc.) initialized to '0'.
3. For each week, identify if the user exists in the `repec_users` list from `mastodon_user_week.csv`. If the user is mentioned in a particular week, the corresponding week column value should be changed from '0' to '1'.
4. Finally, we will save the resulting dataset to a new CSV file named `mentions.csv`.

In [19]:
# Step 1: Load the mastodon_user_week and userinfo datasets
mastodon_user_week_df = pd.read_csv('../data/csv/mastodon_user_week.csv')
userinfo_df = pd.read_csv('../data/csv/cleaned_RePEc_userinfo.csv')

# Identify the maximum week number
max_week = mastodon_user_week_df['week'].max()

# Initialize the columns for the mentions.csv file
week_columns = [f'week{week_num}' for week_num in range(1, max_week + 1)]

# Create the mentions DataFrame with initial data
mentions_df = userinfo_df[['id', 'RePEc_id', 'followers_count', 'following_count']].copy()
mentions_df.columns = ['author_id', 'repec_id', 'number_of_followers', 'following_count']

# Add week columns initialized to 0
for week_col in week_columns:
    mentions_df[week_col] = 0

# Step 2: Update the mentions DataFrame based on the mastodon_user_week data
for _, row in mastodon_user_week_df.iterrows():
    week_num = row['week']
    repec_users = row['repec_users'][1:-1].replace("'", "").split(", ")  # Convert string representation of list to list
    
    # Identify users mentioned in this week
    mentions_df.loc[mentions_df['repec_id'].isin(repec_users), f'week{week_num}'] = 1

# Step 3: Output the file
output_path = '../data/csv/mentions.csv'
mentions_df.to_csv(output_path, index=False)

print(f"Successfully saved the file to {output_path}")

Successfully saved the file to ../data/csv/mentions.csv


## Step 4: Creating the Mastodon User Week Interacted Data File

### Overview
In this step, we will:
1. Use the `mastodon_tweets` dataset to create a new column named `week`. This column will represent the numbered week starting from January 1, 2022. For example, if a tweet was created on '2022-11-18', it will be mapped to its corresponding week number in the year.
2. Determine the maximum week number to define the range of weeks we will be working with. This will help us create the appropriate number of rows (e.g., 1 to 65 weeks).
3. For each week, identify and create a list of unique `RePEc_id`s that correspond to the tweets created in that week. These IDs will be aggregated into lists, and we will populate the `repec_users` column with these lists.
4. Read in the `repec_following.csv` file and for each week, extend the `repec_users` list to include `RePEc_id`s of users who follow those already in the list.
5. Finally, we will save the resulting dataset to a new CSV file named `mastodon_user_week_interacted.csv`.

In [20]:
# Step 1: Load the necessary datasets
mastodon_tweets_df = pd.read_csv('../data/csv/mastodon_tweets.csv')
repec_following_df = pd.read_csv('../data/csv/repec_following.csv')

# Convert 'created_at' to datetime
mastodon_tweets_df['created_at'] = pd.to_datetime(mastodon_tweets_df['created_at'])

# Define the start date
start_date = datetime(2022, 1, 1)

# Calculate the week number
mastodon_tweets_df['week'] = mastodon_tweets_df['created_at'].apply(
    lambda x: (x - start_date).days // 7 + 1
)

# Step 2: Determine the max week number
max_week = mastodon_tweets_df['week'].max()

# Initialize a list to store the data for the new CSV file
week_data = []

# Step 3: Generate the list of unique RePEc IDs for each week
for week_num in range(1, max_week + 1):
    # Filter tweets for the current week
    weekly_tweets = mastodon_tweets_df[mastodon_tweets_df['week'] == week_num]
    
    # Extract unique RePEc IDs
    unique_repec_ids = weekly_tweets['RePEc_id'].unique().tolist()
    
    # Append the result to the list
    week_data.append({
        'week': week_num,
        'repec_users': unique_repec_ids
    })

# Convert the list to a DataFrame
week_df = pd.DataFrame(week_data)

# Step 4: Read the repec_following.csv and extend the repec_users lists
# Create a dictionary for quick lookup of followers
following_dict = repec_following_df.groupby('repec_id')['follower_repec_id'].apply(list).to_dict()

for _, row in week_df.iterrows():
    week_repec_users = set(row['repec_users'])
    extended_repec_users = set(week_repec_users)  # Start with the current week's users
    
    for repec_id in week_repec_users:
        if repec_id in following_dict:
            extended_repec_users.update(following_dict[repec_id])
    
    # Update the row's repec_users with the extended list
    week_df.at[_, 'repec_users'] = list(extended_repec_users)

# Step 5: Output the file
output_path = '../data/csv/mastodon_user_week_interacted.csv'
week_df.to_csv(output_path, index=False)

print(f"Successfully saved the file to {output_path}")

Successfully saved the file to ../data/csv/mastodon_user_week_interacted.csv


## Step 5: Creating the Interacted Data File Using `mastodon_user_week_interacted.csv`

### Overview
In this step, we will:
1. Identify the maximum week number from the `mastodon_user_week_interacted.csv` file to determine how many week columns are needed in the `interacted.csv` file.
2. For each user in the `userinfo_df` dataset, create a row in `interacted.csv` with their `author_id`, `repec_id`, `number_of_followers`, and `following_count`, followed by week columns (`week1`, `week2`, etc.) initialized to '0'.
3. For each week, check if the user or any of the accounts they follow (from `repec_following.csv`) exists in the `repec_users` list from `mastodon_user_week_interacted.csv`. If the user or a followed account is mentioned in a particular week, the corresponding week column value should be changed from '0' to '1'.
4. Finally, we will save the resulting dataset to a new CSV file named `interacted.csv`.


In [21]:
# Step 1: Load the necessary datasets
mastodon_user_week_interacted_df = pd.read_csv('../data/csv/mastodon_user_week_interacted.csv')
userinfo_df = pd.read_csv('../data/csv/cleaned_RePEc_userinfo.csv')

# Identify the maximum week number
max_week = mastodon_user_week_interacted_df['week'].max()

# Initialize the columns for the interacted.csv file
week_columns = [f'week{week_num}' for week_num in range(1, max_week + 1)]

# Create the interacted DataFrame with initial data
interacted_df = userinfo_df[['id', 'RePEc_id', 'followers_count', 'following_count']].copy()
interacted_df.columns = ['author_id', 'repec_id', 'number_of_followers', 'following_count']

# Add week columns initialized to 0
for week_col in week_columns:
    interacted_df[week_col] = 0

# Step 2: Update the interacted DataFrame based on the mastodon_user_week_interacted data
for _, row in mastodon_user_week_interacted_df.iterrows():
    week_num = row['week']
    repec_users = row['repec_users'][1:-1].replace("'", "").split(", ")  # Convert string representation of list to list
    
    # Update the DataFrame if the user exists in the repec_users list
    interacted_df.loc[interacted_df['repec_id'].isin(repec_users), f'week{week_num}'] = 1

# Step 3: Output the file
output_path = '../data/csv/interacted.csv'
interacted_df.to_csv(output_path, index=False)

print(f"Successfully saved the file to {output_path}")


Successfully saved the file to ../data/csv/interacted.csv
