# Project on Mastodon
Sebastian Gottschalk, Kerstin Kirchgässner, Rusen Yasar

# Data processing
The primary objective of this notebook is to turn the json-formatted data into data frames.

In [221]:
import pandas as pd
import json

## Data frame for statuses
Read files:

In [222]:
with open("../data/raw/trending_statuses_1530.txt") as json_file:
    js_status_1 = json.load(json_file)
with open("../data/raw/trending_statuses_1930.txt") as json_file:
    js_status_2 = json.load(json_file)
with open("../data/raw/trending_statuses_2355.txt") as json_file:
    js_status_3 = json.load(json_file)

These should be flattened as data were fetched in batches. At the first level, each value is a list, and each item of these lists is a status.

In [223]:
status_list_of_dicts = []

for key in js_status_1:
    status_list_of_dicts += js_status_1[key]
for key in js_status_2:
    status_list_of_dicts += js_status_2[key]
for key in js_status_3:
    status_list_of_dicts += js_status_3[key]

In [224]:
df_status = pd.json_normalize(status_list_of_dicts, sep="_")

In [225]:
df_status.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 75 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        4424 non-null   object 
 1   created_at                4424 non-null   object 
 2   in_reply_to_id            0 non-null      object 
 3   in_reply_to_account_id    0 non-null      object 
 4   sensitive                 4424 non-null   bool   
 5   spoiler_text              4424 non-null   object 
 6   visibility                4424 non-null   object 
 7   language                  4424 non-null   object 
 8   uri                       4424 non-null   object 
 9   url                       4424 non-null   object 
 10  replies_count             4424 non-null   int64  
 11  reblogs_count             4424 non-null   int64  
 12  favourites_count          4424 non-null   int64  
 13  edited_at                 427 non-null    object 
 14  content 

## Data cleaning and processing

There are some columns to drop, e.g. (almost) fully null.

In [226]:
df_status_pr = df_status.drop(columns = [
    "in_reply_to_id", 
    "in_reply_to_account_id", 
    "reblog", 
    "card", 
    "poll", 
    "poll_id", 
    "poll_expires_at", 
    "poll_expired", 
    "poll_multiple", 
    "poll_votes_count", 
    "poll_voters_count",
    "poll_options", 
    "poll_emojis"
    ])

There should be duplicates, as we fetched data at three points in the same day.

In [227]:
df_status_pr.duplicated(subset="id", keep="last").sum()

1247

In [228]:
df_status_pr = df_status_pr.drop_duplicates(subset="id", keep="last")
df_status_pr = df_status_pr.set_index("id")

### View columns 10 by 10

In [229]:
df_status_pr.iloc[0:5, 0:10]

Unnamed: 0_level_0,created_at,sensitive,spoiler_text,visibility,language,uri,url,replies_count,reblogs_count,favourites_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
112530190136511498,2024-05-30T13:07:14.460Z,False,,public,en,https://mastodon.social/users/GottaLaff/status...,https://mastodon.social/@GottaLaff/11253019013...,7,64,45
112529682288720853,2024-05-30T10:58:05.315Z,False,,public,en,https://mastodon.social/users/arstechnica/stat...,https://mastodon.social/@arstechnica/112529682...,3,37,43
112529690578217691,2024-05-30T11:00:08.000Z,False,,public,en,https://mastodon.world/users/auschwitzmuseum/s...,https://mastodon.world/@auschwitzmuseum/112529...,3,45,8
112529707596827961,2024-05-30T11:04:31.487Z,False,,public,en,https://mastodon.social/users/gamingonlinux/st...,https://mastodon.social/@gamingonlinux/1125297...,1,11,28
112530249264146971,2024-05-30T13:21:45.000Z,False,,public,en,https://universeodon.com/users/georgetakei/sta...,https://universeodon.com/@georgetakei/11253024...,5,10,10


In [230]:
df_status_pr.iloc[0:5, 10:20]

Unnamed: 0_level_0,edited_at,content,media_attachments,mentions,tags,emojis,account_id,account_username,account_acct,account_display_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
112530190136511498,,"<p>The amazing Jamie <a href=""https://mastodon...",[],[],"[{'name': 'raskin', 'url': 'https://mastodon.s...",[],109253075745809751,GottaLaff,GottaLaff,Laffy
112529682288720853,,<p>Researchers get modern marines to test Anci...,"[{'id': '112529682187857378', 'type': 'image',...",[],[],[],110266162634306901,arstechnica,arstechnica,Ars Technica
112529690578217691,,"<p>30 maja 1939 | A Dutch Jewish girl, Emma va...","[{'id': '112529690524209257', 'type': 'image',...",[],[],[],109363522574341875,auschwitzmuseum,auschwitzmuseum@mastodon.world,Auschwitz Memorial
112529707596827961,,<p>Athena Crisis looks a lot like Advanced War...,"[{'id': '112529707478655427', 'type': 'image',...",[],"[{'name': 'opensource', 'url': 'https://mastod...",[],10947,gamingonlinux,gamingonlinux,Liam @ GamingOnLinux 🐧🎮
112530249264146971,,<p>Talk about clingy!</p>,"[{'id': '112530249198104425', 'type': 'image',...",[],[],[],109355700962815786,georgetakei,georgetakei@universeodon.com,George Takei :verified: 🏳️‍🌈🖖🏽


In [231]:
bool_list = []
for men in df_status_pr["mentions"]: 
    bool_list.append(len(men) > 0)
sum(bool_list)

275

In [232]:
df_status_pr.iloc[0:5, 20:30]

Unnamed: 0_level_0,account_locked,account_bot,account_discoverable,account_indexable,account_group,account_created_at,account_note,account_url,account_uri,account_avatar
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
112530190136511498,False,False,True,True,False,2022-10-29T00:00:00.000Z,"<p>Bite me, Musk.</p><p>Progressive political ...",https://mastodon.social/@GottaLaff,https://mastodon.social/users/GottaLaff,https://files.mastodon.social/accounts/avatars...
112529682288720853,False,False,True,False,False,2023-04-26T00:00:00.000Z,"<p>Original news, reviews, analysis of tech tr...",https://mastodon.social/@arstechnica,https://mastodon.social/users/arstechnica,https://files.mastodon.social/accounts/avatars...
112529690578217691,False,False,True,False,False,2022-11-18T00:00:00.000Z,<p>Former German Nazi concentration &amp; exte...,https://mastodon.world/@auschwitzmuseum,https://mastodon.world/users/auschwitzmuseum,https://files.mastodon.social/cache/accounts/a...
112529707596827961,False,False,True,True,False,2016-11-21T00:00:00.000Z,<p>News + Spicy Opinions. One person mostly. C...,https://mastodon.social/@gamingonlinux,https://mastodon.social/users/gamingonlinux,https://files.mastodon.social/accounts/avatars...
112530249264146971,False,False,True,False,False,2022-11-15T00:00:00.000Z,<p>I boldly went to this new site. Follow for ...,https://universeodon.com/@georgetakei,https://universeodon.com/users/georgetakei,https://files.mastodon.social/cache/accounts/a...


In [233]:
df_status_pr.iloc[0:5, 30:40]

Unnamed: 0_level_0,account_avatar_static,account_header,account_header_static,account_followers_count,account_following_count,account_statuses_count,account_last_status_at,account_hide_collections,account_emojis,account_fields
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
112530190136511498,https://files.mastodon.social/accounts/avatars...,https://files.mastodon.social/accounts/headers...,https://files.mastodon.social/accounts/headers...,24284,6108,92375,2024-05-30,False,[],"[{'name': 'WEBSITE', 'value': '<a href=""https:..."
112529682288720853,https://files.mastodon.social/accounts/avatars...,https://files.mastodon.social/accounts/headers...,https://files.mastodon.social/accounts/headers...,154555,6,4635,2024-05-30,False,[],"[{'name': 'Ars Technica', 'value': '<a href=""h..."
112529690578217691,https://files.mastodon.social/cache/accounts/a...,https://files.mastodon.social/cache/accounts/h...,https://files.mastodon.social/cache/accounts/h...,91548,43,2030,2024-05-30,False,[],"[{'name': 'Website', 'value': '<a href=""https:..."
112529707596827961,https://files.mastodon.social/accounts/avatars...,https://files.mastodon.social/accounts/headers...,https://files.mastodon.social/accounts/headers...,60419,239,11213,2024-05-30,False,[],"[{'name': 'Website', 'value': '<a href=""https:..."
112530249264146971,https://files.mastodon.social/cache/accounts/a...,https://files.mastodon.social/cache/accounts/h...,https://files.mastodon.social/cache/accounts/h...,422997,35,5380,2024-05-30,False,"[{'shortcode': 'verified', 'url': 'https://fil...","[{'name': 'Instagram', 'value': '<a href=""http..."


In [234]:
df_status_pr.iloc[0:5, 40:50]

Unnamed: 0_level_0,application_name,application_website,account_noindex,account_roles,card_url,card_title,card_description,card_language,card_type,card_author_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
112530190136511498,Mona for iPhone,https://mastodon.social/@MonaApp,False,[],https://www.nytimes.com/2024/05/29/opinion/ali...,Opinion | How to Force Justices Alito and Thom...,Can they really decide for themselves whether ...,en,link,Jamie Raskin
112529682288720853,Bot Posts (testing),https://arstechnica.com,False,[],https://arstechnica.com/science/2024/05/resear...,Bizarre armor from Mycenaean Greece turns out ...,People suspected the Dendra armor was ceremoni...,en,link,
112529690578217691,,,,,,,,,,
112529707596827961,Buffer,https://buffer.com,False,[],https://www.gamingonlinux.com/2024/05/athena-c...,Athena Crisis looks a lot like Advanced Wars a...,"Entering Early Access on Steam back in March, ...",en,link,Liam Dawe
112530249264146971,,,,,,,,,,


In [235]:
df_status_pr.iloc[0:5, 50:]

Unnamed: 0_level_0,card_author_url,card_provider_name,card_provider_url,card_html,card_width,card_height,card_image,card_image_description,card_embed_url,card_blurhash,card_published_at
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
112530190136511498,https://www.nytimes.com/by/jamie-raskin,The New York Times,,,1050.0,550.0,https://files.mastodon.social/cache/preview_ca...,,,UCIYB|~pD%%M9F?bM|M{%MkBIUMx_NWBM{M{,2024-05-29T18:26:00.000Z
112529682288720853,,Ars Technica,,,640.0,386.0,https://files.mastodon.social/cache/preview_ca...,,,UbECaq02s:xsS7n#ofRkxsRkj[oebIocazfQ,2024-05-30T10:55:57.000Z
112529690578217691,,,,,,,,,,,
112529707596827961,https://www.gamingonlinux.com/profiles/liamd,GamingOnLinux,,,740.0,420.0,https://files.mastodon.social/cache/preview_ca...,,,U7HC3rGFpk-z3txpvEnj]p%YO*n7-Do]R+Rm,2024-05-30T11:03:39.000Z
112530249264146971,,,,,,,,,,,


### Processing tags column

Make tags more usable. First, count.

In [236]:
df_status_pr["tags_num"] = df_status_pr["tags"].map(lambda x: len(x))

Next, concatenate tags into a single string.

In [237]:
def dict_names_to_string(ls_of_dt):
    tag_name = " "
    for dt in ls_of_dt:
        tag_name += dt["name"] + " "
    return tag_name.strip()

In [238]:
df_status_pr["tags_str"] = df_status_pr["tags"].map(dict_names_to_string)

### Processing date-times

In [239]:
from datetime import datetime

In [240]:
df_status_pr["created_at_dt"] = df_status_pr["created_at"].map(
    lambda dt_str: datetime.strptime(dt_str, "%Y-%m-%dT%H:%M:%S.%fZ") if type(dt_str) == str else None
    )
df_status_pr["edited_at_dt"] = df_status_pr["edited_at"].map(
    lambda dt_str: datetime.strptime(dt_str, "%Y-%m-%dT%H:%M:%S.%fZ") if type(dt_str) == str else None
    )
df_status_pr["account_created_at_dt"] = df_status_pr["account_created_at"].map(
    lambda dt_str: datetime.strptime(dt_str, "%Y-%m-%dT%H:%M:%S.%fZ") if type(dt_str) == str else None
    )
df_status_pr["account_last_status_at"] = df_status_pr["account_last_status_at"].map(
    lambda dt_str: datetime.strptime(dt_str, "%Y-%m-%d") if type(dt_str) == str else None
    )
df_status_pr["card_published_at_dt"] = df_status_pr["card_published_at"].map(
    lambda dt_str: datetime.strptime(dt_str, "%Y-%m-%dT%H:%M:%S.%fZ") if type(dt_str) == str else None
    )

One useful information could be account age.

In [241]:
df_status_pr["account_age"] = df_status_pr["account_created_at_dt"].map(
    lambda dt: (datetime.today() - dt).days
)

### Media, mentions, emojis, card

Count media attachments, code if any media attached

In [242]:
df_status_pr["media_count"] = df_status_pr["media_attachments"].map(lambda ls: len(ls))
df_status_pr["any_media"] = df_status_pr["media_count"].map(lambda ct: 1 if ct>0 else 0)

Count mentions, code if any mention present

In [243]:
df_status_pr["mention_count"] = df_status_pr["mentions"].map(lambda ls: len(ls))
df_status_pr["any_mention"] = df_status_pr["mention_count"].map(lambda ct: 1 if ct>0 else 0)

Count emojis, code if any emoji present

In [244]:
df_status_pr["emoji_count"] = df_status_pr["emojis"].map(lambda ls: len(ls))
df_status_pr["any_emoji"] = df_status_pr["emoji_count"].map(lambda ct: 1 if ct>0 else 0)

Check cards

In [245]:
df_status_pr["card_type"].value_counts()

card_type
link     1097
video      66
photo       2
Name: count, dtype: int64

Categorise again, map null values to another category

In [246]:
def card_categorise(card_type): 
    cat = "No card"
    if card_type == "link":
        cat = "link"
    elif card_type == "video":
        cat = "video/photo"
    elif card_type == "photo":
        cat = "video/photo"
    return cat

In [247]:
df_status_pr["card_categories"] = df_status_pr["card_type"].map(card_categorise)

In [248]:
df_status_pr["card_categories"].value_counts()

card_categories
No card        2012
link           1097
video/photo      68
Name: count, dtype: int64

Another variable if there is any card

In [249]:
df_status_pr["any_card"] = df_status_pr["card_categories"].map(lambda cat: 0 if cat == "No card" else 1)

## Tags data

Read file

In [124]:
with open("../data/raw/trending_tags_2355.txt") as json_file:
    js_tags = json.load(json_file)

In [141]:
df_tags = pd.json_normalize(js_tags)

In [152]:
df_tags_long = df_tags.explode("history")

Now there are 7 entries for each tag, and some info to unpack from the history column.

In [155]:
import numpy as np
days_before = np.tile(np.arange(0,7), df_tags_long.shape[0] // 7)

df_tags_long["days_before"] = days_before

In [159]:
df_tags_long["accounts_using"] = df_tags_long["history"].map(lambda dt : dt["accounts"])
df_tags_long["usage_count"] = df_tags_long["history"].map(lambda dt : dt["uses"])

In [161]:
df_tags_long = df_tags_long.drop(columns = "history")

In [162]:
df_tags_long

Unnamed: 0,name,url,days_before,accounts_using,usage_count
0,musiquinta,https://mastodon.social/tags/musiquinta,0,92,226
0,musiquinta,https://mastodon.social/tags/musiquinta,1,2,2
0,musiquinta,https://mastodon.social/tags/musiquinta,2,2,2
0,musiquinta,https://mastodon.social/tags/musiquinta,3,1,2
0,musiquinta,https://mastodon.social/tags/musiquinta,4,1,1
...,...,...,...,...,...
19,TheUmbrellaAcademy,https://mastodon.social/tags/theumbrellaacademy,2,0,0
19,TheUmbrellaAcademy,https://mastodon.social/tags/theumbrellaacademy,3,0,0
19,TheUmbrellaAcademy,https://mastodon.social/tags/theumbrellaacademy,4,0,0
19,TheUmbrellaAcademy,https://mastodon.social/tags/theumbrellaacademy,5,0,0


## Save files for further use

In [250]:
df_status_pr.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3177 entries, 112530190136511498 to 112530897905137155
Data columns (total 76 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   created_at                3177 non-null   object        
 1   sensitive                 3177 non-null   bool          
 2   spoiler_text              3177 non-null   object        
 3   visibility                3177 non-null   object        
 4   language                  3177 non-null   object        
 5   uri                       3177 non-null   object        
 6   url                       3177 non-null   object        
 7   replies_count             3177 non-null   int64         
 8   reblogs_count             3177 non-null   int64         
 9   favourites_count          3177 non-null   int64         
 10  edited_at                 309 non-null    object        
 11  content                   3177 non-null   object        

In [251]:
df_status_pr.to_csv("../data/processed/trending_statuses.csv")

In [None]:
df_tags.to_csv("../data/processed/trending_tags.csv")