<center><h1>Lab: Exploring the DSA Transparency database</h1></center>

The goal of this lab is to study, with very simple methods, a sample from the [Digital Services Act Transparency Database](https://transparency.dsa.ec.europa.eu/dashboard). The database collects moderation decision by online platforms (VLOPs, Very Large Online Platforms), as identified by the [DSA](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX%3A32022R2065#enc_1) and the European Commission.

The main goal of the lab is to gain insights on the differentiated moderation practices of social media platforms subject to the DSA's obligations. To do this, we us a stratified sample of the transparency database, filtered to contain 70000 decisions by social media platforms.


## Imports

In [1]:
import numpy
import pandas as pd

## Load data

In [5]:
data = pd.read_csv(open("./sample-strat-may2024-socmed-70k.csv", encoding="utf-8"), sep=",")

data.head()

socials= data["platform_name"]

## Basic data exploration

The first goal is to get a hang of the data that you have, its variables, and answer a few basic questions. You can familirise yourself with the variables by reading the document "DSA transparency database - Description of variables.docx" provided.

- Which are the platforms in the sample?
- How many decisions are present for each platform?
- Which variables would you modify, and how (this might include removing variables, changing their type, creating new variables from a combination of existing ones, etc.)?

Write your thoughts and summarised results here.

In [9]:
# There are the socials medias related 


socials.unique()

socials.value_counts()

platform_name
Facebook     10000
TikTok       10000
Instagram    10000
Snapchat     10000
LinkedIn     10000
X            10000
YouTube      10000
Name: count, dtype: int64

In general, you can see a variable `x` by typing `data["x"]`, see the different values the variable takes with `set(data["x"].values)`, and see the counts of values for a variable using `data["x"].value_counts()`.

In [10]:
# What about the decisions of taking out something ? 

decisions = data["decision_visibility"]

decisions.value_counts()


decision_visibility
["DECISION_VISIBILITY_CONTENT_REMOVED"]                                       34429
["DECISION_VISIBILITY_OTHER"]                                                 14975
["DECISION_VISIBILITY_CONTENT_DISABLED"]                                       5494
["DECISION_VISIBILITY_CONTENT_DEMOTED"]                                        1559
["DECISION_VISIBILITY_CONTENT_AGE_RESTRICTED"]                                  187
["DECISION_VISIBILITY_CONTENT_LABELLED"]                                        183
["DECISION_VISIBILITY_OTHER","DECISION_VISIBILITY_CONTENT_AGE_RESTRICTED"]        1
Name: count, dtype: int64

Already you should be able to remove of modify a few variables. Do so before going on, to facilitate your exploration. You can remove a column with `data = data.drop("platform_name", axis=1)`.

## Filtering by platform

Now that we have a general idea of the database, we would like to get a first idea of differentiated behaviours by platform. You can filter a dataframe like so: `data[data["x"]==y]`. Below, for example, is the code to get all decisions for TikTok.

In [38]:
data[data["platform_name"] == "TikTok"]

Unnamed: 0,decision_visibility,decision_ground,illegal_content_legal_ground,incompatible_content_ground,category,content_type,content_date,territorial_scope,application_date,source_type,source_identity,automated_detection,automated_decision,platform_name,created_at
10000,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Hate Speech and Hateful Behaviors,STATEMENT_CATEGORY_ILLEGAL_OR_HARMFUL_SPEECH,"[""CONTENT_TYPE_TEXT""]",2024-05-07 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-07 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-07 23:10:14
10001,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-09 02:58:51
10002,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,The ad features restricted or prohibited claim...,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-08 00:00:00,"[""PL""]",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_NOT_AUTOMATED,TikTok,2024-05-10 05:49:37
10003,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-09 19:31:33
10004,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-10 11:18:20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-10 10:15:31
19996,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_IMAGE""]",2024-04-18 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-06 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-06 04:47:09
19997,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-11 14:11:54
19998,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Animal Abuse,STATEMENT_CATEGORY_ANIMAL_WELFARE,"[""CONTENT_TYPE_TEXT""]",2024-05-08 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-08 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-08 21:09:20


Perform a variable exploration per platform. What can you infer in terms of differentiated moderation practices?

## Cross-frequency counts

In the next steps, we are interested in going a bit deeper, seeing how variables relate to each other. This can be done with a cross-frequency analysis, _i.e._ looking how variables co-occur together. Below is an example:

In [50]:
data.groupby("platform_name")["decision_ground"].value_counts()

platform_name  decision_ground                     
Facebook       DECISION_GROUND_INCOMPATIBLE_CONTENT    10000
Instagram      DECISION_GROUND_INCOMPATIBLE_CONTENT     9999
               DECISION_GROUND_ILLEGAL_CONTENT             1
LinkedIn       DECISION_GROUND_INCOMPATIBLE_CONTENT    10000
Snapchat       DECISION_GROUND_INCOMPATIBLE_CONTENT     9993
               DECISION_GROUND_ILLEGAL_CONTENT             7
TikTok         DECISION_GROUND_INCOMPATIBLE_CONTENT    10000
X              DECISION_GROUND_ILLEGAL_CONTENT         10000
YouTube        DECISION_GROUND_INCOMPATIBLE_CONTENT     9957
               DECISION_GROUND_ILLEGAL_CONTENT            43
Name: count, dtype: int64

What can you tell, by platform, reading this kind of analysis?

## Moderation time

Create a new variable for moderation time, _i.e._ the time between the decision and its application. Is this new variable relevant? What can you conclude from it?

In [None]:
data["moderation_time"] = ... # Write code here