In [4]:
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
df = pd.read_csv("data.csv")
display(df.head())

print("Number of columns:", len(df.columns.tolist()))
for col in df.columns.tolist():
    print(col)

Unnamed: 0,incident_id,report_id,thread,Validated By,Title,Tags,date_coded,url,source,file,...,hero,Needs External Review (Hero),villain,Needs External Review (Villain),plot,Needs External Review (Plot),moral,Needs External Review (Moral),comments_policy_narrative,Validated
0,74,2,,,,,2023-10-29,https://twitter.com/MikaelThalen/status/150412...,,,...,Ukrainian soldiers,,Ukrainain soliders are dying in the war.,,President Zelensky is calling for troops to st...,,Zelenksy thinks Ukraine is losing the war so h...,,,
1,75,3,,,,,2023-10-29,https://twitter.com/ThePatriotOasis/status/163...,,,...,The US military/ freedom for Ukraine and Taiwan,,China and Russia,,Increased tensions in Ukraine and Taiwan have ...,,The opening of the draft is inevitable due to ...,,,
2,76,4,,,,,2023-10-29,https://twitter.com/SabrinaHalper/status/17136...,,,...,Warren Buffet,,Economic inequality,,Warren Buffet is wanting to give away some of ...,,Warren Buffet thinks giving his money away to ...,,,
3,70,5,,,,,2023-10-23,https://twitter.com/MechaOrvo/status/171479783...,,,...,Trump for denouncing Israel,,Israel and US,,Trump is denouncing the actions of Israel.,,Trump is standing up for Palestine.,,Images could be interpreted as Macron being th...,
4,77,6,,,,,2023-10-29,https://twitter.com/NotPoliticians/status/1714...,,,...,AI regulations,,Lack of regulations,,AI is being used to create pThe current lack o...,,AI is being used to create political ads that ...,,,


Number of columns: 191
incident_id
report_id
thread
Validated By
Title
Tags
date_coded
url
source
file
file_with_watermark
screenshot
date_posted
format
transcript
summary_content
type
comment_type
Needs external review (type)
HiveModeration_Score
sharer_faketype
Needs External Review (sharer_faketype)
external_verification
Needs External Review (external_verification)
links_verification
evidence_fake
watermark
Needs External Verification (watermark)
likes
views
comments
shares
original_source_name
original_source_username
Needs External Review (Original Source Name)
original_source_type
Needs External Review (Original Source Type)
original_source_actor_type
Needs External Review (Original Source Actor Type)
original_source_country
other_original_source_country
Needs External Review (Original Source Country)
comments_coding_original_source
sharer_name
other_sharer_name
Needs External Review (Sharer Name)
sharer_username
sharer_type
Needs External Review (Sharer Type)
sharer_job
Needs E

A lot of columns; check for which ones are redundant (almost all NaN). Find most relevant ones.
(We can be a bit more aggressive here, removing all with >80% NaN)

In [6]:
all_nan_cols = df.columns[df.isna().all()]
print("Columns with all NaN values:", len(list(all_nan_cols)))

threshold = 0.80
high_nan_cols = df.columns[df.isna().mean() > threshold]
print("Columns with more than 80% NaN values:", len(list(high_nan_cols)))

# Find most relevant columns
df_relevant = df.drop(columns=high_nan_cols)
print("\nRemaining columns:", len(df_relevant.columns.to_list()))
for relev in df_relevant.columns.to_list():
    print(relev)

Columns with all NaN values: 61
Columns with more than 80% NaN values: 134

Remaining columns: 57
incident_id
report_id
date_coded
url
file
screenshot
date_posted
format
transcript
summary_content
type
comment_type
sharer_faketype
external_verification
links_verification
evidence_fake
watermark
likes
views
comments
shares
original_source_name
original_source_type
original_source_actor_type
original_source_country
sharer_name
sharer_type
sharer_job
high_impact_comment
sharer_country
sharer_city
target_name
target_one_name
target_one_sentiment
target_one_type_macro
target_one_type_micro
target_one_country
target_two_name
target_two_sentiment
target_two_type_macro
target_two_type_micro
target_two_country
target_type_macro
target_type_micro
target_country
deepfake_content_depicts
harm_depicted
social_policy_sector
context_deepfake
text_around_deepfake
harm_evidence
communication_goal
core_frame
hero
villain
plot
moral


## Social platforms

In [7]:
from urllib.parse import urlparse

# Determine social platform by extracting the domain from the URL
df['domain'] = df['url'].apply(lambda x: urlparse(x).netloc if pd.notnull(x) else None)
print("Social platform counts:\n", df['domain'].value_counts())

Social platform counts:
 domain
x.com                                     601
twitter.com                               159
www.tiktok.com                             39
youtu.be                                   31
www.youtube.com                            22
www.instagram.com                          18
miro.medium.com                            16
www.facebook.com                           14
farid.berkeley.edu                         12
www.reddit.com                              8
truthsocial.com                             7
archive.is                                  6
                                            4
web.archive.org                             3
www.cnn.com                                 3
www.alternet.org                            3
techcrunch.com                              3
nypost.com                                  2
archive.li                                  2
static01.nytimes.com                        2
www.theguardian.com                         1
m.

We can see that the most popular deepfake/cheapfake platforms are X/Twitter, TikTok, YouTube, Instagram and Facebook. We have some other websites to explore too, such as Medium, Berkeley and TruthSocial, etc.

In [8]:
# Count rows per year in the date_posted column
df['date_posted'] = pd.to_datetime(df['date_posted'], errors='coerce')
df['year'] = df['date_posted'].dt.year
print("Rows per year:\n", df['year'].value_counts())

Rows per year:
 year
2024.0    617
2023.0    240
2020.0     52
2021.0     35
2022.0     16
2019.0      9
2018.0      2
2016.0      1
2025.0      1
Name: count, dtype: int64


We see that most posts were made in 2024 and 2023, which can provide a sufficient basis for our analysis (857 rows).

In [9]:
# Identify rows with media
media_rows = df[df['file'].notna()]
print("Number of rows with media:", media_rows.shape[0])
print("Proportion with media:", media_rows.shape[0]/len(df))

Number of rows with media: 928
Proportion with media: 0.9508196721311475


Around 95% of the rows have media, which is promising for some image processing analysis.

In [10]:
print("Format counts:\n", df['format'].value_counts())
print("\nType counts:\n", df['type'].value_counts())
print("\nSharer faketype counts:\n", df['sharer_faketype'].value_counts())
print("\nExternal verification counts:\n", df['external_verification'].value_counts())

Format counts:
 format
image    646
video    330
Name: count, dtype: int64

Type counts:
 type
deepfake           803
unclear/unknown     89
cheapfake           82
real                 1
Name: count, dtype: int64

Sharer faketype counts:
 sharer_faketype
fake              468
unclear           350
real_authentic    157
Name: count, dtype: int64

External verification counts:
 external_verification
unknown                  455
verified_fake            325
authenticity_disputed    156
verified_real             39
Name: count, dtype: int64


The majority of media is images with a significant portion of videos. We have enough on deepfakes to fully focus on it (and not worry about the nuances/differences between deepfakes and cheapfakes, etc). Most of the sharers of these posts deem their posts to be fake, but an equally significant portion does not explicitly state anything (350 sharers being unclear). Huge lack of external verification which can be a potential weakness.

# Potential ways to filter the data

We will also filter away all the high-NaN columns as well for all options.

In [11]:
df = df.drop(columns=high_nan_cols)

### 1. 2023-2024 analysis of deepfake images/videos that have been verified

In [12]:
df1 = df[
    (df['year'].isin([2023, 2024])) &
    (df['type'] == 'deepfake') &
    (df['file'].notna()) &
    (df['external_verification'] != 'unknown')
]

In [13]:
print(df1.shape)
display(df1)

(340, 59)


Unnamed: 0,incident_id,report_id,date_coded,url,file,screenshot,date_posted,format,transcript,summary_content,...,text_around_deepfake,harm_evidence,communication_goal,core_frame,hero,villain,plot,moral,domain,year
16,203,17,2024-02-12,https://twitter.com/DouglasLucas/status/171631...,Screenshot 2024-02-12 at 12.47.24 PM.png (http...,,2023-10-23,image,,"Two polaroid photographs, the one one the righ...",...,AI and the end of photographic truth? Deceptiv...,political_interference,education,human_interest,Biden,Putin,Putin and Biden hugging,U.S. and Russian leaders can get along,twitter.com,2023.0
18,186,18,2024-02-04,https://twitter.com/21WIRE/status/165300993699...,"""Patrick Henningsen on X_ _⭕️ Granted, this mu...",,2023-05-01,video,"""Today is today, and yesterday was today yeste...",Kamala Harris is speaking at a rally and her s...,...,"⭕️ Granted, this must be a deep fake, but rega...",non_identifiable,"satire,entertainment,harm_reputation",human_interest,A leader who sounds smart and capable during s...,Kamala Harris rambling and not making sense du...,"Kamala's speech does not make sense, which lea...",Kamala Harris's credibility as VP should be qu...,twitter.com,2023.0
20,187,19,2024-02-05,https://www.facebook.com/reel/894486844951526,deepfake 19.html (https://v5.airtableuserconte...,,2023-05-03,video,"“Today is today, and yesterday is today yester...",The sharer posted a video showing her reacting...,...,"""Our VP: 'I can only hit this bong 1 more time...",other,"satire,entertainment,harm_reputation",human_interest,People expect that a VP can communicate with c...,Online users believe that Harris is a rambler.,Kamala Harris is shown to be giving a speech t...,Kamala Harris is portrayed as being unable to ...,www.facebook.com,2023.0
31,201,25,2024-02-09,https://twitter.com/JackPosobiec/status/163383...,Screenshot 2024-02-09 134519.png (https://v5.a...,,2023-03-09,video,I'll tell you something the deepfake that Jack...,AOC announcing her opinions with regards to ho...,...,"""The deepfake that Jack Posobiec made of Presi...",political_interference,"entertainment,harm_political_interference,acti...","responsibility,conflict","Poso's supposters, those who share his ideology","AOC, and other related gov. officials who shar...",Biden and AOC do not understand how to priorit...,we must take a stand against their choices to ...,twitter.com,2023.0
32,283,26,2024-03-04,https://www.reddit.com/r/ChatGPT/comments/156h...,deepfake 26.html (https://v5.airtableuserconte...,,2023-07-22,video,"""Highly anticipated season for reloaded update...",The video shows a news clip from Tucker Carlso...,...,ChatGPT wrote ALL the words coming out of this...,non_identifiable,"satire,entertainment,harm_emotional_psychological",human_interest,,ChatGPT and other AI tech,AI programs give people the ability to create ...,AI has progressed more than a lot of people re...,www.reddit.com,2023.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
917,1002,543,2025-03-04,https://x.com/cattocrunch/status/1825860652847...,GVQyp6pWAAAk6cw.jpg (https://v5.airtableuserco...,Screenshot 2025-03-03 210934.png (https://v5.a...,2024-08-20,image,,The image depicts Donald Trump standing in an ...,...,"The caption says: ""holy shit man look at these...",political_interference,"satire,entertainment",human_interest,Donald Trump,,The post suggests that while some people are c...,AI is being used by politicians to manipulate ...,x.com,2024.0
925,1011,444,2025-03-05,https://x.com/realDonaldTrump/status/182406968...,ScreenRecording_03-05-2025 14-04-06_1.mov (htt...,Screenshot 2025-03-05 at 2.04.31 PM.png (https...,2024-08-15,video,Stayin' Alive by the Bee Gees is playing in th...,Donald Trump and Elon Musk are dancing. \n,...,NA\n,non_identifiable,"entertainment,satire",human_interest,Donald Trump and Elon Musk,,Trump and Musk dancing together,"Trump and Musk work well together, so vote for...",x.com,2024.0
957,1044,594,2025-03-05,https://x.com/Jere_Memez/status/18250285930359...,pdid 594.jpg (https://v5.airtableusercontent.c...,Screenshot 2024-10-30 at 12.54.40 PM.png (http...,2024-08-18,image,,It is a screenshot of Donald Trump's post of K...,...,,political_interference,"harm_political_interference,false_info,harm_re...",human_interest,,Kamala Harris,"If Harris is elected president, she will turn ...","Vote for Trump, not Harris",x.com,2024.0
958,1045,533,2025-03-05,https://x.com/YourMomsHo42215/status/182532763...,pdid 533.jpg (https://v5.airtableusercontent.c...,Screen Shot 2025-03-05 at 4.05.44 PM.png (http...,2024-08-18,image,,It is an image of Taylor Swift dressed in a ty...,...,"Sharer: ""Imagine the group that says “Kamala h...","political_interference,human_rights","harm_reputation,harm_political_interference","human_interest,responsibility",unknown,Donald Trump,"Unclear, but demonstrates that Donald Trump is...",AI can be used but it is important that it mus...,x.com,2024.0


### 2. 2023-2024 analysis of deepfake images/videos, regardless of verification

In [14]:
df2 = df[
    (df['year'].isin([2023, 2024])) &
    (df['type'] == 'deepfake') &
    (df['file'].notna())
]

In [15]:
print(df2.shape)
display(df2)

(714, 59)


Unnamed: 0,incident_id,report_id,date_coded,url,file,screenshot,date_posted,format,transcript,summary_content,...,text_around_deepfake,harm_evidence,communication_goal,core_frame,hero,villain,plot,moral,domain,year
16,203,17,2024-02-12,https://twitter.com/DouglasLucas/status/171631...,Screenshot 2024-02-12 at 12.47.24 PM.png (http...,,2023-10-23,image,,"Two polaroid photographs, the one one the righ...",...,AI and the end of photographic truth? Deceptiv...,political_interference,education,human_interest,Biden,Putin,Putin and Biden hugging,U.S. and Russian leaders can get along,twitter.com,2023.0
18,186,18,2024-02-04,https://twitter.com/21WIRE/status/165300993699...,"""Patrick Henningsen on X_ _⭕️ Granted, this mu...",,2023-05-01,video,"""Today is today, and yesterday was today yeste...",Kamala Harris is speaking at a rally and her s...,...,"⭕️ Granted, this must be a deep fake, but rega...",non_identifiable,"satire,entertainment,harm_reputation",human_interest,A leader who sounds smart and capable during s...,Kamala Harris rambling and not making sense du...,"Kamala's speech does not make sense, which lea...",Kamala Harris's credibility as VP should be qu...,twitter.com,2023.0
20,187,19,2024-02-05,https://www.facebook.com/reel/894486844951526,deepfake 19.html (https://v5.airtableuserconte...,,2023-05-03,video,"“Today is today, and yesterday is today yester...",The sharer posted a video showing her reacting...,...,"""Our VP: 'I can only hit this bong 1 more time...",other,"satire,entertainment,harm_reputation",human_interest,People expect that a VP can communicate with c...,Online users believe that Harris is a rambler.,Kamala Harris is shown to be giving a speech t...,Kamala Harris is portrayed as being unable to ...,www.facebook.com,2023.0
22,188,20,2024-02-05,https://twitter.com/JebraFaushay/status/165937...,Deepfake 20.html (https://v5.airtableuserconte...,,2023-05-18,video,"""Excuse me, excuse me. I need everyone in the ...","The video shows Donald Trump, Mike Pence, and ...",...,Is this real? I feel like it might be a deep f...,other,"satire,entertainment,harm_reputation","conflict,human_interest",The freedom for Trump and his squad to dress i...,Republicans who want to ban drag.,"Republicans want to ban drag, but Donald Trump...",Republicans should stop calling for drag to be...,twitter.com,2023.0
31,201,25,2024-02-09,https://twitter.com/JackPosobiec/status/163383...,Screenshot 2024-02-09 134519.png (https://v5.a...,,2023-03-09,video,I'll tell you something the deepfake that Jack...,AOC announcing her opinions with regards to ho...,...,"""The deepfake that Jack Posobiec made of Presi...",political_interference,"entertainment,harm_political_interference,acti...","responsibility,conflict","Poso's supposters, those who share his ideology","AOC, and other related gov. officials who shar...",Biden and AOC do not understand how to priorit...,we must take a stand against their choices to ...,twitter.com,2023.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
971,1058,460,2025-03-06,https://x.com/PlanetOfMemes/status/18240700394...,460.png (https://v5.airtableusercontent.com/v3...,Screenshot 2025-03-05 at 8.45.11 PM.png (https...,2024-08-15,image,,"Donald Trump and Elon Musk are sitting, talkin...",...,"The caption says ""Let's do some weekly spaces""",non_identifiable,"boost_reputation,entertainment",human_interest,Donald Trump and Elon Musk,,Trump and Musk are friends,Donald Trump and Elon Musk are a good team,x.com,2024.0
972,1059,599,2025-03-06,https://x.com/KryptoKarma2024/status/182516139...,599.png (https://v5.airtableusercontent.com/v3...,Screenshot 2025-03-06 at 3.01.04 PM.png (https...,2024-08-18,image,,Donald Trump is riding a horse with eagle wing...,...,"The caption says ""@realDonaldTrump\n $RDT CTO ...",political_interference,"boost_reputation,harm_political_interference",human_interest,Donald Trump,Kamala Harris,The 2024 presidential election,Vote for Donald Trump,x.com,2024.0
973,1060,757,2025-03-06,https://x.com/Trump_History45/status/183139475...,pdid 757.jpg (https://v5.airtableusercontent.c...,Screen Shot 2025-03-06 at 3.17.43 PM.png (http...,2024-09-04,image,,It is a picture of Kamala Harris shaking hands...,...,A young Kamala Harris shaking hands with commu...,political_interference,"false_info,harm_political_interference,harm_re...","human_interest,responsibility",Donald Trump,Kamala Harris,Kamala Harris wants to create something like a...,Vote for a president who will keep government ...,x.com,2024.0
974,1061,758,2025-03-06,https://x.com/Trump_History45/status/182900803...,pdid 758.jpg (https://v5.airtableusercontent.c...,Screen Shot 2025-03-06 at 3.37.57 PM.png (http...,2024-08-29,image,,It is an image of Barack Obama's face in front...,...,Why is Barack Obama afraid to come out of the ...,"discrimination,human_rights","false_info,harm_reputation",human_interest,unclear,Barack Obama,Obama only legalized gay marriage for his own ...,Don't support someone who won't openly be hone...,x.com,2024.0


[Cont from Yas]
Let's just look at the _important_ content for now:

In [16]:
df1.columns

Index(['incident_id', 'report_id', 'date_coded', 'url', 'file', 'screenshot',
       'date_posted', 'format', 'transcript', 'summary_content', 'type',
       'comment_type', 'sharer_faketype', 'external_verification',
       'links_verification', 'evidence_fake', 'watermark', 'likes', 'views',
       'comments', 'shares', 'original_source_name', 'original_source_type',
       'original_source_actor_type', 'original_source_country', 'sharer_name',
       'sharer_type', 'sharer_job', 'high_impact_comment', 'sharer_country',
       'sharer_city', 'target_name', 'target_one_name', 'target_one_sentiment',
       'target_one_type_macro', 'target_one_type_micro', 'target_one_country',
       'target_two_name', 'target_two_sentiment', 'target_two_type_macro',
       'target_two_type_micro', 'target_two_country', 'target_type_macro',
       'target_type_micro', 'target_country', 'deepfake_content_depicts',
       'harm_depicted', 'social_policy_sector', 'context_deepfake',
       'text_around_d

Categorizing columns:

<h3>Important for analysis</h3>

Media:
* **Visual:** 'format', 'file'
* **Text:** 'transcript', 'summary_content', 'type',
* **Engagement metrics:** 'likes', 'views', 'comments', 'shares'

Personas: _who is the original source/sharer/target of the deepfake_
* **Source**: 'original_source_type', 'original_source_actor_type', 'original_source_country'
* **Sharer**: 'sharer_type', 'sharer_job', 'high_impact_comment', 'sharer_country', 'sharer_city'
* **Target**: 'target_name', 'target_one_name', 'target_one_sentiment', 'target_one_type_macro', 'target_one_type_micro', 'target_one_country', 'target_two_name', 'target_two_sentiment', 'target_two_type_macro', 'target_two_type_micro', 'target_two_country', 'target_type_macro', 'target_type_micro', 'target_country'

Context \& Content:
* **Content**: 'deepfake_content_depicts', 'harm_depicted', 'communication_goal', 'core_frame'
* **Real-world connection**: 'context_deepfake', 'text_around_deepfake', 'harm_evidence'
* **Politics/policy**: 'social_policy_sector', 'hero', 'villain', 'plot', 'moral'

<h3>Not as important for now</h3>

* **Descriptive**: 'year', 'domain'
* **Extra metadata**: 'url', 'screenshot'
* **Names of distributors**: 'original_source_name', 'sharer_name',
* **For verification**: 'external_verification', 'comment_type', 'links_verification', 'evidence_fake'
* **Misc**: 'incident_id', 'report_id', 'date_coded', 'date_posted', 'type' (just 'deepfake'), 'sharer_faketype', 'watermark'

In [17]:
# Not sure if we're going to use df1 or df2 yet, depending on how complicated it is to process video
# So let's save both for now
df1.to_csv('data-2324-ver.csv', index=False)
df2.to_csv('data-2324-no-ver.csv', index=False)