# READ ME

Lees deze notebook eerst voor je dit project uit probeert! Het bevat niet alleen de nodige uitleg over hoe dit project werkt, maar ook de code die nodig is om de correcte data op te halen en te verwerken.

**Voor je dit project kan starten, moet je eerst beschikken over volgende tools:**
- Python 3.8 of hoger
- Jupyter Notebook
- pip (Python package manager) OF Anaconda (aangeraden)
- Git (optioneel, maar aangeraden)
- Een IDE (optioneel, maar aangeraden)
- Een Google account (voor de Google API)
- Een DeepL account (voor de DeepL API)
- Een YouTube account (voor de YouTube API) --> Je hebt ook de juiste toegangsrechten nodig!

**Notes**:
- De data wordt NIET via GitHub bijgehouden wegens Privacy redenen. De data wordt lokaal opgeslagen in de `data` folder.
- De data wordt opgeslagen in Excel-formaat. Dit kan je openen met Excel of Google Sheets.
- De Python code en comments zijn geschreven in het Engels (eenvoudiger naar documentatie toe), Markdown is in het Nederlands en bevat de interpretatie van de resultaten.
- De interpretatie is steeds een momentopname! Indien de data wijzigt, moet ook de interpretatie herzien worden.

In [None]:
# Import libraries
import importlib
import numpy as np
from dotenv import load_dotenv

In [None]:
# Import .env variables
# You will have to create a .env file in the root of this project with the following variables:
# YT_API_KEY=xxxxx
# YT_CHANNEL_ID=xxxxx
# CLIENT_ID=xxxxx
# CLIENT_SECRET=xxxxx
# PROJECT_ID=xxxxx
# DEEPL_API_KEY=xxxxx
load_dotenv()

In [None]:
# Import custom libraries
from lib import youtube, helpers

# Automatically reload libraries when changes are made
importlib.reload(youtube)
importlib.reload(helpers);

## Data Collectie

### Deelstudie 1, analyse 1

In [None]:
# Get video ids from all videos in a specific data range (2019 to 2024)
video_ids = youtube.get_video_ids_in_period('2019-01-01', '2024-12-31')
len(video_ids)

In [None]:
# Get video data based on the video ids
# These are things like title, description, channel name, publish date, etc.
videos = youtube.get_generic_info(np.array(video_ids))

In [None]:
# Get metrics for each video and over multiple time intervals
# These are things like views, likes, comments, etc.
metrics = youtube.get_metrics_over_time(videos)

In [None]:
# Merge the two dataframes and remove/rename duplicate columns
videos = videos.merge(metrics, how='inner', on='id').set_index('id')
videos.drop(columns=['publish_date_y'], inplace=True)
videos.rename(columns={'publish_date_x': 'publish_date'}, inplace=True)

In [None]:
# Save data to Excel file
videos.to_excel('../data/videos.xlsx')

### Deelstudie 1, analyse 2

In [None]:
# Get comments for each video (from the previous step)
comments = youtube.get_video_comments(videos)
len(comments)

In [None]:
# Anonymize comments (hide handlers)
comments = helpers.anonymize_comments(comments)

In [None]:
# Translate comments to English using Google Translate API
# Please note this will take a long time!
comments_nl = comments['comment_nl'].tolist()
comments_google = [helpers.translate_with_google(comment) for comment in comments_nl]
comments['en_google'] = comments_google

In [None]:
# Translate comments to English using DeepL API
# This also will take a long time (but less than Google)
comments_nl = comments['comment_nl'].tolist()
comments_deepl = [helpers.translate_with_deepl(comment) for comment in comments_nl]
comments['en_deepl'] = comments_deepl

In [None]:
# Save data to Excel file
comments.to_excel('../data/comments.xlsx')

### Deelstudie 2

In [None]:
# Define video_ids for the experiment
video_ids = np.array(['GYtUhykvOos','PkmUT16Um_0','JxWT-zYtcGg','rJbiY3S69ek','ewUjvz3nDj4','BDq4yJCRFcE','ih7RQ5lFwIY','Wm-Yk5bK_fk','4efyusOrx14','PUztndRNSU8','1QsVq3vlZsk'])  #

In [None]:
# Get information about each video and calculate the corresponding metrics
exp_videos = youtube.get_generic_info(video_ids)
exp_metrics = youtube.get_metrics_over_time(exp_videos)
exp_videos = exp_videos.merge(exp_metrics, how='inner', on='id').set_index('id')
exp_videos.drop(columns=['publish_date_y'], inplace=True)
exp_videos.rename(columns={'publish_date_x': 'publish_date'}, inplace=True)

In [None]:
# Add manual annotations to the data
exp_videos['has_CTA'] = np.array([False, True, True, False, True, True, False, False, True, False, True])
exp_videos['is_beta'] = np.array([None, False, False, None, True, True, None, None, False, None, True])
exp_videos['university'] = np.array(['KUL', 'VUB', 'UA', 'VUB', 'UG', 'UG', 'UA', 'UG', 'VUB', 'UG', 'UA'])
exp_videos['gender'] = np.array([0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1]) # F=0, M=1
exp_videos['has_ambassador'] = np.array([True, True, False, False, False, False, False, False, False, False, False])

In [None]:
# Save data to Excel file
exp_videos.to_excel('../data/experiment_videos.xlsx')