# Applied Digital Citizen Science
## Session 8

DISCLOSURE: Parts of the code for this notebook were created with GitHub CoPilot, and disclosed as such. All code has been tested by the lecturers.

See Canvas for details about usage of Generative AI in assignments

## Objectives for Session 8

The code here will be relevant for cleaning, aggregating and formatting the data that you will use for your written report. 

The sections of code in this notebook are self-contained, and you may not need to use everything all the time. The objectives of this session are:

1. Combining the raw dataset (generated with the code from session 6) with self-reports
2. Combining the raw dataset (generated with the code from session 6) with the YouTube Data Tools
3. Aggregating the dataset at the correct level depending on your RQ

## 1. Combining the donated dataset with self-reports

For these steps, you first need to:
1. Run the steps in Session 6 that generate the consolidated datasets (for example on watch or search)
2. Download the data from Qualtrics as Excel
3. Make sure all datasets are in the same folder as the script you are running

The example below uses the watch history. You can simply load another dataset (e.g., search) and run the same steps.

In [None]:
import pandas as pd

In [None]:
donated_data = pd.read_excel('Watch.xlsx')

In [None]:
donated_data

In [None]:
self_reports = pd.read_excel('ADCS-2025-demo2_September+21,+2025_17.03.xlsx', skiprows=[1])

In [None]:
self_reports

In [None]:
def extract_participant_id(text):
    text = text.split('_')
    if len(text) > 2:
        return text[2]
    return None

In [None]:
self_reports['participantid'] = self_reports['participant'].apply(extract_participant_id)

In [None]:
self_reports

In [None]:
donated_data['participantid'].value_counts()

In [None]:
self_reports['participantid'].value_counts()

### Options for merging the data

- We could merge the data at the video level (i.e., each participant appears 20 times in the dataset)
- We could merge the data at participant level (i.e., each participant appears once in the dataset)

Which way to choose? Depends on the RQ.

For now, we will have examples in both ways - both at the video level, and at the participant level.

### Merging the data at the video level

In [None]:
dataset_video_level = pd.merge(donated_data, self_reports, on='participantid', how='left')

In [None]:
dataset_video_level

In [None]:
dataset_video_level['participantid'].value_counts()

In [None]:
self_reports['participantid'].value_counts()

For discussion: Do we have the same participants in both? If not, why?

### Merging at the participant level

For this dataset, each participant will appear only once. Considering that in the videos dataset each participant appears multiple times, we will then need to decide how do we aggregate the videos dataset at participant level.

In this example, I will use the number of videos watched.

In [None]:
videos_watched = pd.pivot_table(donated_data, index='participantid', values='Link', aggfunc='count').reset_index()

In [None]:
videos_watched

For discussion: Why do I only have 19 videos, if each participant appears 20 times in the donated dataset?

In [None]:
dataset_participant_level = pd.merge(self_reports, videos_watched, on='participantid', how='left').rename(columns={'Link': 'videos_watched'})

In [None]:
dataset_participant_level

In [None]:
dataset_participant_level.isna().sum()

In [None]:
dataset_participant_level['videos_watched'].describe()

For discussion: Why do some participants have a missing value in videos watched?

### Exporting the datasets

After completing the work, you may want to export the appropriate datasets. The example below has the dataset_participant_level as an example.

In [None]:
dataset_participant_level.to_excel('dataset_participant_level.xlsx', index=False)

## 2. Combining the donated dataset with YouTube Data Tools

For these steps, you first need to:
1. Run the steps in Session 6 that generate the consolidated datasets (for example on watch or search)
2. Make sure all datasets are in the same folder as the script you are running

The example below uses the watch history. 

In [None]:
import pandas as pd

In [None]:
watch_history = pd.read_excel('Watch.xlsx')

In [None]:
watch_history

We now need to get the video_ids from the YouTube videos. Remember the code we saw on Session 4. 

(Disclosure: the code below was created with the help of CoPilot, and was updated for this session).

In [None]:
def extract_youtube_video_id(url):
    url = str(url)
    if 'youtube.com/watch?v=' in url:
        return url.split('v=')[1].split('&')[0]
    elif 'youtu.be/' in url:
        return url.split('youtu.be/')[1].split('?')[0]
    else:
        return None


In [None]:
watch_history['videoid'] = watch_history['Link'].apply(extract_youtube_video_id)

In [None]:
watch_history['videoid']

Now I will print the video ids in a format that I can simply then copy and paste to reuse at the YouTube Data Tools interface.

In [None]:
for videoid in watch_history['videoid'].unique():
    if len(str(videoid)) > 4:
        print(videoid + ',')

I will now go to the YouTube Data Tools and use these ids to generate a report with information about each of these videos (using the video list module). After this is done, I will download the report from YouTube Data Tools to my own computer, and save the file to the same folder as where this script and other relevant files are located.

In [None]:
youtube_report = pd.read_csv('videolist_seeds19_2025_09_21-16_11_44.csv')

In [None]:
youtube_report

Merging both datasets

In [None]:
watch_history_details = pd.merge(watch_history, youtube_report, left_on='videoid', right_on='videoId', how='left')

In [None]:
watch_history_details

### Exporting the datasets

After completing the work, you may want to export the appropriate datasets. The example below has the dataset_participant_level as an example.

In [None]:
watch_history_details.to_excel('watch_history_details.xlsx', index=False)