# Applied Digital Citizen Science
## Session 4

DISCLOSURE: Parts of the code for this notebook were created with GitHub CoPilot, and disclosed as such. All code has been tested by the lecturers.


**GENERATIVE AI GUIDELINES:**
1. For your assignments, you are allowed to use generative AI (e.g., CoPilot, ChatGPT, or UvA AI Chat) to create code. 
2. You are allowed to upload the (persona) data to the UvA AI Chat. You are **NOT allowed** to upload data to external generative AI services (e.g., CoPilot, ChatGPT or any other generative AI services).
3. You are responsible for testing all the code that you use, and are ultimately responsible for its functioning.
4. You must disclose when you use Generative AI for coding, by including a Markdown note for this before *every* cell that uses Generative AI. (So multiple times, if needed, on the same notebook)
5. You **cannot** use any type of Generative AI (including UvA AI Chat) in any written part of any assignments (except for the Python code) in this course.

## Starting up - Recap of the previous session

Challlenges from the previous week:
1. Run Python on Vscode
2. Create a new Jupyter notebook
3. Run a code to calculate something
4. Print some text in your notebook
5. Create a markdown cell and write text with bold and italics

*(We will do this together in class)*


## Objectives for Session 4

1. Load the browser history data from your persona
2. Generate a report of the most frequently accessed domains
3. Convert the timestamps ("time_usec") into a meaningful fields
4. If time allows: extract one video id from a YouTube URL and connect it to a YouTube Data Tool Report (i.e., data linkage)

### 1. Loading reqired libraries

Python is a general programming language, which means that in principle we can build whatever program we want with it. That, of course, is not always efficient. Imagine if you have to program everything - even how to run a correlation - from scratch. This brings us to one of its advantages: In the same way as R, there is an active community of (open source) developers and volunteers that creates Python "extensions" (called libraries) that allow us to reuse programs from others.

We will use a few of these libraries in this course, and will get introduced to them in due time. The first one that we will start using is Pandas, a data analysis library. For more information about it, see its [documentation](https://pandas.pydata.org/).

To import a library, we use the command "import" and the library name. Usually we also provide an abbreviation for the library name, to make us more efficient. Below we are importing Pandas, and calling it as "pd" from now on. We will also import the library json, which helps us handle json files. Pandas itself can handle many files that are in json, but not all of them.

In [None]:
import pandas as pd

import json

The code below is using a command inside the json library (load) to read a file. Inside it, we are using another command (open) that is native to Python, to open the file. The result of the operation is being stored in the variable called "data"

*Please note that this is from a sample user, not a real person*

In [36]:
data = json.load(open('Takeout 2/Chrome/History.json'))

The code below shows which keys the data has.

In [37]:
data.keys()

dict_keys(['Browser History', 'Typed Url', 'Session', 'Shared Tab Group'])

We want to know what is inside "Browser History", so we will visualise it briefly

In [66]:
data['Browser History'][:1]

[{'favicon_url': 'https://www.google.com/favicon.ico',
  'page_transition_qualifier': 'CLIENT_REDIRECT',
  'title': 'Google Takeout',
  'url': 'https://takeout.google.com/',
  'time_usec': 1757361757989133,
  'client_id': '0B3RVjuLE/l8vhvu+ARWeQ=='}]

Now we are converting only the Browser History data into a dataframe.

In [39]:
history = pd.DataFrame(data['Browser History'])

In [65]:
history[:3]

Unnamed: 0,favicon_url,page_transition_qualifier,title,url,time_usec,client_id,domain,date_time,youtube_video_id
0,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/,1757361757989133,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:37.989133+00:00,
1,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/manage,1757361755710235,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:35.710235+00:00,
2,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/,1757361749586885,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:29.586885+00:00,


In [41]:
len(history)

46

History is interesting, but not particularly useful. Let's see a few functions that we can use to make it more useful.

First, let's extract the domain.

In [42]:
# If you don't have tldextract installed, run the command below to install it.
!pip install tldextract



In [43]:
import tldextract
def extract_domain(url):
    result = tldextract.extract(url)
    return result.domain

The code below creates a new column called "domain" in the dataframe history, by applying the function "extract_domain" to the column "url"

In [44]:
history['domain'] = history['url'].apply(extract_domain)

In [64]:
history[:3]

Unnamed: 0,favicon_url,page_transition_qualifier,title,url,time_usec,client_id,domain,date_time,youtube_video_id
0,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/,1757361757989133,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:37.989133+00:00,
1,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/manage,1757361755710235,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:35.710235+00:00,
2,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/,1757361749586885,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:29.586885+00:00,


Let's check the top domains

In [46]:
history['domain'].value_counts()

domain
google         21
youtube        16
scistarter      4
wikipedia       3
urbanreleaf     1
ucdavis         1
Name: count, dtype: int64

Let's now see how we can make the time_usec column useful.

In [47]:
from datetime import datetime, timedelta, timezone
 
def filetime_to_datetime(microseconds: int) -> datetime:
    epoch_start = datetime(1970, 1, 1, tzinfo=timezone.utc)
    return epoch_start + timedelta(microseconds=microseconds)

In [48]:
history['date_time'] = history['time_usec'].apply(filetime_to_datetime)

In [49]:
history[:3]

Unnamed: 0,favicon_url,page_transition_qualifier,title,url,time_usec,client_id,domain,date_time
0,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/,1757361757989133,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:37.989133+00:00
1,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/manage,1757361755710235,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:35.710235+00:00
2,https://www.google.com/favicon.ico,CLIENT_REDIRECT,Google Takeout,https://takeout.google.com/,1757361749586885,0B3RVjuLE/l8vhvu+ARWeQ==,google,2025-09-08 20:02:29.586885+00:00


Let's extract the ids of YouTube videos in the history.

In [50]:
history[history['domain']=='youtube']

Unnamed: 0,favicon_url,page_transition_qualifier,title,url,time_usec,client_id,domain,date_time
3,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,"Latto, Ice Spice - GYATT (Official Music Video...",https://www.youtube.com/watch?v=h1SdotpjkTU&li...,1757361745075476,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:25.075476+00:00
4,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,"Shaboozey, Stephen Wilson Jr. - Took A Walk (f...",https://www.youtube.com/watch?v=xdI_3GdLt8g&li...,1757361742206189,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:22.206189+00:00
5,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,Lady Gaga - The Dead Dance (Official Music Vid...,https://www.youtube.com/watch?v=xGaZBfJOyAc&li...,1757361739942513,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:19.942513+00:00
6,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,Justin Bieber - LOVE SONG (Audio) - YouTube,https://www.youtube.com/watch?v=BEsAhEAjC8Y&li...,1757361737374452,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:17.374452+00:00
7,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,Echo Station | Deep Sci-Fi Outpost Ambience fo...,https://www.youtube.com/playlist?list=RDCLAK5u...,1757361736578717,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:16.578717+00:00
8,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,The Music Channel - YouTube,https://www.youtube.com/channel/UC-9-kyTW8ZkZN...,1757361729056287,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:09.056287+00:00
9,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,YouTube,https://www.youtube.com/,1757361724449645,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:02:04.449645+00:00
10,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,Echo Station | Deep Sci-Fi Outpost Ambience fo...,https://www.youtube.com/watch?v=1fDP9T9Cagg,1757361716469736,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:01:56.469736+00:00
11,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,Gaming - YouTube,https://www.youtube.com/gaming,1757361709456376,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:01:49.456376+00:00
12,https://www.youtube.com/s/desktop/49e708f0/img...,CLIENT_REDIRECT,YouTube,https://www.youtube.com/,1757361702611504,0B3RVjuLE/l8vhvu+ARWeQ==,youtube,2025-09-08 20:01:42.611504+00:00


In [54]:
history[history['domain']=='youtube'].iloc[3].url

'https://www.youtube.com/watch?v=BEsAhEAjC8Y&list=RDCLAK5uy_k5n4srrEB1wgvIjPNTXS9G1ufE9WQxhnA&index=2'

*Disclosure:* The code below was generated with the help of CoPilot.

In [55]:
def extract_youtube_video_id(url):
    if 'youtube.com/watch?v=' in url:
        return url.split('v=')[1].split('&')[0]
    elif 'youtu.be/' in url:
        return url.split('youtu.be/')[1].split('?')[0]
    else:
        return None


In [56]:
history['youtube_video_id'] = history['url'].apply(extract_youtube_video_id)

In [61]:
for url, id in history[history['domain']=='youtube'][['url','youtube_video_id']].values.tolist():
    print(url, id)

https://www.youtube.com/watch?v=h1SdotpjkTU&list=RDCLAK5uy_k5n4srrEB1wgvIjPNTXS9G1ufE9WQxhnA&index=4 h1SdotpjkTU
https://www.youtube.com/watch?v=xdI_3GdLt8g&list=RDCLAK5uy_k5n4srrEB1wgvIjPNTXS9G1ufE9WQxhnA&index=3 xdI_3GdLt8g
https://www.youtube.com/watch?v=xGaZBfJOyAc&list=RDCLAK5uy_k5n4srrEB1wgvIjPNTXS9G1ufE9WQxhnA&index=2 xGaZBfJOyAc
https://www.youtube.com/watch?v=BEsAhEAjC8Y&list=RDCLAK5uy_k5n4srrEB1wgvIjPNTXS9G1ufE9WQxhnA&index=2 BEsAhEAjC8Y
https://www.youtube.com/playlist?list=RDCLAK5uy_k5n4srrEB1wgvIjPNTXS9G1ufE9WQxhnA&playnext=1&index=1 None
https://www.youtube.com/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ None
https://www.youtube.com/ None
https://www.youtube.com/watch?v=1fDP9T9Cagg 1fDP9T9Cagg
https://www.youtube.com/gaming None
https://www.youtube.com/ None
https://www.youtube.com/watch?v=LxJzb61L-7U&list=RDLxJzb61L-7U&start_radio=1 LxJzb61L-7U
https://www.youtube.com/ None
https://www.youtube.com/ None
https://www.youtube.com/watch?v=w_Rj6HLtl_8&list=RDw_Rj6HLtl_8&start_radio=1 w_