#### Big Data – Exercises

# Fall 2024 -  Week 11 - RumbleDB

# Moodle quiz (11.2): querying a bigger git-archive dataset

You will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need these things:
- Something related to the query output (we will grade this)
- The query you wrote (ungraded)

This exercise was designed to run on the exam magic box (and tested there too). It should work on all systems, but if you run into issues there you can look at the tutorial on how to run docker on [GitHub codespaces](https://github.com/RumbleDB/bigdata-exercises/blob/master/Big_Data/exercise05/GitHub_Codespaces.pdf), or the alternative instructions in [last year's exercises](https://github.com/RumbleDB/bigdata-exercises/tree/08ba6ba6222d96003ad7bd895a71ab6c32bcc872/Big_Data/exercise11).

To get started, run the cell below to properly connect jupyter with rumbleDB (don't worry about the cell, we don't expect you to know what this does).

In [1]:
!pip install rumbledb
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://rumble:9090/jsoniq

[0menv: RUMBLEDB_SERVER=http://rumble:9090/jsoniq


### Check the data
We provide you with a bigger git-archive dataset [git-archive-big.json](https://www.rumbledb.org/samples/git-archive-big.json), you can already check that you get the correct number of records. The dataset should contain 206978 records. You can either use `wget` to download and read the dataset locally or simply read with `json-file` from the URI.

We recommend running the cell below to download the data (reading it directly from the URL is slow and hard to debug using the notebook interface).

In [2]:
!wget https://www.rumbledb.org/samples/git-archive-big.json

--2025-01-23 13:27:24--  https://www.rumbledb.org/samples/git-archive-big.json
Resolving www.rumbledb.org (www.rumbledb.org)... 3.165.190.127, 3.165.190.108, 3.165.190.49, ...
Connecting to www.rumbledb.org (www.rumbledb.org)|3.165.190.127|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 532404791 (508M) [application/json]
Saving to: ‘git-archive-big.json’


2025-01-23 13:27:29 (112 MB/s) - ‘git-archive-big.json’ saved [532404791/532404791]



In [3]:
%%jsoniq
count(for $i in json-file("git-archive-big.json")
return $i)

Took: 1.5011537075042725 ms
206978


In [4]:
# json-file("https://www.rumbledb.org/samples/git-archive-big.json") # to read it from the URL

In [5]:
%%jsoniq
distinct-values(json-file("git-archive-big.json").type)

Took: 0.8167319297790527 ms
"PullRequestEvent"
"MemberEvent"
"PushEvent"
"IssuesEvent"
"PublicEvent"
"CommitCommentEvent"
"ReleaseEvent"
"IssueCommentEvent"
"ForkEvent"
"GollumEvent"
"WatchEvent"
"PullRequestReviewCommentEvent"
"CreateEvent"
"DeleteEvent"


## Question 1: Give the login name of the two actors that committed to master the most in PushEvent events.

Write the two names, separated by a comma with no space in between them.

Hint: Note that all commits in a push event are stored in a list (you should count those as distinct commits).

In [7]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events)

Took: 0.7827973365783691 ms
"repo"
"org"
"actor"
"public"
"type"
"created_at"
"id"
"payload"


In [8]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events.actor)

Took: 0.7545630931854248 ms
"gravatar_id"
"login"
"avatar_url"
"id"
"url"


In [9]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events.payload)

Took: 0.7732892036437988 ms
"push_id"
"head"
"size"
"pusher_type"
"forkee"
"ref"
"distinct_size"
"pages"
"pull_request"
"ref_type"
"action"
"release"
"issue"
"number"
"member"
"commits"
"description"
"before"
"master_branch"
"comment"


In [36]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return $events[$$.type eq "PushEvent"].payload[1]

Took: 0.03875374794006348 ms
{"push_id": 536740396, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "before": "86ffa724b4d70fce46e760f8cc080f5ec3d7d85f", "commits": [{"sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": {"email": "da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com", "name": "davidjhulse"}, "message": "Altered BingBot.jar\n\nFixed issue with multiple account support", "distinct": true, "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81"}]}


In [10]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events.payload.commits[])

Took: 0.7553353309631348 ms
"message"
"author"
"sha"
"distinct"
"url"


In [17]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
let $event := $events[$$.type eq "PushEvent"][1]
return { "name" : $event.actor.login, "commits": $event.payload.commits }

Took: 0.03885340690612793 ms
{"name": "davidjhulse", "commits": [{"sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": {"email": "da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com", "name": "davidjhulse"}, "message": "Altered BingBot.jar\n\nFixed issue with multiple account support", "distinct": true, "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81"}]}


In [19]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
for $event in $events
where $event.type eq "PushEvent"
group by $name := $event.actor.login
let $commits := $event.payload.commits
where $name eq "davidjhulse"
count $c
where $c le 2
return { "name" : $name, "commits" : $commits }

Took: 1.796821117401123 ms
{"name": "davidjhulse", "commits": [[{"sha": "8c8e4d569815d7e9a9174d26b8fda883afa11021", "author": {"email": "da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com", "name": "davidjhulse"}, "message": "Altered BingBot.jar Again\n\nSwapped it for the real deal", "distinct": true, "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/8c8e4d569815d7e9a9174d26b8fda883afa11021"}], [{"sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": {"email": "da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com", "name": "davidjhulse"}, "message": "Altered BingBot.jar\n\nFixed issue with multiple account support", "distinct": true, "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81"}]]}


In [20]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
for $event in $events
where $event.type eq "PushEvent"
group by $name := $event.actor.login
let $num-commits := count($event.payload.commits)
order by $num-commits descending
count $c
where $c le 2
return { "name" : $name, "num-commits" : $num-commits }

Took: 2.8177366256713867 ms
{"name": "kinlane", "num-commits": 3843}
{"name": "KenanSulayman", "num-commits": 1832}


Above is wrong because I forgot to filter by pushes to master!

In [37]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
for $event in $events
where $event.type eq "PushEvent"
where $event.payload.ref eq "refs/heads/master"
group by $name := $event.actor.login
let $num-commits := count($event.payload.commits)
order by $num-commits descending
count $c
where $c le 2
return { "name" : $name, "num-commits" : $num-commits }

Took: 2.7198431491851807 ms
{"name": "KenanSulayman", "num-commits": 1832}
{"name": "qdm", "num-commits": 673}


Above is wrong because i need to unfold the array into a sequence...

In [38]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
for $event in $events
where $event.type eq "PushEvent"
where $event.payload.ref eq "refs/heads/master"
group by $name := $event.actor.login
let $num-commits := count($event.payload.commits[])
order by $num-commits descending
count $c
where $c le 2
return { "name" : $name, "num-commits" : $num-commits }

Took: 2.660771131515503 ms
{"name": "mirror-updates", "num-commits": 2530}
{"name": "KenanSulayman", "num-commits": 1832}


## Question 2: For how many repos do we have both a creation and deletion event in the data?

Write the number and nothing else.

In [21]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events)

Took: 0.7448616027832031 ms
"repo"
"org"
"actor"
"public"
"type"
"created_at"
"id"
"payload"


In [22]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events.repo)

Took: 0.7668747901916504 ms
"name"
"id"
"url"


In [23]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return distinct-values($events.type)

Took: 0.77705979347229 ms
"PullRequestEvent"
"MemberEvent"
"PushEvent"
"IssuesEvent"
"PublicEvent"
"CommitCommentEvent"
"ReleaseEvent"
"IssueCommentEvent"
"ForkEvent"
"GollumEvent"
"WatchEvent"
"PullRequestReviewCommentEvent"
"CreateEvent"
"DeleteEvent"


In [24]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
let $repos :=
    for $event in $events
    group by $repo-id := $event.repo.id
    let $types := $event.type
    where "CreateEvent" = $types
    where "DeleteEvent" = $types
    return $repo-id
return count(distinct-values($repos))

Took: 2.980253219604492 ms
1242


In [27]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
let $repos :=
    for $event in $events
    group by $repo-id := $event.repo.url
    let $types := $event.type
    where "CreateEvent" = $types
    where "DeleteEvent" = $types
    return $repo-id
return count(distinct-values($repos))

Took: 2.44370698928833 ms
1241


In [29]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
let $repos :=
    for $event in $events
    group by $repo-id := $event.repo.name
    let $types := $event.type
    where "CreateEvent" = $types
    where "DeleteEvent" = $types
    return $repo-id
return count(distinct-values($repos))

Took: 2.445394515991211 ms
1241


## Question 3: Which repository has the highest number of people pushing to it?

Give both the repository id and the number of people, separated by a comma with spaces in between.

Hint: Differentiate users (_actors_) using their actor id.

In [25]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events)

Took: 0.7698550224304199 ms
"repo"
"org"
"actor"
"public"
"type"
"created_at"
"id"
"payload"


In [26]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
return keys($events.actor)

Took: 0.7647500038146973 ms
"gravatar_id"
"login"
"avatar_url"
"id"
"url"


In [33]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
for $event in $events
where $event.type eq "PushEvent"
group by $repo-name := $event.repo.name
let $num-actors := count(distinct-values($event.actor.id))
order by $num-actors descending
count $c
where $c = 1
return $repo-name

Took: 3.5710954666137695 ms
"odoo-dev/odoo"


In [39]:
%%jsoniq
let $path as string := "git-archive-big.json"
let $events := json-file($path)
for $event in $events
where $event.type eq "PushEvent"
group by $repo-name := $event.repo.id
let $num-actors := count(distinct-values($event.actor.id))
order by $num-actors descending
count $c
where $c = 1
return $repo-name

Took: 3.4483160972595215 ms
19853934
