#### Big Data – Exercises

# Fall 2024 -  Week 11 - RumbleDB

# Moodle quiz (11.2): querying a bigger git-archive dataset

You will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need these things:
- Something related to the query output (we will grade this)
- The query you wrote (ungraded)

This exercise was designed to run on the exam magic box (and tested there too). It should work on all systems, but if you run into issues there you can look at the tutorial on how to run docker on [GitHub codespaces](https://github.com/RumbleDB/bigdata-exercises/blob/master/Big_Data/exercise05/GitHub_Codespaces.pdf), or the alternative instructions in [last year's exercises](https://github.com/RumbleDB/bigdata-exercises/tree/08ba6ba6222d96003ad7bd895a71ab6c32bcc872/Big_Data/exercise11).

To get started, run the cell below to properly connect jupyter with rumbleDB (don't worry about the cell, we don't expect you to know what this does).

In [1]:
!pip install rumbledb
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://rumble:9090/jsoniq

[0menv: RUMBLEDB_SERVER=http://rumble:9090/jsoniq


### Check the data
We provide you with a bigger git-archive dataset [git-archive-big.json](https://www.rumbledb.org/samples/git-archive-big.json), you can already check that you get the correct number of records. The dataset should contain 206978 records. You can either use `wget` to download and read the dataset locally or simply read with `json-file` from the URI.

We recommend running the cell below to download the data (reading it directly from the URL is slow and hard to debug using the notebook interface).

In [2]:
!wget https://www.rumbledb.org/samples/git-archive-big.json

--2024-12-04 15:35:51--  https://www.rumbledb.org/samples/git-archive-big.json
Resolving www.rumbledb.org (www.rumbledb.org)... 3.165.190.49, 3.165.190.75, 3.165.190.108, ...
Connecting to www.rumbledb.org (www.rumbledb.org)|3.165.190.49|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 532404791 (508M) [application/json]
Saving to: ‘git-archive-big.json’


2024-12-04 15:35:56 (112 MB/s) - ‘git-archive-big.json’ saved [532404791/532404791]



In [3]:
%%jsoniq
count(for $i in json-file("git-archive-big.json")
return $i)

Took: 1.0146398544311523 ms
206978


In [4]:
# json-file("https://www.rumbledb.org/samples/git-archive-big.json") # to read it from the URL

In [5]:
%%jsoniq
distinct-values(json-file("git-archive-big.json").type)

Took: 0.5265016555786133 ms
"PullRequestEvent"
"MemberEvent"
"PushEvent"
"IssuesEvent"
"PublicEvent"
"CommitCommentEvent"
"ReleaseEvent"
"IssueCommentEvent"
"ForkEvent"
"GollumEvent"
"WatchEvent"
"PullRequestReviewCommentEvent"
"CreateEvent"
"DeleteEvent"


In [11]:
%%jsoniq
keys(json-file("git-archive-big.json"))

Took: 0.6636226177215576 ms
"repo"
"org"
"actor"
"public"
"type"
"created_at"
"id"
"payload"


In [14]:
%%jsoniq
keys(json-file("git-archive-big.json").payload)

Took: 0.512007474899292 ms
"push_id"
"head"
"size"
"pusher_type"
"forkee"
"ref"
"distinct_size"
"pages"
"pull_request"
"ref_type"
"action"
"release"
"issue"
"number"
"member"
"commits"
"description"
"before"
"master_branch"
"comment"


In [18]:
%%jsoniq
keys(json-file("git-archive-big.json").payload.commits[])

Took: 0.4760148525238037 ms
"message"
"author"
"sha"
"distinct"
"url"


In [21]:
%%jsoniq
keys(json-file("git-archive-big.json").payload.commits[].author)

Took: 0.5734133720397949 ms
"name"
"email"


In [20]:
%%jsoniq
keys(json-file("git-archive-big.json").actor)

Took: 0.591423749923706 ms
"gravatar_id"
"login"
"avatar_url"
"id"
"url"


In [44]:
%%jsoniq
keys(json-file("git-archive-big.json").repo)

Took: 0.663179874420166 ms
"name"
"id"
"url"


## Question 1: Give the login name of the two actors that committed to master the most in PushEvent events.

Write the two names, separated by a comma with no space in between them.

Hint: Note that all commits in a push event are stored in a list (you should count those as distinct commits).

In [41]:
%%jsoniq
let $path := "git-archive-big.json"
for $event in json-file($path)
    where $event.type eq "PushEvent" and $event.payload.ref eq "refs/heads/master"
    group by $actor-login := $event.actor.login
    let $commit-count := count($event.payload.commits[])
    order by $commit-count descending
    count $c
    where $c le 2
    return {"actor-login": $actor-login, "num-commits": $commit-count}

Took: 2.456744432449341 ms
{"actor-login": "mirror-updates", "num-commits": 2530}
{"actor-login": "KenanSulayman", "num-commits": 1832}


## Question 2: For how many repos do we have both a creation and deletion event in the data?

Write the number and nothing else.

In [55]:
%%jsoniq
let $path := "git-archive-big.json"
let $repos := (
    for $event in json-file($path)
        group by $repo-id := $event.repo.id
        let $repo-name := $event.repo.name
        return { "repo-id": $repo-id, "repo-name": $repo-name, "types": $event.type }
)
let $repos-with-both := (
    for $repo in $repos
        where "CreateEvent" = $repo.types[] and "DeleteEvent" = $repo.types[]
        return $repo.repo-id
)
return count(distinct-values($repos-with-both))

Took: 2.6752636432647705 ms
1242


## Question 3: Which repository has the highest number of people pushing to it?

Give both the repository id and the number of people, separated by a comma with spaces in between.

Hint: Differentiate users (_actors_) using their actor id.

In [58]:
%%jsoniq
let $path := "git-archive-big.json"
for $event in json-file($path)
    where $event.type = "PushEvent"
    group by $repo-id := $event.repo.id
    let $people-count := count(distinct-values($event.actor.login))
    order by $people-count descending
    count $c
    where $c eq 1
    return {"repo-id": $repo-id, "people-count": $people-count}



Took: 3.186026096343994 ms
{"repo-id": 19853934, "people-count": 22}
