# Finding which tutorials have been reviewed so far

This notebook fetches submissions and reviews from the SciPy PreTalx database,
in order to perform some simple data exploration.
In particular, we look towards determining which of all submissions are tutorials,
and of these, which have been reviewed.

Before running this notebook, be sure to [have set up](README.md#setup) the necessary computing environment.

In [None]:
from contextlib import closing
import json
import numpy as np
import pandas as pd
import requests as rq
from tqdm.auto import tqdm

<center style="padding-top: 3em; padding-bottom: 3em;">
    <strong style="border: 1px solid; padding: 6pt;">Important: change the next cell so it contains <em>your</em> PreTalx authentication token.</strong>
</center>

You can find your API token on your [profile page](https://cfp.scipy.org/orga/me) on PreTalx, under the **API Access** heading.

In [None]:
TOKEN = "This is *my* token and I don't know you"

The PreTalx API provides the submissions and reviews in paged streams of results.
Each page provides up to 25 results out of the lot.
The following function articulates the string of requests to download all items of a certain stream.

In [None]:
def fetch_sequence_cfp_scipy(url1, max_queries=50):
    sequence = []
    url = url1
    max_queries = 50
    num_queries = 0
    num_results_expected = None

    with closing(tqdm(total=max_queries)) as progress:
        while True:
            response = rq.get(url, headers={"Authorization": f"Token {TOKEN}"})
            assert response.ok
            data = response.json()
            progress.update()
            num_queries += 1

            assert "results" in data
            assert "next" in data

            if num_results_expected is None and "count" in data:
                num_results_expected = data["count"]
                max_queries = int(np.ceil(num_results_expected / len(data["results"])))
                progress.reset(max_queries)
                progress.update(num_queries)
            else:
                assert num_results_expected == data["count"]

            sequence += data["results"]
            url = data["next"]
            if not url:
                break

    return sequence

Let's fetch the raw submissions and reviews.

In [None]:
submissions_ = fetch_sequence_cfp_scipy(
    "https://cfp.scipy.org/api/events/2024/submissions/"
)
len(submissions_)

In [None]:
reviews_ = fetch_sequence_cfp_scipy("https://cfp.scipy.org/api/events/2024/reviews/")
len(reviews_)

Help me, Pandas. You're my only hope.

In [None]:
submissions = pd.DataFrame.from_records(submissions_)
submissions

The data is not flat.
At a glance, I see that talk submissions have the clean `Talk` type (`submission_type`), but others have a localizable structure.
I should think that all of these dictionaries should have the form `{'en': some_type}`, but let's check.

In [None]:
submissions["submission_type"].value_counts()

The assumption is valid,
so let's flatten the `submission_type` column.

In [None]:
submissions["submission_type"] = submissions["submission_type"].apply(
    lambda x: x["en"] if isinstance(x, dict) else x
)
submissions

This makes it easy to extract the tutorials out of the whole set of submissions.
Glancing further, I see that some submissions have already been rejected.
Take a look at the set of submission states to determine which to retain in further analysis.

In [None]:
submissions["state"].value_counts()

So I presume to care only about tutorials in the `submitted` state with respect to worrying as to whether they have been reviewed.
Let's filter.

In [None]:
tutorials = submissions.loc[
    (submissions["submission_type"] == "Tutorial")
    & (submissions["state"] == "submitted")
].copy()
tutorials

As I'm going to list submissions in junction with reviews,
I care about eyeballing the title and author list of each tutorial.
The latter are in readable form, but the former is not.
Let's unpack author list and represent them similar to how they often are typographed as paper citations.

In [None]:
def label_authors(speakers):
    surnames = [speaker.get("name", "WHO").split()[-1] for speaker in speakers]
    assert len(surnames) > 0
    if len(surnames) >= 4:
        return f"{surnames[0]} et al."
    elif len(surnames) == 3:
        return f"{surnames[0]}, {surnames[1]} and {surnames[2]}"
    elif len(surnames) == 2:
        return f"{surnames[0]} and {surnames[1]}"
    else:
        return surnames[0]


tutorials["authors"] = tutorials["speakers"].apply(label_authors)
tutorials

Ok,
having what I need to track tutorials,
let's Pandasify reviews.

In [None]:
reviews = pd.DataFrame.from_records(reviews_)
reviews

Tracking which tutorials are targeted by which review is a simple join.
We can use the unique submission identifiers as join key:
column `code` in the `tutorials` frame,
column `submission` in the `reviews` frame.

In [None]:
tutorials_reviewed = (
    tutorials[["code", "authors", "title"]]
    .merge(
        reviews[["submission", "user", "updated", "score"]],
        how="left",
        left_on="code",
        right_on="submission",
    )
    .drop(columns=["submission"])
    .astype({"updated": "datetime64[ns, GMT]"})
)
tutorials_reviewed

Joins are tricky, so let's sanity check that the data frame above effectively stores data on each retained tutorial.

In [None]:
assert tutorials_reviewed["code"].nunique() == tutorials["code"].nunique()

Which tutorials have not been reviewed yet?

In [None]:
tutorials_reviewed.loc[tutorials_reviewed["updated"].isna()]

Of those that _were_ reviewed, which got less than 2 reviews?

In [None]:
num_reviews = (
    tutorials_reviewed.dropna(how="any")
    .groupby(["code", "authors", "title"], as_index=False)
    .size()
    .sort_values("size")
)
num_reviews.loc[num_reviews["size"] < 2]