# Why do I care about tags?

I like to approach feature engineering for an ML problem by asking: if **I** had to make the prediction I'm asking my ML system to make, what information would I want to have?

For predicting whether a given person will answer a given question correctly, I would want to know about two broad areas: 

- **Concepts**: What skills or concepts are being tested by this question, and how well can this person apply each concept?
- **Misconceptions**: Does the person have any specific misconceptions that are relevant to the question? Are specific things they understand *incorrectly* about the problem at hand?

Getting at misconceptions would be non-trivial with the data provided. Normally, I would look for meaningful patterns in incorrect answers to questions. But in this case, all we know about the incorrect answers is what letter the student chose. We don't have any data about *what* makes that particular answer incorrect. One potential way to go would be to cluster incorrect answers that frequently co-occur, but I'm not sure yet whether I want to go there.

On the other hand, the dataset does seem to provide a direct way of examining the concepts or skills that each question tests: each question has one or more tags associated with it, and each lecture is tagged with exactly one of those tags.

## But what *are* these tags?

The dataset description says:
> The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

I don't find this super informative or reassuring (doesn't sufficiency sort of depend on what kind of clustering I want to do?), so I dug a bit deeper.

The [ArXiv paper](https://arxiv.org/abs/1912.03072) describess the question/tag/lecture part of the dataset as:
> 13,169 problems and 1,021 lectures tagged with 293 types of skills

*(although in a later table it notes that the "level 1" dataset - which seems to be what this challenge dataset is mostly based on - only has 188 tags. This matches the number of unique values in the tag field for lectures, so that checks out!)*

The [GitHub page for the EdNet dataset](https://github.com/riiid/ednet) says that the tags are "expert-annotated", and also notes that:
> Once the number of incorrect answers to questions with particular tags exceeds certain threshold, Santa suggests lectures and questions with corresponding tags. ... It also offers lectures and questions if the average correctness rate of questions with particular tags decreased by more than a certain threshold.

Finally, [a page on the associated AAAI workshop website](https://sites.google.com/view/tipce-2021/shared-task) says:
> Since the table includes educational tags of each learning item, methods like BKT can effectively make use of pedagogical properties to estimate a student's knowledge state. 

So, it does seem like the tags are meant to represent skills or aspects of the knowledge state of the student. And the set of tags seems to have been designed by experts, presumably ones whose expertise is in the subject matter being taught (namely, the TOEIC test).


## My plan

In the longer run, I'd like to analyze how useful these tags might be as a representation of atomic skills in a student-knowledge model; and even possibly use the student behavior data to refine the set of tags currently present in the dataset. If and when I've convinced myself that I have a set of tags which represent learnable skills, I can use these tags to model individual users' knowledge state, and use the knowledge state to predict how they will answer a given question.

Specifically, I'm looking at [these](https://scholar.google.com/scholar?cluster=3456251199517922701) [two](https://scholar.google.com/scholar?cluster=8158178723780769138) papers, which describe data-driven methods for evaluating, improving and using expert-designed cogintive models.

But first, I want to explore a couple of interesting patterns I noticed in the tag-related data, which should help me have a better understanding of what these tags actually are.

# The Data

First, I want to get the data from the questions and lectures tables into a useful format.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from scipy import signal # just in case

In [None]:
questions_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
# Transform tags into lists of ints:
questions_df['tags'] = questions_df['tags'].apply(lambda ts: [int(x) for x in str(ts).split() if x != 'nan'])
questions_df.head()

In [None]:
lectures_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
lectures_df.head()


I also want to make a convenient table for tag data. I thought about doing all this the proper pandas way, but then I remembered that I don't care about the pandas way because it does not care about me and my data-processing desires.

In [None]:
tag_to_questions = {}
for i, row in questions_df.iterrows():
    for t in row['tags']:
        if t not in tag_to_questions:
            tag_to_questions[t] = set()
        tag_to_questions[t].add(row['question_id'])
tags_df = pd.DataFrame([{'tag':t,'questions':qs}for t,qs in tag_to_questions.items()])
tags_df.head()

# Two types of skills

The histogram below shows a very strongly bimodal distribution of tags.

The per-tag metric in the histogram is **the fraction of questions with this tag that also have other tags**. So for each tag, either all questions with that tag have multiple tags, or (nearly) none do.

In [None]:
questions_df['multitag'] = questions_df['tags'].apply(lambda ts: len(ts)>1)

def calc_fract_multitagged(tag_row):
    tag_qs = questions_df[questions_df['question_id'].isin(tag_row['questions'])]
    return tag_qs[tag_qs['multitag']==True].size/tag_qs.size
tags_df['fraction_multitagged'] = tags_df.apply(calc_fract_multitagged, axis=1)
plt.hist(tags_df['fraction_multitagged'])
plt.show()


There seem to be two types of tags: ones that only ever appear by themselves in a question, and ones that only ever appear together with other tags.

If we interpret tags as skills, this means that there are two kinds of skills present in this training app:
- a set of skills are only ever tested/trained by themselves
- a separate set of skills that are only ever tested in conjunction with other skills. 

And nothing in between.


From a pedagogical perspective, this was surprising to me at first. I expected to see, for each skill, some exercises that let the user practice that skill separately, and some that challenge the user to use it in conjunction with other skills.

Then I realized that the questions in this app are probably just designed to mimic actual questions on the TOEIC; not necessarily provide explicit teaching of the individual skills. Indeed, it's probably unreasonable to expect a test-prep app to take on actually teacing the English language; it's probably more focused on test-taking skills and things you can improve by just raw grindy practice, like understanding speakers with various accents. Still, I imagine having interactive exercises that target individual skills couldn't have hurt.[](http://)

# Skills are (mostly) specific to TOEIC parts

If each part of the TOEIC test was actually tagged with a separate set of skills, that could explain the two different categories of tags we see above: Some parts of the test might only ever contain questions that test one skill, while others only contain questions that test multiple skills at once.

The diagrams below shows this isn't *quite* the case - there is *some* overlap in the skills being tested by each part - but there are still meaningful patterns in how skills are split between parts

In [None]:
def calc_tag_parts(tag_row):
    tag_qs = questions_df[questions_df['question_id'].isin(tag_row['questions'])]
    return tuple(tag_qs['part'].unique())
tags_df['parts'] = tags_df.apply(calc_tag_parts, axis=1)

part_sizes = tags_df.groupby('parts').size()
plt.bar(range(len(part_sizes)), part_sizes)
plt.xticks(range(len(part_sizes)),part_sizes.index, rotation=80)
plt.title('number of tags which span a given combination of test parts')
plt.show()

We can see that:

- each test part does have at least some tags that are specific to that part
- there is very little intersection between parts 1-4 (the Listening section of the test) and parts 5-7 (the Reading section). In fact, the intersection is just one tag, which spans all parts *except* part 5.
- parts 5 & 6 have a lot of tags in common. That makes ense, since these parts of the test tend to ask very similar questions (fill in the blank with the most appropriate response).
- There are a few tags that apply to all listening parts (1-4). These probably have to do with general aural comprehension of English.
- There are also a few tags that are common to parts 3 and 4. These parts ask you to listen to one (in part 4) or several (in part 3) people talking and then answer questions about what was said. So these tags probably have to do with more specifically inferring meaning from spoken conversation.

---

Going back to the "tags for single-tagged questions" vs. "tags for multi-tagged questions" dichotomy, below I've split out the "fraction-multitagged" histogram by test part. As I suspected, the test parts play a big role: 
- parts 1-4 and 7 only ever test skills in conjunction with other skills
- part 5 only ever tests skills in isolation from other skills
- part 6, which is very similar to part 5, also mostly tests skills in isolation, but occasionally has questions with several skill tags. I bet this is caused by that one tag which spans all parts except 5.

In [None]:
fig, axarr = plt.subplots(4, 2, figsize=(15, 15))
flat_axes_list = [item for sublist in axarr for item in sublist]

fig.tight_layout()

for part in range(1,8):
    # I want to specifically limit the logic to questions and tags in this part
    def calc_multitagged_in_part(tag_row):
        part_qs = questions_df[questions_df['part']==part]
        tag_qs = part_qs[part_qs['question_id'].isin(tag_row['questions'])]
        return tag_qs[tag_qs['multitag']==True].size/tag_qs.size
    part_multitagged = tags_df[tags_df['parts'].apply(lambda ps: part in ps)].apply(calc_multitagged_in_part, axis=1)
    ax = flat_axes_list[part-1]
    ax.set_title(f'part {part}')
    ax.hist(part_multitagged, bins=[x/10 for x in range(11)])


# Relationships between tags

Let's examine how tags are related to each other through questions. Namely, which tags occur together in the same question, and how often to pairs of tags co-occur?

In [None]:
def tag_relationship_graph(tags, show_disconnected=True, print_values=False, return_graph=False):
    G = nx.Graph()
    for i in tags.index:
        first_tag = tags.loc[i]['tag']
        first_qs = tags.loc[i]['questions']
        if show_disconnected:
            G.add_node(first_tag)
        for _, second_row in tags.loc[i:].iterrows():
            second_tag = second_row['tag']
            second_qs = second_row['questions']
            if first_tag != second_tag:
                qs_in_common = len(first_qs.intersection(second_qs))
                if qs_in_common > 0:
                    G.add_edge(first_tag, second_tag, weight=qs_in_common)
                if print_values:
                    print(f'{first_tag} <-> {second_tag}: {qs_in_common}')
            
    pos=nx.spring_layout(G)
    nx.draw(G,  pos=pos, node_color='bisque', with_labels=True)
    nx.draw_networkx_edge_labels(G, pos=pos,edge_labels=nx.get_edge_attributes(G,'weight'))
    plt.show()
    if return_graph:
        return G

    
def tags_used_in_part(part):
    return tags_df[tags_df['parts'].apply(lambda ps: part in ps)]

In [None]:
tag_relationship_graph(tags_used_in_part(5))

The graph of all tags used in part 5 offers no surprises, but it's a nice sanity check: None of the tags are related to each other in any way, since they are all "singleton" tags that only ever occur by themselves in questions.

Let's take a look at the similar, but slightly more interesing, part 6:

In [None]:
tag_relationship_graph(tags_used_in_part(6))

The interesting part of the graph is a bit too clumped together in the center, because of all the disconnected tags. Let's look at just those tags that are connected to any other tag:

In [None]:
tag_relationship_graph(tags_used_in_part(6), show_disconnected=False)

Only tag 162 is connected to any other tags! Without it, part 6 would look exactly like part 5.

Tag 162 is the one tag that's present in all parts except 5, so my suspicion that it is the one tag that made the part 6 histogram look weird is confirmed:

In [None]:
tags_df[tags_df['tag']==162]

The more-connected parts of the test all look pretty similar to each other in terms of tag relationships - basically, everything's connected:

In [None]:
tag_relationship_graph(tags_used_in_part(4))

It might be interesting to dig into whether the outlier tags are interesting in some way, and/or what's going on in the middle. But later.

---

One other interesing group of tags are those tags which apply to both parts 3 and 4 (but not others):


In [None]:
tag_relationship_graph(tags_df[tags_df['parts']==(3,4)])

Even though we know that these tags are always used in conjunction with other tags, apparently they are never used in conjunction with *each other*. 

I bet that they represent some kind of one-hot encoding of a variable with 7 possible values. In fact, each question in parts 3 and 4 are tagged with exactly one of these tags:

In [None]:
all_qs = set()
for q_set in tags_df[tags_df['parts']==(3,4)]['questions']:
    all_qs.update(q_set)
print(f'{len(questions_df[questions_df["part"].isin((3,4))])} total questions in parts 3 and 4')
print(f'{len(all_qs)} questions tagged with one of these 7 disjoint tags')


# Tags and lectures

Each lecture is about exactly one tag, though there can be many lectures about the same tag, since there are 188 tags and over 1000 lectures.

Tags with more lectures probably represent harder concepts, and possibly even compound or umbrella concepts that can't be logically and concisely explained. 

For example, may questions on the TOEIC test your understanding of colloquialisms and phrases common in spoken English. It might make sense to tag this kind of question with a "colloquialism" tag. But it's not really a concept that is teachable through a few short lectures: the "skill" involved is just remembering and getting used to a bunch of different unrelated phrases.

In [None]:
tags_df.set_index('tag', inplace=True, drop=False)


In [None]:
tags_df['num_lectures'] = lectures_df.groupby('tag').count()['lecture_id']
tags_df['num_lectures'] = tags_df['num_lectures'].fillna(0).astype(int)
tags_df['lectures'] = lectures_df.groupby('tag')['lecture_id'].unique()


In [None]:
tags_df['lectures'] = tags_df['lectures'].fillna('')

In [None]:
plt.hist(tags_df['num_lectures'], bins=range(9))
plt.title('Histogram of the number of lectures a particular tag has')
plt.show()

Interestingly, there are quite a few tags with no lectures at all. These could be tags that represent somehting other than a "learnable" concept, or they could be concepts like my proposed "colloquialism" tag, that are in principle learnable, but not through lectures. A third possibility is that these tags represent very easy concepts which don't even need a lecture.

It could, of course, be a mix of the three. One way to tell might be to look at the learning curves for each tag: how quickly do people get better at answering questions associated with this tag?

---

Our mystery 7-value one-hot encoding set of tags has a pretty interesting distribution of lectures:

In [None]:
tags_df[tags_df['parts']==(3,4)].sort_values('num_lectures')

There seems to be almost a linear growth in number of lectures from tag to tag. This might mean that these 7 values are actually some kind of "difficulty level" encoding, not a categorical variable.

To test that, I would also need to look at how hard the questions in each tag tend to be.

# Future work

## Cognitive Modeling

As I mentioned at the beginning of this notebook, my goal is actually to use tags (or a modified set of tags) to model the cognitive state of the student.

I plan on making separate notebooks about each one, so I won't go into details, except to link again the two papers I'm looking at:
- [Learning Factor Analysis](https://scholar.google.com/scholar?cluster=3456251199517922701) 
- [Q-Matrix](https://scholar.google.com/scholar?cluster=8158178723780769138) 

## Combining questions/tags/lectures with student data

Aside from analyzing the learnability and/or difficulty of individual tags, there are several other things (of varying importance) I might like to explore that will require pulling in the actual student data:

- Are there specific questions and/or specific wrong answers to questions which seem to trigger lectures?
    - If so, are there some lectures that seem "interchangeable", in that they are equally likely to trigger after a certain question? or are lectures that cover the same tag nevertheless are triggered in different contexts, and are probably distinct in their content?
- Diagnostic questions: According to the dataset description, each new user is asked a series of diagnostic questions to determine their current level. 
    - Which questions or tags get asked the most in the diagnostic part? 
    - Is there a separate set of diagnostic questions, or are all questions used for both diagnosis and training? 
    - Is there a deterministic decision tree of what diagnostic question gets asked next, given the answers so far?
- Do different users answer roughly the same sequences of questions? what are common subsequences of questions?
- Do lectures help increase performance in some tags more than others?
- Are there tags that seem to represent specific types of misconceptions? Specifically, are there tags that correlate with clusters of co-occurring wrong answers?
- Are people more likely to view lectures for the listening part of the test? (Because they are already in a position to consume sound and video from their phone/computer - not, for example, in a quiet public place without headphones)
- Is there a lot of variation in lecture elapsed time? (might be a good way to infer lecture length, and/or whether people get bored with lectures and stop them)


# Output

I added some useful information, so I'm going to save the new dataframes (and tag relationship graph) for reuse.

This output is also [available as a dataset](https://www.kaggle.com/yanamal/riiid-questions-tags-lectures-expanded-metadata).

In [None]:
tag_graph = tag_relationship_graph(tags_df, return_graph=True)

In [None]:
from networkx.readwrite import json_graph
import json, numpy

def default_serialize_int64(o):
    if isinstance(o, numpy.integer): return int(o)
    raise TypeError

tag_json = json_graph.node_link_data(tag_graph)
with open('tag_relationships.json', 'w') as tag_file:
    json.dump(tag_json, tag_file, default=default_serialize_int64)

In [None]:
# put tags column back into space_separated string
questions_df['tags'] = questions_df['tags'].apply(lambda t_lst: ' '.join(str(t) for t in t_lst))

In [None]:
questions_df.to_csv('questions_and_tags.csv', index=False)
questions_df.head()


In [None]:
# put columns back into space_separated string
tags_df['questions'] = tags_df['questions'].apply(lambda t_lst: ' '.join(str(t) for t in t_lst))
tags_df['parts'] = tags_df['parts'].apply(lambda t_lst: ' '.join(str(t) for t in t_lst))
tags_df['lectures'] = tags_df['lectures'].apply(lambda t_lst: ' '.join(str(t) for t in t_lst))

In [None]:
tags_df.to_csv('tags.csv', index=False)
tags_df.head()