<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Querying-for-Interviews-with-Person-X" data-toc-modified-id="Querying-for-Interviews-with-Person-X-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Querying for Interviews with Person X</a></span><ul class="toc-item"><li><span><a href="#Interviews-with-Bernie-Sanders" data-toc-modified-id="Interviews-with-Bernie-Sanders-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Interviews with Bernie Sanders</a></span></li><li><span><a href="#Interviews-with-Kellyanne-Conway-and-John-McCain" data-toc-modified-id="Interviews-with-Kellyanne-Conway-and-John-McCain-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Interviews with Kellyanne Conway and John McCain</a></span></li></ul></li></ul></div>

In [None]:
# Imports. Run this first!

from query.models import LabeledInterview, LabeledPanel, LabeledCommercial, Video, FaceIdentity
from esper.rekall import *
from rekall.temporal_predicates import *
from rekall.spatial_predicates import *
from rekall.interval_list import IntervalList
from esper.prelude import esper_widget
#from esper.captions import topic_search
from django.db.models import FloatField

sandbox_videos = [529, 763, 2648, 3459, 3730, 3769, 3952, 4143, 4611, 5281, 6185, 7262, 8220,
    8697, 8859, 9215, 9480, 9499, 9901, 10323, 10335, 11003, 11555, 11579, 11792,
    12837, 13058, 13141, 13247, 13556, 13827, 13927, 13993, 14482, 15916, 16215,
    16542, 16693, 16879, 17458, 17983, 19882, 19959, 20380, 20450, 23181, 23184,
    24193, 24847, 24992, 25463, 26386, 27188, 27410, 29001, 31378, 32472, 32996,
    33004, 33387, 33541, 33800, 34359, 34642, 36755, 37107, 37113, 37170, 38275,
    38420, 40203, 40856, 41480, 41725, 42756, 45472, 45645, 45655, 45698, 48140,
    49225, 49931, 50164, 50561, 51175, 52075, 52749, 52945, 53355, 53684, 54377,
    55711, 57384, 57592, 57708, 57804, 57990, 59122, 59398, 60186]

# Querying for Interviews with Person X

We have an annotated sandbox of panels, interviews, and commercials (`query_labeledpanel`, `query_labeledinterview`, `query_commercial`). In this notebook we'll try to use write queries for these concepts and compare the results against our labels.

A few notes about the panels:
* Panel segments go from the introduction of panelists to the shot where the host says "thank you" or "goodbye" to the panelists. Sometimes the camera will cut wide and show all the panelists before a commercial break; such shots are *not* included in the labeled segments.
* Panel segments are split up by commercials (i.e., if the same panel appears before and after a commercial break, that will be two panel segments).
* Panels segments do not include segments where the host is just cutting to multiple reporters out in the field to cover some news story.

A few notes about the interviews. See the third bullet point in particular.
* Interview segments go from the first shot of the guest to where the host thanks the guest. Sometimes the host thanks the guest while the guest is on screen, and sometimes the host thanks the guest off-screen. In the former case, the segment continues until the guest is no longer on screen; in the latter case, the segment stops when the host changes the subject after thanking the guest.
* Sometimes the host doesn't thank the guest; in this case, the segment ends when the guest is no longer on screen or when the host changes the subject.
* **Interviews with analysts and correspondents from the same network are *not* included. This is to differentiate between the typical interview and "interviews" where the guest is just presenting a news segment.**
* Interviews with reporters from the same network are also usually not included, *unless* the format of the interview is sufficiently different from the typical "here's a reporter to tell you the news" format. This is a judgment call on my (Dan Fu) part.

This dataset also includes extra annotation of interviews with Kellyanne Conway, Bernie Sanders, and John McCain.
* These interview segments include **any** segment where Kellyanne Conway, Bernie Sanders, or John McCain appear and are being interviewed. This includes segments during commercials or short clips where an interview of one of them is being played on another channel or show.
* These segments also include clips where the guest appears for a few seconds as a "preview" before a commercial break.
* Each of these segments is annotated with the name of the guest(s) and interviewer(s).

A few notes about the commercials.
* Commercial segments go from the beginning of the commercial break to the end of the commercial break. Sometimes networks will put in a segment from the network to let the viewers know that the commercial break is ending (think "this is CNN, the must trusted name in news" segment). Such segments are included in the commercial segments.

Let's define some statistics. Suppose we have a set of intervals `query` that represent our query, and a set of intervals `ground_truth` that represent our ground truth. We are interested in four statistics:
* **precision**: This is your standard definition of precision, computed over the intervals: `sum(overlap(query, ground_truth)) / sum(query)`
* **recall**: This is your standard definition of recall, computed over the intervals: `sum(overlap(query, ground_truth)) / sum(ground_truth)`
* **precision_per_item**: We may also be interested in *how many* segments we hit. How many segments in `query` overlap with *any* segment in `ground_truth`? This is `sum(count(overlap(query, ground_truth))) / sum(count(query))`.
* **recall_per_item**: Similar to precision_per_item. How many segments in `ground_truth` overlap with *any* segment in `query`? `sum(count(overlap(query, ground_truth))) / sum(count(ground_truth))`.

In [None]:
# Returns precision, recall, precision_per_item, recall_per_item
def compute_statistics(query_intrvllists, ground_truth_intrvllists):
    total_query_time = 0
    total_query_segments = 0
    total_ground_truth_time = 0
    total_ground_truth_segments = 0
    
    for video in query_intrvllists:
        total_query_time += query_intrvllists[video].coalesce().get_total_time()
        total_query_segments += query_intrvllists[video].size()
    for video in ground_truth_intrvllists:
        total_ground_truth_time += ground_truth_intrvllists[video].coalesce().get_total_time()
        total_ground_truth_segments += ground_truth_intrvllists[video].size()
        
    total_overlap_time = 0
    overlapping_query_segments = 0
    overlapping_ground_truth_segments = 0
    
    for video in query_intrvllists:
        if video in ground_truth_intrvllists:
            query_list = query_intrvllists[video]
            gt_list = ground_truth_intrvllists[video]
            
            total_overlap_time += query_list.overlaps(gt_list).coalesce().get_total_time()
            overlapping_query_segments += query_list.filter_against(gt_list, predicate=overlaps()).size()
            overlapping_ground_truth_segments += gt_list.filter_against(query_list, predicate=overlaps()).size()
    
    if total_query_time == 0:
        precision = 1.0
        precision_per_item = 1.0
    else:
        precision = total_overlap_time / total_query_time
        precision_per_item = overlapping_query_segments / total_query_segments
    
    if total_ground_truth_time == 0:
        recall = 1.0
        recall_per_item = 1.0
    else:
        recall = total_overlap_time / total_ground_truth_time
        recall_per_item = overlapping_ground_truth_segments / total_ground_truth_segments
    
    return precision, recall, precision_per_item, recall_per_item

def print_statistics(query_intrvllists, ground_truth_intrvllists):
    precision, recall, precision_per_item, recall_per_item = compute_statistics(
        query_intrvllists, ground_truth_intrvllists)

    print("Precision: ", precision)
    print("Recall: ", recall)
    print("Precision Per Item: ", precision_per_item)
    print("Recall Per Item: ", recall_per_item)

First, let's just visualize all the labeled data. Interviews are in red, panels are in blue, and commercials are in purple.

In [None]:
interviews = LabeledInterview.objects \
        .annotate(fps=F('video__fps')) \
        .annotate(min_frame=F('fps') * F('start')) \
        .annotate(max_frame=F('fps') * F('end'))
panels = LabeledPanel.objects \
        .annotate(fps=F('video__fps')) \
        .annotate(min_frame=F('fps') * F('start')) \
        .annotate(max_frame=F('fps') * F('end'))
commercials = LabeledCommercial.objects \
        .annotate(fps=F('video__fps')) \
        .annotate(min_frame=F('fps') * F('start')) \
        .annotate(max_frame=F('fps') * F('end'))

result = intrvllists_to_result(qs_to_intrvllists(interviews))
add_intrvllists_to_result(result, qs_to_intrvllists(panels), color="blue")
add_intrvllists_to_result(result, qs_to_intrvllists(commercials), color="purple")

esper_widget(result)

## Interviews with Bernie Sanders

In [None]:
# Let's get all interviews of Bernie Sanders in our dataset and display it as black.
# For this task, we won't display any interviews that weren't original appearances.

bernie_interviews = LabeledInterview.objects \
        .annotate(fps=F('video__fps')) \
        .annotate(min_frame=F('fps') * F('start')) \
        .annotate(max_frame=F('fps') * F('end')) \
        .filter(guest1="bernie sanders")

bernie_interviews_intrvllists = qs_to_intrvllists(bernie_interviews)
bernie_interviews_original_intrvllists = qs_to_intrvllists(bernie_interviews.filter(original=True))

# Hide result in a function for namespace reasons
def get_result():
    result = intrvllists_to_result(bernie_interviews_original_intrvllists, color='black')

    return result

esper_widget(get_result(), show_middle_frame=True)

In [None]:
# Helper function to get results with ground truth

def result_with_ground_truth(intrvllists):
    result = intrvllists_to_result(bernie_interviews_original_intrvllists, color='black')
    add_intrvllists_to_result(result, intrvllists, color='red')
    return result

In [None]:
# Let's query for Bernie Sanders interviews. This may take a while to materialize all the data.

identities = FaceIdentity.objects.filter(face__shot__video_id__in=sandbox_videos)
hosts = identities.filter(face__is_host=True)
sanders = identities.filter(identity__name="bernie sanders").filter(probability__gt=0.7)

hosts_intrvllists = qs_to_intrvllists(hosts
    .annotate(video_id=F("face__shot__video_id"))
    .annotate(min_frame=F("face__shot__min_frame"))
    .annotate(max_frame=F("face__shot__max_frame")))
sanders_intrvllists = qs_to_intrvllists(sanders
    .annotate(video_id=F("face__shot__video_id"))
    .annotate(min_frame=F("face__shot__min_frame"))
    .annotate(max_frame=F("face__shot__max_frame")))

In [None]:
# Get all shots with Bernie Sanders and a host
sanders_with_host_intrvllists = {}
for video in sanders_intrvllists:
    if video in hosts_intrvllists:
        sanders_with_host_intrvllists[video] = sanders_intrvllists[video].overlaps(hosts_intrvllists[video]).coalesce()

print_statistics(sanders_with_host_intrvllists, bernie_interviews_original_intrvllists)

esper_widget(result_with_ground_truth(sanders_with_host_intrvllists))

What do we get from those statistics? We are missing half the interviews, but we're hitting all of them. This tells us that part of our problem has to do with not coalescing well enough. We also have a problem where half our query segments do *not* hit an interview, so we need to cull some. Let's try something else.

In [None]:
'''
We're going to look for the following patterns:
    (Bernie Sanders + host) -> host OR
    host -> (Bernie Sanders + host) OR
    (Bernie Sanders + host) -> Bernie Sanders OR
    Bernie Sanders -> (Bernie Sanders + host)

We'll coalesce that, and then check in with the Esper widget again.
'''
sanders_interview_intrvllists = {}
for video in sanders_with_host_intrvllists:
    sanders_with_host = sanders_with_host_intrvllists[video]
    hosts = hosts_intrvllists[video]
    sanders = sanders_intrvllists[video]
    
    sanders_interview_intrvllists[video] = sanders_with_host.merge(
        hosts, predicate=or_pred(before(max_dist=10), after(max_dist=10))).set_union(
        sanders_with_host.merge(sanders, predicate=or_pred(before(max_dist=10), after(max_dist=10)))
    ).coalesce()

print_statistics(sanders_interview_intrvllists, bernie_interviews_original_intrvllists)

esper_widget(result_with_ground_truth(sanders_interview_intrvllists))

We're much closer to getting all the interviews, but we still have some large gaps. Let's try to see what's going on.

In [None]:
investigation_result = intrvllists_to_result(bernie_interviews_original_intrvllists, color='black')
add_intrvllists_to_result(investigation_result, sanders_with_host_intrvllists, color='orange')
add_intrvllists_to_result(investigation_result, sanders_intrvllists, color='blue')
add_intrvllists_to_result(investigation_result, sanders_interview_intrvllists, color='red')

esper_widget(investigation_result)

There are some gaps because of consecutive Bernie Sanders or host shots. Let's dilate and coalesce those and have another go at that.

In [None]:
sanders_interview_consec_intrvllists = {}
for video in sanders_with_host_intrvllists:
    sanders_with_host = sanders_with_host_intrvllists[video]
    hosts = hosts_intrvllists[video].dilate(10).coalesce().dilate(-10)
    sanders = sanders_intrvllists[video].dilate(10).coalesce().dilate(-10)
    
    sanders_interview_consec_intrvllists[video] = sanders_with_host.merge(
        hosts, predicate=or_pred(or_pred(overlaps(), before(max_dist=10)), after(max_dist=10))).set_union(
        sanders_with_host.merge(sanders, predicate=or_pred(or_pred(overlaps(), before(max_dist=10)), after(max_dist=10)))
    ).coalesce()

print_statistics(sanders_interview_consec_intrvllists, bernie_interviews_original_intrvllists)

esper_widget(result_with_ground_truth(sanders_interview_consec_intrvllists))

In [None]:
sanders_interview_filtered_intrvllists = {}
for video in sanders_interview_intrvllists:
    sanders_interview = sanders_interview_consec_intrvllists[video]
    
    sanders_interview_filtered_intrvllists[video] = sanders_interview \
        .dilate(600) \
        .coalesce() \
        .dilate(-600) \
        .filter_length(min_length=1350)

print_statistics(sanders_interview_filtered_intrvllists, bernie_interviews_original_intrvllists)

esper_widget(result_with_ground_truth(sanders_interview_filtered_intrvllists))

That looks pretty good. We still have some false positives, but it's hard to get rid of those with what we have right now. Let's summarize what we did:

In [None]:
# Show multiple stages of our query process all in one timeline.
summarize_bernie_result = intrvllists_to_result(bernie_interviews_original_intrvllists, color='black')
add_intrvllists_to_result(summarize_bernie_result, sanders_with_host_intrvllists, color='orange')
add_intrvllists_to_result(summarize_bernie_result, sanders_intrvllists, color='blue')
add_intrvllists_to_result(summarize_bernie_result, hosts_intrvllists, color='purple')
add_intrvllists_to_result(summarize_bernie_result, sanders_interview_intrvllists, color='green')
add_intrvllists_to_result(summarize_bernie_result, sanders_interview_consec_intrvllists, color='brown')
add_intrvllists_to_result(summarize_bernie_result, sanders_interview_filtered_intrvllists, color='red')

esper_widget(summarize_bernie_result)

## Interviews with Kellyanne Conway and John McCain

Now that we have a simple query for interviews with Bernie Sanders, let's do the same thing for Kellyanne Conway and John McCain, using our best method from before.

In [None]:
# ground truth for interviews where guest 1 is X
def ground_truth_interviews_intrvllists(name, original=True):
    interviews = LabeledInterview.objects \
        .annotate(fps=F('video__fps')) \
        .annotate(min_frame=F('fps') * F('start')) \
        .annotate(max_frame=F('fps') * F('end')) \
        .filter(guest1=name)
    if original:
        interviews = interviews.filter(original=original)
    return qs_to_intrvllists(interviews)

# intrvllists for shots with a face with identity X
def named_person_intrvllists(name):
    person = identities.filter(identity__name=name).filter(probability__gt=0.7)
    
    return qs_to_intrvllists(person
        .annotate(video_id=F("face__shot__video_id"))
        .annotate(min_frame=F("face__shot__min_frame"))
        .annotate(max_frame=F("face__shot__max_frame")))

# helper function to get hosts
def host_intrvllists():
    host = identities.filter(face__is_host=True)

    return qs_to_intrvllists(hosts
        .annotate(video_id=F("face__shot__video_id"))
        .annotate(min_frame=F("face__shot__min_frame"))
        .annotate(max_frame=F("face__shot__max_frame")))

# query for interviews of person X
def interview_query(person_intrvllists, host_intrvllists):
    interview_intrvllists = {}
    for video in person_intrvllists:
        if video not in host_intrvllists:
            continue
        person = person_intrvllists[video]
        host = host_intrvllists[video]
        person_with_host = person.overlaps(host).coalesce()
        
        overlaps_before_or_after_pred = or_pred(or_pred(
            overlaps(), before(max_dist=10)), after(max_dist=10))
        
        interview_candidates = person_with_host \
            .merge(hosts, predicate=overlaps_before_or_after_pred) \
            .set_union(person_with_host.merge(
                person, predicate=overlaps_before_or_after_pred)) \
            .coalesce()
        
        interviews_filtered = interview_candidates \
            .dilate(600) \
            .coalesce() \
            .dilate(-600) \
            .filter_length(min_length=1350)
        
        if interviews_filtered.size() > 0:
            interview_intrvllists[video] = interviews_filtered
    
    return interview_intrvllists

In [None]:
# Helper function to do all the above in one call

def summarize_named_interview(name, original=True):
    gt = ground_truth_interviews_intrvllists(name, original)
    person = named_person_intrvllists(name)
    person_interviews = interview_query(person, hosts_intrvllists)

    print_statistics(person_interviews, gt)

    summarize_result = intrvllists_to_result(gt, color='black')
    add_intrvllists_to_result(summarize_result, person, color='blue')
    add_intrvllists_to_result(summarize_result, hosts_intrvllists, color='purple')
    add_intrvllists_to_result(summarize_result, person_interviews, color='red')

    return summarize_result

In [None]:
# This will take a while to materialize some of the data

kellyanne_result = summarize_named_interview("kellyanne conway")
esper_widget(kellyanne_result)

In [None]:
# This will take a while to materialize some of the data

mccain_result = summarize_named_interview("john mccain")
esper_widget(mccain_result)

So in summary, the precision for Kellyanne Conway is pretty good (95%), but it's not as good for John McCain. The main reason is that there are very few interviews of John McCain, so any false positives (there is one) throw the precision numbers quite off. There's also a big false negative for John McCain - an interview with Jake Tapper that was played verbatim on CNN, but not on Jake Tapper's show. This false negative occurs because Jake Tapper's face wasn't registered as a host during that playtime.