<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Enter-a-Topic" data-toc-modified-id="Enter-a-Topic-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Enter a Topic</a></span></li><li><span><a href="#Build-a-Lexicon" data-toc-modified-id="Build-a-Lexicon-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Build a Lexicon</a></span></li><li><span><a href="#Search-for-Segments" data-toc-modified-id="Search-for-Segments-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Search for Segments</a></span></li><li><span><a href="#Validation" data-toc-modified-id="Validation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Validation</a></span><ul class="toc-item"><li><span><a href="#Assert-No-Double-Counting" data-toc-modified-id="Assert-No-Double-Counting-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Assert No Double Counting</a></span></li><li><span><a href="#Sensitivity-of-Total-Segment-Length-to-Window-Size" data-toc-modified-id="Sensitivity-of-Total-Segment-Length-to-Window-Size-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Sensitivity of Total Segment Length to Window Size</a></span></li><li><span><a href="#Sensitivity-of-Total-Segment-Length-to-Threshold" data-toc-modified-id="Sensitivity-of-Total-Segment-Length-to-Threshold-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Sensitivity of Total Segment Length to Threshold</a></span></li><li><span><a href="#Overlap-Between-Topics" data-toc-modified-id="Overlap-Between-Topics-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Overlap Between Topics</a></span></li></ul></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Analysis</a></span><ul class="toc-item"><li><span><a href="#Topic-by-Show" data-toc-modified-id="Topic-by-Show-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Topic by Show</a></span><ul class="toc-item"><li><span><a href="#Topic-by-Show-By-Year" data-toc-modified-id="Topic-by-Show-By-Year-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Topic by Show By Year</a></span></li><li><span><a href="#Topic-by-Show-By-Quarter" data-toc-modified-id="Topic-by-Show-By-Quarter-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Topic by Show By Quarter</a></span></li></ul></li><li><span><a href="#Multitopic-Comparison" data-toc-modified-id="Multitopic-Comparison-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Multitopic Comparison</a></span></li></ul></li></ul></div>

In [None]:
from esper.prelude import *
from esper.stdlib import *
from esper.topics import *

# Enter a Topic

In [None]:
topic = 'syria'

# Build a Lexicon

In [None]:
lexicon = mutual_info(topic)
lexicon

# Search for Segments

In [None]:
segments = find_segments(lexicon, window_size=100, threshold=50)

In [None]:
show_segments(segments)

# Validation

In [None]:
print('Coverage of "{}": {:0.2f} hrs'.format(topic, get_total_segment_length(segments).total_seconds() / 60 / 60))

## Assert No Double Counting
This might happen if we have more than one transcript file loaded for each video.

In [None]:
check_for_double_counting(segments)

## Sensitivity of Total Segment Length to Window Size

We are interested in the stability of the total segment runtime when window size is varied. A low variation indicates that the algorithm is not sensitive to the choice of the window size parameter.

In [None]:
plot_total_segment_length_vs_window_size(
    lexicon,
    window_sizes=[10, 50, 100, 250, 500, 1000]
)

## Sensitivity of Total Segment Length to Threshold

We are interested in the stability of the total segment runtime when the threshold is varied. A low variation indicates that the algorithm is not sensitive to the choice of the threshold parameter.

In [None]:
plot_total_segment_length_vs_threshold(
    lexicon, 
    thresholds=[5, 10, 25, 50, 75, 100, 200]
)

## Overlap Between Topics

Some topics are subtopics of another topic. For instance, we expect "affordable care act" to be a subtopic of "healthcare". This section prints out the segment overlap between topics.

In [None]:
related_topics = ['isis', 'terrorism', 'middle east', 'islam']
unrelated_topics = ['baseball', 'healthcare', 'taxes']

In [None]:
topics = [topic] + related_topics + unrelated_topics
assert len(topics) > 1
topic_overlap = get_overlap_between_topics(
    [topic] + related_topics + unrelated_topics, 
    window_size=250
)
topic_overlap

# Analysis

## Topic by Show

In [None]:
topic_time_by_show = get_topic_time_by_show(segments)
plot_topic_time_by_show(topic, topic_time_by_show)

### Topic by Show By Year

In [None]:
plot_topic_by_show_over_time(topic, segments)

### Topic by Show By Quarter

In [None]:
plot_topic_by_show_over_time(topic, segments, quarters=True)

## Multitopic Comparison

In [None]:
topics_to_compare = ['healthcare', 'election', 'email', 'immigration']

In [None]:
topics = [topic] + topics_to_compare
assert len(topics) > 1

def plot_topic_comparison_by_show(topics, window_size=250, threshold=50):
    topic_times_by_show = []
    for topic in topics:
        lexicon = mutual_info(topic)
        segments = find_segments(lexicon, window_size=window_size, threshold=threshold)
        topic_times_by_show.append(get_topic_time_by_show(segments))
    plot_topic_time_by_show(topics, topic_times_by_show)
    
plot_topic_comparison_by_show(topics)