# Internet Archive TV news analysis <a class="tocSkip">
This document contains the code and corresponding visualizations/statistics for answering various questions about the TV news dataset.

Note that "the dataset" currently refers to 100 1-hour videos (50 from CNN and 50 from FOX) plus 120 10-minute videos (60 from CNN and 60 from FOX).

In [1]:
%matplotlib inline
from query.datasets.prelude import *

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Gender" data-toc-modified-id="Gender-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Gender</a></span><ul class="toc-item"><li><span><a href="#Detector-accuracy" data-toc-modified-id="Detector-accuracy-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Detector accuracy</a></span></li><li><span><a href="#Male-vs.-female-face-instances" data-toc-modified-id="Male-vs.-female-face-instances-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Male vs. female face instances</a></span></li><li><span><a href="#Male-vs.-female-face-instances-across-channels" data-toc-modified-id="Male-vs.-female-face-instances-across-channels-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Male vs. female face instances across channels</a></span></li><li><span><a href="#Male-vs.-female-face-instances-across-shows" data-toc-modified-id="Male-vs.-female-face-instances-across-shows-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Male vs. female face instances across shows</a></span></li><li><span><a href="#Male-vs.-female-face-instances-across-time-of-day" data-toc-modified-id="Male-vs.-female-face-instances-across-time-of-day-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Male vs. female face instances across time of day</a></span></li></ul></li></ul></div>

<hr />
# Gender
These queries analyze the distribution of men vs. women across a number of axes. We use faces detected by [MTCNN](https://github.com/kpzhang93/MTCNN_face_detection_alignment/) and gender detected by [rude-carnie](https://github.com/dpressel/rude-carnie). We only consider faces with a height > 20% of the frame to eliminate people in the background. Face detection was run at 2 frames per second on all videos. If a person's face is detected, we count that as 0.5 seconds of screen time. Total screen times reported double-count frames depending on the number of people in them, e.g. 2 women in one frame is 1 second of screen time.

In [2]:
rudecarnie = Labeler.objects.get(name='rudecarnie')
mtcnn = Labeler.objects.get(name='mtcnn')

def format_time(seconds):
    return '{}:{:02d}:{:02d}'.format(seconds/3600, seconds/60 % 60, seconds % 60)

def filter_gender(face_filter, video_filter):
    df = pd.DataFrame(list(Gender.objects \
        .annotate(count=Subquery(
            face_filter(FaceGender.objects \
            .filter(gender=OuterRef('pk'), labeler=rudecarnie, face__labeler=mtcnn) \
            .annotate(height=F('face__bbox_y2') - F('face__bbox_y1')) \
            .filter(height__gte=0.2)) \
            .values('gender').annotate(count=Count('*')) \
            .values('count'), models.IntegerField())).values()))
    row = {
        'male': int(df.loc[df['name'] == 'male']['count'].values[0]),
        'female': int(df.loc[df['name'] == 'female']['count'].values[0]),
    }
    total = float(row['male'] + row['female'])
    row['male_percent'] = '{:.0f}%'.format(row['male'] / total * 100)
    row['male_screentime'] = format_time(row['male'] / 2)
    row['female_percent'] = '{:.0f}%'.format(row['female'] / total * 100)
    row['female_screentime'] = format_time(row['female'] / 2)

    total_length = int(sum([v['length'] for v in video_filter(Video.objects) \
        .annotate(length=Sum(Cast(F('num_frames'), models.FloatField()) / F('fps'))) \
        .values('length')]))
    row['length'] = format_time(total_length)
    
    return row
   

ordering = ['length', 'male', 'male_percent', 'male_screentime', 'female', 'female_percent', 'female_screentime']

## Detector accuracy
Handlabels are just for the "main person in the frame" (instruction to labelers), so precision is expected to be low.

In [3]:
face_labeler = Labeler.objects.get(name='mtcnn')
hand_labeler = Labeler.objects.get(name='handlabeled')
gender_labeler = Labeler.objects.get(name='rudecarnie')

face_tp = 0
face_fp = 0
face_fn = 0

gender_t = 0
gender_f = 0

handlabeled = [t['person__frame__video__id'] for t in Face.objects \
    .filter(labeler=hand_labeler) \
    .values('person__frame__video__id') \
    .distinct('person__frame__video__id') \
    .values('person__frame__video__id')]

for i, video in enumerate(Video.objects.filter(id__in=handlabeled)):
    frames_with_faces = Frame.objects \
        .filter(video=video) \
        .annotate(c=Subquery(
            Face.objects.filter(person__frame=OuterRef('pk')) \
            .values('person__frame') \
            .annotate(c=Count('*')).values('c'))) \
        .filter(c__gt=0)
    #print(i, video.path)
    for frame in frames_with_faces:
        handlabeled_faces = list(Face.objects.filter(person__frame=frame, labeler=hand_labeler))
        autolabeled_faces = list(Face.objects.filter(person__frame=frame, labeler=face_labeler))
        
        for autoface in autolabeled_faces:
            good = np.where(np.array([bbox_iou(autoface, handface) > 0.5 for handface in handlabeled_faces]))
            index = good[0][0] if len(good[0]) > 0 else None
            if index is not None:
                face_tp += 1
                auto_gender = FaceGender.objects.get(face=autoface)
                hand_gender = FaceGender.objects.get(face=handlabeled_faces[index])
                if auto_gender.gender == hand_gender.gender:
                    gender_t += 1
                else:
                    gender_f += 1
            else:
                face_fp += 1
            
        for handface in handlabeled_faces:
            good = any([bbox_iou(autoface, handface) > 0.5 for autoface in autolabeled_faces])
            if not good:
                face_fn += 1
    
print('Face precision: {:.2f}'.format(face_tp / float(face_tp + face_fp)))
print('Face recall: {:.2f}'.format(face_tp / float(face_tp + face_fn)))
print('Gender accuracy: {:.2f}'.format(gender_t / float(gender_t + gender_f)))

Face precision: 0.51
Face recall: 0.94
Gender accuracy: 0.94


<hr />
## Male vs. female face instances
Count the total number of face instances for each gender (male or female) across the whole dataset.

In [4]:
pd.DataFrame([filter_gender(lambda qs: qs, lambda qs: qs)])[ordering]

<hr />
## Male vs. female face instances across channels

In [5]:
counts = []
for channel in Channel.objects.all():
    c = filter_gender(
        lambda qs: qs.filter(face__person__frame__video__channel=channel), 
        lambda qs: qs.filter(channel=channel))
    c['channel'] = channel.name
    counts.append(c)
    
pd.DataFrame(counts)[['channel'] + ordering]

<hr />
## Male vs. female face instances across shows

In [6]:
counts = []
for show in Show.objects.all():
    c = filter_gender(
        lambda qs: qs.filter(face__person__frame__video__show=show),
        lambda qs: qs.filter(show=show))
    c['show'] = show.name
    counts.append(c)
    
pd.DataFrame(counts)[['show'] + ordering]

<hr />
## Male vs. female face instances across time of day

In [7]:
hours = Video.objects.annotate(hour=Extract('time', 'hour')).distinct('hour').order_by('hour').values('hour')

counts = []
for hour in hours:
    hour = hour['hour']
    c = filter_gender(
        lambda qs: qs.filter(face__person__frame__video__time__hour=hour),
        lambda qs: qs.filter(time__hour=hour))
    c['hour'] = datetime.time(hour, 0).strftime('%I %p')
    counts.append(c)

pd.DataFrame(counts)[['hour'] + ordering]    