Skip to content

TV Archives cracked open - "AI for IA" Artificial Intelligence for Internet Archive. Talk at MozFest, London Oct 2017. VIEW SLIDES: https://traceypooh.github.io/mozfest17

Notifications You must be signed in to change notification settings

traceypooh/mozfest17

Repository files navigation

<!doctype html><script src="eveal.js"></script>

TV Archives cracked Open "AI for IA"

Artificial Intelligence for Internet Archive

MozFest, London Oct 2017

by [traceypooh](https://twitter.com/tracey_pooh)
https://traceypooh.github.io/mozfest17 _?_ for key shortcuts
git clone https://github.com/traceypooh/mozfest17; open mozfest17/index.html

Gist

decentralized research and AI
built on top of
a library of stable, untampered worldwide TV recordings


Intro to archive.org

  • WayBack Machine
    • past copies of 300B+ pages
    • 15M books, lendable
    • ~4M videos, ~4M audio & live concerts
    • 3M images
    • 200K software items & emulation (in JS!)

--- --- ---

Library!

  • Absolute browser Privacy
    • no personal data or IP addresses extracted
  • Validation & nontampering
    • keep original versions with 2+ checksums and logs
<file name="commute.mp4" source="derivative">
<title>commute</title>
<format>h.264</format>
<original>commute.avi</original>
<mtime>1325973601</mtime>
<size>11919082</size>
<md5>ff17ed66e7db5693dd208dd6ac488ff8</md5>
<crc32>ad1df03a</crc32>
<sha1>e9f9de8379cd25653d487ab30d198fc61a050091</sha1>
<length>115.61</length>
<height>480</height>
<width>640</width>
</file>

External Blockchain of Proofs

of file mod times / checksums


archive.org/tv

  • recording 50 - 100 channels
    • 24 x 7
    • around the world
    • since 2000
  • 2 million+ news shows
  • search captions/metadata
  • new Trump Administration and Congress subsets
  • citable reference clips
  • Popcorn editing/mashup clips
  • for AI experiments

Artificial Intelligence

  • text:
    • chyron ("lower third") scanning OCR (Third Eye)
    • caption alignment
    • OCR captions from DVB-S
      • BBC News
    • speech to text (VoiceBase)
      • Al Jazeera English
      • Deutsche Welle English
  • image:
    • public officials facial detection
      (Faceomatic <-- Matroid <-- FaceNet)

Artificial Intelligence


Public Feeds


- OCR 'lower third' - chyrons - overlaid text on broadcasts - not captions or descriptive text - editorial / summarizing in nature - 4 TV channels, 24x7, ~1 min from realtime - CNN - MSNBC - Fox News - BBC News
  AFTER WH MEETING, SCHUMER DISHES
  WHEN HE THOUGHT NIC WAS OFF
  
--- # bots - twitter bots - https://twitter.com/tvThirdEye - https://twitter.com/tvThirdEyeB - https://twitter.com/tvThirdEyeF - https://twitter.com/tvThirdEyeM - https://twitter.com/tvThirdEye/lists/all


API


Chyron filtering

  • tesseract OCR
    • free; errors
  • simhash
    • groups 'nearly the same'
      • character flips
      • word off in time
  • look for vowels
  • pick 'most seen' group every minute
    • and tweet

TV AI Examples


clips

  • little JSON annotations
  • associate metadata to program start/end time range
  • auto expands each clip to a "synthetic" document
    • to elastic search
  • JSONPatch for changes
  • track play counts, some referers
  • allows for decentralized annotations to other IA / research

clip

{
    "268.1|269.1": {
        "subject": [
            "Criminal Activity"
            "Crime"
        ],
        "factcheck": [
            "http://www.factcheck.org/2016/07/factchecking-trumps-big-speech/"
        ]
    },
    "266.7|267.2": {
        "ad_id": "PolAd_DonaldTrump_d9dsn",
        "type": "campaign",
        "race": "PRES",
        "cycle": "2016",
        "message": "pro",
        "sponsor": [
            "Republican National Cmte"
        ],
        "sponsor_type": "PAC",
        "subject": [
            "Job Accomplishments"
        ],
        "person": [
            "Donald Trump"
        ]
    },
    "268.1|269.1": {
        "collection": [
            "nancy_pelosi_archive"
        ],
        "subject": [
            "Voting",
        ],
    }
}

Where We're Going


[Part 2] "There Goes 2 Weeks"

deep dive into Image Matching and
Facial Recognition



An imposter does not have Imposter Syndrome

CNNs

  • Convolutional Neural Network
    • filtered neural network
  • each layer uses output from prior layer as input
  • instead of rule-based learning, use classified datasets to learn
  • multi-node connections (but not "fully connected")
  • "data squashers"

CNN Example

  • feed in image
  • node looking for eyelash
  • node looking for iris
    • could feed to node looking for eye
  • meanwhile... nose node
    • all feed to face recognizer node
    • could feed to "is this Barack Obama?"

Guru

Rik Heijdens from jwplayer

  • Demuxed 2017 talk
  • feed in video - for each shot, make 3 vectors:
    • image Inception CNN (tensorflow)
    • audio CNN spectrogram
    • text transcripts/STT into Word2Vec
  • concat vectors, compare (cosine similarity), and graph
  • ... yields scene detection
  • all just for ideal Ad insertion!

Image Matching

  • pixel diff algorithms (MAE, RMSE, MSE)
  • perceptual hashing pHash.org
    • image => 8x8 grayscale
    • convolve to 8x8 image with DCT
    • reduce to 64bit number
    • hamming distance Int64 pairs

pHash - to gray 8x8

<style> .hashes img { width:150px; } </style>



TensorFlow & Training


OpenFace


OpenFace Training

  • 3+ images per person/face
  • avoid 'overfit'
  • align eyes + nose (nostrils?)



Siamese "one shot" CNN recognizers

  • Rik idea
  • differentiate instead of classify
  • learns similarity of 2 inputs

- repo / py notebook ---


AI Ethics


Demo Time


Demo Time


help Shape US with YOUR Thoughts

  • extend/shape our APIs
  • AI ideas
  • research, visualizations
  • tag clips with AI metadata or pointers to Decentralized metadata
  • more!

Ergo

decentralized research and AI
built on top of
a library of stable, untampered worldwide TV recordings


The End

About

TV Archives cracked open - "AI for IA" Artificial Intelligence for Internet Archive. Talk at MozFest, London Oct 2017. VIEW SLIDES: https://traceypooh.github.io/mozfest17

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published