# Setup TheMovieDB Collection

In [1]:
import sys
sys.path.append('..')
from aips import *

## Create Collection

Create collection for http://themoviedb.org (TMDB) dataset for this book. We will just look at title, overview, and release_year fields.

In [3]:
tmdb_collection="tmdb"
create_collection(tmdb_collection)
enable_ltr(tmdb_collection)

upsert_text_field(tmdb_collection, "title")
upsert_text_field(tmdb_collection, "overview")
upsert_double_field(tmdb_collection, "release_year")

Wiping 'tmdb' collection
[('action', 'CREATE'), ('name', 'tmdb'), ('numShards', 1), ('replicationFactor', 1)]
Creating 'tmdb' collection
Status: Success
Del/Adding LTR QParser for tmdb collection
<Response [200]>
Status: Success
Status: Success
Adding LTR Doc Transformer for tmdb collection
Status: Success
Status: Success
Adding 'title' field to collection
Status: Success
Adding 'overview' field to collection
Status: Success
Adding 'release_year' field to collection
Status: Success


## Download and index data

Download TMDB data and index. We also download a judgment list, labeled movies as relevant/irrelevant for several movie queries

In [5]:
from ltr.download import download, extract_tgz
import tarfile

dataset = ['https://github.com/ai-powered-search/tmdb/raw/main/judgments.tgz', 
           'https://github.com/ai-powered-search/tmdb/raw/main/movies.tgz']
download(dataset, dest='data/')
extract_tgz('data/movies.tgz') # -> Holds 'tmdb.json', big json dict with corpus
extract_tgz('data/judgments.tgz') # -> Holds 'ai_pow_search_judgments.txt', 
                                  # which is our labeled judgment list

from ltr.client.solr_client import SolrClient
import json
client = SolrClient(solr_base=SOLR_URL)

from ltr.index import reindex
from ltr.helpers.movies import indexable_movies
movies=indexable_movies(movies='data/tmdb.json')
reindex(client, index='tmdb', doc_src=movies)

data/judgments.tgz already exists
data/movies.tgz already exists
Reindexing...
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 200]
500 Docs Sent [Status: 20

## Next Up, judgments and feature logging

Next up we use a _judgment list_, a set of labeled relevant / irrelevant movies for search query strings. We then extract some features from the search engine to setup a full training set we can use to train a model.

These examples [come up next](2.ch10-judgments-and-logging.ipynb)