# [ Chapter 10 - Learning to Rank for Generalizable Search Relevance ]
# Setup TheMovieDB Collection

In [1]:
import sys
sys.path.append('..')
from aips import *
engine = get_engine()

## Create Collection

Create collection for http://themoviedb.org (TMDB) dataset for this book. We will just look at title, overview, and release_year fields.

In [2]:
tmdb_collection = "tmdb"
engine.create_collection(tmdb_collection)
engine.enable_ltr(tmdb_collection)

Wiping 'tmdb' collection
Status: Success
Creating 'tmdb' collection
Status: Success
Adding LTR QParser for tmdb collection
Status: Success
Adding LTR Doc Transformer for tmdb collection
Status: Success


## Download and index data

Download TMDB data and index. We also download a judgment list, labeled movies as relevant/irrelevant for several movie queries

In [9]:
from ltr.download import download, extract_tgz
from ltr.helpers.movies import indexable_movies
import tarfile
import json

dataset = ['https://github.com/ai-powered-search/tmdb/raw/main/judgments.tgz', 
           'https://github.com/ai-powered-search/tmdb/raw/main/movies.tgz']
download(dataset, dest='data/')
extract_tgz('data/movies.tgz') # -> Holds 'tmdb.json', big json dict with corpus
extract_tgz('data/judgments.tgz') # -> Holds 'ai_pow_search_judgments.txt', 
                                  # which is our labeled judgment list

movies = indexable_movies(movies='data/tmdb.json')
engine.add_documents(tmdb_collection, list(movies))

data/judgments.tgz already exists
data/movies.tgz already exists

Adding Documents to 'tmdb' collection


## Next Up, judgments and feature logging

Next up we use a _judgment list_, a set of labeled relevant / irrelevant movies for search query strings. We then extract some features from the search engine to setup a full training set we can use to train a model.

Up next: [Judgements and Logging](2.judgments-and-logging.ipynb)