# CS109B - Project - Milestone 2

### Sathish Angappan, Hannah Bend, Yohann Smadja

In this notebook, we will extract our data from TMDb and transform it into train and test data.  At this point, the question we are attempting to answer is such: can a neural network be trained to recognize changes in movie poster trends decade over decade and accurately categorize a movie based on its poster?  Can it learn or account for the stylistic changes and trends throughout the decades as graphic design improved?

In [1]:
## setup
import tmdbsimple as tmdb
import requests
import urllib
import pandas as pd
import time

In [2]:
tmdb.API_KEY = '248bdba4be72e97807ddd482a6f8d508' #--Hannah's key

## 1 Decide on data to extract

- Discussion about the imbalanced nature of the data and how you want to address it
- Description of your data
- What does your choice of Y look like?
- Which features do you choose for X and why? 
- How do you sample your data, how many samples, and why?

Our initial foray into the data revealed two key insights: 1) movies are classified as having multiple genres and 2) some genres are extremely niche.  To circumvent the latter, we are limiting our data universe to the three most common genres of Action, Comedy and Drama.  We will only extract from TMDb those movies whose list of genres include one or more of the three; unfortunately, this doesn't resolve for the first imbalance, as some movies will appear twice.  Take the 2016 movie Sing, for example, which TMDb categorizes first as an Animation, then a Comedy, then a Drama; Sing will appear in our initial data extract twice as a Comedy and again as a Drama.  We might address it simply by choosing the instance listed first, as TMDb seems to classify in descending order of relevance.

We will have two downloads: one csv with metadata on the movie, and all of their corresponding posters.  Once we vectorize the poster images, we should be able to merge the two datasets on movie id.  Ultimately, our data will have the following for columns:
    - movie id: TMDb's movie id
    - title: the name of the movie
    - release date: date of the movie's release
    - release year: year of the movie's release 
    - popularity: a score of the movie's popularity, scale of 0-100
    - vote average: average number of stars per rating, scale of 1-10
    - genre id: the corresponding id of the movie's genre
    - V1-##: vectorized columns of images, i.e. pixels
    
The genre id column will be our Y, or outcome variable, a factor with three possible values.  We chose the release date, release year, popularity and average vote as our X metadata variables, to be used in conjunction with the vectorized images.  Release year will allow us to train by decade; release data will allow us capture seasonality within the year (i.e. romance movies will trend upwards in February).  Popularity and average vote will help us capture any trends in genre popularity by/within decade.  For example, comedies might be more popular immediately following a stressful period such as the Cuban missile crisis.  Transforming the poster images into vectors will allow us to treat each "pixel" as a predictor and use an approach like PCA to identify the columns that comprise the most variability in the prediction. Lastly, title will not be used as a predictor but as an identifier.

To begin with, we extracted the metadata for all movies of the three genres; we will narrow down this universe first by removing those movies that appear twice and secondly by setting our merge to a left outer join so that we only model on movies who have corresponding poster images.  Depending on the size of the dataset at that point, we may further reduce the number of observations by randomly sampling from each genre. Should we decide to go this route, we would first sample, then split into train and test. 

## 2 Extract Data

### 2.1 Extract metadata in chunks

Using code such as that below, we split up the extract by genre so as to not overload any one of our APIs.

In [7]:
title = []
ids = []
release_date = []
popularity = []
genre_ids = []
vote_average = []
release_year = []

#Genre ID: 
comedy = 35
discover = tmdb.Discover()

for y in range(60):
    response = discover.movie(page = 1, primary_release_year=2016-y, with_genres=comedy)
    no_of_pages = discover.total_pages
    for i in range(no_of_pages):
        time.sleep(2)
        response = discover.movie(page = i+1, primary_release_year=2016-y, with_genres=comedy)

        for k, s in zip(range(len(discover.results)), discover.results):
            title.append(s['title'])
            ids.append(s['id'])
            release_date.append(s['release_date'])
            popularity.append(s['popularity'])
            vote_average.append(s['vote_average'])
            genre_ids.append(comedy)
            release_year.append(2016-y)

df_all_movies = pd.DataFrame({'title': title, 'ids': ids, 'release_date': release_date, 'release_year' : release_year, 
                               'popularity': popularity, 'genre_ids': genre_ids, 'vote_average': vote_average})

len(df_all_movies)
df_all_movies.to_csv('movies1960_2016_comedy.csv', encoding='utf-8')

### 2.2 Extract all posters for 1 movie

We will create for loops of the code below to slowly download the posters for the movies in our dataset, likely limiting to just those categorized as English to reduce noise.  Given the posters all have different dimensions, we'll likely need to scale the images prior to vectorizing.  Per the lecture on CNN, we should be able to so while maintaining RBG values, which are key for our analysis.

In [11]:
lostintranslation = tmdb.Movies(153)

u'tt0335266'

Several formats are available. We will select the same resolution of 500 for all of posters. 

In [18]:
CONFIG_PATTERN = 'http://api.themoviedb.org/3/configuration?api_key={key}'
url = CONFIG_PATTERN.format(key=tmdb.API_KEY)
r = requests.get(url)
config = r.json()

base_url = config['images']['base_url']
sizes = config['images']['poster_sizes']
sizes

[u'w92', u'w154', u'w185', u'w342', u'w500', u'w780', u'original']

In [17]:
# Download all the posters of "Lost in Translation"
IMDBID = lostintranslation.info()["imdb_id"]

IMG_PATTERN = 'http://api.themoviedb.org/3/movie/{imdbid}/images?api_key={key}' 
r = requests.get(IMG_PATTERN.format(key=tmdb.API_KEY,imdbid=IMDBID))
api_response = r.json()
paths = api_response['posters']
urls_posters = []
for path in paths:
    urls_posters.append("{0}{1}{2}".format(base_url, u'w500', path['file_path'])) 
urls_posters

['http://image.tmdb.org/t/p/w500/5T8VvuFTdaawKLJk34i69Utaw7o.jpg',
 'http://image.tmdb.org/t/p/w500/en1D6ETmeeneBInKBwZ2xVjve8r.jpg',
 'http://image.tmdb.org/t/p/w500/y3Fpvs5mYoD3ncPjkxYxwc5TRdU.jpg',
 'http://image.tmdb.org/t/p/w500/oOmkcmzrMKfyMVWQHULJkmp8lY3.jpg',
 'http://image.tmdb.org/t/p/w500/rWYr8wuIDYo7rUXW6uusRNZWBte.jpg',
 'http://image.tmdb.org/t/p/w500/pe4W2mTLHMkBW59XB2EvQ6WCjRU.jpg',
 'http://image.tmdb.org/t/p/w500/xmqRMnS8EVAOkrN5kD6iX4aq2eM.jpg',
 'http://image.tmdb.org/t/p/w500/gNhfI18oPPqmR8RMFq6ciNo9FGy.jpg',
 'http://image.tmdb.org/t/p/w500/ntzUnYFOvyEMCYW4FD2LuPrLPyM.jpg',
 'http://image.tmdb.org/t/p/w500/bmMu4vtQ8zt78sa3uxvGxvrZtwR.jpg',
 'http://image.tmdb.org/t/p/w500/7kkAVAoaVN0YLztG4FMrb5l7aRs.jpg',
 'http://image.tmdb.org/t/p/w500/6tdPmesqRuJhcb6V3QTgxlX6Mog.jpg',
 'http://image.tmdb.org/t/p/w500/xniSkkzLQBbqeHrJYAa6UHdD0Dg.jpg',
 'http://image.tmdb.org/t/p/w500/in8sEXG554zQdm4aMnY9YXoaJuu.jpg',
 'http://image.tmdb.org/t/p/w500/x9mlUUt7YNFHlcqPQycO1p3DX3A.j

In [None]:
moviename = "lostintranslation"
local_path = 'C:\\Users\\yohan\\Dropbox\\Harvard\\CS109b\\Project\\Movie Posters\\'
genre_id = 18 #drama
k=1
for url in urls_posters:
    urllib.urlretrieve(url, local_path+moviename+"_"+str(k)+"_"+str(genre_id)+".jpg")
    k+=1  

### 2.3 Storage considerations

We first noticed that most of the posters we downloaded for Lost in Translation are almost identical. Some have different color saturations, some have titles in different languages. Our dataframe of metadata contains 60,244 drama movies, 17,000 action movies and finally 40,000 comedies. If we assume we can download 30 posters per movie and that a poster is 100K big then just for drama we would end up with 60244 $\times$ 30 $\times$ 100 KB = 180 GB of movie posters. It seems more reasonable to only download one poster per movie. 


## 3. Image vectorization

We use matplotlib to process RGB images and transform them in 'uint8' dataformat. 'u' stands for 'unassigned' (all values are positive), 'int' stands for integers (all values are integers) and finally 8 means there is only 8 bits of information. Hence all values are between 0 and 255.

Each image is transformed into a 3-dimension matrix: Each pixel is represented by 3 values for the three additive primary colors, red, green and blue.

Even though most posters have the same matrix size (750 $\times$ 500 $\times$ 3), it happens that some posters have a slightly different shape. We show below how to resize the images.

In [13]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

# First poster
img1 = mpimg.imread('C:\\Users\\yohan\\Dropbox\\Harvard\\CS109b\\Project\\Movie Posters\\lostintranslation_1_18.jpg')
print "Image 1 shape:", img1.shape
img1

Image 1 shape: (750L, 500L, 3L)


array([[[  5,   4,   2],
        [  5,   4,   2],
        [  4,   3,   0],
        ..., 
        [  8,   4,   1],
        [  8,   4,   1],
        [  8,   4,   1]],

       [[  5,   4,   2],
        [  5,   4,   2],
        [  4,   3,   0],
        ..., 
        [  8,   4,   1],
        [  8,   4,   1],
        [  8,   4,   1]],

       [[  4,   3,   1],
        [  4,   3,   1],
        [  4,   3,   0],
        ..., 
        [  8,   4,   1],
        [  8,   4,   1],
        [  8,   4,   1]],

       ..., 
       [[138,  58,  49],
        [133,  53,  44],
        [131,  51,  42],
        ..., 
        [151,  65,  50],
        [151,  65,  50],
        [152,  66,  51]],

       [[134,  54,  47],
        [134,  54,  47],
        [132,  52,  45],
        ..., 
        [147,  64,  48],
        [147,  64,  48],
        [147,  64,  48]],

       [[126,  46,  39],
        [133,  53,  46],
        [134,  54,  47],
        ..., 
        [146,  65,  48],
        [145,  64,  47],
        [145,  64,

In [19]:
# Second poster
img2 = mpimg.imread('C:\\Users\\yohan\\Dropbox\\Harvard\\CS109b\\Project\\Movie Posters\\lostintranslation_2_18.jpg')
print "Image 2 shape:", img2.shape

Image 2 shape: (739L, 500L, 3L)


In [32]:
import scipy.misc

img2_resized = scipy.misc.imresize(img2, (750,500), interp='bilinear', mode=None)
print "Image 2 new shape:", img2_resized.shape

Image 2 new shape: (750L, 500L, 3L)
