# Get Movie Poster Image Links
This file uses the TMDB API [(The Movie Database)](https://www.themoviedb.org/) to fetch the URLs for each movie poster. Due to processing time and API request rate, this data is fetched in batches of 500 and stored to temporary csv files, before being concatenated into one output file which is used directly in Tableau.
The URLs collected reference the location where each movie poster .png file is hosted on TMDB, meaning that we can simply provide Tableau with the URL and an image is displayed.

In [41]:
import pandas as pd
import json
import os
from tmdbv3api import TMDb
from tmdbv3api import Movie

#Load in configs and dependencies
#baseDir=pathlib.Path(__file__).parent.resolve()
baseDir=os.getcwd()
parentDir=os.path.dirname(baseDir)
configLocation=os.path.join(baseDir, "config.json")
with open(configLocation, "r") as f:
    configs = json.load(f)

datafolder=os.path.join(parentDir,configs["inputfolder"])
outputfolder=os.path.join(parentDir,configs["outputfolder"])
tempfolder = os.path.join(parentDir,configs["tempfolder"])




TMDB API requires a per-user API key which should not be stored in public repo. Set up an account and input yours here: 

In [2]:
# get user input
my_api_key = input()


In [3]:
#Initialise TMDB API connection

tmdb = TMDb()
tmdb.api_key = my_api_key
tmdb.language = 'en'
tmdb.debug = True

movie = Movie()
def get_imagelink(tmdbMovie, id):
    try:
        out=tmdbMovie.details(id).poster_path
    except:
        out='NA'
        pass
    return out


In [56]:
movieFact=pd.read_csv(f'{tempfolder}/movie_fact.csv')
links = pd.read_csv(f'{datafolder}/links.csv')
links=links.rename(columns={'movieId': 'movieID'})
links['movieID']=links.movieID.apply(lambda x: 'M_'+str(x))
movieFact=pd.merge(movieFact, links, on='movieID', how='left')
movieFact.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   film_title     9742 non-null   object 
 1   release_year   9742 non-null   int64  
 2   genres_raw     9742 non-null   object 
 3   combined_tags  1572 non-null   object 
 4   description    9618 non-null   object 
 5   movieID        9742 non-null   object 
 6   imdbId         9742 non-null   int64  
 7   tmdbId         9734 non-null   float64
dtypes: float64(1), int64(2), object(5)
memory usage: 609.0+ KB


In [57]:
# info shows 8 cases where tmbd ID is not available
movieFact.head()

Unnamed: 0,film_title,release_year,genres_raw,combined_tags,description,movieID,imdbId,tmdbId
0,Toy Story,1995,"Adventure, Animation, Children, Comedy, Fantasy",pixar pixar fun,"Led by Woody, Andy's toys live happily in his ...",M_1,114709,862.0
1,Jumanji,1995,"Adventure, Children, Fantasy",fantasy magic board game Robin Williams game,When siblings Judy and Peter discover an encha...,M_2,113497,8844.0
2,Grumpier Old Men,1995,"Comedy, Romance",moldy old,A family wedding reignites the ancient feud be...,M_3,113228,15602.0
3,Waiting to Exhale,1995,"Comedy, Drama, Romance",,"Cheated on, mistreated and stepped on, the wom...",M_4,114885,31357.0
4,Father of the Bride Part II,1995,Comedy,pregnancy remake,Just when George Banks has recovered from his ...,M_5,113041,11862.0


In [26]:
## 1000 rows took 10.5 mins
movie = Movie()

mini=movieFact[:1001].copy()
mini['image_link']=mini.tmdbId.apply(lambda x: get_imagelink(movie, x))
mini.to_csv(f'{tempfolder}/images/movie_images_1000.csv', index=False)
mini.head()


Unnamed: 0,film_title,release_year,genres_raw,combined_tags,description,id,movieId,imdbId,tmdbId,image_link
0,Toy Story,1995,"Adventure, Animation, Children, Comedy, Fantasy",pixar pixar fun,"Led by Woody, Andy's toys live happily in his ...",M_1,1,114709,862.0,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg
1,Jumanji,1995,"Adventure, Children, Fantasy",fantasy magic board game Robin Williams game,When siblings Judy and Peter discover an encha...,M_2,2,113497,8844.0,/vgpXmVaVyUL7GGiDeiK1mKEKzcX.jpg
2,Grumpier Old Men,1995,"Comedy, Romance",moldy old,A family wedding reignites the ancient feud be...,M_3,3,113228,15602.0,/1FSXpj5e8l4KH6nVFO5SPUeraOt.jpg
3,Waiting to Exhale,1995,"Comedy, Drama, Romance",,"Cheated on, mistreated and stepped on, the wom...",M_4,4,114885,31357.0,/4wjGMwPsdlvi025ZqR4rXnFDvBz.jpg
4,Father of the Bride Part II,1995,Comedy,pregnancy remake,Just when George Banks has recovered from his ...,M_5,5,113041,11862.0,/rj4LBtwQ0uGrpBnCELr716Qo3mw.jpg


In [12]:
# batches of 500
for each in range(1001, len(movieFact), 500):
    mini=movieFact[each:min(each+500, len(movieFact))].copy()
    mini['description']=mini.tmdbId.apply(lambda x: get_imagelink(movie, x))
    print(f'rows {each} to {min(each+500, len(movieFact))} complete')
    mini.to_csv(f'{tempfolder}/images/movie_images_{each}_to_{min(each+500, len(movieFact))}.csv', index=False)

rows 1001 to 1501 complete
rows 1501 to 2001 complete
rows 2001 to 2501 complete
rows 2501 to 3001 complete
rows 3001 to 3501 complete
rows 3501 to 4001 complete
rows 4001 to 4501 complete
rows 4501 to 5001 complete
rows 5001 to 5501 complete
rows 5501 to 6001 complete
rows 6001 to 6501 complete
rows 6501 to 7001 complete
rows 7001 to 7501 complete
rows 7501 to 8001 complete
rows 8001 to 8501 complete
rows 8501 to 9001 complete
rows 9001 to 9501 complete
rows 9501 to 9742 complete


In [38]:
# combine all image files into one output file
movieIms=pd.read_csv(f'{tempfolder}/images/movie_images_1000.csv')
for each in range(1001, len(movieFact), 500):
    mini=pd.read_csv(f'{tempfolder}/images/movie_images_{each}_to_{min(each+500, len(movieFact))}.csv')
    movieIms=pd.concat([movieIms, mini], axis=0)

movieIms=movieIms.reset_index(drop=True)
movieIms['link'] = movieIms['description'].astype(str)
movieIms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   film_title     9742 non-null   object 
 1   release_year   9742 non-null   int64  
 2   genres_raw     9742 non-null   object 
 3   combined_tags  1572 non-null   object 
 4   description    9614 non-null   object 
 5   id             9742 non-null   object 
 6   movieId        9742 non-null   int64  
 7   imdbId         9742 non-null   int64  
 8   tmdbId         9734 non-null   float64
 9   link           9742 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 761.2+ KB


In [49]:
# drop majority of columns as these are already available in movie_fact

movieIms[['movieID', 'link']].to_csv(f'{outputfolder}/movie_images_all.csv', index=False)