### DATA ANALYSIS WITH MOVIE DATASET

https://codechalleng.es/challenges/13/

Details
There is this great ML article Predict Movie Rating. In this week's code challenge we use its data set to get the 20 highest rated directors based on their average movie IMDB ratings.

Steps:

As mentioned in the article the dataset is here, but we provided a copy in the repo's 13/ subfolder.

Parse the movie_metadata.csv, using csv.DictReader you get a bunch of OrderedDicts from which you only need the following k,v pairs:

OrderedDict([...
            ('director_name', 'Lawrence Kasdan'),   
            ...
            ('movie_title', 'Mumford\xa0'),
            ...
            ('title_year', '1999'),
            ...
            ('imdb_score', '6.9'),
            ...
Only consider directors with a minimum of 4 movies, otherwise you get misrepresentative data. However going to min 5 movies we miss Sergio Leone :(

Take movies of year >= 1960.

Print the top 20 highest rated directors with their movies ordered desc on rating.

It should look something like this (indeed some awesome movies here!):

In [52]:
# Use the CSV module to create a OrderedDict, where each line(ROW) of the CSV file becomes a DICTIONARY
# Each Attribute (Column) of the Dataset becomes a KEY and ATTRIBUTES's value is Value of the DICTIONARY
# and we have a LIST OF DICTIONARIES AS THE ROWS WITHIN THE FILE.

In [53]:
import csv

In [54]:
# Create a Dictionary of Directors and their respective list of movies
# List of Movies are stored as list of Named Tuples

from collections import namedtuple

# Create a Named Tuple to hold the Movie Attributes
Movie = namedtuple('Movie', 'title year score')

In [55]:
## Movies Dataset File
movies_csv = 'movie_metadata.csv'

In [56]:
# Function to create Dictionary of directors and respective list of movie details

In [57]:
from collections import defaultdict

In [58]:
def get_movies_director(data=movies_csv):
    
    directors = defaultdict(list)
    # Open the CSV file
    with open(data, encoding="utf8") as csv_file:
        # Parse through the File
        for line in csv.DictReader(csv_file):
            try:
                director = line['director_name']
                title = line['movie_title']
                year = line['title_year']
                score = line['imdb_score']
            except ValueError:
                continue
                
            # Add the movie details to the director dictionary
            directors[director].append(Movie(title=title, year=year, score=score))
    
    return directors

In [59]:
# Get the Movies Dataset as dictionary of Directors
dict_directors = get_movies_director(movies_csv)

In [60]:
# Find the Directors with most movies
# Create a Dictionary with Director and the number of movies directed
# List the Most Directed Directors

In [61]:
from collections import Counter

In [62]:
cnt = Counter()
for director, movies in dict_directors.items():
    cnt[director] = len(movies)
    
cnt.most_common(5)

[('', 104),
 ('Steven Spielberg', 26),
 ('Woody Allen', 22),
 ('Martin Scorsese', 20),
 ('Clint Eastwood', 20)]