## Analysis of Movies from 2000 - 2019 using the TMDB Database 
<b>Author:</b> Rohit Vincent <br>
<b>Source :</b> https://www.themoviedb.org/ <br>
<b>Description of the database:</b> The Movie Database (TMDb) is a community built movie and TV database. Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different.

<b>Following javascript code is to disable the scroll function for the output of the notebook</b>

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<b>Install wordcloud if not already installed</b>

In [None]:
%pip install wordcloud

<b>Ensuring storage of plots within notebook</b>

In [None]:
%matplotlib inline

#### Import Required Libraries

In [None]:
import ast
import re
import time

import matplotlib
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import requests  # to make TMDB API calls
import seaborn as sns
from nltk import word_tokenize as tokenize
from nltk.corpus import stopwords
from wordcloud import WordCloud
from IPython.display import Markdown, display

**Define awesome function to print better**

In [None]:
# custom function to display colors & text styles
def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))

#### Set Global Variables/Parameters
<ul>
    <li><b>REQUEST_LIMIT</b>: TMDB API has request limit of 40 requests per 10 seconds. 
To ensure the program doesn't overload the number of requests, we will limit the number of requests per 10 seconds based on the global parameter set below. We will set this to 39 currently.</li>
    <li><b>FILENAME</b>: Filename to where data fetched from API should be saved & loaded for further processing.</li>
    <li><b>APIKEY</b>: API key value for user to access the TMDB API</li>
</ul>
    
        


In [None]:
# Request Limit for TMDB API per 10 seconds
REQUEST_LIMIT = 39
# Filename to store/load data from API
FILENAME='tmdb_dump.csv'
# API key
APIKEY = 'b67e8d052c594195b61e2533a4968dd7'

### Declare & Set Other Variables/ Utility Functions

<b>Declare PorterStemmer for stemming reviews of movies</b>

In [None]:
porter = nltk.PorterStemmer()

<b>Set Figure Size of Plots</b>

In [None]:
sns.set(rc={'figure.figsize':(15,7)})

<b>Define Functions to plot Bar Graphs & Wordclouds</b>

In [None]:
# Define Generate Random Colours
def getColors(length):
    colors = list()
    for i in range(length):
        colors.append(list(np.random.rand(3,)))
    return colors

# Plot bar graph
def plotbar(common_words,title,xlabel,ylabel):
    # Unzip to get labels, values
    labels, ys = zip(*common_words)
    xs = np.arange(len(labels)) 
    # Set title
    plt.title(title, fontsize=15)
    plt.bar(xs, ys,align='center', color=getColors(len(xs)))
    plt.xticks(xs, labels) #Replace default x-ticks with xs, then replace xs with labels
    plt.yticks(ys)
    plt.tight_layout()
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=90)
    plt.show()

# Generate wordcloud
def disp_wordcloud(text,title):
    #Remove stop words
    wordcloud = WordCloud(stopwords=stopwords.words('english'), background_color="white").generate(' '.join(text))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(title, fontsize=15)
    plt.axis("off")
    plt.show()
        

<b>Define functions to clean text using the following steps:</b>
1. Calculate Lexical Diversity of text
2. Clean Text
<ul>
    <li> Remove punctuations </li>
    <li> Convert word to lowercase </li>
    <li> Remove stop words such as a, an, the </li>
    <li> Convert to root form of word using PorterStemmer </li>
</ul>

In [None]:
# Lexical Diversity     
def lexical_diversity(text):
    return len(set(text)) / len(text)
    
# Perform all the above + stemming        
def cleanText(text):     
    #Remove Punctuations, stopwords, convert to lower case & stem the words
    lowercase_text = [word.lower() for word in text if word.isalnum()]
    # Remove stop words
    filtered_words = [word.lower() for word in lowercase_text if word.lower() not in stopwords.words('english')]
    # Stem
    stemmed_words = [porter.stem(word) for word in filtered_words]
    return stemmed_words

<b>Define Class tmdb for analysis of the dataset</b>

Following parameters are defined in the class
<ul>
    <li><b> Limit: </b> Limit of requests</li>
    <li><b> Pagelimit: </b> Limit of pages for a request. TMDB has max of 1000 which is defaulted here</li>
    <li><b> API Key: </b> API Key for TMDB</li>
    <li><b> Start: </b> Start Year of Data Selection & Analysis</li>
    <li><b> End: </b> End Year of Data Selection & Analysis</li>
</ul>

Following functions are defined in the class:
<ul>
    <li> <b>check_requestLimit : </b>Function to check if request has reached limit in the last 10 seconds.</li>
    <li> <b>fetchFilmDetails : </b>Fetch data set from API & Save to CSV.</li>
    <li> <b>processDetails : </b>Function to Preprocess dataset & apply some filtering.</li>
    <li> <b>analyzeAuthors : </b>Function to Analyze Authors & Reviews written by them.</li>
    <li> <b>analyzeGenre : </b> Analyse Genre.</li>
    <li> <b>analyzeCompanyGenre : </b>Analyze company & the genre of movies they produce.</li>
    <li> <b>analyzeMovieTimelines:</b> Analyze movies released across years/month.</li>
    <li> <b>analyzeCompanies: </b> Analyze movies produced across each production company</li>
    <li> <b>analyzeLang: </b> Analyze movie languages</li>
    <li> <b>analyzeRevenue: </b>Analyze Revenue </li>
    <li> <b>analyzeVoteAvg: </b>Analyze Vote Average</li>
</ul>  

In [None]:
# Define class TMDB
class tmdb:
    
    # Constructor
    def __init__(self, limit, apikey, filename,start_year,end_year):
        self.limit = limit
        self.count = 0
        # Can't fetch more than 1000 pages with the API
        self.pagelimit = 1000
        self.apikey = apikey
        self.filename = filename
        self.start = start_year
        self.end = end_year
        
    # Function to check if request has reached limit in the last 10 seconds.    
    def check_requestLimit(self):
        # If Count is zero, start timer
        if(self.count == 0):
            self.start_time = time.time()
        # Increase request count
        self.count+=1
        # Get Current time
        current_time = time.time()
        # For every 9 seconds limit requests
        if(current_time - self.start_time <= 9):
            # If has reached limit, sleep for 10 seconds & reset
            if(self.count >= self.limit):
                printmd("Going to sleep for 10 seconds since request threshold has reached",color="red")
                time.sleep(10)
                self.count = 0
        else:
            # At 10th seconds reset counter to start again
            self.count = 0

    # Fetch data set from API & Save to CSV               
    def fetchFilmDetails(self):
        # define column names for our new dataframe
        columns = ['film','id','genres','overview','popularity','production_companies','original_title','original_language','release_date','revenue','vote_average','vote_count','reviews']
        # create dataframe with columns
        df = pd.DataFrame(columns=columns)
        # For each year
        for year in list(range(self.start,self.end+1)):
            printmd("<b>Fetching for year:</b>"+str(year),color="green")
            # Fetch Film Details
            response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  self.apikey+'&release_date.gte='+str(year)+'-01-01&release_date.lte='+str(year)+'-01-01')
            tmdb.check_requestLimit()
            films_json = response.json() # store parsed json response
            # Get total number of pages for the results
            pages = films_json['total_pages']
            #loop at each page
            for page in list(range(1, pages)):
                # Get Response for current page
                response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  self.apikey +'&release_date.gte='+str(year)+'-01-01&release_date.lte='+str(year)+'-01-01&page='+str(page))
                # Check if limit has reached
                tmdb.check_requestLimit()
                films_json = response.json() # store parsed json response
                # Stop fetching for current year if page limit is hit
                if(page > self.pagelimit):
                    break
                # Get results from json
                films = films_json['results']
                # for each of the film in the current page get details & Reviews
                for film in films:
                    # Get Film details
                    film_details = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ self.apikey +'&language=en-US')
                    # Check if limit has reached
                    tmdb.check_requestLimit()
                    # Convert to json
                    film_detail = film_details.json()
                    # Get Reviews 
                    film_reviews = requests.get('http://api.themoviedb.org/3/movie/'+str(film['id'])+'/reviews?api_key='+self.apikey+'&language=en-US')
                    # Check if limit has reached
                    tmdb.check_requestLimit()
                    # Convert to json
                    film_review = film_reviews.json()
                    # store all details in dataframe    
                    df.loc[len(df)]=[film['title'],film_detail['id'],film_detail['genres'],film_detail['overview'],film_detail['popularity'],film_detail['production_companies'],film_detail['original_title'],film_detail['original_language'],film_detail['release_date'],film_detail['revenue'],film_detail['vote_average'],film_detail['vote_count'],film_review['results']]
        # Save dataframe as csv
        df.to_csv(self.filename)
        printmd("Document Saved.",color="green")
        printmd("<b>Number of records: </b>"+df.shape[0],color="green")
    
    # Function to Preprocess dataset & apply some filtering.
    def processDetails(self):
        
        # Load CSV
        self.films = pd.read_csv(self.filename, index_col=0)
        printmd("**Cleaning & Filtering Data**")
        
        # Convert date column to date time format
        self.films['release_date'] = pd.to_datetime(self.films['release_date'])
        
        # Add seperate columns for month & year for analysis
        printmd("<b>Columns before modification: </b>"+ str(self.films.columns.tolist()),color="red")
        printmd("**Adding Columns month & year for analysis**")
        self.films['month'] = self.films['release_date'].dt.month
        self.films['year'] = self.films['release_date'].dt.year
        printmd("<b>Columns after modification: </b>"+ str(self.films.columns.tolist()),color="green")
        
        # Filter out films for the start & end date
        years = list(self.films['year'].value_counts(sort=False))
        years_ix = list(self.films['year'].value_counts(sort=False).index)
        plotbar(zip(years_ix,years),"Bar plot showing distribution of data over each year",'Year','Count of Movies Released')  
        printmd("**Filter dataset to reduce dataset to movies released between "+str(self.start)+" & "+str(self.end)+".**")
        self.films = self.films[(self.films['release_date'].dt.year > self.start) & (self.films['release_date'].dt.year < self.end)]
        years = list(self.films['year'].value_counts(sort=False))
        years_ix = list(self.films['year'].value_counts(sort=False).index)
        plotbar(zip(years_ix,years),"Bar plot showing distribution of data over each year after applying filter",'Year','Count of Movies Released')
        printmd("<b>Inference:</b> We can see that the TMDB database is not currently updated with recent years data but has a significant collection of movies from past 10 years.",color="blue")
        
        # Remove Duplicates
        printmd("**Check for duplicate records**")
        duplicates = self.films[self.films.duplicated(subset ="film",keep="first")]['film']
        print(duplicates)
        printmd("<b>Count of duplicate records: </b>"+str(len(duplicates)),color="red")
        printmd("<b>Count of total records including duplicates: </b>"+str(self.films.shape[0]),color="red")
        printmd("**Remove duplicate records**")
        self.films.drop_duplicates(subset ="film",keep = "first", inplace = True) 
        printmd("<b>Count of total records after removing duplicates: </b>"+str(self.films.shape[0]),color="green")
        printmd("<b>Inference:</b> The TMDB database has duplicate records which are maintained. This could be added by different users.",color="blue")
        
        printmd("<b> Analyze genres in the dataset </b>")
        printmd("<b>ISSUE: </b> We can see that genres are surrounded with brackets & needs to split into a more readable form to be parse by python",color="red")
        genres = self.films.iloc[10:20]['genres']
        print(genres)
        printmd("<b> Analyze production companies in the dataset </b>")
        printmd("<b>ISSUE: </b> We can see that companies are surrounded with brackets & needs to split into a more readable form to be parse by python",color="red")
        companies = self.films.iloc[:10]['production_companies']
        print(companies)
        printmd("<b> Analyze reviews in the dataset </b>")
        printmd("<b>ISSUE: </b> We can see that reviews & authors are combined. Furthermore they are surrounded with brackets & needs to split into a more readable form to be parse by python",color="red")
        reviews = self.films.iloc[-20:]['reviews']
        print(reviews)
        # For each film fetch Genres, Companies, Reviews & Authors
        for index, row in self.films.iterrows():
            # Fetch Genres and convert to list
            genres_list = ast.literal_eval(row[2])
            genres = list()
            # For genres of the movie
            for genre in genres_list:
                # Append Genre to list
                genres.append(genre['name'])
            # Update Dataframe
            self.films.loc[index,'genres'] = ','.join(genres)    
            
            # Fetch companies
            prod_comp_list = ast.literal_eval(row[5])
            prod_companies = list()
            # Add each company of the movie
            for company in prod_comp_list:
                # Append company to list
                prod_companies.append(company['name'])
            # Update Dataframe
            self.films.loc[index,'production_companies'] = ','.join(prod_companies)
            
            # Fetch Reviews
            reviews_list = ast.literal_eval(row[12])
            reviews = list()
            authors = list()
            # Add each review of the movie after removing punctuations & converting to lower case & ascii
            for review in reviews_list:
                stripped_review = re.sub(r"[\[\]\"\',-.;:@#?!&*$()/]+\ *", " ", review['content'])
                stripped_review = ''.join([t.lower() for t in stripped_review if t.isalnum() or ' '])
                stripped_review = (stripped_review.encode('ascii', 'ignore')).decode("utf-8")
                authors.append(review['author'])
                reviews.append(stripped_review)
                
            # Update Dataframe
            self.films.loc[index,'reviews'] = ','.join(reviews)
            self.films.loc[index,'authors'] = ','.join(authors)
        
       
        printmd("<b>Genres after processing</b>",color="green")
        genres = self.films.iloc[10:20]['genres']
        print(genres)
        printmd("<b>Companies after processing</b>",color="green")
        companies = self.films.iloc[:10]['production_companies']
        print(companies)
        printmd("<b>Reviews after processing generates two seperate columns for reviews & authors</b>",color="green")
        printmd("<b>Reviews are also cleaned of special characters & converted to lower case</b>",color="green")
        reviews = self.films.iloc[-10:]['reviews']
        print(reviews)
        authors = self.films.iloc[-10:]['authors']
        print(authors)

        printmd("<b>ISSUE: </b> Fields have missing entries due to data not being available",color="red")
        printmd("<b>Replace empty fields with Nan</b>, will remove for each analysis based on fields")
        # Replace empty strings with Nan
        self.films = self.films.replace(r'^\s*$', np.nan, regex=True)       
        printmd("<b>Reviews after processing</b>",color="green")
        reviews = self.films.iloc[-10:]['reviews']
        print(reviews)
    
    # Analyze Authors & Reviews written by them
    def analyzeAuthors(self):
        
        printmd("<b> Dataset size before: </b>"+str(self.films.shape[0]),color="red")
        printmd("<b> Remove Missing Entries </b>")
        # Get list of authors & Drop empty rows
        films = self.films.dropna(subset=['authors'])
        printmd("<b> Dataset size after: </b>"+str(films.shape[0]),color="green")
        
        author_list = list()
        review_list = list()
        
        # Display authors with highest number of reviews
        for index,film in films.iterrows():
            film_authors = film['authors']
            film_reviews = film['reviews']
            author_list += film_authors.split(",")
            review_list += film_reviews.split(",")
        fd = nltk.FreqDist(author_list)
        author_reviews = pd.DataFrame(np.column_stack([author_list, review_list]), 
                               columns=['authors', 'reviews'])
        common_authors = fd.most_common(5)
        printmd("<b> Show top 5 authors who have written lots of reviews </b> ")
        plotbar(common_authors,"Authors with highest number reviews",'Author','Review count')
        all_reviews = list()
        authors_lexdiv = list()
        authors = list()
        printmd("<b>Inference:</b> Author "+str(common_authors[0][0])+ " has the most reviews ("+str(common_authors[0][1])+") written in the dataset.",color="blue")
        printmd("<b>Inference:</b> Author "+str(list(dict(fd.most_common()[-1:]).keys())[0])+ " has the least reviews ("+str(list(dict(fd.most_common()[-1:]).values())[0])+") written in the dataset.",color="blue")
        
        # Analyse Reviews
        reviews = review_list[0]
        printmd("<b> Example of a review</b>",color="red")
        print(reviews)
        printmd("<b>Clean the sentence by removing stop words & stemming</b>")
        printmd("<b> Cleaned Review </b>",color="green")
        printmd(' '.join(cleanText(reviews.split())))
        
        printmd("<b> Analysis of Reviews </b>")
        printmd("<b> Display common words used by each top author </b>")
        # For each author, analyze words he has used
        for author in common_authors:
            reviews = author_reviews.loc[author_reviews['authors'] == author[0]]['reviews']
            review_total = list()
            for review in reviews:
                review_total += review.split()
            cleaned_review =  cleanText(review_total) 
            all_reviews += cleaned_review
            # Get frequency distribution
            fd = nltk.FreqDist(cleaned_review)
            # Display common 10 words used by top 5 authors
            common_words = fd.most_common(10)
            plotbar(common_words,"Top 10 Common words used by "+author[0],'Words','Count')
            # Get lexical diversity of each of the authors
            authors_lexdiv.append(lexical_diversity(cleaned_review))
            authors.append(author[0])
        # Plot lexical diversity
        printmd("<b> Lexical Diversity </b>")
        printmd("The below plot signifies how well each author writes a review and whether he uses a creative vocabulary or repeats most of his critics among all his reviews")
        plotbar(zip(authors,authors_lexdiv),"Lexical Diversity of top 5 Reviewers ",'Reviewers','Lexical Diversity')                                  
        # Display wordcloud of common words
        printmd("<b> Word Cloud </b>")
        printmd("The word cloud signifies common words used by the authors with bigger words signifying more usage among the reviews in the dataset.")
        disp_wordcloud(all_reviews, "Common words used by top 5 reviewers")

    # Analyse Genre
    def analyzeGenre(self):
        
        printmd("<b>Number of records: </b>" + str(self.films.shape[0]),color="red")
        genres_list = list()
        prod_genre = list()
        printmd("<b>Remove movies with no genre</b>")
        # Drop empty rows
        films = self.films.dropna(subset=['genres'])
        printmd("<b>Number of records: </b>" + str(films.shape[0]),color="green")
        
        # For each film
        for index,film in films.iterrows():
            # Get genres for the film
            film_genres = film['genres']
            # For each genre in the film, create a list
            genres_list += film_genres.split(",")
            prod_genre +=  genres_list
                
        # Generate Frequency distribution of genres
        fd = nltk.FreqDist(prod_genre)
        genres = dict(fd)
        plt.pie(genres.values(), labels=genres.keys(),autopct='%.2f', startangle=0)
        plt.title("Distribution of genres across the data")
        plt.show()
        printmd("<font color =blue><b> Inference: </b>The above pie chart shows Drama & Documentary movies are most captured in this database with both consisting of more than 16% each. Genres such as Adventure, Mystery are least captured in this dataset.</font>")
    
    # Analyze company & the genre of movies they produce
    def analyzeCompanyGenre(self):  
        # Drop empty rows
        printmd("<b>Number of records:</b>" + str(self.films.shape[0]),color="red")
        company_list = list()
        genres_list = list()
        prod_genre = list()
        printmd("<b>Remove movies with no genre or production company</b>")
        films = self.films.dropna(subset=['production_companies','genres'],how='any')
        printmd("<b>Number of records:</b>" + str(films.shape[0]),color="green")
        
        # For each film, get companies & genres
        for index,film in films.iterrows():
            film_companies = film['production_companies']
            film_genres = film['genres']
            company_list += film_companies.split(",")
            genres_list += film_genres.split(",")
            # for each production company
            for company in film_companies.split(","):
                # for each genre
                prod_genre +=  genres_list
                    
        # Zip company+genre combo
        prod_comp_genre = zip(company_list,prod_genre)
        # Get frequency of the above combinations
        fd = nltk.FreqDist(prod_comp_genre)
        # Display top 30 companies
        top_companies = dict(fd.most_common(30))
        ser = pd.Series(list(top_companies.values()),
                  index=pd.MultiIndex.from_tuples(top_companies.keys()))
        df = ser.unstack().fillna(0)
        sns.heatmap(df).set_title('Production Companies Vs. Genres - Most Number Of Movies(Top 30)')
        # Display first heatmap
        plt.show()
        
        # Display top 30 companies
        least_companies = dict(fd.most_common(60)[-30:])
        ser = pd.Series(list(least_companies.values()),
                  index=pd.MultiIndex.from_tuples(least_companies.keys()))
        df = ser.unstack().fillna(0)
        sns.heatmap(df).set_title('Production Companies Vs. Genres - Most Number Of Movies(Top 30-60)')
        top_companies = dict(fd.most_common())
        printmd("<b>Inference:</b> Production/Genre"+str(list(dict(fd.most_common()[:1]).keys())[0])+ " has "+str(list(dict(fd.most_common()[:1]).values())[0])+" movie in the dataset.",color="blue")
        printmd("<b>Inference:</b> Production/Genre"+str(list(dict(fd.most_common()[-1:]).keys())[0])+ " has "+str(list(dict(fd.most_common()[-1:]).values())[0])+" movie in the dataset.",color="blue")
    
    # Analyze movies released across years/month    
    def analyzeMovieTimelines(self):
        # Get list of movies & drop movies with no revenue or rating
        printmd("<b> Movies per month </b>")
        months = list(self.films['month'].value_counts(sort=False))
        months_ix = list(self.films['month'].value_counts(sort=False).index)
        plotbar(zip(months_ix,months),"Movies across the Months",'Month','Count of Movies Released')
        printmd("<b>Inference: </b>Movies are skewed towards January since the database has defaulted all movies without a release date to the the first day of the year of release",color="blue")
        printmd("<b> Movies per year </b>")
        years = list(self.films['year'].value_counts(sort=False))
        years_ix = list(self.films['year'].value_counts(sort=False).index)
        plotbar(zip(years_ix,years),"Movies across the Years",'Year','Count of Movies Released')  
        printmd("<b>Inference: </b>Movies in recent years hasn't been updated and have less data than past years",color="blue")
        
    # Analyze movies produced across each production company
    def analyzeCompanies(self):
        # Filter Data
        printmd("<b>Number of records: </b>" + str(self.films.shape[0]),color="red")
        # Get list of companies & Drop empty rows
        films = self.films.dropna(subset=['production_companies'])
        printmd("<b>Drop rows without production companies</b>")
        printmd("<b>Number of records: </b>" + str(films.shape[0]),color="green")
        
        company_list = list()
        prod_year = list()
        lang = list()
        # For each film add list of years,language for each production company
        for index,film in films.iterrows():
            film_companies = film['production_companies']
            company_list += film_companies.split(",")
            for company in film_companies.split(","):
                prod_year += [film['year']]
                lang += [film['original_language']]
        # Combine company list, production year        
        prod_comp_year = zip(company_list,prod_year)
        prod_comp_lang = zip(company_list,lang)
        
        # Plot top 15 companies with high number of releases
        fd = nltk.FreqDist(company_list)
        top_companies = fd.most_common(15)
        plotbar(top_companies,"Top 15 Production Companies with high releases",'Company','Movie count')
        printmd("<b>Inference:</b> Company "+str(list(dict(fd.most_common()[:1]).keys())[0])+ " has highest released movies with "+str(list(dict(fd.most_common()[:1]).values())[0])+" movie in the dataset.",color="blue")
        printmd("<b>Inference:</b> Company "+str(list(dict(fd.most_common()[-1:]).keys())[0])+ " has lowest released movies with "+str(list(dict(fd.most_common()[-1:]).values())[0])+" movie in the dataset.",color="blue")
        
        # Plot the top 60 movies each year released by these companies
        # Plot top 30 first
        fd = nltk.FreqDist( prod_comp_year)
        top_companies = dict(fd.most_common(30))
        ser = pd.Series(list(top_companies.values()),
                  index=pd.MultiIndex.from_tuples(top_companies.keys()))
        df = ser.unstack().fillna(0)
        sns.heatmap(df).set_title('Production Companies Vs. Count of Movies Year - Top 30')
        plt.show()
        
        # Plot 30-60
        least_companies = dict(fd.most_common(60)[-30:])
        ser = pd.Series(list(least_companies.values()),
                  index=pd.MultiIndex.from_tuples(least_companies.keys()))
        df = ser.unstack().fillna(0)
        sns.heatmap(df).set_title('Production Companies Vs. Count of Movies Year - Top 30 to 60')
        
        # Plot the top 60 movies each year released by these companies
        # Plot top 30 first
        fd = nltk.FreqDist(prod_comp_lang)
        top_companies = dict(fd.most_common(30))
        ser = pd.Series(list(top_companies.values()),
                  index=pd.MultiIndex.from_tuples(top_companies.keys()))
        df = ser.unstack().fillna(0)
        sns.heatmap(df).set_title('Production Companies Vs. Language - Top 30')
        plt.show()
        
        # Plot 30-60
        least_companies = dict(fd.most_common(60)[-30:])
        ser = pd.Series(list(least_companies.values()),
                  index=pd.MultiIndex.from_tuples(least_companies.keys()))
        df = ser.unstack().fillna(0)
        sns.heatmap(df).set_title('Production Companies Vs. Language - Top 30 to 60')

    # Analyze languages
    def analyzeLang(self):          
     # For each film
        lang = list()
        for index,film in self.films.iterrows():
            # Get languages for the film
            film_lang = film['original_language']
            # For each genre in the film, create a list
            lang += [film_lang]
        # Generate Frequency distribution of genres
        fd = nltk.FreqDist(lang)
        languages = dict(fd)
        plt.pie(languages.values(), labels=languages.keys(),autopct='%.2f', startangle=0)
        plt.title("Distribution of languages across the data")
        plt.show()
        printmd("<font color =blue><b> Inference: </b>The above pie chart shows 75% of the movies are from the en(English) language</font>")
    
    # Analyze Revenue
    def analyzeRevenue(self):
        x_axis_numbering = range(self.films['year'].min(), self.films['year'].max()+1)
        plt.xticks(x_axis_numbering)
        movies_year = (self.films[self.films.groupby(['year'])['revenue'].transform(max) == self.films['revenue']]).sort_values(by=['year'])
        plt.xlabel('Year')
        plt.ylabel('Revenue')
        plt.plot(movies_year['year'].values, movies_year['revenue'].values, label='Highest Revenue')
        
        # Remove zero revenue
        films = self.films[self.films.revenue > 1000]
        movies_year = (films[films.groupby(['year'])['revenue'].transform(min) == films['revenue']]).sort_values(by=['year'])
        plt.title('Highest/Lowest Revenue Per Year')
        plt.xlabel('Year')
        plt.ylabel('Revenue')
        plt.plot(movies_year['year'].values, movies_year['revenue'].values, label='Lowest Revenue')
        plt.legend()
        plt.show()
    
    # Analyze Vote Average
    def analyzeVoteAvg(self):
        # Remove 10 average
        films = self.films[self.films.vote_average != 10]
        x_axis_numbering = range(films['year'].min(), films['year'].max()+1)
        plt.xticks(x_axis_numbering)
        movies_year = (films[films.groupby(['year'])['vote_average'].transform(max) == films['vote_average']]).sort_values(by=['year'])
        plt.xlabel('Year')
        plt.ylabel('Vote Avg')
        plt.plot(movies_year['year'].values, movies_year['vote_average'].values, label='Highest Vote Avg')
        
        # Remove zero average
        films = self.films[self.films.vote_average != 0]
        movies_year = (films[films.groupby(['year'])['vote_average'].transform(min) == films['vote_average']]).sort_values(by=['year'])
        plt.title('Highest/Lowest Vote Average Per Year')
        plt.xlabel('Year')
        plt.ylabel('Vote Avg')
        plt.plot(movies_year['year'].values, movies_year['vote_average'].values, label='Lowest Vote Avg')
        plt.legend()
        plt.show()        


### Create Object for movies from 2000 to 2019

In [None]:
tmdb = tmdb(REQUEST_LIMIT,APIKEY,FILENAME,2000,2019)

### Fetch Data from API & Save to CSV

In [None]:
tmdb.fetchFilmDetails()

### Preprocessing: Clean & Filter Dataset

In [None]:
tmdb.processDetails()

### Analysis: Authors of reviews

Analyse Authors & Reviews in the dataset.
<ul>
    <li> Remove empty rows of authors & reviews. </li>
    <li> Clean the sentence by removing stop words & stemming. </li>
    <li> Fetch top 5 authors with highest number of reviews. </li>
    <li> Display top 10 common words used in their reviews. </li>
    <li> Show the lexical diversity of the reviews published by these authors.</li>
    <li> Show the common words used in the reviews by these authors </li>
</ul>

In [None]:
tmdb.analyzeAuthors()

### Analysis: Genres

Analyse Genres present in the dataset.

<ul>
    <li> Remove empty rows of genres. </li>
    <li> Plot distribution of genres in the dataset. </li>
    <li> Heatmap showing distribution of genres across different production companies. </li>
</ul>    

In [None]:
tmdb.analyzeGenre()

In [None]:
tmdb.analyzeCompanyGenre()

### Analysis: Movie Timelines

Analyse Movies based on the release date in the dataset.

<ul>
    <li> Plot movies released each month. </li>
    <li> Plot movies released each year. </li>
</ul>   

In [None]:
tmdb.analyzeMovieTimelines()

### Analysis: Production Companies

Analyse Companies based on the movies they release.

<ul>
    <li> Plot movies released each year for each company. </li>
    <li> Analyze movies released by language for each company. </li>
</ul>   

In [None]:
tmdb.analyzeCompanies()

### Analysis: Language

Analyse languages which the movies are based on.

<ul>
    <li> Plot distribution of languages </li>
</ul>  

In [None]:
tmdb.analyzeLang()

### Analysis: Revenue & Vote Average

Analyse revenue & vote average for each year.

<ul>
    <li> Plot highest revenue per year </li>
    <li> Plot lowest revenue per year </li>
    <li> Plot highest vote avg per year </li>
    <li> Plot lowest vote avg per year </li>
</ul>  

In [None]:
tmdb.analyzeRevenue()

In [None]:
tmdb.analyzeVoteAvg()

### Tentative Conclusion

Further Analysis could be carried out alongside realtime data such as twitter:
<ul>
    <li> Comparison of tweets to the popularity & ratings </li>
</ul>

Following inferences were made from analysing the dataset:
<ul>
    <li> The highest revenue for a movie was made in 2009 </li>
    <li> The dataset is skewed & incomplete with current real-time data </li>
    <li> Release dates are incorrect or unavailable for many movies and have been defaulted to the first month of they year instead </li>
    <li> Lots of movies are missing production company, genre etc </li>
    <li> More reviews would give an in-depth analysis on the movies </li>
<ul>