# The top 15 movies on IMDb

## Main objectives: 
    1. Fetch top 15 movies with a minimum of 100 votes
    2. Ranking = (numVotes / averageNumberOfVotes) * averageRating
    3. List of the title of these top 15 movies


### Method:
    1. Import the files into Jupyter notebook
    2. Point to the files with abspath
    3. Read the .tsv file containing the rating informations into a dataframe and then convert it into rdd type
    4. Do same for the .tsv file containing the tile informations
    5. Filter all movies, short movies, tv episodes, etc. so that only those that have 100 or more number of votes remain
    6. Join data using the join function, using the 'tconst' as the key
    7. Filter the joined data so that only movies are left
    8. Calculate the ranking using the formula
    9. Loop through the final list and print out top 15 movies

In [1]:
from pyspark import SparkContext
from functions import filter_100_votes, tuple_conversion, filter_movies, find_mean, rank

In [2]:
import pathlib
path_to_ratings = pathlib.Path('raw_data/title.ratings.tsv')
abs_path = os.path.abspath(path_to_ratings)
df_ratings = spark.read.csv(abs_path, sep=r'\t', header=True).select('tconst','averageRating','numVotes')
rdd_ratings = df_ratings.rdd.map(list)
rdd_ratings.top(3)

[['tt9916778', '7.3', '24'],
 ['tt9916766', '6.9', '14'],
 ['tt9916720', '6.0', '57']]

In [3]:
path_to_titles = pathlib.Path('raw_data/title.basics.tsv')
abs_path = os.path.abspath(path_to_titles)
df_titles = spark.read.csv(abs_path, sep=r'\t', header=True).select('tconst','titleType','primaryTitle')
rdd_titles = df_titles.rdd.map(list)
rdd_titles.top(3)

[['tt9916880', 'tvEpisode', 'Horrid Henry Knows It All'],
 ['tt9916856', 'short', 'The Wind'],
 ['tt9916852', 'tvEpisode', 'Episode #3.20']]

In [4]:
rdd_ratings = filter_100_votes(rdd_ratings)

In [5]:
rdd_ratings_pair = tuple_conversion(rdd_ratings)
rdd_titles_pair = tuple_conversion(rdd_titles)

In [6]:
rdd_joined = rdd_ratings_pair.join(rdd_titles_pair)
rdd_movies = filter_movies(rdd_joined)
rdd_movies.take(1)

[('tt0000009',
  (['tt0000009', '5.9', '154'], ['tt0000009', 'movie', 'Miss Jerry']))]

In [7]:
top_15_ranked = rank(rdd_movies).take(15)

In [8]:
for each_term in top_15_ranked:
    print(each_term[1][1][2])

The Shawshank Redemption
The Dark Knight
Inception
Fight Club
Pulp Fiction
Forrest Gump
The Godfather
The Lord of the Rings: The Return of the King
The Lord of the Rings: The Fellowship of the Ring
The Matrix
The Lord of the Rings: The Two Towers
The Dark Knight Rises
Interstellar
Se7en
Gladiator
