# The top 15 movies on IMDb

## Main objectives: 
    1. Fetch top 15 movies with a minimum of 100 votes
    2. Ranking = (numVotes / averageNumberOfVotes) * averageRating
    3. List of the title of these top 15 movies

#### Import the files into Jupyter notebook

In [1]:
from pyspark import SparkContext
from functions import filter_100_votes, tuple_conversion, filter_movies, find_mean, rank

#### Point to the files with abspath. Then read the .tsv file containing the rating informations into a dataframe and then convert it into rdd type.

In [2]:
import pathlib
path_to_ratings = pathlib.Path('raw_data/title.ratings.tsv')
abs_path = os.path.abspath(path_to_ratings)
df_ratings = spark.read.csv(abs_path, sep=r'\t', header=True).select('tconst','averageRating','numVotes')
rdd_ratings = df_ratings.rdd.map(list)
rdd_ratings.top(3)

[['tt9916778', '7.3', '24'],
 ['tt9916766', '6.9', '14'],
 ['tt9916720', '6.0', '57']]

#### Then do the same for the .tsv file containing the tile informations

In [3]:
path_to_titles = pathlib.Path('raw_data/title.basics.tsv')
abs_path = os.path.abspath(path_to_titles)
df_titles = spark.read.csv(abs_path, sep=r'\t', header=True).select('tconst','titleType','primaryTitle')
rdd_titles = df_titles.rdd.map(list)
rdd_titles.top(3)

[['tt9916880', 'tvEpisode', 'Horrid Henry Knows It All'],
 ['tt9916856', 'short', 'The Wind'],
 ['tt9916852', 'tvEpisode', 'Episode #3.20']]

#### Filter all movies, short movies, tv episodes, etc. so that only those that have 100 or more number of votes remain

In [4]:
rdd_ratings = filter_100_votes(rdd_ratings)

#### Convert data, in order to do a complete join

In [5]:
rdd_ratings_pair = tuple_conversion(rdd_ratings)
rdd_titles_pair = tuple_conversion(rdd_titles)

#### Join data using the join function, using the 'tconst' as the key

In [6]:
rdd_joined = rdd_ratings_pair.join(rdd_titles_pair)

#### Filter the joined data so that only movies are left

In [None]:
rdd_movies = filter_movies(rdd_joined)
rdd_movies.take(1)

#### Calculate the ranking using the formula

In [None]:
top_15_ranked = rank(rdd_movies).take(15)

#### Loop through the final list and print out top 15 movies

In [None]:
for each_term in top_15_ranked:
    print(each_term[1][1][2])