### Import and Initialize Modules

This cell imports the necessary classes from custom modules (`Movies`, `Links`, `Ratings`, and `Tags`) and initializes the `Movies` class with data from the `movies.csv` file.  The `Movies` class likely provides functionality to analyze movie data.

In [None]:
from movies import Movies 
movie_data = Movies("../datasets/movies.csv")

Top 10 frequent and longest tags (intersection):



In [6]:
%timeit movie_data.dist_by_release()
dist_by_release = movie_data.dist_by_release()
print("Top years of movies:")
for i, j in dist_by_release.items():
    print(f"{i} : {j}")

992 μs ± 34.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Top years of movies:
1995 : 224
1994 : 184
1996 : 181
1993 : 101
1992 : 23
1990 : 15
1991 : 15
1989 : 14
1986 : 9
1982 : 8
1940 : 8
1957 : 8
1987 : 8
1980 : 8
1981 : 7
1988 : 7
1979 : 7
1955 : 6
1959 : 6
1968 : 6
1997 : 6
1939 : 6
1985 : 6
1967 : 5
1965 : 5
1951 : 5
1958 : 5
1944 : 5
1941 : 5
1975 : 5
1971 : 5
1984 : 5
1964 : 4
1973 : 4
1954 : 4
1934 : 4
1960 : 4
1963 : 4
1950 : 4
1974 : 4
1983 : 4
1977 : 3
1937 : 3
1972 : 3
1952 : 3
1961 : 3
1953 : 3
1946 : 3
1938 : 3
1956 : 3
1962 : 3
1976 : 2
1969 : 2
1970 : 2
1942 : 2
1945 : 2
1947 : 2
1935 : 2
1936 : 2
1949 : 2
1978 : 2
1943 : 1
1932 : 1
1966 : 1
1948 : 1
1933 : 1
1931 : 1


### Movie Distribution by Genre

This cell calculates and prints the distribution of movies across different genres using the `dist_by_genres` method of the `movie_data` object.

In [None]:
%timeit movie_data.dist_by_genres()
dist_by_genres = movie_data.dist_by_genres()
print("Top genres of movies:")
for i, j in dist_by_genres.items():
    print(f"{i} : {j}")

493 μs ± 45.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Top genres of movies:
Drama : 507
Comedy : 365
Romance : 208
Thriller : 179
Action : 158
Adventure : 126
Crime : 122
Children : 100
Fantasy : 69
Sci-Fi : 69
Mystery : 58
Musical : 53
Horror : 51
War : 48
Animation : 37
Documentary : 25
Western : 23
Film-Noir : 18
IMAX : 3


# Frequent Genres in Top 10 Movies

This cell identifies the most frequent genres among the top 10 movies, using the `most_genres` method of the `movie_data` object.

In [None]:
%timeit movie_data.most_genres(10)
movie_dist = movie_data.most_genres(10)
print("Most genres of movies:")
for i, j in movie_dist.items():
    print(f"{i} : {j}")

327 μs ± 109 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Most genres of movies:
Strange Days (1995) : 6
Lion King, The (1994) : 6
Getaway, The (1994) : 6
Super Mario Bros. (1993) : 6
Beauty and the Beast (1991) : 6
All Dogs Go to Heaven 2 (1996) : 6
Space Jam (1996) : 6
Aladdin and the King of Thieves (1996) : 6
Toy Story (1995) : 5
Money Train (1995) : 5


### Initialize Links Data

This cell imports the `Links` class, defines a list of movie IDs, specifies the fields to retrieve from the links data, and creates an instance of the `Links` class using the `links.csv` file.

In [None]:
# %%timeit
from links import Links 
movie_id = [1,2,3,4,5,6,7,8,9,10,11,123]
fields = ["movie_id", "title", "Director", "imdb_id", "Budget", "Cumulative Worldwide Gross", "Runtime"] 
links_data = Links("../datasets/links.csv") 

### Retrieve IMDb Data

This cell retrieves and prints IMDb data for the specified movie IDs and fields using the `get_imdb` method of the `links_data` object.

In [None]:
%timeit links_data.get_imdb(movie_id, fields)
result = links_data.get_imdb(movie_id, fields)
print("All imdb data:")
for i in result:
    print(i)

3.69 ms ± 736 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
All imdb data:
[1, 'Toy Story (1995)', 'John Lasseter', '0114709', '$30,000,000 (estimated)', '$394,436,586', '1 hour 21 minutes']
[2, 'Jumanji (1995)', 'Joe Johnston', '0113497', '$65,000,000 (estimated)', '$262,821,940', '1 hour 44 minutes']
[3, 'Grumpier Old Men (1995)', 'Howard Deutch', '0113228', '$25,000,000 (estimated)', '$71,518,503', '1 hour 41 minutes']
[4, 'Waiting to Exhale (1995)', 'Forest Whitaker', '0114885', '$16,000,000 (estimated)', '$81,452,156', '2 hours 4 minutes']
[5, 'Father of the Bride Part II (1995)', 'Charles Shyer', '0113041', '$30,000,000 (estimated)', '$76,594,107', '1 hour 46 minutes']
[6, 'Heat (1995)', 'Michael Mann', '0113277', '$60,000,000 (estimated)', '$187,436,818', '2 hours 50 minutes']
[7, 'Sabrina (1995)', 'Sydney Pollack', '0114319', '$58,000,000 (estimated)', '$53,696,959', '2 hours 7 minutes']
[8, 'Tom and Huck (1995)', 'Peter Hewitt', '0112302', 'N/A', '$23,920,048', '1 h

### Top Directors

This cell finds and prints the top 10 directors, likely based on the number of movies they have directed, using the `top_directors` method of the `links_data` object.

In [None]:
%timeit links_data.top_directors(10)
top_directors = links_data.top_directors(10)
print("Top directors:")
for i, j in top_directors.items():
    print(f"{i} : {j}")

4.4 ms ± 559 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Top directors:
Alfred Hitchcock : 11
Directors : 9
Woody Allen : 8
Stanley Kubrick : 6
Frank Capra : 6
Robert Stevenson : 6
Rob Reiner : 5
Martin Scorsese : 5
Tony Scott : 5
John Carpenter : 5


### Most Expensive Movies

This cell identifies and prints the 10 most expensive movies using the `most_expensive` method of the `links_data` object.

In [None]:
%timeit links_data.most_expensive(10)
most_expensive = links_data.most_expensive(10)
print("Most expensive:")
for i, j in most_expensive.items():
    print(f"{i} : {j}")

8.25 ms ± 752 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Most expensive:
Akira (1988) : 1100000000
Ghost in the Shell (Kôkaku kidôtai) (1995) : 330000000
Waterworld (1995) : 175000000
Germinal (1993) : 164000000
Cold Fever (Á köldum klaka) (1995) : 130000000
True Lies (1994) : 115000000
Terminator 2: Judgment Day (1991) : 102000000
Batman Forever (1995) : 100000000
Hunchback of Notre Dame, The (1996) : 100000000
Eraser (1996) : 100000000


### Most Profitable Movies

This cell finds and prints the 10 most profitable movies using the `most_profitable` method of the `links_data` object.

In [None]:
%timeit links_data.most_profitable(10)
most_profitable = links_data.most_profitable(10)
print("Most profitable:")
for i, j in most_profitable.items():
    print(f"{i} : {j}")

11.5 ms ± 615 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Most profitable:
Jurassic Park (1993) : 1041379926
Lion King, The (1994) : 934161373
E.T. the Extra-Terrestrial (1982) : 786807407
Star Wars: Episode IV - A New Hope (1977) : 764398507
Independence Day (a.k.a. ID4) (1996) : 742400891
Forrest Gump (1994) : 623226465
Star Wars: Episode V - The Empire Strikes Back (1980) : 532016086
Ghost (1990) : 483703557
Aladdin (1992) : 476050219
Home Alone (1990) : 458684675


### Longest Movies

This cell finds and prints the 10 longest movies using the `longest` method of the `links_data` object.

In [None]:
%timeit links_data.longest(10)
longest = links_data.longest(10)
print("The longest:")
for i, j in longest.items():
    print(f"{i} : {j}")

5.29 ms ± 708 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The longest:
Wallace & Gromit: A Close Shave (1995) : 1800
Wallace & Gromit: The Wrong Trousers (1993) : 1740
Some Folks Call It a Sling Blade (1993) : 1500
Winnie the Pooh and the Blustery Day (1968) : 1500
Grand Day Out with Wallace and Gromit, A (1989) : 1380
Gone with the Wind (1939) : 238
Once Upon a Time in America (1984) : 229
Lawrence of Arabia (1962) : 227
Ben-Hur (1959) : 212
Godfather: Part II, The (1974) : 202


### Highest Cost Per Minute

This cell finds and prints the top 10 movies with the highest cost per minute, calculated using the `top_cost_per_minute` method of the `links_data` object.

In [None]:
%timeit links_data.top_cost_per_minute(10)
top_cost_per_minute = links_data.top_cost_per_minute(10)
print("Top cost per minute:")
for i, j in top_cost_per_minute.items():
    print(f"{i} : {j}")

11.2 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Top cost per minute:
Akira (1988) : 8870967.74
Ghost in the Shell (Kôkaku kidôtai) (1995) : 3975903.61
Cold Fever (Á köldum klaka) (1995) : 1566265.06
Waterworld (1995) : 1296296.3
Hunchback of Notre Dame, The (1996) : 1098901.1
Germinal (1993) : 1025000.0
Judge Dredd (1995) : 937500.0
Space Jam (1996) : 909090.91
Eraser (1996) : 869565.22
Batman Forever (1995) : 826446.28


### Initialize Ratings Data and Related Classes

This cell imports the necessary classes for analyzing movie ratings (`Ratings`, `print_data`, `Movies`, and `Users`).  It then initializes instances of these classes using the `ratings.csv` and `movies.csv` files.  This setup enables analysis of movie ratings, including distributions and user statistics.

In [None]:
# %%timeit
from ratings import Ratings, print_data, Movie_ratings, User_ratings

ratings_data = Ratings("../datasets/ratings.csv", "../datasets/movies.csv")
movies_instance = Movie_ratings(ratings_data)
users_instance = User_ratings(ratings_data)

### Top Movies by Average Rating

This cell prints the top 10 movies based on their average rating, using the `top_by_ratings` method of the `movies_instance` object and the `print_data` function.

In [None]:
%timeit movies_instance.top_by_ratings(10, metric='average')
print_data(movies_instance.top_by_ratings(10, metric='average'), "Top Movies by Average Rating")

1.1 ms ± 5.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Top Movies by Average Rating:
Bottle Rocket (1996) : 5.0
Canadian Bacon (1995) : 5.0
Star Wars: Episode IV - A New Hope (1977) : 5.0
James and the Giant Peach (1996) : 5.0
Wizard of Oz : 5.0
Citizen Kane (1941) : 5.0
Adventures of Robin Hood : 5.0
Mr. Smith Goes to Washington (1939) : 5.0
Winnie the Pooh and the Blustery Day (1968) : 5.0
Three Caballeros : 5.0



### Top Movies by Median Rating

This cell prints the top 10 movies based on their median rating, using the `top_by_ratings` method of the `movies_instance` object and the `print_data` function.

In [None]:
%timeit movies_instance.top_by_ratings(10, metric='median')
print_data(movies_instance.top_by_ratings(10, metric='median'), "Top Movies by Median Rating")

1.28 ms ± 57.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Top Movies by Median Rating:
Bottle Rocket (1996) : 5.0
Canadian Bacon (1995) : 5.0
Star Wars: Episode IV - A New Hope (1977) : 5.0
Tommy Boy (1995) : 5.0
Forrest Gump (1994) : 5.0
Fugitive : 5.0
Jurassic Park (1993) : 5.0
Tombstone (1993) : 5.0
Dances with Wolves (1990) : 5.0
Pinocchio (1940) : 5.0



### Top Movies by Median Rating (Duplicate)

This cell appears to be a duplicate of the previous cell, printing the top 10 movies by median rating again.

In [None]:
%timeit movies_instance.top_by_ratings(10, metric='median')
print_data(movies_instance.top_by_ratings(10, metric='median'), "Top Movies by Median Rating")

1.3 ms ± 69.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Top Movies by Median Rating:
Bottle Rocket (1996) : 5.0
Canadian Bacon (1995) : 5.0
Star Wars: Episode IV - A New Hope (1977) : 5.0
Tommy Boy (1995) : 5.0
Forrest Gump (1994) : 5.0
Fugitive : 5.0
Jurassic Park (1993) : 5.0
Tombstone (1993) : 5.0
Dances with Wolves (1990) : 5.0
Pinocchio (1940) : 5.0



### Most Common Ratings

This cell prints the 10 most common movie ratings using the `most_ratings` method of the `movies_instance` object and the `print_data` function.

In [None]:
%timeit movies_instance.most_ratings(10)
print_data(movies_instance.most_ratings(10), "Most Common Ratings")

186 μs ± 9.55 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Most Common Ratings:
4.0 : 292
5.0 : 267
3.0 : 253
2.0 : 57
1.0 : 39
4.5 : 33
0.5 : 24
3.5 : 17
1.5 : 11
2.5 : 7



### Movie Ratings Distribution by Year

This cell prints the distribution of movie ratings across different years, using the `dist_by_year` method of the `movies_instance` object and the `print_data` function.

In [None]:
%timeit movies_instance.dist_by_year()
print_data(movies_instance.dist_by_year(), "Distribution by Year")

771 μs ± 38.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Distribution by Year:
1996 : 358
1999 : 82
2000 : 296
2001 : 70
2005 : 121
2006 : 4
2007 : 1
2011 : 39
2015 : 29



### Most Controversial Movies

This cell identifies and prints the 5 most controversial movies based on the variance in their ratings, using the `top_controversial` method of the `movies_instance` object.

In [None]:
%timeit movies_instance.top_controversial(5)
print_data(movies_instance.top_controversial(5), "Most Controversial Movies (by Rating Variance)")

945 μs ± 94 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Most Controversial Movies (by Rating Variance):
Bambi (1942) : 5.06
Rescuers : 5.06
My Fair Lady (1964) : 5.06
Matrix : 4.0
Schindler's List (1993) : 3.42



### User Ratings Count Distribution

This cell prints the distribution of the number of ratings given by each user, using the `ratings_distribution` method of the `users_instance` object.

In [None]:
%timeit users_instance.ratings_distribution()
print_data(users_instance.ratings_distribution(), "Ratings Count Distribution")

291 μs ± 23.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Ratings Count Distribution:
1 : 232
2 : 29
3 : 39
4 : 216
5 : 44
6 : 314
7 : 126



### User Median Ratings Distribution

This cell prints the distribution of users' median ratings, using the `average_median_ratings_distribution` method of the `users_instance` object.

In [None]:
%timeit users_instance.average_median_ratings_distribution("median")
print_data(users_instance.average_median_ratings_distribution("median"), "Median Ratings Distribution")

328 μs ± 5.85 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Median Ratings Distribution:
1 : 5.0
2 : 4.0
3 : 0.5
4 : 4.0
5 : 4.0
6 : 3.0
7 : 4.0



### User Average Ratings Distribution

This cell prints the distribution of users' average ratings, using the `average_median_ratings_distribution` method of the `users_instance` object.

In [None]:
%timeit users_instance.average_median_ratings_distribution(metric='average')
print_data(users_instance.average_median_ratings_distribution(metric='average'), "Average Ratings Distribution")

298 μs ± 8.52 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Average Ratings Distribution:
1 : 4.37
2 : 3.95
3 : 2.44
4 : 3.56
5 : 3.64
6 : 3.49
7 : 3.35



### Users with Highest Rating Variance

This cell prints the top 10 users with the highest variance in their ratings, indicating those with the mostdiverse opinions on movies, using the `top_n_variance_users` method of the `users_instance` object.

In [None]:
%timeit users_instance.top_n_variance_users(10)
print_data(users_instance.top_n_variance_users(10), "Top 10 Users by Variance")

456 μs ± 46.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Top 10 Users by Variance:
3 : 4.26
4 : 1.72
7 : 1.65
5 : 0.96
6 : 0.72
1 : 0.64
2 : 0.63



### Initialize Tag Data

This cell imports the `Tags` class, defines the path to the tags data file, specifies a word to search for within the tags, and creates an instance of the `Tags` class using the data from the tags file.

In [None]:
# %%timeit
from tags import Tags, print_data
 
path_to_file = "../datasets/tags.csv"
word_to_search = "oo"
 
tags_data = Tags(path_to_file)

### Most Common Tags

This cell prints the 10 most common tags using the `most_words` method of the `tags_data` object and the `print_data` function. (Note: The name `most_words` might be slightly misleading, as it likely refers to the most frequent tags, not necessarily the tags with the most words.)

In [None]:
%timeit tags_data.most_words(10)
print_data(tags_data.most_words(10), "most_common_tags")

1.44 ms ± 337 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
most_common_tags:
funny : 15
sci-fi : 14
twist ending : 12
dark comedy : 12
atmospheric : 10
superhero : 10
comedy : 10
action : 10
suspense : 10
Leonardo DiCaprio : 9



### Longest Tags

This cell finds and prints the 10 longest tags, along with their lengths, using the `longest` method of the `tags_data` object.

In [None]:
%timeit tags_data.longest(10)
longest_tags = tags_data.longest(10)
print("Longest tags:")
for tag in longest_tags:
    print(f"{tag} : {len(tag)}")

1.02 ms ± 45.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Longest tags:
Something for everyone in this one... saw it without and plan on seeing it with kids! : 85
the catholic church is the most corrupt organization in history : 63
audience intelligence underestimated : 36
Oscar (Best Music - Original Score) : 35
assassin-in-training (scene) : 28
Oscar (Best Cinematography) : 27
Everything you want is here : 27
political right versus left : 27
representation of children : 26
Guardians of the Galaxy : 23


### Tags Containing 'oo'

This cell finds and prints the tags that contain the word 'oo', along with their lengths, using the `tags_with` method of the `tags_data` object.

In [None]:
%timeit tags_data.tags_with("oo")
tags_with_oo = tags_data.tags_with("oo")
print("Tags with 'oo':")
for tag in tags_with_oo:
    print(f"{tag} : {len(tag)}")

1.01 ms ± 75.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Tags with 'oo':
High School : 11
Hollywood : 9
Orlando Bloom : 13
Poor story : 10
Woody Harrelson : 15
based on a book : 15
bloody : 6
cartoon : 7
comic book : 10
courtroom drama : 15
feel-good : 9
good dialogue : 13
good soundtrack : 15
good writing : 12
goofy : 5
high school : 11
highschool : 10
oldie but goodie : 16
poorly paced : 12
too long : 8
way too long : 12


### Longest of Most Common Tags

This cell finds and prints the 10 longest tags among the most common tags, along with their lengths, using the `most_words_and_longest` method of the `tags_data` object.

In [None]:
%timeit tags_data.most_words_and_longest(10)
most_words_and_longest = tags_data.most_words_and_longest(10)
print("Longest of most common tags:")
for tag in most_words_and_longest:
    print(f"{tag} : {len(tag)}")

3.14 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Longest of most common tags:


### Most Popular Tags

This cell prints the 10 most popular tags using the `most_popular` method of the `tags_data` object and the `print_data` function.

In [None]:
%timeit tags_data.most_popular(10)
print_data(tags_data.most_popular(10), "Most popular tags")

1.21 ms ± 35.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Most popular tags:
funny : 15
sci-fi : 14
twist ending : 12
dark comedy : 12
atmospheric : 10
superhero : 10
comedy : 10
action : 10
suspense : 10
Leonardo DiCaprio : 9

