# Exercise 1

### Imports and reading data

In [1028]:
import pandas as pd
import numpy as np
path = "..\\..\\Lecture Code (my notes)\\Assets\\Data\\"

# Read in dataframe
df = pd.read_csv(path+"imdb_top_1000.csv")

### Task 1: Get rid of useless columns

In [1029]:
# Drop the unecessary columns. We won't be using these.
df.drop(columns=["Poster_Link","Certificate","Overview","Runtime","Meta_score","No_of_Votes"], inplace=True)
df.head()

Unnamed: 0,Series_Title,Released_Year,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,Gross
0,The Shawshank Redemption,1994,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,28341469
1,The Godfather,1972,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,134966411
2,The Dark Knight,2008,"Action, Crime, Drama",9.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,534858444
3,The Godfather: Part II,1974,"Crime, Drama",9.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,57300000
4,12 Angry Men,1957,"Crime, Drama",9.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,4360000


### Intermission Task: Analyze and Clean Data

In [1030]:
# Next, we should do an analysis of the data and clean it if possible.
# We'll take a look at various statistics.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Released_Year  1000 non-null   object 
 2   Genre          1000 non-null   object 
 3   IMDB_Rating    1000 non-null   float64
 4   Director       1000 non-null   object 
 5   Star1          1000 non-null   object 
 6   Star2          1000 non-null   object 
 7   Star3          1000 non-null   object 
 8   Star4          1000 non-null   object 
 9   Gross          831 non-null    object 
dtypes: float64(1), object(9)
memory usage: 78.3+ KB


In [1031]:
df.head()

Unnamed: 0,Series_Title,Released_Year,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,Gross
0,The Shawshank Redemption,1994,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,28341469
1,The Godfather,1972,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,134966411
2,The Dark Knight,2008,"Action, Crime, Drama",9.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,534858444
3,The Godfather: Part II,1974,"Crime, Drama",9.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,57300000
4,12 Angry Men,1957,"Crime, Drama",9.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,4360000


In [1032]:
# Things of note:
# 1. Gross has 169 null entries. These will need to be filled.
#
# 2. Gross should also not be an object- we should also expect this to be an integer.
# This is likely because of the commas included in the numbers (4,360,000 for example), making them strings.
#
# 3. The release year should ideally not be an object- we should expect an integer here.
#
# Let's investigate each of these.

In [1033]:
# 1. Gross has null entries

# We could simply fill these with averages- but this will mess with out values when we do aggregates later.
# It would be best to simply delete these entries.
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 831 entries, 0 to 997
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   831 non-null    object 
 1   Released_Year  831 non-null    object 
 2   Genre          831 non-null    object 
 3   IMDB_Rating    831 non-null    float64
 4   Director       831 non-null    object 
 5   Star1          831 non-null    object 
 6   Star2          831 non-null    object 
 7   Star3          831 non-null    object 
 8   Star4          831 non-null    object 
 9   Gross          831 non-null    object 
dtypes: float64(1), object(9)
memory usage: 71.4+ KB


In [1034]:
# The 169 null entries have been deleted- we are left with 831

# 2. Gross should be an integer

# We must remove the commas before converting, as they will cause an error in the cast.
df["Gross"] = df["Gross"].str.replace(',','').astype(int)
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 831 entries, 0 to 997
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   831 non-null    object 
 1   Released_Year  831 non-null    object 
 2   Genre          831 non-null    object 
 3   IMDB_Rating    831 non-null    float64
 4   Director       831 non-null    object 
 5   Star1          831 non-null    object 
 6   Star2          831 non-null    object 
 7   Star3          831 non-null    object 
 8   Star4          831 non-null    object 
 9   Gross          831 non-null    int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 71.4+ KB


In [1035]:
# Gross is now an int64 type!

# 3. Release year should be an integer. 

# Let's investigate its range of values.
df["Released_Year"].value_counts()

Released_Year
2014    31
2004    29
2013    27
2009    25
2007    25
        ..
1947     1
1938     1
1933     1
PG       1
1953     1
Name: count, Length: 95, dtype: int64

In [1036]:
# The 'PG' entry should not be there. We'll keep only entries with numerical characters
df = df[df["Released_Year"].str.contains(r"[0-9]+")]
df["Released_Year"] = df["Released_Year"].astype(int)

df["Released_Year"].value_counts()

Released_Year
2014    31
2004    29
2013    27
2007    25
2009    25
        ..
1956     1
1945     1
1938     1
1933     1
1953     1
Name: count, Length: 94, dtype: int64

In [1037]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 830 entries, 0 to 997
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   830 non-null    object 
 1   Released_Year  830 non-null    int64  
 2   Genre          830 non-null    object 
 3   IMDB_Rating    830 non-null    float64
 4   Director       830 non-null    object 
 5   Star1          830 non-null    object 
 6   Star2          830 non-null    object 
 7   Star3          830 non-null    object 
 8   Star4          830 non-null    object 
 9   Gross          830 non-null    int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 71.3+ KB


In [1038]:
# We were able to remove the PG entry. Looks like it was the only one that needed removing.

### Task 2: Oldest Movie

In [1039]:
# Find the oldest movie
df.loc[df["Released_Year"] == df["Released_Year"].min()]

# Alternatively:
# df.nsmallest(1, columns=["Released_Year"])

Unnamed: 0,Series_Title,Released_Year,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,Gross
127,The Kid,1921,"Comedy, Drama, Family",8.3,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller,5450000


### Task 3: Newest Movie

In [1040]:
# Find the newest movie
df.loc[df["Released_Year"] == df["Released_Year"].max()]

# This has multiple movies "tied" for first place, since only the year is listed.
# This could be used instead if you only wanted one:
# df.nlargest(1, columns=["Released_Year"])

Unnamed: 0,Series_Title,Released_Year,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,Gross
19,Gisaengchung,2019,"Comedy, Drama, Thriller",8.6,Bong Joon Ho,Kang-ho Song,Lee Sun-kyun,Cho Yeo-jeong,Choi Woo-sik,53367844
33,Joker,2019,"Crime, Drama, Thriller",8.5,Todd Phillips,Joaquin Phoenix,Robert De Niro,Zazie Beetz,Frances Conroy,335451311
59,Avengers: Endgame,2019,"Action, Adventure, Drama",8.4,Anthony Russo,Joe Russo,Robert Downey Jr.,Chris Evans,Mark Ruffalo,858373000
84,1917,2019,"Drama, Thriller, War",8.3,Sam Mendes,Dean-Charles Chapman,George MacKay,Daniel Mays,Colin Firth,159227644
128,Chhichhore,2019,"Comedy, Drama",8.2,Nitesh Tiwari,Sushant Singh Rajput,Shraddha Kapoor,Varun Sharma,Prateik,898575
195,Portrait de la jeune fille en feu,2019,"Drama, Romance",8.1,Céline Sciamma,Noémie Merlant,Adèle Haenel,Luàna Bajrami,Valeria Golino,3759854
217,Ford v Ferrari,2019,"Action, Biography, Drama",8.1,James Mangold,Matt Damon,Christian Bale,Jon Bernthal,Caitriona Balfe,117624028
334,Gully Boy,2019,"Drama, Music, Romance",8.0,Zoya Akhtar,Vijay Varma,Nakul Roshan Sahdev,Ranveer Singh,Vijay Raaz,5566534
463,Knives Out,2019,"Comedy, Crime, Drama",7.9,Rian Johnson,Daniel Craig,Chris Evans,Ana de Armas,Jamie Lee Curtis,165359751
466,Marriage Story,2019,"Comedy, Drama, Romance",7.9,Noah Baumbach,Adam Driver,Scarlett Johansson,Julia Greer,Azhy Robertson,2000000


### Task 4: Top 10 movies by IMDB rating

In [1041]:
# Get top 10 movies
df.nlargest(10, columns=["IMDB_Rating"])
# Another option: df.sort_values("IMDB_Rating", ascending=False).head(10)

Unnamed: 0,Series_Title,Released_Year,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,Gross
0,The Shawshank Redemption,1994,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,28341469
1,The Godfather,1972,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,134966411
2,The Dark Knight,2008,"Action, Crime, Drama",9.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,534858444
3,The Godfather: Part II,1974,"Crime, Drama",9.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,57300000
4,12 Angry Men,1957,"Crime, Drama",9.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,4360000
5,The Lord of the Rings: The Return of the King,2003,"Action, Adventure, Drama",8.9,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,377845905
6,Pulp Fiction,1994,"Crime, Drama",8.9,Quentin Tarantino,John Travolta,Uma Thurman,Samuel L. Jackson,Bruce Willis,107928762
7,Schindler's List,1993,"Biography, Drama, History",8.9,Steven Spielberg,Liam Neeson,Ralph Fiennes,Ben Kingsley,Caroline Goodall,96898818
8,Inception,2010,"Action, Adventure, Sci-Fi",8.8,Christopher Nolan,Leonardo DiCaprio,Joseph Gordon-Levitt,Elliot Page,Ken Watanabe,292576195
9,Fight Club,1999,Drama,8.8,David Fincher,Brad Pitt,Edward Norton,Meat Loaf,Zach Grenier,37030102


### Task 5: Top movie for each genre

In [1042]:
# I imagine there are lots of ways to tackle this problem- this is what I came up with.

# The genres are combined
movies = df['Genre'].str.split(', ')

# Creates a list of each unique genre
unique_genres = []
for movie in movies:
    for genre in movie:
        if(genre not in unique_genres):
            unique_genres.append(genre)
unique_genres.sort()

# Find the best in each movie genre
for genre in unique_genres:
    # Get the name of the highest rated movie
    # Maybe this is too dense. It's a nice one-liner though.
    best = df.loc[df["Genre"].str.contains(genre)].nlargest(1, "IMDB_Rating")

    # Print results
    print(f"Best movie in {genre} goes to:\n{best["Series_Title"].values[0]+"\n"}")

Best movie in Action goes to:
The Dark Knight

Best movie in Adventure goes to:
The Lord of the Rings: The Return of the King

Best movie in Animation goes to:
Sen to Chihiro no kamikakushi

Best movie in Biography goes to:
Schindler's List

Best movie in Comedy goes to:
Gisaengchung

Best movie in Crime goes to:
The Godfather

Best movie in Drama goes to:
The Shawshank Redemption

Best movie in Family goes to:
Sen to Chihiro no kamikakushi

Best movie in Fantasy goes to:
Star Wars: Episode V - The Empire Strikes Back

Best movie in Film-Noir goes to:
Double Indemnity

Best movie in History goes to:
Schindler's List

Best movie in Horror goes to:
Psycho

Best movie in Music goes to:
Whiplash

Best movie in Musical goes to:
Singin' in the Rain

Best movie in Mystery goes to:
Se7en

Best movie in Romance goes to:
Forrest Gump

Best movie in Sci-Fi goes to:
Inception

Best movie in Sport goes to:
Bacheha-Ye aseman

Best movie in Thriller goes to:
Gisaengchung

Best movie in War goes to:
S

### Task 6: Director with the most movies

In [1043]:
# Get the value counts for each director. By default the highest will be at the 0th index.
dir_counts = df["Director"].value_counts()
print(f"The director with the most movies is: {dir_counts.keys()[0]}")

The director with the most movies is: Steven Spielberg


### Task 7: Star with the most movies

In [1044]:
# Creates a dataframe of stars and their total movie count for each Star column (Star1/Star2 etc)
stars = pd.concat([df[column].value_counts() for column in df if "Star" in column], axis = 1)
# Sums their count for each Star column
stars = stars.sum(axis=1)

# Print result
print(f"The star that has appeared in the most movies is: {stars.nlargest(1).keys()[0]}")

The star that has appeared in the most movies is: Robert De Niro


### Task 8: Highest grossing movie for each genre

In [1045]:
# We can reuse the unique_genres variable from earlier

# Find the highest grossing movie in each genre
for genre in unique_genres:
    # Get the name of the grossing movie in the genre
    best = df.loc[df["Genre"].str.contains(genre)].nlargest(1, "Gross")

    # Print results
    print(f"Highest grossing movie in {genre} goes to:\n{best["Series_Title"].values[0]+"\n"}")


Highest grossing movie in Action goes to:
Star Wars: Episode VII - The Force Awakens

Highest grossing movie in Adventure goes to:
Star Wars: Episode VII - The Force Awakens

Highest grossing movie in Animation goes to:
Incredibles 2

Highest grossing movie in Biography goes to:
The Blind Side

Highest grossing movie in Comedy goes to:
Toy Story 4

Highest grossing movie in Crime goes to:
The Dark Knight

Highest grossing movie in Drama goes to:
Avengers: Endgame

Highest grossing movie in Family goes to:
E.T. the Extra-Terrestrial

Highest grossing movie in Fantasy goes to:
Avatar

Highest grossing movie in Film-Noir goes to:
Notorious

Highest grossing movie in History goes to:
Gone with the Wind

Highest grossing movie in Horror goes to:
The Exorcist

Highest grossing movie in Music goes to:
Bohemian Rhapsody

Highest grossing movie in Musical goes to:
Fiddler on the Roof

Highest grossing movie in Mystery goes to:
The Sixth Sense

Highest grossing movie in Romance goes to:
Titanic


### Task 9: Lowest grossing movie for each director

In [1046]:
# Get all unique directors
directors = df["Director"].unique()

# Find the lowest grossing movie for each director
for director in directors:
    # Find the lowest grossing movie for the current director
    worst = df.loc[df["Director"].str.contains(director)].nsmallest(1, "Gross")

    # Print results
    print(f"{director}'s lowest grossing movie is:\n{worst["Series_Title"].values[0]+"\n"}")

Frank Darabont's lowest grossing movie is:
The Shawshank Redemption

Francis Ford Coppola's lowest grossing movie is:
The Conversation

Christopher Nolan's lowest grossing movie is:
Memento

Sidney Lumet's lowest grossing movie is:
12 Angry Men

Peter Jackson's lowest grossing movie is:
The Hobbit: The Desolation of Smaug

Quentin Tarantino's lowest grossing movie is:
Reservoir Dogs

Steven Spielberg's lowest grossing movie is:
Empire of the Sun

David Fincher's lowest grossing movie is:
Zodiac

Robert Zemeckis's lowest grossing movie is:
Back to the Future Part II

Sergio Leone's lowest grossing movie is:
Giù la testa

Lana Wachowski's lowest grossing movie is:
The Matrix

Martin Scorsese's lowest grossing movie is:
The King of Comedy

Irvin Kershner's lowest grossing movie is:
Star Wars: Episode V - The Empire Strikes Back

Milos Forman's lowest grossing movie is:
Amadeus

Bong Joon Ho's lowest grossing movie is:
Salinui chueok

Fernando Meirelles's lowest grossing movie is:
Cidade d

### Task 10: Save the dataframe as a parquet file

In [1047]:
# This line did not work- not sure if I need to install another library for it or not.
df.to_parquet(path+"imdb_processed")

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

### Task 11: Aggregate the gross revenue for all 1000 movies.

In [None]:
# There will be 830 movies in mine since I removed movies with no gross revenue
type_stats = df.groupby("IMDB_Rating").agg({"Gross":['mean', 'median']})
type_stats.sort_values(('Gross','mean'), ascending=False)

Unnamed: 0_level_0,Gross,Gross
Unnamed: 0_level_1,mean,median
IMDB_Rating,Unnamed: 1_level_2,Unnamed: 2_level_2
9.0,198839500.0,57300000.0
8.8,196300600.0,292576195.0
8.9,194224500.0,107928762.0
8.7,192668600.0,171479930.0
9.2,134966400.0,134966411.0
8.4,133760900.0,25544867.0
8.6,111256900.0,100125643.0
8.5,89778750.0,23341568.0
7.8,75684710.0,32449205.5
8.0,71311990.0,24475416.0


# Challenge 1

### Will attempt if I have time after making my QA presentation