# DATA 1 Practical 4

Simos Gerasimou


## Internet Movie Database Exploration and Analysis

**IMDB** ([http://imdb.com](http://imdb.com)) is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, fan and critical reviews, and ratings. You can find more information at [Wikipedia](https://en.wikipedia.org/wiki/IMDb).

**DataVision** has scraped the website and acquired data for a wide number of movies from IMDB. The company wants to analyse this data to extract insights from its products and answer questions including:
* the most successful movies, directors, actors
* patterns that might lead to predicting successful movies in the future

#### Your tasks are to explore this dataset and generade some actionable knowledge. 


#### This Jupyter Notebook will be presented to the Warner Bros main stakeholders who have limited knowledge about data science. So, your findings should be complemented by a suitable justification explaining what you observe and, when applicable, what does this observation  mean and, possibly, why it occurs.

* For each question (task) a description is provided accompanied (most of the time) by two cells: one for writing the Python code and another for providing the justification. Feel free to add more cells if you feel they are needed, but keep the cells corresponding to the same question close by.

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 3: Dapa Manipulation with Pandas from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)

(2) If you haven't already done so, complete the exercises on Pandas tutorial to become familiar with the library.

**Pandas API Reference**: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

***

### Part 1: Reading dataset

The IMDB dataset is available on VLE (look for IMDB-Movie-Data-Filtered.csv in the Practicals section)

**T1) Load the IMDB dataset using Pandas**

**Note**: You have to download the dataset on your local machine and then load it onto the Jupyter Notebook

In [1]:
import pandas as pd 
import numpy as np
datasetName = "IMDB-Movie-Data-Filtered.csv"
movies = pd.read_csv(datasetName) 

***

### Part 2: Cleaning the dataset


**T2) Print (i) the first and (ii) the last five records of the dataframe**

In [2]:
movies.head(5)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,Joel David Moore,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,...,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,7.9,33000
1,Color,Gore Verbinski,302.0,169.0,Orlando Bloom,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,...,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,7.1,0
2,Color,Sam Mendes,602.0,148.0,Rory Kinnear,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,...,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,6.8,85000
3,Color,Christopher Nolan,813.0,164.0,Christian Bale,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,...,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,8.5,164000
4,Color,Andrew Stanton,462.0,132.0,Samantha Morton,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,...,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English,USA,PG-13,263700000.0,2012.0,6.6,24000


In [3]:
movies.tail(5)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
3804,Color,Shane Carruth,143.0,77.0,David Sullivan,424760.0,Drama|Sci-Fi|Thriller,Shane Carruth,Primer,72639,...,changing the future|independent film|invention...,http://www.imdb.com/title/tt0390384/?ref_=fn_t...,371.0,English,USA,PG-13,7000.0,2004.0,7.0,19000
3805,Color,Neill Dela Llana,35.0,80.0,Edgar Tancangco,70071.0,Thriller,Ian Gamazon,Cavite,589,...,jihad|mindanao|philippines|security guard|squa...,http://www.imdb.com/title/tt0428303/?ref_=fn_t...,35.0,English,Philippines,Not Rated,7000.0,2005.0,6.3,74
3806,Color,Robert Rodriguez,56.0,81.0,Peter Marquardt,2040920.0,Action|Crime|Drama|Romance|Thriller,Carlos Gallardo,El Mariachi,52055,...,assassin|death|guitar|gun|mariachi,http://www.imdb.com/title/tt0104815/?ref_=fn_t...,130.0,Spanish,USA,R,7000.0,1992.0,6.9,0
3807,Color,Edward Burns,14.0,95.0,Caitlin FitzGerald,4584.0,Comedy|Drama,Kerry Bishé,Newlyweds,1338,...,written and directed by cast member,http://www.imdb.com/title/tt1880418/?ref_=fn_t...,14.0,English,USA,Not Rated,9000.0,2011.0,6.4,413
3808,Color,Jon Gunn,43.0,90.0,Brian Herzlinger,85222.0,Documentary,John August,My Date with Drew,4285,...,actress name in title|crush|date|four word tit...,http://www.imdb.com/title/tt0378407/?ref_=fn_t...,84.0,English,USA,PG,1100.0,2004.0,6.6,456


**T3) Get general info about the dataset**

In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3809 entries, 0 to 3808
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   color                   3809 non-null   object 
 1   director_name           3809 non-null   object 
 2   num_critic_for_reviews  3809 non-null   float64
 3   duration                3809 non-null   float64
 4   actor_2_name            3809 non-null   object 
 5   gross                   3809 non-null   float64
 6   genres                  3809 non-null   object 
 7   actor_1_name            3809 non-null   object 
 8   movie_title             3809 non-null   object 
 9   num_voted_users         3809 non-null   int64  
 10  actor_3_name            3809 non-null   object 
 11  plot_keywords           3809 non-null   object 
 12  movie_imdb_link         3809 non-null   object 
 13  num_user_for_reviews    3809 non-null   float64
 14  language                3809 non-null   

**T4) Explore the dataset and try to understand the meaning of each column. For each column, write its meaning and its data type. Also, which columns might be irrelevant?**

**T5) Get the shape of the dataframe**

In [5]:
movies.shape

(3809, 21)

**T6) Get the name of the columns of the dataframe**

In [6]:
movies.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'actor_2_name', 'gross', 'genres', 'actor_1_name', 'movie_title',
       'num_voted_users', 'actor_3_name', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'imdb_score', 'movie_facebook_likes'],
      dtype='object')

**T7) Give more readable names to the columns with suitable capitalisation and eliminating underscores ("_")**

In [7]:
movies.rename(columns=dict(dict(
    zip(movies.columns,
        [item.replace('_', ' ').capitalize() for item in movies.columns]
))))

Unnamed: 0,Color,Director name,Num critic for reviews,Duration,Actor 2 name,Gross,Genres,Actor 1 name,Movie title,Num voted users,...,Plot keywords,Movie imdb link,Num user for reviews,Language,Country,Content rating,Budget,Title year,Imdb score,Movie facebook likes
0,Color,James Cameron,723.0,178.0,Joel David Moore,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,...,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,7.9,33000
1,Color,Gore Verbinski,302.0,169.0,Orlando Bloom,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,...,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,7.1,0
2,Color,Sam Mendes,602.0,148.0,Rory Kinnear,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,...,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,6.8,85000
3,Color,Christopher Nolan,813.0,164.0,Christian Bale,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,...,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,8.5,164000
4,Color,Andrew Stanton,462.0,132.0,Samantha Morton,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,...,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English,USA,PG-13,263700000.0,2012.0,6.6,24000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3804,Color,Shane Carruth,143.0,77.0,David Sullivan,424760.0,Drama|Sci-Fi|Thriller,Shane Carruth,Primer,72639,...,changing the future|independent film|invention...,http://www.imdb.com/title/tt0390384/?ref_=fn_t...,371.0,English,USA,PG-13,7000.0,2004.0,7.0,19000
3805,Color,Neill Dela Llana,35.0,80.0,Edgar Tancangco,70071.0,Thriller,Ian Gamazon,Cavite,589,...,jihad|mindanao|philippines|security guard|squa...,http://www.imdb.com/title/tt0428303/?ref_=fn_t...,35.0,English,Philippines,Not Rated,7000.0,2005.0,6.3,74
3806,Color,Robert Rodriguez,56.0,81.0,Peter Marquardt,2040920.0,Action|Crime|Drama|Romance|Thriller,Carlos Gallardo,El Mariachi,52055,...,assassin|death|guitar|gun|mariachi,http://www.imdb.com/title/tt0104815/?ref_=fn_t...,130.0,Spanish,USA,R,7000.0,1992.0,6.9,0
3807,Color,Edward Burns,14.0,95.0,Caitlin FitzGerald,4584.0,Comedy|Drama,Kerry Bishé,Newlyweds,1338,...,written and directed by cast member,http://www.imdb.com/title/tt1880418/?ref_=fn_t...,14.0,English,USA,Not Rated,9000.0,2011.0,6.4,413


**T8) Find and print the duplicated records**

**Note**: A record is duplicated if all its entries are identical with another record

In [8]:
movies[movies.duplicated()]

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
133,Color,David Yates,248.0,110.0,Alexander Skarsgård,124051759.0,Action|Adventure|Drama|Romance,Christoph Waltz,The Legend of Tarzan,42372,...,africa|capture|jungle|male objectification|tarzan,http://www.imdb.com/title/tt0918940/?ref_=fn_t...,239.0,English,USA,PG-13,180000000.0,2016.0,6.6,29000
182,Color,Bill Condon,322.0,115.0,Kristen Stewart,292298923.0,Adventure|Drama|Fantasy|Romance,Robert Pattinson,The Twilight Saga: Breaking Dawn - Part 2,185394,...,battle|friend|super strength|vampire|vision,http://www.imdb.com/title/tt1673434/?ref_=fn_t...,329.0,English,USA,PG-13,120000000.0,2012.0,5.5,65000
292,Color,Joe Wright,256.0,111.0,Cara Delevingne,34964818.0,Adventure|Family|Fantasy,Hugh Jackman,Pan,39956,...,1940s|child hero|fantasy world|orphan|referenc...,http://www.imdb.com/title/tt3332064/?ref_=fn_t...,186.0,English,USA,PG,150000000.0,2015.0,5.8,24000
377,Color,Josh Trank,369.0,100.0,Reg E. Cathey,56114221.0,Action|Adventure|Sci-Fi,Tim Blake Nelson,Fantastic Four,110486,...,box office flop|critically bashed|portal|telep...,http://www.imdb.com/title/tt1502712/?ref_=fn_t...,695.0,English,USA,PG-13,120000000.0,2015.0,4.3,41000
383,Color,Rob Cohen,187.0,106.0,Vin Diesel,144512310.0,Action|Crime|Thriller,Paul Walker,The Fast and the Furious,272223,...,eighteen wheeler|illegal street racing|truck|t...,http://www.imdb.com/title/tt0232500/?ref_=fn_t...,988.0,English,USA,PG-13,38000000.0,2001.0,6.7,14000
565,Color,Brett Ratner,245.0,101.0,Rufus Sewell,72660029.0,Action|Adventure,Dwayne Johnson,Hercules,115687,...,army|greek mythology|hercules|king|mercenary,http://www.imdb.com/title/tt1267297/?ref_=fn_t...,269.0,English,USA,PG-13,100000000.0,2014.0,6.0,21000
627,Color,Paul Verhoeven,196.0,113.0,Rachel Ticotin,119412921.0,Action|Sci-Fi,Ronny Cox,Total Recall,240241,...,ambiguous ending|false memory|implanted memory...,http://www.imdb.com/title/tt0100802/?ref_=fn_t...,391.0,English,USA,R,65000000.0,1990.0,7.5,0
758,Color,Joss Whedon,703.0,173.0,Robert Downey Jr.,623279547.0,Action|Adventure|Sci-Fi,Chris Hemsworth,The Avengers,995415,...,alien invasion|assassin|battle|iron man|soldier,http://www.imdb.com/title/tt0848228/?ref_=fn_t...,1722.0,English,USA,PG-13,220000000.0,2012.0,8.1,123000
1155,Color,Angelina Jolie Pitt,322.0,137.0,Jack O'Connell,115603980.0,Biography|Drama|Sport|War,Finn Wittrock,Unbroken,103589,...,emaciation|male nudity|plane crash|prisoner of...,http://www.imdb.com/title/tt1809398/?ref_=fn_t...,351.0,English,USA,PG-13,65000000.0,2014.0,7.2,35000
1237,Color,Paul McGuigan,159.0,110.0,Spencer Wilding,5773519.0,Drama|Horror|Sci-Fi|Thriller,Daniel Radcliffe,Victor Frankenstein,28618,...,assistant|experiment|frankenstein|medical stud...,http://www.imdb.com/title/tt1976009/?ref_=fn_t...,91.0,English,USA,PG-13,40000000.0,2015.0,6.0,11000


**T9) Remove all duplicates**

In [9]:
movies.drop_duplicates(inplace=True)

***

### Part 3: Analysing the dataset


**T10)  Get some descriptive statistics for the dataset**

In [10]:
movies.describe()

Unnamed: 0,num_critic_for_reviews,duration,gross,num_voted_users,num_user_for_reviews,budget,title_year,imdb_score,movie_facebook_likes
count,3774.0,3774.0,3774.0,3774.0,3774.0,3774.0,3774.0,3774.0,3774.0
mean,165.626391,110.085056,51960590.0,104377.6,332.372814,45761200.0,2003.015368,6.462666,9249.777954
std,123.676886,22.652983,69673380.0,151219.6,410.02318,225477800.0,9.890751,1.054046,21459.205923
min,2.0,37.0,162.0,48.0,2.0,218.0,1927.0,1.6,0.0
25%,75.0,95.0,7743569.0,18657.75,106.0,10000000.0,1999.0,5.9,0.0
50%,137.0,106.0,29110160.0,52901.0,207.0,25000000.0,2005.0,6.6,218.0
75%,223.0,120.0,66483150.0,126903.5,395.0,50000000.0,2010.0,7.2,11000.0
max,813.0,330.0,760505800.0,1689764.0,5060.0,12215500000.0,2016.0,9.3,349000.0


**T11)  For each movie, print its title, director and three main actors**

In [11]:
movies[['movie_title', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']]

Unnamed: 0,movie_title,director_name,actor_1_name,actor_2_name,actor_3_name
0,Avatar,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Pirates of the Caribbean: At World's End,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Spectre,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,The Dark Knight Rises,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,John Carter,Andrew Stanton,Daryl Sabara,Samantha Morton,Polly Walker
...,...,...,...,...,...
3804,Primer,Shane Carruth,Shane Carruth,David Sullivan,Casey Gooden
3805,Cavite,Neill Dela Llana,Ian Gamazon,Edgar Tancangco,Quynn Ton
3806,El Mariachi,Robert Rodriguez,Carlos Gallardo,Peter Marquardt,Consuelo Gómez
3807,Newlyweds,Edward Burns,Kerry Bishé,Caitlin FitzGerald,Daniella Pineda


**T12)  Find the movies with the shortest and longest duration, respectively. What are your observations?**

In [12]:
min(movies.duration), max(movies.duration)

(37.0, 330.0)

**T13)  Find the countries in which at least one movie has been recorded**

In [13]:
np.unique(movies.country)

array(['Afghanistan', 'Argentina', 'Aruba', 'Australia', 'Belgium',
       'Brazil', 'Canada', 'Chile', 'China', 'Colombia', 'Czech Republic',
       'Denmark', 'Finland', 'France', 'Georgia', 'Germany', 'Greece',
       'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Mexico', 'Netherlands',
       'New Line', 'New Zealand', 'Norway', 'Official site', 'Peru',
       'Philippines', 'Poland', 'Romania', 'Russia', 'South Africa',
       'South Korea', 'Spain', 'Taiwan', 'Thailand', 'UK', 'USA',
       'West Germany'], dtype=object)

**T14)  Get the movies with score above 8.3 and duration less than 2h**

In [14]:
movies.query('imdb_score > 8.3 & duration < 120')

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
56,Color,Andrew Stanton,421.0,98.0,Fred Willard,223806889.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,WALL·E,718837,...,earth|obesity|plant|robot|soil,http://www.imdb.com/title/tt0910970/?ref_=fn_t...,1043.0,English,USA,G,180000000.0,2008.0,8.4,16000
487,Color,Roger Allers,186.0,73.0,Nathan Lane,422783777.0,Adventure|Animation|Drama|Family|Musical,Matthew Broderick,The Lion King,644348,...,king|prince|scar|uncle|unnecessary guilt,http://www.imdb.com/title/tt0110357/?ref_=fn_t...,656.0,English,USA,G,45000000.0,1994.0,8.5,17000
1988,Color,Steven Spielberg,234.0,115.0,Karen Allen,242374454.0,Action|Adventure,Harrison Ford,Raiders of the Lost Ark,661017,...,archeological dig|archeologist|ark of the cove...,http://www.imdb.com/title/tt0082971/?ref_=fn_t...,771.0,English,USA,PG,18000000.0,1981.0,8.5,16000
2069,Black and White,Alfred Hitchcock,290.0,108.0,Vera Miles,32000000.0,Horror|Mystery|Thriller,Janet Leigh,Psycho,422432,...,money|motel|rain|shower|theft,http://www.imdb.com/title/tt0054215/?ref_=fn_t...,1040.0,English,USA,R,806947.0,1960.0,8.5,18000
2167,Color,Robert Zemeckis,198.0,116.0,Thomas F. Wilson,210609762.0,Adventure|Comedy|Sci-Fi,Lea Thompson,Back to the Future,732212,...,clock tower|delorean|future|time travel|time t...,http://www.imdb.com/title/tt0088763/?ref_=fn_t...,809.0,English,USA,PG,19000000.0,1985.0,8.5,39000
2837,Black and White,Tony Kaye,162.0,101.0,Beverly D'Angelo,6712241.0,Crime|Drama,Ethan Suplee,American History X,782437,...,curb stomping|neo nazi|prison|son dislikes mot...,http://www.imdb.com/title/tt0120586/?ref_=fn_t...,1420.0,English,USA,R,7500000.0,1998.0,8.6,35000
2900,Color,Ridley Scott,392.0,116.0,Yaphet Kotto,78900000.0,Horror|Sci-Fi,Tom Skerritt,Alien,563827,...,alien|creature|future|outer space|spaceship,http://www.imdb.com/title/tt0078748/?ref_=fn_t...,1110.0,English,UK,R,11000000.0,1979.0,8.5,23000
3116,Color,Bryan Singer,162.0,106.0,Chazz Palminteri,23272306.0,Crime|Drama|Mystery|Thriller,Kevin Spacey,The Usual Suspects,740918,...,criminal|dirty cop|flashback|limping|suspect,http://www.imdb.com/title/tt0114814/?ref_=fn_t...,1182.0,English,USA,R,6000000.0,1995.0,8.6,28000
3198,Black and White,Christopher Nolan,274.0,113.0,Thomas Lennon,25530884.0,Mystery|Thriller,Callum Rennie,Memento,845580,...,flashback|memory|murder|short term memory|tele...,http://www.imdb.com/title/tt0209144/?ref_=fn_t...,2067.0,English,USA,R,9000000.0,2000.0,8.5,40000
3274,Color,Darren Aronofsky,234.0,102.0,Mark Margolis,3609278.0,Drama,Ellen Burstyn,Requiem for a Dream,573541,...,addiction|diet pill|drug addiction|fast motion...,http://www.imdb.com/title/tt0180093/?ref_=fn_t...,1916.0,English,USA,R,4500000.0,2000.0,8.4,38000


**T15)  Find the movies directed by Ridley Scott**

Hint: Slicing may be easier for this task; querying is also doable but slightly more difficult

In [15]:
movies.query("director_name == 'Ridley Scott'")

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
21,Color,Ridley Scott,343.0,156.0,William Hurt,105219735.0,Action|Adventure|Drama|History,Mark Addy,Robin Hood,211765,...,1190s|archer|england|king of england|robin hood,http://www.imdb.com/title/tt0955308/?ref_=fn_t...,546.0,English,USA,PG-13,200000000.0,2010.0,6.7,17000
155,Color,Ridley Scott,314.0,150.0,María Valverde,65007045.0,Action|Adventure|Drama,Christian Bale,Exodus: Gods and Kings,128682,...,egypt|exodus|moses|pharaoh|plague,http://www.imdb.com/title/tt1528100/?ref_=fn_t...,657.0,English,UK,PG-13,140000000.0,2014.0,6.1,51000
219,Color,Ridley Scott,775.0,124.0,Charlize Theron,126464904.0,Adventure|Mystery|Sci-Fi,Michael Fassbender,Prometheus,456260,...,cave painting|medical scanner|planet|pregnant ...,http://www.imdb.com/title/tt1446714/?ref_=fn_t...,2326.0,English,USA,R,130000000.0,2012.0,7.0,97000
265,Color,Ridley Scott,239.0,194.0,Orlando Bloom,47396698.0,Action|Adventure|Drama|History|War,Liam Neeson,Kingdom of Heaven,217373,...,12th century|crusader|jerusalem|knight|medieva...,http://www.imdb.com/title/tt0320661/?ref_=fn_t...,942.0,English,USA,R,130000000.0,2005.0,7.2,0
268,Color,Ridley Scott,568.0,151.0,Donald Glover,228430993.0,Adventure|Drama|Sci-Fi,Matt Damon,The Martian,472488,...,astronaut|international cooperation|left for d...,http://www.imdb.com/title/tt3659388/?ref_=fn_t...,1023.0,English,USA,PG-13,108000000.0,2015.0,8.1,153000
272,Color,Ridley Scott,265.0,171.0,Connie Nielsen,187670866.0,Action|Drama|Romance,Djimon Hounsou,Gladiator,982637,...,battlefield|blood|combat|gladiator|roman empire,http://www.imdb.com/title/tt0172495/?ref_=fn_t...,2368.0,English,USA,R,103000000.0,2000.0,8.5,21000
279,Color,Ridley Scott,300.0,176.0,Ruby Dee,130127620.0,Biography|Crime|Drama,Denzel Washington,American Gangster,324671,...,death|heroin|popcorn|smuggling|vietnam,http://www.imdb.com/title/tt0765429/?ref_=fn_t...,458.0,English,USA,R,100000000.0,2007.0,7.8,0
319,Color,Ridley Scott,200.0,152.0,Sam Shepard,108638745.0,Drama|History|War,Ioan Gruffudd,Black Hawk Down,292022,...,army|helicopter|somali|somalia|warlord,http://www.imdb.com/title/tt0265086/?ref_=fn_t...,1103.0,English,USA,R,92000000.0,2001.0,7.7,10000
614,Color,Ridley Scott,238.0,128.0,Simon McBurney,39380442.0,Action|Drama|Thriller,Leonardo DiCaprio,Body of Lies,174248,...,cia|jordan|middle east|spy|terrorist,http://www.imdb.com/title/tt0758774/?ref_=fn_t...,263.0,English,USA,R,70000000.0,2008.0,7.1,0
926,Color,Ridley Scott,97.0,125.0,Demi Moore,48154732.0,Action|Drama|War,Viggo Mortensen,G.I. Jane,60326,...,feminism|hit in the crotch|kicked in the crotc...,http://www.imdb.com/title/tt0119173/?ref_=fn_t...,142.0,English,USA,R,50000000.0,1997.0,5.8,2000


**T16)  Find the movies in which Hugh Jackman played as 1st, 2nd or 3rd actor**

In [16]:
jackman_in = movies.isin(['Hugh Jackman']).any(axis=1)
movies[jackman_in]

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
33,Color,Brett Ratner,334.0,104.0,Kelsey Grammer,234360014.0,Action|Adventure|Fantasy|Sci-Fi|Thriller,Hugh Jackman,X-Men: The Last Stand,383427,...,battle|mutant|outrage|walking through a wall|x...,http://www.imdb.com/title/tt0376994/?ref_=fn_t...,1912.0,English,Canada,PG-13,210000000.0,2006.0,6.8,0
46,Color,Bryan Singer,539.0,149.0,Peter Dinklage,233914986.0,Action|Adventure|Fantasy|Sci-Fi|Thriller,Jennifer Lawrence,X-Men: Days of Future Past,514125,...,dystopia|super strength|supernatural power|tim...,http://www.imdb.com/title/tt1877832/?ref_=fn_t...,752.0,English,USA,PG-13,200000000.0,2014.0,8.0,82000
119,Color,Gavin Hood,350.0,119.0,Ryan Reynolds,179883016.0,Action|Adventure|Fantasy|Sci-Fi|Thriller,Hugh Jackman,X-Men Origins: Wolverine,361924,...,army|civil war|claw fight|commando|wolverine t...,http://www.imdb.com/title/tt0458525/?ref_=fn_t...,641.0,English,USA,PG-13,150000000.0,2009.0,6.7,0
140,Color,David Bowers,135.0,85.0,Kate Winslet,64459316.0,Adventure|Animation|Comedy|Family,Hugh Jackman,Flushed Away,85086,...,boat|frog|rat|sewer|toad,http://www.imdb.com/title/tt0424095/?ref_=fn_t...,122.0,English,UK,PG,149000000.0,2006.0,6.7,0
141,Color,Joe Wright,256.0,111.0,Cara Delevingne,34964818.0,Adventure|Family|Fantasy,Hugh Jackman,Pan,39956,...,1940s|child hero|fantasy world|orphan|referenc...,http://www.imdb.com/title/tt3332064/?ref_=fn_t...,186.0,English,USA,PG,150000000.0,2015.0,5.8,24000
152,Color,Peter Ramsey,256.0,97.0,Kamil McFadden,103400692.0,Adventure|Animation|Family|Fantasy,Hugh Jackman,Rise of the Guardians,123553,...,belief|box office hit|children|new york city|t...,http://www.imdb.com/title/tt1446192/?ref_=fn_t...,174.0,English,USA,PG,145000000.0,2012.0,7.3,25000
202,Color,Bryan Singer,289.0,134.0,Bruce Davison,214948780.0,Action|Adventure|Fantasy|Sci-Fi|Thriller,Hugh Jackman,X-Men 2,405973,...,mutant|prison|professor|school|x men,http://www.imdb.com/title/tt0290334/?ref_=fn_t...,1055.0,English,Canada,PG-13,110000000.0,2003.0,7.5,0
231,Color,James Mangold,440.0,138.0,Tao Okamoto,132550960.0,Action|Adventure|Sci-Fi|Thriller,Hugh Jackman,The Wolverine,328067,...,healing power|marvel comics|mecha|regeneration...,http://www.imdb.com/title/tt1430132/?ref_=fn_t...,533.0,English,USA,PG-13,120000000.0,2013.0,6.7,68000
255,Color,Shawn Levy,327.0,127.0,Torey Michael Adkins,85463309.0,Action|Drama|Sci-Fi|Sport,Hugh Jackman,Real Steel,254841,...,arena|boxing|boxing movie|robot|robot battle,http://www.imdb.com/title/tt0433035/?ref_=fn_t...,426.0,English,USA,PG-13,110000000.0,2011.0,7.1,36000
385,Color,George Miller,206.0,108.0,Hugh Jackman,197992827.0,Animation|Comedy|Family|Music|Romance,Robin Williams,Happy Feet,132501,...,dance|emperor penguin|friend|penguin|song,http://www.imdb.com/title/tt0366548/?ref_=fn_t...,548.0,English,USA,PG,100000000.0,2006.0,6.5,0


**T17)  Put the movies into groups depending on their colour and show a summary**

In [28]:
movies.groupby('color').describe()

Unnamed: 0_level_0,num_critic_for_reviews,num_critic_for_reviews,num_critic_for_reviews,num_critic_for_reviews,num_critic_for_reviews,num_critic_for_reviews,num_critic_for_reviews,num_critic_for_reviews,duration,duration,...,imdb_score,imdb_score,movie_facebook_likes,movie_facebook_likes,movie_facebook_likes,movie_facebook_likes,movie_facebook_likes,movie_facebook_likes,movie_facebook_likes,movie_facebook_likes
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Black and White,124.0,165.064516,109.121082,14.0,80.0,149.0,206.0,576.0,124.0,116.274194,...,7.8,8.9,124.0,6490.056452,15112.656742,0.0,0.0,0.0,2250.0,109000.0
Color,3650.0,165.645479,124.154638,2.0,75.0,136.0,223.0,813.0,3650.0,109.874795,...,7.2,9.3,3650.0,9343.532877,21637.464837,0.0,0.0,229.5,11000.0,349000.0


**T18) Get all movies that were released between 2005 and 2010, have a rating above 8.0, but have less income than the estimated budget**

In [30]:
movies.query("2005 <= title_year <= 2010 & imdb_score > 8 & gross < budget")

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
3083,Color,José Padilha,142.0,115.0,Fernanda Machado,8060.0,Action|Crime|Drama|Thriller,Wagner Moura,Elite Squad,81644,...,brazil|military police|police|rio de janeiro b...,http://www.imdb.com/title/tt0861739/?ref_=fn_t...,107.0,Portuguese,Brazil,R,4000000.0,2007.0,8.1,11000
3501,Black and White,David Sington,107.0,100.0,Buzz Aldrin,1134049.0,Documentary|History,John F. Kennedy,In the Shadow of the Moon,5475,...,1960s|astronaut|moon|nasa|spacecraft accident,http://www.imdb.com/title/tt0925248/?ref_=fn_t...,44.0,English,UK,PG,2000000.0,2007.0,8.1,0


**T19) Get all movies that were released after 2010 (inclusive), have a rating above 8.0 (exclusive), but made below the 25th percentile in revenue.**

In [39]:
movies.query('title_year >= 2010 & imdb_score > 8 & gross < gross.quantile(.25)')

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,actor_2_name,gross,genres,actor_1_name,movie_title,num_voted_users,...,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
3081,Color,Denis Villeneuve,226.0,139.0,Mélissa Désormeaux-Poulin,6857096.0,Drama|Mystery|War,Lubna Azabal,Incendies,80429,...,brother sister relationship|family relationshi...,http://www.imdb.com/title/tt1255953/?ref_=fn_t...,156.0,French,Canada,R,6800000.0,2010.0,8.2,37000
3323,Color,Ron Fricke,115.0,102.0,Balinese Tari Legong Dancers,2601847.0,Documentary|Music,Collin Alfredo St. Dic,Samsara,22457,...,hall of mirrors|mont saint michel france|palac...,http://www.imdb.com/title/tt0770802/?ref_=fn_t...,69.0,,USA,PG-13,4000000.0,2011.0,8.5,26000
3373,Color,Thomas Vinterberg,349.0,115.0,Alexandra Rapaport,610968.0,Drama,Thomas Bo Larsen,The Hunt,170155,...,deer|gun|gunshot|hunt|kindergarten teacher,http://www.imdb.com/title/tt2106476/?ref_=fn_t...,249.0,Danish,Denmark,R,3800000.0,2012.0,8.3,60000
3630,Color,Joshua Oppenheimer,248.0,96.0,Herman Koto,484221.0,Biography|Crime|Documentary|History,Anwar Congo,The Act of Killing,23836,...,death squad|mass killing|musical number|refere...,http://www.imdb.com/title/tt2375605/?ref_=fn_t...,107.0,Indonesian,UK,Not Rated,1000000.0,2012.0,8.2,20000
3658,Color,Asghar Farhadi,354.0,123.0,Leila Hatami,7098492.0,Drama|Mystery,Shahab Hosseini,A Separation,151812,...,alzheimer's disease|caregiver|divorce|iran|ira...,http://www.imdb.com/title/tt1832382/?ref_=fn_t...,264.0,Persian,Iran,PG-13,500000.0,2011.0,8.4,48000
3705,Color,Marius A. Markevicius,26.0,89.0,Greg Speirs,133778.0,Documentary|Sport,Tommy Sheppard,The Other Dream Team,3086,...,basketball|basketball team|grateful dead|lithu...,http://www.imdb.com/title/tt1606829/?ref_=fn_t...,9.0,English,USA,Not Rated,500000.0,2012.0,8.4,0


**T20) Find out the sum of votes, budget and gross income by year**

In [43]:
movies.groupby('title_year')[['num_voted_users', 'budget', 'gross']].aggregate(sum)

Unnamed: 0_level_0,num_voted_users,budget,gross
title_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1927.0,111841,6.000000e+06,2.643500e+04
1929.0,4546,3.790000e+05,2.808000e+06
1933.0,7921,4.390000e+05,2.300000e+06
1935.0,13269,6.090000e+05,3.000000e+06
1936.0,143086,1.500000e+06,1.632450e+05
...,...,...,...
2012.0,22741118,7.566431e+09,1.023866e+10
2013.0,22821998,8.355416e+09,1.047177e+10
2014.0,18812088,7.458670e+09,1.021050e+10
2015.0,11997638,7.000195e+09,9.280609e+09


**T21) Find out the top ten directors with the highest average score**

In [50]:
movies.groupby('director_name')['imdb_score'].mean().sort_values(ascending=False).head(10)

director_name
Akira Kurosawa       8.700000
Tony Kaye            8.600000
Charles Chaplin      8.600000
Majid Majidi         8.500000
Alfred Hitchcock     8.500000
Damien Chazelle      8.500000
Ron Fricke           8.500000
Sergio Leone         8.433333
Christopher Nolan    8.425000
Richard Marquand     8.400000
Name: imdb_score, dtype: float64

### Ideas for practicing at home

* Find the average score, budget and gross income of movies in which Liam Neeson played
* Produce a dataset fragment grouped by Language and/or Country
* Any other analysis that you might could generate some useful insight.
* Try to do Practicals 2 and 3 using Pandas