# IMDB data exploration

---

In [None]:
!pip install --upgrade gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 5.1.0
    Uninstalling gdown-5.1.0:
      Successfully uninstalled gdown-5.1.0
Successfully installed gdown-5.2.0


In [None]:
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 7.74MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 35.0MB/s]


In [None]:
import pandas as pd
import numpy as np

movies = pd.read_csv('movies.csv', index_col=0)
directors = pd.read_csv('directors.csv', index_col=0)

data = movies.merge(directors, how='left', left_on='director_id', right_on='id')
data.drop(['director_id','id_y'], axis=1, inplace=True)

Let's explore all the features in the merged dataset.

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1465 entries, 0 to 1464
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id_x           1465 non-null   int64  
 1   budget         1465 non-null   int64  
 2   popularity     1465 non-null   int64  
 3   revenue        1465 non-null   int64  
 4   title          1465 non-null   object 
 5   vote_average   1465 non-null   float64
 6   vote_count     1465 non-null   int64  
 7   year           1465 non-null   int64  
 8   month          1465 non-null   object 
 9   day            1465 non-null   object 
 10  director_name  1465 non-null   object 
 11  gender         1341 non-null   object 
dtypes: float64(1), int64(6), object(5)
memory usage: 148.8+ KB


Looks like only `gender` column has missing values (will come later).

#### How can we describe these features to know more about their range of values?

In [None]:
data.describe()

Unnamed: 0,id_x,budget,popularity,revenue,vote_average,vote_count,year
count,1465.0,1465.0,1465.0,1465.0,1465.0,1465.0,1465.0
mean,45225.191126,48022950.0,30.855973,143253900.0,6.368191,1146.396587,2002.615017
std,1189.096396,49355410.0,34.845214,206491800.0,0.818033,1578.077438,8.680141
min,43597.0,0.0,0.0,0.0,3.0,1.0,1976.0
25%,44236.0,14000000.0,11.0,17380130.0,5.9,216.0,1998.0
50%,45022.0,33000000.0,23.0,75781640.0,6.4,571.0,2004.0
75%,45990.0,66000000.0,41.0,179246900.0,6.9,1387.0,2009.0
max,48395.0,380000000.0,724.0,2787965000.0,8.3,13752.0,2016.0


This gives us all **statistics** related to these columns.

Notice that some columns such as `title`, `month` are missing.

How are these missing columns different?

- They are of **object dtype**.

#### How can we include object type in `df.describe()`?

In [None]:
data.describe(include=object)

Unnamed: 0,title,month,day,director_name,gender
count,1465,1465,1465,1465,1341
unique,1465,12,7,199,2
top,Avatar,Dec,Friday,Steven Spielberg,Male
freq,1,193,654,26,1309


If you notice,
- The range of values in the `revenue` and `budget` columns seem to be very high.
- Generally, the budget and revenue for hollywood movies is in millions of dollars.

#### How can we change the values of `revenue` and `budget` into million dollars USD?


In [None]:
data['revenue'] = (data['revenue']/1000000).round(2)
data

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237000000,150,2787.97,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
1,43598,300000000,139,961.00,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male
2,43599,245000000,107,880.67,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male
3,43600,250000000,112,1084.94,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
4,43602,258000000,115,890.87,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male
...,...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,0,3,0.32,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male
1461,48370,27000,19,3.15,Clerks,7.4,755,1994,Sep,Tuesday,Kevin Smith,Male
1462,48375,0,7,0.00,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male
1463,48376,0,3,0.00,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male


Similarly, we can do it for the `budget` as well.

In [None]:
data['budget']=(data['budget']/1000000).round(2)
data.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237.0,150,2787.97,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
1,43598,300.0,139,961.0,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male
2,43599,245.0,107,880.67,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male
3,43600,250.0,112,1084.94,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
4,43602,258.0,115,890.87,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male


Let's say we are interested in fetching all the **highly rated movies**

- movies with **ratings > 7**

#### How can we get movies with ratings greater than 7?

We can use the concept of `masking`.

Lets first create a mask to filter such movies.

- In SQL, we can do `SELECT * FROM movies WHERE vote_average>7`
- In Pandas

In [None]:
data['vote_average'] > 7

0        True
1       False
2       False
3        True
4       False
        ...  
1460     True
1461     True
1462    False
1463    False
1464    False
Name: vote_average, Length: 1465, dtype: bool

But we still don't know the row values... Only that which row satisfies the condtion.

#### How do we get the row values from this mask?

In [None]:
data.loc[data['vote_average'] > 7]

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237.00,150,2787.97,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
3,43600,250.00,112,1084.94,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
14,43616,250.00,120,956.02,The Hobbit: The Battle of the Five Armies,7.1,4760,2014,Dec,Wednesday,Peter Jackson,Male
16,43619,250.00,94,958.40,The Hobbit: The Desolation of Smaug,7.6,4524,2013,Dec,Wednesday,Peter Jackson,Male
19,43622,200.00,100,1845.03,Titanic,7.5,7562,1997,Nov,Tuesday,James Cameron,Male
...,...,...,...,...,...,...,...,...,...,...,...,...
1456,48321,0.01,20,7.00,Eraserhead,7.5,485,1977,Mar,Saturday,David Lynch,Male
1457,48323,0.00,5,0.00,The Mighty,7.1,51,1998,Oct,Friday,Peter Chelsom,Male
1458,48335,0.06,27,3.22,Pi,7.1,586,1998,Jul,Friday,Darren Aronofsky,Male
1460,48363,0.00,3,0.32,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male


You can also perform the filtering without using `loc`.




In [None]:
data[data['vote_average'] > 7]

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237.00,150,2787.97,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
3,43600,250.00,112,1084.94,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
14,43616,250.00,120,956.02,The Hobbit: The Battle of the Five Armies,7.1,4760,2014,Dec,Wednesday,Peter Jackson,Male
16,43619,250.00,94,958.40,The Hobbit: The Desolation of Smaug,7.6,4524,2013,Dec,Wednesday,Peter Jackson,Male
19,43622,200.00,100,1845.03,Titanic,7.5,7562,1997,Nov,Tuesday,James Cameron,Male
...,...,...,...,...,...,...,...,...,...,...,...,...
1456,48321,0.01,20,7.00,Eraserhead,7.5,485,1977,Mar,Saturday,David Lynch,Male
1457,48323,0.00,5,0.00,The Mighty,7.1,51,1998,Oct,Friday,Peter Chelsom,Male
1458,48335,0.06,27,3.22,Pi,7.1,586,1998,Jul,Friday,Darren Aronofsky,Male
1460,48363,0.00,3,0.32,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male


But this is not recommended. Why?

- It can create confusion between **implicit/explicit indexing** as discussed before.
- `loc` is also much faster.

#### Hhow can we return a subset of columns, say only `title` and `director_name`?

In [None]:
data.loc[data['vote_average'] > 7, ['title','director_name']]

Unnamed: 0,title,director_name
0,Avatar,James Cameron
3,The Dark Knight Rises,Christopher Nolan
14,The Hobbit: The Battle of the Five Armies,Peter Jackson
16,The Hobbit: The Desolation of Smaug,Peter Jackson
19,Titanic,James Cameron
...,...,...
1456,Eraserhead,David Lynch
1457,The Mighty,Peter Chelsom
1458,Pi,Darren Aronofsky
1460,The Last Waltz,Martin Scorsese


So far, we've only seen single condition based filtering.

#### What if we want to filter highly rated movies released after 2014?

Notice that two different conditions are involved here.

1. Movies should be highly rated i.e. ratings > 7
2. Movies should be released either in the year 2015 or later.

We can use the `&` operator to combine multiple conditions.

In [None]:
data.loc[(data['vote_average'] > 7) & (data['year'] >= 2015)].head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
30,43641,190.0,102,1506.25,Furious 7,7.3,4176,2015,Apr,Wednesday,James Wan,Male
78,43724,150.0,434,378.86,Mad Max: Fury Road,7.2,9427,2015,May,Wednesday,George Miller,Male
106,43773,135.0,100,532.95,The Revenant,7.3,6396,2015,Dec,Friday,Alejandro González Iñárritu,Male
162,43867,108.0,167,630.16,The Martian,7.6,7268,2015,Sep,Wednesday,Ridley Scott,Male
312,44128,75.0,48,108.15,The Man from U.N.C.L.E.,7.1,2265,2015,Aug,Thursday,Guy Ritchie,Male


Recall how we apply mutliple conditions in Numpy?

- Use **element-wise operator `&` or `|`**

**Note:** For specifying multiple conditions, we need to put each separate condition within parenthesis `()`.

#### How can we find movies released on either Friday or Sunday?

In [None]:
data.loc[(data['day'] == 'Friday') | (data['day'] == 'Saturday')].head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
1,43598,300.0,139,961.0,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male
12,43614,380.0,135,1045.71,Pirates of the Caribbean: On Stranger Tides,6.4,4948,2011,May,Saturday,Rob Marshall,Male
22,43627,200.0,35,783.77,Spider-Man 2,6.7,4321,2004,Jun,Friday,Sam Raimi,Male
25,43632,150.0,21,836.3,Transformers: Revenge of the Fallen,6.0,3138,2009,Jun,Friday,Michael Bay,Male
40,43656,200.0,45,769.65,2012,5.6,4903,2009,Oct,Saturday,Roland Emmerich,Male


Thus, we can perform complex queries using both `&` and `|` operators.

Now let's try to answer few more questions from this data.

#### How will you find the Top 5 most popular movies?

- We can simply sort our data based on values of the `popularity` column.

In [None]:
data.sort_values(['popularity'],ascending=False).head(5)

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
58,43692,165.0,724,675.12,Interstellar,8.1,10867,2014,Nov,Wednesday,Christopher Nolan,Male
78,43724,150.0,434,378.86,Mad Max: Fury Road,7.2,9427,2015,May,Wednesday,George Miller,Male
119,43796,140.0,271,655.01,Pirates of the Caribbean: The Curse of the Bla...,7.5,6985,2003,Jul,Wednesday,Gore Verbinski,Male
120,43797,125.0,206,752.1,The Hunger Games: Mockingjay - Part 1,6.6,5584,2014,Nov,Tuesday,Francis Lawrence,Male
45,43662,185.0,187,1004.56,The Dark Knight,8.2,12002,2008,Jul,Wednesday,Christopher Nolan,Male


On applying this to a string column, it sorts the dataframe ***lexicographically**.

In [None]:
data.sort_values(['title'],ascending=False).head(5)

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
436,44364,60.0,36,71.07,xXx: State of the Union,4.7,549,2005,Apr,Wednesday,Lee Tamahori,Male
330,44165,70.0,46,277.45,xXx,5.8,1424,2002,Aug,Friday,Rob Cohen,Male
994,45681,15.0,21,2.86,eXistenZ,6.7,475,1999,Apr,Wednesday,David Cronenberg,Male
547,44594,50.0,37,55.97,Zoolander 2,4.7,797,2016,Feb,Saturday,Ben Stiller,Male
850,45313,28.0,38,60.78,Zoolander,6.1,1337,2001,Sep,Friday,Ben Stiller,Male


#### How will get list of movies directed by a particular director, say 'Christopher Nolan'?

In [None]:
data.loc[data['director_name'] == 'Christopher Nolan',['title']]

Unnamed: 0,title
3,The Dark Knight Rises
45,The Dark Knight
58,Interstellar
59,Inception
74,Batman Begins
565,Insomnia
641,The Prestige
1341,Memento


**Note:**
- The string indicating "Christopher Nolan" could have been something else as well.  
- The better way is to use string methods. We will discuss this later.

---