## 1. Problem Statement

We have the dataset of 1000 popular movies on IMDB in the last 10 years ie, 2006-2016.

This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data.

## 2. Data Loading and Description

This dataset includes 1000 observations of 12 columns. The data fields included are:

| Column Name         | Description                                                                              |
| -------------       |:-------------                                                                           :| 
| Rank                | Movie rank order                                                                         | 
| Title               | The title of the film                                                                    |  
| Genre               | A comma-separated list of genres used to classify the film                               | 
| Description         | Brief one-sentence movie summary                                                         |   
| Director            | The name of the film's director                                                          |
| Actors              | A comma-separated list of the main stars of the film                                     |
| Year                | The year that the film released as an integer.                                           |
| Runtime (Minutes)   | The duration of the film in minutes                                                      |
| Rating              | User rating for the movie 0-10                                                           |
| Votes               | Number of votes                                                                          |
| Revenue (Millions)  | Movie revenue in millions                                                                |
| Metascore           | An aggregated average of critic scores (0-100), higher scores represent positive reviews |

#### Importing Packages

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
sns.set()

from subprocess import check_output

#### Importing the dataset

In [39]:
movie_data = pd.read_csv(r"C:\Users\yjoshi\PythonCodes\DataSet\1000 movies data.csv")

In [11]:
movie_data.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


## 3. Data Profiling

#### Understanding the DataSet

In [13]:
movie_data.shape

(1000, 12)

In [16]:
movie_data.isna().sum()

Rank                    0
Title                   0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

In [19]:
movie_data.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [20]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
Rank                  1000 non-null int64
Title                 1000 non-null object
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.8+ KB


From above, we can see that __Metascore__ and __Revenue (Millions)__ column contain maximum null values. We will need to find proper value to replace nulls. As both these columns solely depends on movies performance and viewer's choice. 

### Pre-Profiling

In [21]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(movie_data)
profile.to_file(outputfile="movie_data_preprocessing.html")

ModuleNotFoundError: No module named 'pandas_profiling'

### Pre-Processing

Handling missing data:
1. For missing Metascore values, since it is related to Rating we can replace missing Metascore values with Rating*10 values
2. For Revenue (Millions) missing values, we need to explore a little more as to which values to replace null with

In [88]:
movie_data.Metascore.fillna(movie_data.Rating*10,inplace=True)

In [23]:
movie_data.dtypes

Rank                    int64
Title                  object
Genre                  object
Description            object
Director               object
Actors                 object
Year                    int64
Runtime (Minutes)       int64
Rating                float64
Votes                   int64
Revenue (Millions)    float64
Metascore             float64
dtype: object

In [29]:
movie_data.Genre.unique().shape

(207,)

There are 207 unique Genres but this number is due to combinations of genres. Let us split this and get unique genre as the first mentioned Genre is the major genre the film belongs to.

#### Adding new columns to simplify data

In [40]:
movie_data[['Genre1','Genre2','Genre3']] = movie_data.Genre.str.split(",",expand=True,)

Lets fill the null values in Genre2 and Genre3 with Genre1 value

In [41]:
movie_data["Genre2"].fillna(movie_data.Genre1,inplace=True)

In [43]:
movie_data["Genre3"].fillna(movie_data.Genre2,inplace=True)

In [48]:
print(movie_data.Genre1.unique().shape)
print(movie_data.Genre2.unique().shape)
print(movie_data.Genre3.unique().shape)

(13,)
(19,)
(19,)


#### There are two columns representing rating, Metascore is on a scale of 0-100 and Rating is on a scale of 0-10, lets create a new column with average of both on a scale of 10.

In [91]:
movie_data["Avg_Rating"] = (movie_data["Rating"]+movie_data["Metascore"]/10)/2

In [90]:
movie_data

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Genre1,Genre2,Genre3,Avg_Rating
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,Action,Adventure,Sci-Fi,7.85
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,Adventure,Mystery,Sci-Fi,6.75
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,Horror,Thriller,Thriller,6.75
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,Animation,Comedy,Family,6.55
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,Action,Adventure,Fantasy,5.10
5,6,The Great Wall,"Action,Adventure,Fantasy",Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0,Action,Adventure,Fantasy,5.15
6,7,La La Land,"Comedy,Drama,Music",Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0,Comedy,Drama,Music,8.80
7,8,Mindhorn,Comedy,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0,Comedy,Comedy,Comedy,6.75
8,9,The Lost City of Z,"Action,Adventure,Biography",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0,Action,Adventure,Biography,7.45
9,10,Passengers,"Adventure,Drama,Romance",Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0,Adventure,Drama,Romance,5.55


In [59]:
print(movie_data.Votes.min())
print(movie_data.Votes.max())

movie_data[movie_data.Votes>1000000].Votes.count()

61
1791916


6

In [None]:
movie_data = movie_data.drop(["Description"],axis=1)

In [100]:
movie_data.groupby("Director").filter(lambda x: len(x) >= 2).isnull().sum()

Rank                   0
Title                  0
Genre                  0
Director               0
Actors                 0
Year                   0
Runtime (Minutes)      0
Rating                 0
Votes                  0
Revenue (Millions)    26
Metascore              0
Genre1                 0
Genre2                 0
Genre3                 0
Avg_Rating             0
dtype: int64