# **Disney Movies Data Mining Project**

#  **Problem Definition**

The world famous Walt Disney productions have had countless successes over the last century. Some movies, however, proved to be particularly more successful and profitable than others, classifying a dataset based on the major Disney movies released in theatres can help us figure out key elements to the success of some of these films, we want to assess their success based on their profitability.




### **Dataset Description**

For the task of news classification with machine learning, we are provided with a dataset containing Disney Movies, namely the Academy Award Review of Walt Disney Productions,United States,English,41.0,,45.472,1937-05-19,7.2,N/A, including their title, their production company, the country they were produced in, their release language, their running time, their budget and box office, their release date. It also comprises their imdb, metascore and rotten_tomatoes scores, the name of their directors and producers, their original material, their main cast, their composers,distributors, cinematographers, editors and their authors.

The Attributes are:

>* title
>* Production company
>* Country
>* Language
>* Running time
>* Budget,Box office
>* Release date
>* imdb,metascore
>* rotten_tomatoes
>* Directed by
>* Produced by
>* Based on,Starring
>* Music by
>* Distributed by
>* Cinematography
>* Edited by
>* Screenplay by

### **Tasks to be performed**


>* Importing Required Libraries
>* Analyzing the data
>* Extracting useful information and potential reasons for a success
>* Visualizing the pertinence of our findings



### **Importing Required Libraries**


In [1]:
import pandas as pd
import numpy as np

### Reading the file

In [2]:
dataset = pd.read_csv("disney_movies.csv")
dataset.head()

Unnamed: 0.1,Unnamed: 0,title,Production company,Country,Language,Running time,Budget,Box office,Release date,imdb,...,rotten_tomatoes,Directed by,Produced by,Based on,Starring,Music by,Distributed by,Cinematography,Edited by,Screenplay by
0,0,Academy Award Review of,Walt Disney Productions,United States,English,41.0,,45.472,1937-05-19,7.2,...,,,,,,,,,,
1,1,Snow White and the Seven Dwarfs,Walt Disney Productions,United States,English,83.0,1490000.0,418000000.0,1937-12-21,7.6,...,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Snow White', 'by The', 'Brothers Grimm']","['Adriana Caselotti', 'Lucille La Verne', 'Har...","['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures,,,
2,2,Pinocchio,Walt Disney Productions,United States,English,88.0,2600000.0,164000000.0,1940-02-07,7.4,...,73%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['The Adventures of Pinocchio', 'by', 'Carlo C...","['Cliff Edwards', 'Dickie Jones', 'Christian R...","['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures,,,
3,3,Fantasia,Walt Disney Productions,United States,English,126.0,2280000.0,83300000.0,1940-11-13,7.7,...,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",,"['Leopold Stokowski', 'Deems Taylor']",See program,RKO Radio Pictures,James Wong Howe,,
4,4,The Reluctant Dragon,Walt Disney Productions,United States,English,74.0,600000.0,960000.0,1941-06-20,6.9,...,68%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,,"['Robert Benchley', 'Frances Gifford', 'Buddy ...","['Frank Churchill', 'Larry Morey']",RKO Radio Pictures,Bert Giennon,Paul Weatherwax,


## Data cleaning
In data cleaning, we check for redundant data in dataset, filter the dataset to get only important information and also rename the columns to make it convenient to process.

#### 1. Checking for redundacy

In [3]:
dataset.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
439    False
440    False
441    False
442    False
443    False
Length: 444, dtype: bool

There is no duplicated data as shown above. 

#### 2. removing unnecessary columns from dataset
Here we remove irrelevant columns for our tasks, which is assessing movies success based on their profitability.
This columns are:
>* unnamed column
>* Production Company - Since almost all movies are produced by Walta Disney company, this column have no effect in accessing movies success
>* Language - almost all movies are in English
>* Country
>* Based on
>* Music by
>* Distributed by
>* Cinematography
>* Edited by
>* Screenplay by

All these columns are not important for our task, so we can get rid of them.

In [4]:
# Dropping unnecessary columns
dataset.drop(columns=['Unnamed: 0', 'Production company','Language','Country', 'Produced by', 'Language', 'Based on', 'Music by', 'Distributed by', 'Cinematography', 'Edited by', 'Screenplay by'], inplace=True, axis=1)


In [5]:
dataset.head()

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Starring
0,Academy Award Review of,41.0,,45.472,1937-05-19,7.2,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,418000000.0,1937-12-21,7.6,95.0,,"['David Hand (supervising)', 'William Cottrell...","['Adriana Caselotti', 'Lucille La Verne', 'Har..."
2,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,73%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...","['Cliff Edwards', 'Dickie Jones', 'Christian R..."
3,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,96.0,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Leopold Stokowski', 'Deems Taylor']"
4,The Reluctant Dragon,74.0,600000.0,960000.0,1941-06-20,6.9,,68%,"['Alfred Werker', '(live action)', 'Hamilton L...","['Robert Benchley', 'Frances Gifford', 'Buddy ..."


In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 444 entries, 0 to 443
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            444 non-null    object 
 1   Running time     430 non-null    float64
 2   Budget           278 non-null    float64
 3   Box office       358 non-null    float64
 4   Release date     435 non-null    object 
 5   imdb             424 non-null    float64
 6   metascore        293 non-null    float64
 7   rotten_tomatoes  391 non-null    object 
 8   Directed by      443 non-null    object 
 9   Starring         409 non-null    object 
dtypes: float64(5), object(5)
memory usage: 34.8+ KB


#### Renaming columns for convenience.

In [7]:
dataset = dataset.rename(columns={'Running time': 'Running_time',
                       'Box office': 'Box_office',
                       'Release date': 'Release_date',
                       'Directed by': 'Directed_by'})
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Directed_by,Starring
0,Academy Award Review of,41.0,,45.472,1937-05-19,7.2,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,418000000.0,1937-12-21,7.6,95.0,,"['David Hand (supervising)', 'William Cottrell...","['Adriana Caselotti', 'Lucille La Verne', 'Har..."
2,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,73%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...","['Cliff Edwards', 'Dickie Jones', 'Christian R..."
3,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,96.0,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Leopold Stokowski', 'Deems Taylor']"
4,The Reluctant Dragon,74.0,600000.0,960000.0,1941-06-20,6.9,,68%,"['Alfred Werker', '(live action)', 'Hamilton L...","['Robert Benchley', 'Frances Gifford', 'Buddy ..."


### Extracting the two main actors.
In the starring column, there are many actors mentioned. We are only interested in main actor and the second main actor.
Some movies have multiple directors, some have only one. For those movies with multiple directors, we only need the main director, make sure all movies have one director.

In [8]:
# Extracting the main actor
main_actor = []
# dataset.isnull().head()
dataset.fillna('missing', inplace=True) # filling NA/NAN values with "missing"
for i in range(len(dataset)):
    actors = dataset.Starring[i].split(",", 2)
    main = actors[0]
    main = main.replace("'", "")
    main = main.replace('[', '')
    main = main.replace(']', '')
    main = main.replace('"', '')
    main = main.strip()
    main_actor.append(main)

dataset['Main_actor'] = main_actor

# Extracting the second actor

second_actor = []
for i in range(len(dataset)):
    actors = dataset.Starring[i].split(",", 2)
    if len(actors) == 1:
        actor2 = 'missing'
        second_actor.append(actor2)
    else:
        actor2 = actors[1]
        actor2 = actor2.replace("'", "")
        actor2 = actor2.replace('[', '')
        actor2 = actor2.replace(']', '')
        actor2 = actor2.replace('"', '')
        actor2 = actor2.strip()
        second_actor.append(actor2)
dataset['Second_actor'] = second_actor

# Extracting the main director
director = []
for i in range(len(dataset)):
    directors = dataset.Directed_by[i].split(",", 1)
    main_dir = directors[0]
    main_dir = main_dir.replace("'", "")
    main_dir = main_dir.replace('[', '')
    main_dir = main_dir.replace(']', '')
    main_dir = main_dir.replace('"', '')
    main_dir = main_dir.strip()
    director.append(main_dir)

dataset['Director'] = director

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 444 entries, 0 to 443
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            444 non-null    object
 1   Running_time     444 non-null    object
 2   Budget           444 non-null    object
 3   Box_office       444 non-null    object
 4   Release_date     444 non-null    object
 5   imdb             444 non-null    object
 6   metascore        444 non-null    object
 7   rotten_tomatoes  444 non-null    object
 8   Directed_by      444 non-null    object
 9   Starring         444 non-null    object
 10  Main_actor       444 non-null    object
 11  Second_actor     444 non-null    object
 12  Director         444 non-null    object
dtypes: object(13)
memory usage: 45.2+ KB


In [9]:
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Directed_by,Starring,Main_actor,Second_actor,Director
0,Academy Award Review of,41.0,missing,45.472,1937-05-19,7.2,missing,missing,missing,missing,missing,missing,missing
1,Snow White and the Seven Dwarfs,83.0,1490000.0,418000000.0,1937-12-21,7.6,95.0,missing,"['David Hand (supervising)', 'William Cottrell...","['Adriana Caselotti', 'Lucille La Verne', 'Har...",Adriana Caselotti,Lucille La Verne,David Hand (supervising)
2,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,73%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...","['Cliff Edwards', 'Dickie Jones', 'Christian R...",Cliff Edwards,Dickie Jones,Ben Sharpsteen
3,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,96.0,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Leopold Stokowski', 'Deems Taylor']",Leopold Stokowski,Deems Taylor,Samuel Armstrong
4,The Reluctant Dragon,74.0,600000.0,960000.0,1941-06-20,6.9,missing,68%,"['Alfred Werker', '(live action)', 'Hamilton L...","['Robert Benchley', 'Frances Gifford', 'Buddy ...",Robert Benchley,Frances Gifford,Alfred Werker


#### Removing Directed_by and Starring columns
Since we have extracted our main actor, the second main actor and main director, we don't need the two columns i.e Directed_by and Starring.


In [10]:
dataset.drop(columns=['Directed_by', 'Starring'], inplace=True, axis=1)
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Main_actor,Second_actor,Director
0,Academy Award Review of,41.0,missing,45.472,1937-05-19,7.2,missing,missing,missing,missing,missing
1,Snow White and the Seven Dwarfs,83.0,1490000.0,418000000.0,1937-12-21,7.6,95.0,missing,Adriana Caselotti,Lucille La Verne,David Hand (supervising)
2,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,73%,Cliff Edwards,Dickie Jones,Ben Sharpsteen
3,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,96.0,95%,Leopold Stokowski,Deems Taylor,Samuel Armstrong
4,The Reluctant Dragon,74.0,600000.0,960000.0,1941-06-20,6.9,missing,68%,Robert Benchley,Frances Gifford,Alfred Werker


### Removing rows with missing Data
For our task, the important columns are "Budget" and "Box_office", "imdb", "metascore", "rotten_tomatoes"
As we can see from the dataset, some rows have missing data of the above mentioned columns, so we need to get rid off such rows. 

In [11]:
# setting display.max_rows to None so that we can see the whole rows of dataset
pd.set_option('display.max_rows', None)

In [12]:
for i in range(len(dataset)):
    if(dataset['Box_office'][i]=='missing' or dataset['Budget'][i]=='missing' or dataset['imdb'][i]=='missing' or dataset['metascore'][i]=='missing' or dataset['rotten_tomatoes'][i]=='missing'):
        dataset.drop(i, inplace=True)

dataset


Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Main_actor,Second_actor,Director
2,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,73%,Cliff Edwards,Dickie Jones,Ben Sharpsteen
3,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,96.0,95%,Leopold Stokowski,Deems Taylor,Samuel Armstrong
5,Dumbo,64.0,950000.0,1300000.0,1941-10-23,7.2,96.0,98%,Edward Brophy,Herman Bing,Ben Sharpsteen
6,Bambi,70.0,858000.0,267400000.0,1942-08-09,7.3,91.0,90%,see below,missing,Supervising director
10,Make Mine Music,75.0,1350000.0,3275000.0,1946-04-20,6.3,60.0,70%,Nelson Eddy,missing,Jack Kinney
11,Song of the South,94.0,2125000.0,65000000.0,1946-11-12,7.1,54.0,50%,James Baskett,Bobby Driscoll,Live action:
13,Melody Time,75.0,1500000.0,2560000.0,1948-05-27,6.3,69.0,80%,Roy Rogers,Trigger,Jack Kinney
16,Cinderella,74.0,2900000.0,263600000.0,1950-02-15,6.9,67.0,83%,Ilene Woods,Eleanor Audley,Clyde Geronimi
18,Alice in Wonderland,75.0,3000000.0,5600000.0,1951-07-26,6.4,53.0,51%,Kathryn Beaumont,Ed Wynn,Clyde Geronimi
20,Peter Pan,77.0,4000000.0,87400000.0,1953-02-05,7.3,76.0,81%,Bobby Driscoll,Kathryn Beaumont,Clyde Geronimi


In [13]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 223 entries, 2 to 443
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            223 non-null    object
 1   Running_time     223 non-null    object
 2   Budget           223 non-null    object
 3   Box_office       223 non-null    object
 4   Release_date     223 non-null    object
 5   imdb             223 non-null    object
 6   metascore        223 non-null    object
 7   rotten_tomatoes  223 non-null    object
 8   Main_actor       223 non-null    object
 9   Second_actor     223 non-null    object
 10  Director         223 non-null    object
dtypes: object(11)
memory usage: 29.0+ KB


After removing rows with missing data of of above mentioned columns, only 223 rows are left out of 444 rows.

### Data Transformation
###### Calculating profit and Adding category column to dataset
>* Profit = Box_office-Budget
###### The category
>* If Profit >= 2.5*Budget, then the movie is categorized as Blockbuster
>* If Budget < Profit < 2.5*Budget, then the movie is categorized as Hit
>* If Profit < Budget, then the movie is cotegorized as Flop

In [14]:
# converting the datatype of Box_office and Budget into float
dataset = dataset.astype({'Box_office':'float64', 'Budget': 'float64', 'imdb':'float64', 'metascore':'float64'})
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 223 entries, 2 to 443
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            223 non-null    object 
 1   Running_time     223 non-null    object 
 2   Budget           223 non-null    float64
 3   Box_office       223 non-null    float64
 4   Release_date     223 non-null    object 
 5   imdb             223 non-null    float64
 6   metascore        223 non-null    float64
 7   rotten_tomatoes  223 non-null    object 
 8   Main_actor       223 non-null    object 
 9   Second_actor     223 non-null    object 
 10  Director         223 non-null    object 
dtypes: float64(4), object(7)
memory usage: 29.0+ KB


In [15]:
dataset.to_csv('movie_dataset.csv')
dataset = pd.read_csv('movie_dataset.csv')
category = []
dataset['Profit'] = dataset['Box_office'] - dataset['Budget']
# for i in range(len(dataset)):
#     profit.append(dataset.Box_office[i] - dataset.Budget[i])

for i in range(len(dataset)):
    if dataset.Profit[i] >= (2.5*dataset.Budget[i]):
        category.append('Blockbuster')
    elif dataset.Profit[i] > dataset.Budget[i]:
        category.append('Hit')
    else:
        category.append('Flop')

dataset['Category'] = category
dataset.drop(['Unnamed: 0','Profit'], inplace=True, axis=1)

# dataset.to_csv('movie_dataset.csv')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223 entries, 0 to 222
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            223 non-null    object 
 1   Running_time     223 non-null    float64
 2   Budget           223 non-null    float64
 3   Box_office       223 non-null    float64
 4   Release_date     223 non-null    object 
 5   imdb             223 non-null    float64
 6   metascore        223 non-null    float64
 7   rotten_tomatoes  223 non-null    object 
 8   Main_actor       223 non-null    object 
 9   Second_actor     223 non-null    object 
 10  Director         223 non-null    object 
 11  Category         223 non-null    object 
dtypes: float64(5), object(7)
memory usage: 21.0+ KB


In [16]:
dataset.head()

Unnamed: 0,title,Running_time,Budget,Box_office,Release_date,imdb,metascore,rotten_tomatoes,Main_actor,Second_actor,Director,Category
0,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,73%,Cliff Edwards,Dickie Jones,Ben Sharpsteen,Blockbuster
1,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.7,96.0,95%,Leopold Stokowski,Deems Taylor,Samuel Armstrong,Blockbuster
2,Dumbo,64.0,950000.0,1300000.0,1941-10-23,7.2,96.0,98%,Edward Brophy,Herman Bing,Ben Sharpsteen,Flop
3,Bambi,70.0,858000.0,267400000.0,1942-08-09,7.3,91.0,90%,see below,missing,Supervising director,Blockbuster
4,Make Mine Music,75.0,1350000.0,3275000.0,1946-04-20,6.3,60.0,70%,Nelson Eddy,missing,Jack Kinney,Hit


In [17]:
dataset.to_csv('final_movie_dataset.csv', index=False)