<a href="https://colab.research.google.com/github/python-noobtopro/data_science_made_fun/blob/main/Disney_allmovie_data_final_cleaned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***********************************************************************

## TASK IS TO COLLECT MOVIE INFO OF ALL DISNEY MOVIES (nearly 500 movies) FROM SCRAPING DATA FROM WIKIPEDIA AND SAVING IT LOCALLY AS A JSON FILE, CSV FILE, PICKLE FILE
### Further we are interested in Data Cleaning so that we can easily analyse the data in future projects:

1.   Converting date string to datetime object
2.   Removing unwanted elements like subs [2], [3], etc
3.   Converting all analyzable data like budget, box office collection into numerical values like float, int 


### Finally we are intersted to save our data in three formats:


1.   JSON
2.   pickle
3.   csv (using pandas dataframe)





*************************************************************************

**********************************************************************

Libraries/modules used:


*   BeautifulSoup  (Web scraping)
*   requests       (Request/Response cycle)
*   urllib         (Authorization for APIs)
*   datetime       (Converting date strings to datetime object)
*   pickle         (saving the data)
*   pandas         (saving data as csv)
*   json           (saving data) 


Concepts used:


*   Http/Https protocols
*   Request/Response cycle
*   RESTful APIs and Authorization
*   Web Scraping (more of HTML parsing)
*   Handling JSON objects
*   Data Cleaning
*   Saving data as panda dataframe


******************************************************************

### Starting with we crawling to all the links found on the master web page 'https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films' and then parsing the webpage to the required HTML/CSS tags and getting the JSON object

In [None]:
import requests
from bs4 import BeautifulSoup as bs

In [None]:
base_url = "https://en.wikipedia.org"
master_url = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"

link = requests.get(master_url)
soup = bs(link.content, 'html.parser')       # making soup object using BeautifulSoup


In [None]:
# Digging to the href link of each movie title

movies_link = soup.select(".wikitable.sortable i a")

In [None]:
# Defining get_info function to get the info for each movie from the loaded movie webpage
# this function will remove references



def clear_tags(soup):
    for tags in soup.find_all(["sup", "span"]):
        tags.decompose()
        
        
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"):
        return [text for text in row_data.stripped_strings]    
         # This automatically strips the content and makes a list of strings seperated with comma when jumps from one tag to other
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

def get_info(new_url):
    movie_info = {}
    
    link = requests.get(new_url)
    soup = bs(link.content, 'html.parser')
    search = soup.find(class_="infobox vevent")
    clear_tags(soup)
    tr_tag = search.find_all("tr")
    for index, tags in enumerate(tr_tag):
        if index == 0:
            movie_info['Title'] = tags.find("th").get_text()
        elif index == 1:
            continue
        else:
            keys = tags.find("th").get_text(" ", strip=True)     
            values = get_content_value(tags.find("td"))
            movie_info[keys] = values
    return movie_info


In [None]:
## Using get_info function on each link and appending it to master data

master_data = []
for index, links in enumerate(movies_link):         # We defined movies_link above
    if index%10 == 0:                         
        print('Extracted', index+1, 'of', len(movies_link))    # To check the progress
    try:
        relative_url = links['href']
        movie_title = links['title']
        
        new_url = base_url + relative_url
        master_data.append(get_info(new_url))
        
    except Exception as e:
        print(links.get_text())
        print(e) 
    

Extracted 1 of 510
Extracted 11 of 510
Extracted 21 of 510
Extracted 31 of 510
Extracted 41 of 510
Sleeping Beauty
'NoneType' object has no attribute 'get_text'
Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
Extracted 51 of 510
Extracted 61 of 510
Extracted 71 of 510
Extracted 81 of 510
Extracted 91 of 510
Extracted 101 of 510
Extracted 111 of 510
Extracted 121 of 510
True-Life Adventures
'NoneType' object has no attribute 'find_all'
Extracted 131 of 510
Extracted 141 of 510
The London Connection
'NoneType' object has no attribute 'find'
Extracted 151 of 510
Extracted 161 of 510
Extracted 171 of 510
Extracted 181 of 510
Extracted 191 of 510
Extracted 201 of 510
Extracted 211 of 510
Extracted 221 of 510
Extracted 231 of 510
Extracted 241 of 510
Extracted 251 of 510
Extracted 261 of 510
Spirited Away
'NoneType' object has no attribute 'get_text'
Extracted 271 of 510
Extracted 281 of 510
Extracted 291 of 510
Extracte

In [None]:
len(master_data)   # Checking that everything executed well

491

**Out of 510 movie_links we have successfully scraped the complete Movie Information of 491 movies which is quite a big number.**


*The problem with other links could be broken links, poor HTML tagging, etc.*

In [None]:
# Creating save method

import json

def saveas_json(filename, data):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

In [None]:
# Creating reload method

import json

def reload_json(filename):
    with open(filename, encoding='utf-8') as data_file:
        data_loaded = json.load(data_file)
    return data_loaded

In [None]:
# Sample execution of save method

#saveas_json('disney_allmovie_data_sample.json', master_data)

In [None]:
# Sample Reloading file

# reload_json('disney_allmovie_data_sample.json')

## Cleaning data

*Now that we have movie data of more than 450 movies we can go further to clean it to make it good to analyze*

*************************************************************
*  Convert date string into datetime object

*We need to convert date in the format ['May 19, 1937'] to a datetime object*

*   We also need to convert the the time string such as '142 minutes' to integer time

*   We also need to convert all Budget and Box office values to a numerical (float) value

*e.g., '$418 million', '$164 million', '$76.4–$83.3 million (United States and Canada)'*

*************************************************************************

In [None]:
from datetime import datetime
import dateutil.parser

In [None]:
# Checking different types of date string that we can encounter

print([items.get('Release date', 'N/A') for items in master_data])

[['May 19, 1937'], ['December 21, 1937 ( Carthay Circle Theatre )'], ['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'], ['November 13, 1940'], ['June 27, 1941'], ['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)'], ['August 9, 1942 (World Premiere – London)', 'August 13, 1942 (Premiere – New York City)', 'August 21, 1942 (U.S.)'], ['August 24, 1942 (World Premiere – Rio de Janeiro)', 'February 6, 1943 (U.S. Premiere – Boston)', 'February 19, 1943 (U.S.)'], ['July 17, 1943'], ['December 21, 1944 (Mexico City)', 'February 3, 1945 (US)'], ['April 20, 1946 (New York City premiere)', 'August 15, 1946 (U.S.)'], ['November 12, 1946 (Premiere: Atlanta, Georgia)', 'November 20, 1946', 'March 30, 1947 (Stanford Theatre, Palo Alto, California)'], ['September 27, 1947'], 'May 27, 1948', ['November 29, 1948 (Chicago, Illinois)', 'January 19, 1949 (Indianapolis, Indiana)'], ['October 5, 1949'], ['February 15, 1950 (Boston)', 'March 4, 1950 (United States)'], ['

This code snippet will extract dates from all different date string types and handle all the edge cases and convert it to datetime object

In [None]:
#str_date = ["26 October 1953 ( 1953-10-26 ) (Premiere- London ) [1]", "27 February 1954 ( 1954-02-27 ) ( US ) [1]"]

def date_convert(str_date):
    if str_date == 'N/A':
        return None
    
    if isinstance(str_date, list):
        date = str_date[0]
        split_date = date.split("(")[0].strip()
        date = dateutil.parser.parse(split_date).strftime('%d %B, %Y')   #Converting every format to a specific format
        return datetime.strptime(date, '%d %B, %Y').date()                 # Creating python date object
    
    if not isinstance(str_date, list):
        split_date = str_date.split("(")[0].strip()
        date = dateutil.parser.parse(split_date).strftime('%d %B, %Y')    #Converting every format to a specific format
        return datetime.strptime(date, '%d %B, %Y').date()  

This snippet will update all a new key value pair in the master_data list of all movie data with a datetime object

In [None]:
for dictionary in master_data:      #Selecting a dictionary from master data and adding a new key to the dictionary
    dictionary['Release date (Standard)'] = date_convert(dictionary.get('Release date', 'N/A'))

In [None]:
# Check

print([items.get('Release date (Standard)') for items in master_data])

[datetime.date(1937, 5, 19), datetime.date(1937, 12, 21), datetime.date(1940, 2, 7), datetime.date(1940, 11, 13), datetime.date(1941, 6, 27), datetime.date(1941, 10, 23), datetime.date(1942, 8, 9), datetime.date(1942, 8, 24), datetime.date(1943, 7, 17), datetime.date(1944, 12, 21), datetime.date(1946, 4, 20), datetime.date(1946, 11, 12), datetime.date(1947, 9, 27), datetime.date(1948, 5, 27), datetime.date(1948, 11, 29), datetime.date(1949, 10, 5), datetime.date(1950, 2, 15), datetime.date(1950, 6, 22), datetime.date(1951, 7, 26), datetime.date(1952, 3, 13), datetime.date(1953, 2, 5), datetime.date(1953, 7, 23), datetime.date(1953, 11, 10), datetime.date(1953, 10, 26), datetime.date(1954, 8, 17), datetime.date(1954, 12, 23), datetime.date(1955, 5, 25), datetime.date(1955, 6, 22), datetime.date(1955, 9, 14), datetime.date(1955, 12, 22), datetime.date(1956, 6, 8), datetime.date(1956, 7, 18), datetime.date(1956, 9, 4), datetime.date(1956, 12, 20), datetime.date(1957, 6, 19), datetime.date

In [None]:
# Checking that previous date string still exists

print([items.get('Running time', 'N/A') for items in master_data])

['41 minutes (74 minutes 1966 release)', '83 minutes', '88 minutes', '126 minutes', '74 minutes', '64 minutes', '70 minutes', '42 minutes', '70 min', '71 minutes', '75 minutes', '94 minutes', '73 minutes', '75 minutes', '82 minutes', '68 minutes', '74 minutes', '96 minutes', '75 minutes', '84 minutes', '77 minutes', '92 minutes', '69 minutes', '81 minutes', ['60 minutes (VHS version)', '71 minutes (original)'], '127 minutes', '92 minutes', '76 minutes', '75 minutes', '73 minutes', '85 minutes', '81 minutes', '70 minutes', '90 min.', '80 minutes', '75 minutes', '83 minutes', '83 minutes', '72 minutes', '97 minutes', '104 minutes', '93 minutes', '105 minutes', '95 minutes', '97 minutes', '134 minutes', '69 minutes', '92 minutes', '126 minutes', '79 minutes', '97 minutes', '128 minutes', '73 minutes', '91 minutes', '105 minutes', '98 minutes', '130 minutes', '89 min.', '93 minutes', '67 minutes', '98 minutes', '100 minutes', '118 minutes', '103 minutes', '110 minutes', '80 min.', '79 minu

This snippet will extract all the numerical values in the 'Running time' key in dictionary of movie info and convert it to integer time

In [None]:
def int_time_convert(raw_time):
    if raw_time == 'N/A':
        return None
    elif isinstance(raw_time, list):
        return int(raw_time[0].split(" ")[0])      # Grabbing just the digit
    else:
        return int(raw_time.split(" ")[0])

This snippet will update all a new key value pair in the master_data list of all movie data with a integer 'Running time (int)'

In [None]:
for dictionary in master_data:
    dictionary['Running time (int)'] = int_time_convert(dictionary.get('Running time', 'N/A'))
  

In [None]:
# Check

print([items.get('Running time (int)', 'N/A') for items in master_data])

[41, 83, 88, 126, 74, 64, 70, 42, 70, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 92, 76, 75, 73, 85, 81, 70, 90, 80, 75, 83, 83, 72, 97, 104, 93, 105, 95, 97, 134, 69, 92, 126, 79, 97, 128, 73, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 79, 91, 91, 97, 118, 139, 131, 92, 87, 116, 93, 110, 110, 131, 101, 108, 84, 78, 75, 164, 106, 110, 99, 113, 108, 112, 93, 91, 93, 100, 100, 79, 96, 113, 89, 118, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, 91, 112, 115, 95, 91, 97, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 112, 84, 97, 97, 114, 96, 97, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 90, 74, 90, 89, 110, 74, 93, 84, 83, 74, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 88, 102, 102, 97, 111, 100, 96, 98, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 101, 104, 103, 86, 105, 75, 93, 92, 98, 95, 93, 87, 93, 87, 128, 77, 86, 95, 114, 93, 83

In [None]:
# Checking different forms of strings

print([items.get("Budget", 'N/A') for items in master_data])


['N/A', '$1.49 million', '$2.6 million', '$2.28 million', '$600,000', '$950,000', '$858,000', 'N/A', '$788,000', 'N/A', '$1.35 million', '$2.125 million', 'N/A', '$1.5 million', '$1.5 million', 'N/A', '$2.2 million', '$1,800,000', '$3 million', 'N/A', '$4 million', '$2 million', '$300,000', '$1.8 million', 'N/A', '$5 million', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$700,000', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'under $1 million or $1,250,000', 'N/A', '$2 million', 'N/A', 'N/A', '$2.5 million', 'N/A', 'N/A', '$4 million', '$3.6 million', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', '$4.4–6 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', '$6.3 million',

In [None]:
## Checking different forms of strings

print([items.get("Box office", 'N/A') for items in master_data])

['$45.472', '$418 million', '$164 million', '$76.4–$83.3 million (United States and Canada)', '$960,000 (worldwide rentals)', '>$1.3 million (est. United States/Canada rentals, 1941)', '$267.4 million', '$1.135 million (worldwide rentals)', '$799,000', '$3.355 million (worldwide rentals)', '$3.275 million (worldwide rentals)', '$65 million', '$3.165 million (worldwide rentals)', '$2.56 million (worldwide rentals)', '$3.7 million (U.S. rental) $575,000 (foreign rental)', '$1.625 million (worldwide rentals)', '$182 million', '$4,100,000 (worldwide rentals)', ['$2.4 million (1951, domestic)', '$3.5 million (1974, domestic)'], '$2.1 million (US rentals)', '$87.4 million (United States and Canada)', '$1 million (US)', '$2.6 million (US)', 'N/A', '$1.75 million (US and Canadian rentals)', '$28.2 million', '$2,150,000 (US)', '$187 million', '$2.1 million (US)', '$1.6 million (US)', '$1.7 million (US)', 'N/A', 'N/A', '$2.75 million (US)', 'N/A', '$1.75 million (US rentals)', '$6,250,000 (U.S./

This code snippet will handle all the edge cases and convert all the dollars to a numerical value

In [None]:
def dollars_to_float(dollars):
    try:
        if dollars == 'N/A':
            return None
    
        if isinstance(dollars, list):
            
            if 'Original release' in dollars:
                dollars = dollars[2]
                return float(dollars.split(' ')[0].lstrip('$'))*1000000
                
            if 'million' in dollars:
                dollars = dollars[0]
                return float(dollars.split(' ')[0].lstrip('$'))*1000000
                
            else:
                dollars = "".join([i for i in dollars if i.isdigit()])
                return float(dollars)
        else:
            if '>$' in dollars:         # special edge case
                return float(dollars.split(' ')[0].lstrip('>').lstrip('$'))*1000000
            if 'under' in dollars:      # special edge case
                return float(dollars.split(' ')[1].lstrip('$'))*1000000
            if '$76.4–$83.3 million' in dollars:     # special edge case
                return float(76.4)*1000000
            elif 'million' in dollars:
                return float(dollars.split(' ')[0].lstrip('$'))*1000000
            else:
                dollars = "".join([i for i in dollars if i.isdigit()])
                return float(dollars)
    except:
        return '-'
        print("Invalid Type")
        print(dollars)
        

Appending th eupdated key value pair in the master_data list

In [None]:
for dictionary in master_data:
    dictionary['Budget (float)'] = dollars_to_float(dictionary.get("Budget", 'N/A'))

In [None]:
for dictionary in master_data:
    dictionary['Box office (float)'] = dollars_to_float(dictionary.get("Box office", 'N/A'))

In [None]:
# Check

master_data[1]

{'Based on': ['Snow White', 'by The', 'Brothers Grimm'],
 'Box office': '$418 million',
 'Box office (float)': 418000000.0,
 'Budget': '$1.49 million',
 'Budget (float)': 1490000.0,
 'Country': 'United States',
 'Directed by': ['David Hand',
  'William Cottrell',
  'Wilfred Jackson',
  'Larry Morey',
  'Perce Pearce',
  'Ben Sharpsteen'],
 'Distributed by': 'RKO Radio Pictures',
 'Language': 'English',
 'Music by': ['Frank Churchill', 'Paul Smith', 'Leigh Harline'],
 'Produced by': 'Walt Disney',
 'Production company': 'Walt Disney Productions',
 'Release date': ['December 21, 1937 ( Carthay Circle Theatre )'],
 'Release date (Standard)': datetime.date(1937, 12, 21),
 'Running time': '83 minutes',
 'Running time (int)': 83,
 'Starring': ['Adriana Caselotti',
  'Lucille La Verne',
  'Harry Stockwell',
  'Roy Atwell',
  'Pinto Colvig',
  'Otis Harlan',
  'Scotty Mattraw',
  'Billy Gilbert',
  'Eddie Collins',
  'Moroni Olsen',
  'Stuart Buchanan'],
 'Title': 'Snow White and the Seven Dwa

### Since, datetime object is not seriazable in JSON we can save this data as pickle file

This snippet defines function to save and load a pickle file

*We are just defining the function here and not saving it because we still have to update the data*

In [None]:
import pickle

def saveas_pickle(filename, data):
    with open(filename, 'wb') as fh:
        pickle.dump(data, fh)
        
        
def load_pickle(filename):
    with open(filename, 'rb') as fh:
        return pickle.load(fh)

# We need to add some more field in the master_data like say Imdb ratings, Metascore which was not avaialable at the wikipedia

### We will use RESTful API (in our case http://www.omdbapi.com) to get those additional data and then finally integrate it to our data

In [None]:
# Importing necessary libraries to start API calls

import requests
import urllib
import os

url = 'http://www.omdbapi.com/?'

In [None]:
# Defining some useful functions
# 1. get_omdb_info(title) to get info of each movie from omdb api calls based on title
# 2. get_rottentomato_score(omdb_info) for getting exact score by scrolling in the value list of 'Ratings' key

# Also note that 'API_KEY' environment variable can be set from 'sysdm.cpl'

def get_omdb_info(title):
    parameters = {'apikey':'Enter Your Key Here', 't': title}
    header = urllib.parse.urlencode(parameters)
    full_url = url + header
    return requests.get(full_url).json()

def get_rottentomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None      # If we have nothing to iterate on inside the 'Ratings' we return None

In [None]:
# Now updating the values out of omdb_info to the master_data

for index, dictionary in enumerate(master_data):
    if (index)%50 == 0:
      print('Getting', index+1, 'out of', len(master_data))
    title = dictionary['Title']
    omdb_info = get_omdb_info(title)
    dictionary['Genre'] = omdb_info.get('Genre', None)
    dictionary['Movie Plot'] = omdb_info.get('Plot', None)
    dictionary['Awards'] = omdb_info.get('Awards', None)
    dictionary['Imdb Rating'] = omdb_info.get('imdbRating', None)
    dictionary['Metascore'] = omdb_info.get('Metascore', None)
    dictionary['Rotten Tomatoes Score'] = get_rottentomato_score(omdb_info)
    

Getting 1 out of 491
Getting 51 out of 491
Getting 101 out of 491
Getting 151 out of 491
Getting 201 out of 491
Getting 251 out of 491
Getting 301 out of 491
Getting 351 out of 491
Getting 401 out of 491
Getting 451 out of 491


In [None]:
# Checking that new entries extracted through omdb API call is added to the entry

master_data[-21]

# Notice
# 'Genre': 'Animation, Adventure, Comedy',
#  'Movie Plot': 'The romantic tale of a sheltered uptown Cocker Spaniel dog and a streetwise downtown Mutt.',
#  'Awards': 'Nominated for 1 BAFTA Film Award1 win & 2 nominations total',
#  'Imdb Rating': '7.3',
#  'Metascore': '78',
#  'Rotten Tomatoes Score': '93%'

{'Awards': 'N/A',
 'Box office (float)': None,
 'Budget (float)': None,
 'Cinematography': ['Mahyar Abousaeedi', 'Jonathan Pytko'],
 'Country': 'United States',
 'Directed by': 'Domee Shi',
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Edited by': ['Nicholas C. Smith', 'Steve Bloom'],
 'Genre': 'Animation, Adventure, Comedy',
 'Imdb Rating': 'N/A',
 'Language': 'English',
 'Metascore': 'N/A',
 'Movie Plot': 'A 13-year-old girl named Mei Lee turns into a giant red panda whenever she gets too excited.',
 'Music by': 'Ludwig Göransson',
 'Produced by': 'Lindsey Collins',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Release date': ['March 11, 2022 (United States)'],
 'Release date (Standard)': datetime.date(2022, 3, 11),
 'Rotten Tomatoes Score': None,
 'Running time (int)': None,
 'Screenplay by': ['Julia Cho', 'Domee Shi'],
 'Starring': ['Rosalie Chiang',
  'Sandra Oh',
  'Ava Morse',
  'Maitreyi Ramakrishnan',
  'Hyein Park',
  'Orio

## Saving as pickle file

In [None]:
# Now saving the data as a new pickle file

saveas_pickle('cleaned_and_added_data_from_omdb.pickle', master_data)

# Check the local folder for the pickle file

In [None]:
# Check

master_data[-2]

{'Awards': 'Won 1 Oscar. 11 wins & 6 nominations total',
 'Based on': ['Snow White', 'by The', 'Brothers Grimm'],
 'Box office': '$418 million',
 'Box office (float)': 418000000.0,
 'Budget': '$1.49 million',
 'Budget (float)': 1490000.0,
 'Country': 'United States',
 'Directed by': ['David Hand',
  'William Cottrell',
  'Wilfred Jackson',
  'Larry Morey',
  'Perce Pearce',
  'Ben Sharpsteen'],
 'Distributed by': 'RKO Radio Pictures',
 'Genre': 'Animation, Adventure, Family',
 'Imdb Rating': '7.6',
 'Language': 'English',
 'Metascore': '95',
 'Movie Plot': 'Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.',
 'Music by': ['Frank Churchill', 'Paul Smith', 'Leigh Harline'],
 'Produced by': 'Walt Disney',
 'Production company': 'Walt Disney Productions',
 'Release date': ['December 21, 1937 ( Carthay Circle Theatre )'],
 'Release date (Standard)': datetime.date(1937, 12, 21),
 'Rotten Tomatoes Score

# Now we are going to save the data in JSON as well as CSV format

## Saving as JSON

In [None]:
master_data_copy = [movie_info.copy() for movie_info in master_data]

In [None]:
master_data_copy[-2]

{'Awards': 'Won 1 Oscar. 11 wins & 6 nominations total',
 'Based on': ['Snow White', 'by The', 'Brothers Grimm'],
 'Box office': '$418 million',
 'Box office (float)': 418000000.0,
 'Budget': '$1.49 million',
 'Budget (float)': 1490000.0,
 'Country': 'United States',
 'Directed by': ['David Hand',
  'William Cottrell',
  'Wilfred Jackson',
  'Larry Morey',
  'Perce Pearce',
  'Ben Sharpsteen'],
 'Distributed by': 'RKO Radio Pictures',
 'Genre': 'Animation, Adventure, Family',
 'Imdb Rating': '7.6',
 'Language': 'English',
 'Metascore': '95',
 'Movie Plot': 'Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.',
 'Music by': ['Frank Churchill', 'Paul Smith', 'Leigh Harline'],
 'Produced by': 'Walt Disney',
 'Production company': 'Walt Disney Productions',
 'Release date': ['December 21, 1937 ( Carthay Circle Theatre )'],
 'Release date (Standard)': datetime.date(1937, 12, 21),
 'Rotten Tomatoes Score

This snippet will replace all the datetime object in the copy set to the string date

In [None]:
for items in master_data_copy:
    datetime_object = items['Release date (Standard)']
    if datetime_object:                                               # Checking if it exists
        items['Release date (Standard)'] = datetime_object.strftime('%B %d, %Y')
    else:
        items['Release date (Standard)'] = None       

In [None]:
# Checking that the string is changed in copy

master_data_copy[-2]

{'Awards': 'Won 1 Oscar. 11 wins & 6 nominations total',
 'Based on': ['Snow White', 'by The', 'Brothers Grimm'],
 'Box office': '$418 million',
 'Box office (float)': 418000000.0,
 'Budget': '$1.49 million',
 'Budget (float)': 1490000.0,
 'Country': 'United States',
 'Directed by': ['David Hand',
  'William Cottrell',
  'Wilfred Jackson',
  'Larry Morey',
  'Perce Pearce',
  'Ben Sharpsteen'],
 'Distributed by': 'RKO Radio Pictures',
 'Genre': 'Animation, Adventure, Family',
 'Imdb Rating': '7.6',
 'Language': 'English',
 'Metascore': '95',
 'Movie Plot': 'Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.',
 'Music by': ['Frank Churchill', 'Paul Smith', 'Leigh Harline'],
 'Produced by': 'Walt Disney',
 'Production company': 'Walt Disney Productions',
 'Release date': ['December 21, 1937 ( Carthay Circle Theatre )'],
 'Release date (Standard)': 'December 21, 1937',
 'Rotten Tomatoes Score': None,

In [None]:
# Now saving as JSON
# Maybe we need to go above and make a call at the saveas_json() function again

saveas_json('cleaned_and_added_data_from_omdb.json', master_data_copy)

## Saving as CSV

#### We will use pandas to save as csv

In [None]:
import pandas as pd

df = pd.DataFrame(master_data)       # We can pass list of dict as arguement directly to make a df object

In [None]:
df.head()

Unnamed: 0,Title,Production company,Distributed by,Release date,Running time,Country,Language,Box office,Release date (Standard),Running time (int),Budget (float),Box office (float),Genre,Movie Plot,Awards,Imdb Rating,Metascore,Rotten Tomatoes Score,Directed by,Written by,Based on,Produced by,Starring,Music by,Budget,Story by,Narrated by,Cinematography,Edited by,Languages,Screenplay by,Countries,Production companies,Color process,Layouts by,Created by,Original work,Owner
0,Academy Award Review of,Walt Disney Productions,RKO Radio Pictures,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,$45.472,1937-05-19,41.0,,45472.0,"Animation, Short, Comedy",A compilation of five Oscar-winning Disney sho...,,7.1,,,,,,,,,,,,,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,RKO Radio Pictures,"[December 21, 1937 ( Carthay Circle Theatre )]",83 minutes,United States,English,$418 million,1937-12-21,83.0,1490000.0,418000000.0,"Animation, Adventure, Family",Exiled into the dangerous forest by her wicked...,Won 1 Oscar. 11 wins & 6 nominations total,7.6,95.0,,"[David Hand, William Cottrell, Wilfred Jackson...","[Ted Sears, Richard Creedon, Otto Englander, D...","[Snow White, by The, Brothers Grimm]",Walt Disney,"[Adriana Caselotti, Lucille La Verne, Harry St...","[Frank Churchill, Paul Smith, Leigh Harline]",$1.49 million,,,,,,,,,,,,,
2,Pinocchio,Walt Disney Productions,RKO Radio Pictures,"[February 7, 1940 ( Center Theatre ), February...",88 minutes,United States,English,$164 million,1940-02-07,88.0,2600000.0,164000000.0,"Animation, Adventure, Comedy","A living puppet, with the help of a cricket as...",Won 2 Oscars. 7 wins total,7.4,99.0,100%,"[Ben Sharpsteen, Hamilton Luske, Bill Roberts,...",,"[The Adventures of Pinocchio, by, Carlo Collodi]",Walt Disney,"[Cliff Edwards, Dickie Jones, Christian Rub, W...","[Leigh Harline, Paul J. Smith]",$2.6 million,"[Ted Sears, Otto Englander, Webb Smith, Willia...",,,,,,,,,,,,
3,Fantasia,Walt Disney Productions,RKO Radio Pictures,"[November 13, 1940]",126 minutes,United States,English,$76.4–$83.3 million (United States and Canada),1940-11-13,126.0,2280000.0,76400000.0,"Animation, Family, Fantasy",A collection of animated interpretations of gr...,Won 2 Oscars. 8 wins & 1 nomination total,7.7,96.0,95%,"[Samuel Armstrong, James Algar, Bill Roberts, ...",,,"[Walt Disney, Ben Sharpsteen]","[Leopold Stokowski, Deems Taylor]",See program,$2.28 million,"[Joe Grant, Dick Huemer]",Deems Taylor,James Wong Howe,,,,,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,RKO Radio Pictures,"[June 27, 1941]",74 minutes,United States,English,"$960,000 (worldwide rentals)",1941-06-27,74.0,600000.0,960000.0,"Animation, Comedy, Family",Humorist Robert Benchley learns about the anim...,,6.8,,100%,"[Alfred Werker, (live action), Hamilton Luske,...","[Live-action:, Ted Sears, Al Perkins, Larry Cl...",,Walt Disney,"[Robert Benchley, Frances Gifford, Buddy Peppe...","[Frank Churchill, Larry Morey]","$600,000",,,Bert Glennon,Paul Weatherwax,,,,,,,,,


In [None]:
df.to_csv('cleaned_data_final.csv')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 491 entries, 0 to 490
Data columns (total 38 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Title                    491 non-null    object 
 1   Production company       197 non-null    object 
 2   Distributed by           489 non-null    object 
 3   Release date             485 non-null    object 
 4   Running time             473 non-null    object 
 5   Country                  437 non-null    object 
 6   Language                 472 non-null    object 
 7   Box office               382 non-null    object 
 8   Release date (Standard)  485 non-null    object 
 9   Running time (int)       473 non-null    float64
 10  Budget (float)           300 non-null    object 
 11  Box office (float)       382 non-null    object 
 12  Genre                    473 non-null    object 
 13  Movie Plot               473 non-null    object 
 14  Awards                   4

In [None]:
# Sorting values on the basis of 'Running Time (int)'

running_time = df.sort_values(['Running time (int)'], ascending=False)
running_time.head(10)

Unnamed: 0,Title,Production company,Distributed by,Release date,Running time,Country,Language,Box office,Release date (Standard),Running time (int),Budget (float),Box office (float),Genre,Movie Plot,Awards,Imdb Rating,Metascore,Rotten Tomatoes Score,Directed by,Written by,Based on,Produced by,Starring,Music by,Budget,Story by,Narrated by,Cinematography,Edited by,Languages,Screenplay by,Countries,Production companies,Color process,Layouts by,Created by,Original work,Owner
322,Pirates of the Caribbean: At World's End,,Buena Vista Pictures,"[May 19, 2007 ( Disneyland Resort ), May 25, 2...",167 minutes,United States,English,$960.9 million,2007-05-19,167.0,3e+08,9.609e+08,"Action, Adventure, Fantasy","Captain Barbossa, Will Turner and Elizabeth Sw...",Nominated for 2 Oscars. 22 wins & 51 nominatio...,7.1,50.0,44%,Gore Verbinski,"[Ted Elliott, Terry Rossio]",[Characters by Ted Elliott Terry Rossio Stuart...,Jerry Bruckheimer,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Hans Zimmer,$300 million,,,Dariusz Wolski,"[Craig Wood, Stephen Rivkin]",,,,"[Walt Disney Pictures, Jerry Bruckheimer Films]",,,,,
85,The Happiest Millionaire,Walt Disney Productions,Buena Vista Distribution,"[June 23, 1967, November 30, 1967]","[164 minutes, (, Los Angeles, premiere), 144 m...",United States,English,$5 million (U.S./Canada rentals),1967-06-23,164.0,5e+06,5e+06,"Comedy, Family, Musical",Clever yet hapless new butler John Lawless man...,Nominated for 1 Oscar. 2 nominations total,6.8,,50%,Norman Tokar,,"[My Philadelphia Father, by Cordelia Drexel Bi...","[Walt Disney, Bill Anderson]","[Fred MacMurray, Tommy Steele, Greer Garson, G...",Jack Elliott,$5 million,A. J. Carothers,,Edward Colman,Cotton Warburton,,A. J. Carothers,,,,,,,
428,Jagga Jasoos,,UTV Motion Pictures,[14 July 2017],162 minutes,India,Hindi,83 crore,2017-07-14,162.0,-,83,"Action, Adventure, Comedy","Join Jagga, a gifted teenage detective, who al...",11 wins & 17 nominations,6.5,,83%,Anurag Basu,"[Screenplay:, Anurag Basu, Dialogues in Rhyme:...",,"[Siddharth Roy Kapur, Anurag Basu, Ranbir Kapoor]","[Ranbir Kapoor, Katrina Kaif, Saswata Chatterj...","[Pritam, Anirudh Ravichander]",[131 crore],Anurag Basu,,Ravi Varman,Ajay Sharma,,,,"[Walt Disney Pictures India, Picture Shuru Ent...",,,,,
422,Dangal,,UTV Motion Pictures,"[21 December 2016 (United States), 23 December...",161 minutes,India,Hindi,(US$270 million),2016-12-21,161.0,-,-,"Action, Biography, Drama",Former wrestler Mahavir Singh Phogat and his t...,28 wins & 6 nominations,8.4,,88%,Nitesh Tiwari,"[Saeed Aadil, Piyush Gupta, Shreyas Jain, Nikh...",Lives of Mahavir Singh Phogat and Phogat sisters,"[Aamir Khan, Kiran Rao, Siddharth Roy Kapur]","[Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh...",Pritam,(US$9.3 million),"[Curation:, Nitesh Tiwari, Concept:, Divya V. ...",Aparshakti Khurana,Setu,Ballu Saluja,,,,"[Aamir Khan Productions, Walt Disney Pictures ...",,,,,
453,Hamilton,,Walt Disney Studios Motion Pictures,"[July 3, 2020]",160 minutes,United States,English,,2020-07-03,160.0,1.25e+07,,"Biography, Drama, History",The real life of one of America's foremost fou...,Won 1 Primetime Emmy. 17 wins & 42 nominations...,8.4,90.0,97%,Thomas Kail,Lin-Manuel Miranda,"[Alexander Hamilton, by, Ron Chernow]","[Thomas Kail, Lin-Manuel Miranda, Jeffrey Seller]","[Daveed Diggs, Renée Elise Goldsberry, Jonatha...",Lin-Manuel Miranda,$12.5 million (stage production),,,Declan Quinn,Jonah Moran,,,,"[Walt Disney Pictures, 5000 Broadway Productio...",,,,,
411,ABCD 2,Walt Disney Pictures,UTV Motion Pictures,[19 June 2015],154 minutes,India,Hindi,est.,2015-06-19,154.0,-,-,Music,,,,,,Remo D'Souza,"[Dialogues and Lyrics:, Mayur Puri, Screenplay...","[Suresh & Vernon, of the, Fictitious Crew]",Siddharth Roy Kapur,"[Prabhu Deva, Varun Dhawan, Shraddha Kapoor, L...",Sachin–Jigar,,Remo D'Souza,,Vijay Kumar Arora,Manan Sagar,,,,,,,,,
314,Pirates of the Caribbean: Dead Man's Chest,,Buena Vista Pictures,"[June 24, 2006 ( Disneyland Resort ), July 7, ...",150 minutes,United States,English,$1.066 billion,2006-06-24,150.0,2.25e+08,1066,"Action, Adventure, Fantasy",Jack Sparrow races to recover the heart of Dav...,Won 1 Oscar. 45 wins & 54 nominations total,7.3,53.0,53%,Gore Verbinski,"[Ted Elliott, Terry Rossio]",[Characters by Ted Elliott Terry Rossio Stuart...,Jerry Bruckheimer,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Hans Zimmer,$225 million,,,Dariusz Wolski,"[Craig Wood, Stephen Rivkin]",,,,"[Walt Disney Pictures, Jerry Bruckheimer Films]",,,,,
331,The Chronicles of Narnia: Prince Caspian,,Walt Disney Studios Motion Pictures,"[May 7, 2008 ( New York City ), May 16, 2008 (...",150 minutes,,English,$419.7 million,2008-05-07,150.0,2.25e+08,4.197e+08,"Action, Adventure, Family","The Pevensie siblings return to Narnia, where ...",2 wins & 21 nominations,6.5,62.0,66%,Andrew Adamson,,"[Prince Caspian, by, C. S. Lewis]","[Mark Johnson, Andrew Adamson, Philip Steuer]","[Georgie Henley, Skandar Keynes, William Mosel...",Harry Gregson-Williams,$225 million,,,Karl Walter Lindenlaub,Sim Evan-Jones,,"[Andrew Adamson, Christopher Markus Stephen Mc...","[United States, United Kingdom]","[Walt Disney Pictures, Walden Media]",,,,,
390,The Lone Ranger,,"[Walt Disney Studios, Motion Pictures]","[June 22, 2013 ( Hyperion Theatre ), July 3, 2...",149 minutes,United States,English,$260.5 million,2013-06-22,149.0,-,2.605e+08,"Action, Adventure, Western",Native American warrior Tonto recounts the unt...,Nominated for 2 Oscars. 5 wins & 20 nomination...,6.4,37.0,30%,Gore Verbinski,,"[Fran Striker, George W. Trendle]","[Jerry Bruckheimer, Gore Verbinski]","[Johnny Depp, Armie Hammer, Tom Wilkinson, Wil...",Hans Zimmer,$225–250 million,"[Ted Elliott, Terry Rossio, Justin Haythe]",,Bojan Bazelli,"[James Haygood, Craig Wood]",,"[Justin Haythe, Ted Elliott, Terry Rossio]",,"[Walt Disney Pictures, Jerry Bruckheimer Films...",,,,,
304,"The Chronicles of Narnia:The Lion, the Witch a...",,Buena Vista Pictures,"[December 7, 2005 ( Royal Film Performance ), ...",143 minutes,,English,$745 million,2005-12-07,143.0,1.8e+08,7.45e+08,"Adventure, Family, Fantasy",Four kids travel through a wardrobe to the lan...,Won 1 Oscar. 18 wins & 46 nominations total,6.9,75.0,76%,Andrew Adamson,,"[The Lion, the Witch and the Wardrobe, by, C. ...","[Mark Johnson, Phillip Steuer]","[William Moseley, Anna Popplewell, Skandar Key...",Harry Gregson-Williams,$180 million,,,Donald McAlpine,"[Sim Evan-Jones, Jim May]",,"[Ann Peacock, Andrew Adamson, Christopher Mark...","[United States, United Kingdom]","[Walt Disney Pictures, Walden Media]",,,,,




***************************************

**So, Finally we have a huge data set of all the movies from Disney and it's necesarry information.**

We are ready to go forward and analyze/visualize the data on our parameters

****************************************