# Wikipedia Disney Movies list - Web Scraping


![](https://i.imgur.com/0EBdH19.jpg)

## 1. Introduction : 


### 1.1. What is web scraping <a name="subparagraph1"></a>

Web scraping is the process of collecting structured web data in an automated fashion. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

There are a number of tools and methods for performing web scraping; using network traffic, Scrappy, Selenium and Beautiful Soup are the most popular methods. Every method has its own advantage and drawbacks. In this project Beautiful Soup was used.

### 1.2. Problem statement 

Movies are hobby to consumers. However data related to these main players of the entertainment industry can be valuable in many aspects. Competitor's can see how other companie's investments have performerd, or how a specific director's movies perform on the market. Also it might be useful to see which actors bring more revenue to a movie. 

These use cases, makes data sets related to this industry valuable. That's why in this project we tried to scrape movies of Walt Disney Company on Wikipedia and get the main information on the movies produced by this company

The Walt Disney Company, commonly known as Disney, is an American diversified multinational mass media and entertainment conglomerate headquartered at the Walt Disney Studios complex in Burbank, California.

Disney was originally founded on October 16, 1923, by brothers Walt and Roy O. Disney as the Disney Brothers Cartoon Studio; it also operated under the names The Walt Disney Studio and Walt Disney Productions before officially changing its name to The Walt Disney Company in 1986. The company established itself as a leader in the American animation industry before diversifying into live-action film production, television, and theme parks.

### 1.3. Tools used in this project

In this project, we are going to use Python as our coding language to scrape the Wikipedia pages. 
In Python, We will mainly use Requests library to get the information from the websites.

Then we will use Beautiful Soup library to turn it into an BS object.

Throughout the project we will use a knowledge of HTML in order to inspect the page and find the right tags

Then the information that has been scrapped will be turned into a Pandas DataFrame and then
we are going to save the file as CSV. 



## 2. Project Steps:

1. Assessing the Disney Movies page on Wikipedia: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films 
2. Scraping all the links to movies on Disney Movies Wikipedia and save it to a CSV file
3. Scraping Info-Box of a sample movie
4. Extracting all the Info-Boxes of all the movies on Disney Movies Wikipedia automatically (Putting it all together)
5. Saving all the scrapped information to a csv file

#### 2.1. Assessing the Disney Movies page on Wikipedia: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films 


In [233]:
# Importing necessary libraries

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd


In [3]:
# using requests library to get the info on the page 
r_main = requests.get ("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

In [4]:
# checking if the page has responded with no error
r_main.status_code

200

In [5]:
# turning what we got from the page through request into a beautiful soup object
main_soup = bs(r_main.text, "html.parser")

we need to find all the links to the movies listed on the page .In order to do so, we should assess the page's html source

So we are going to use the inspect option on browser to check a sample movie in the lists on the page.

![](https://i.imgur.com/XTSibPE.jpg)

#### 2.2. Scraping all the links to movies on Disney Movies Wikipedia 


From the above page, we can see that the href and title attributes are in the "i" and "a" tags,
so we are going to get the info on these tags for all the movies

In [6]:
# getting info of i and a tags from the beautiful soup of the page
links = main_soup.select(".wikitable.sortable i a")

In [15]:
#checking for the length of it
len(links)

454

In [7]:
#automating the process to get all the title and links to the movies 

link_dict = {}
for i in range(len(links)):
        dict_key = links[i].get("title")
        dict_value = "https://en.wikipedia.org" + links[i].get("href")
        link_dict[dict_key] = dict_value

In [17]:
# checking for the content of the dictionary created
link_dict

{'Academy Award Review of Walt Disney Cartoons': 'https://en.wikipedia.org/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons',
 'Snow White and the Seven Dwarfs (1937 film)': 'https://en.wikipedia.org/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 'Pinocchio (1940 film)': 'https://en.wikipedia.org/wiki/Pinocchio_(1940_film)',
 'Fantasia (1940 film)': 'https://en.wikipedia.org/wiki/Fantasia_(1940_film)',
 'The Reluctant Dragon (1941 film)': 'https://en.wikipedia.org/wiki/The_Reluctant_Dragon_(1941_film)',
 'Dumbo': 'https://en.wikipedia.org/wiki/Dumbo',
 'Bambi': 'https://en.wikipedia.org/wiki/Bambi',
 'Saludos Amigos': 'https://en.wikipedia.org/wiki/Saludos_Amigos',
 'Victory Through Air Power (film)': 'https://en.wikipedia.org/wiki/Victory_Through_Air_Power_(film)',
 'The Three Caballeros': 'https://en.wikipedia.org/wiki/The_Three_Caballeros',
 'Make Mine Music': 'https://en.wikipedia.org/wiki/Make_Mine_Music',
 'Song of the South': 'https://en.wikipedia.org/wiki/Song_of_the_Sout

we are going to save this dictionary into a CSV file

In [18]:
#turning the dictionary into a dataframe to easily save it to csv file
link_dict_df = pd.DataFrame(list(link_dict.items()),columns=['movie_name', 'movie_url'])

In [19]:
# saving the dictionary into a csv file
link_dict_df.to_csv("Disney_links.csv", index = False)

#### 2.3. Scraping Info-Box of a sample movie

Now that we have got the links to all disney movies we need to scrape these links
in order to get the info-box of every movie page

The info-boxes looks like this:

![](https://i.imgur.com/790h6Be.png)

We want to scrape this info box for all the disney movis 
In order to get the familiar with the HTML tags We will scrape a sample one and then we will go for automating it . 

In [20]:
# using requests library to get the info on the page 

r = requests.get ("https://en.wikipedia.org/wiki/Toy_Story_3")

In [21]:
# checking if the page has responded with no error
r.status_code

200

In [22]:
# turning what we got from the page through request into a beautiful soup object

soup2 = bs (r.text, "html.parser")

We need to find the HTML tags related to the info-box of the page.In order to do so, we need to assess the page's html source

So I am going to use the inspect option on browser to check the info-box on the page.


![](https://i.imgur.com/dJuLgw3.jpg)

As it shows in the above picture the info box has its own class and all the info are stored in a tbody tag and each info is stored in a tr tag so we are going to scrape the infobox based on these findings

In [44]:
#getting info of the class infobox
t_tags = soup2.find_all ("table" , {"class": "infobox vevent"})
#getting info of tbody tag
tbody_tags = t_tags[0].find("tbody")
#getting info of all tr tags
tr_tags = tbody_tags.find_all("tr")

In [46]:
# check the number of the tr tags
len(tr_tags)

18

In [56]:
# a for loop to get the infobox informations
# the first one is the title which has a different format and is put into the dictionary alone
# then its the image which we don't want so we start the for loop on number 2
movie_info = {}
movie_info["title"] = tr_tags[0].text
for i in range(2, len(tr_tags)) :
    content_key = tr_tags[i].find("th").get_text(" ", strip = True)
    content_value = tr_tags[i].find("td").get_text(" ", strip = True).replace("\xa0", " ").split("\n")
    movie_info [content_key] = content_value

In [53]:
# infobox for one movie
movie_info

{'title': 'Toy Story 3',
 'Directed by': ['Lee Unkrich'],
 'Screenplay by': ['Michael Arndt'],
 'Story by': ['John Lasseter Andrew Stanton Lee Unkrich'],
 'Produced by': ['Darla K. Anderson'],
 'Starring': ['Tom Hanks Tim Allen Joan Cusack Don Rickles Wallace Shawn John Ratzenberger Estelle Harris Ned Beatty Michael Keaton Jodi Benson John Morris'],
 'Cinematography': ['Jeremy Lasky Kim White'],
 'Edited by': ['Ken Schretzmann'],
 'Music by': ['Randy Newman'],
 'Production companies': ['Walt Disney Pictures Pixar Animation Studios'],
 'Distributed by': ['Walt Disney Studios Motion Pictures'],
 'Release date': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest ) June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': ['103 minutes [1]'],
 'Country': ['United States'],
 'Language': ['English'],
 'Budget': ['$200 million [1]'],
 'Box office': ['$1.067 billion [1]']}

#### 2.4. Extracting all the Info-Boxes of all the movies on Disney Movies Wikipedia automatically (Putting it all together)


In [221]:
# Defining functions to automate the proccess we went through on previous sections 

def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"):
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

def clean_tags(soup):
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose()
        
def get_info_box(url):

    r = requests.get(url)
    soup = bs(r.content)
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all("tr")
    
    clean_tags(soup)

    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find("th").get_text(" ", strip=True)
        else:
            header = row.find('th')
            if header:
                content_key = row.find("th").get_text(" ", strip=True)
                content_value = get_content_value(row.find("td"))
                movie_info[content_key] = content_value
            
    return movie_info

In [222]:
#testing the function with a sample link

get_info_box("https://en.wikipedia.org/wiki/One_Little_Indian_(film)")

{'title': 'One Little Indian',
 'Directed by': 'Bernard McEveety',
 'Written by': 'Harry Spalding',
 'Produced by': 'Winston Hibler',
 'Starring': ['James Garner',
  'Vera Miles',
  'Pat Hingle',
  'Morgan Woodward',
  'Jodie Foster'],
 'Cinematography': 'Charles F. Wheeler',
 'Edited by': 'Robert Stafford',
 'Music by': 'Jerry Goldsmith',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['June 20, 1973'],
 'Running time': '90 Minutes',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$2 million'}

In [223]:
# defining a function to get all the links on the "list of walt disney pictures film"
# and then get the info box of them

r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
soup = bs(r.content)
movies = soup.select(".wikitable.sortable i a")

base_path = "https://en.wikipedia.org/"

movie_info_list = []
for index, movie in enumerate(movies):
    if index % 10 == 0:
        print(index)
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))
        
    except Exception as e:
        print(movie.get_text())
        print(e)

0
10
20
30
40
Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
50
60
70
80
90
100
110
120
True-Life Adventures
'NoneType' object has no attribute 'find_all'
130
140
The London Connection
'NoneType' object has no attribute 'find'
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
Better Nate Than Never
'NoneType' object has no attribute 'find_all'


In [224]:
# checking the length 
# 3 of the movies did not have a format required to be scrapped 
len(movie_info_list)


449

In [230]:
# checking a number of items in the list
movie_info_list[300:302]

[{'title': 'Bridge to Terabithia',
  'Directed by': 'Gábor Csupó',
  'Screenplay by': ['David L. Paterson', 'Jeff Stockwell'],
  'Based on': ['Bridge to Terabithia', 'by', 'Katherine Paterson'],
  'Produced by': ['David L. Paterson', 'Lauren Levine', 'Hal Lieberman'],
  'Starring': ['Josh Hutcherson',
   'AnnaSophia Robb',
   'Bailee Madison',
   'Robert Patrick',
   'Zooey Deschanel',
   'Latham Gaines'],
  'Cinematography': 'Michael Chapman',
  'Edited by': 'John Gilbert',
  'Music by': 'Aaron Zigman',
  'Production companies': ['Walt Disney Pictures',
   'Walden Media',
   'Klasky Csupo'],
  'Distributed by': ['Buena Vista Pictures Distribution',
   '(United States)',
   'Summit Entertainment',
   '(International)'],
  'Release date': ['February 16, 2007'],
  'Running time': '95 minutes',
  'Country': 'United States',
  'Language': 'English',
  'Budget': '$20–25 million',
  'Box office': '$137.6 million'},
 {'title': 'Meet the Robinsons',
  'Directed by': 'Stephen Anderson',
  'Scre

In [231]:
# turning the list into a dataframe
df = pd.DataFrame(movie_info_list)

In [232]:
# checking the content of dataframe
df

Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Box office,Directed by,Written by,Based on,...,Screenplay by,Countries,Production companies,Color process,Japanese,Hepburn,Adaptation by,Animation by,Traditional,Simplified
0,Academy Award Review of,Walt Disney Productions,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,$45.472,,,,...,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,"[December 21, 1937 ( Carthay Circle Theatre )]",83 minutes,United States,English,$418 million,"[David Hand, William Cottrell, Wilfred Jackson...","[Ted Sears, Richard Creedon, Otto Englander, D...","[Snow White, by The, Brothers Grimm]",...,,,,,,,,,,
2,Pinocchio,Walt Disney Productions,"[February 7, 1940 ( Center Theatre ), February...",88 minutes,United States,English,$164 million,"[Ben Sharpsteen, Hamilton Luske, Bill Roberts,...",,"[The Adventures of Pinocchio, by, Carlo Collodi]",...,,,,,,,,,,
3,Fantasia,Walt Disney Productions,"[November 13, 1940]",126 minutes,United States,English,$76.4–$83.3 million (United States and Canada),"[Samuel Armstrong, James Algar, Bill Roberts, ...",,,...,,,,,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,"[June 27, 1941]",74 minutes,United States,English,"$960,000 (worldwide rentals)","[Alfred Werker, (live action), Hamilton Luske,...","[Live-action:, Ted Sears, Al Perkins, Larry Cl...",,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,The Little Mermaid,,,,United States,English,,Rob Marshall,,"[Disney, 's, The Little Mermaid, by, Ron Cleme...",...,"[Jane Goldman, David Magee]",,"[Walt Disney Pictures, Lucamar Productions, Ma...",,,,,,,
445,Shrunk,,,,United States,English,,Joe Johnston,,"[Characters, by, Stuart Gordon, ,, Brian Yuzna...",...,Todd Rosenberg,,"[Walt Disney Pictures, Mandeville Films]",,,,,,,
446,Chip 'n Dale: Rescue Rangers,,,,United States,English,,Akiva Schaffer,,"[Chip 'n Dale: Rescue Rangers, by, Tad Stones,...",...,"[Dan Gregor, Doug Mand]",,"[Walt Disney Pictures, Mandeville Films]",,,,,,,
447,Pinocchio,,,,United States,English,,Robert Zemeckis,,"[Disney, 's, Pinocchio, The Adventures of Pino...",...,"[Chris Weitz, Robert Zemeckis]",,"[Walt Disney Pictures, Depth of Field, ImageMo...",,,,,,,


#### 2.5. Saving all the scrapped information to a CSV file

Using the Pandas capabilitis we made a csv file of the dataframe

a part of the CSV file is available in the picture

In [229]:
df.to_csv("disneyfilms.csv")

![](https://i.imgur.com/uoigjZ8.jpg)

Now we have a file which we can use to further analyze the performance of movies

## 3. Ideas for future work

Ideas for future work
- scraping list of movie Wikipedia pages of other companies
- Analyzing many factors affecting the performance of movies
- Scrapping the list of Academy Award winning movies and their pages to check for their performance 