## Web Scraping

*In this notebook I'll scraping the data that i will use for this project. The data is the **List of Walt Disney Pictures Films** from wikipedia, the website can be found at: [Walt Disney Pictures Films](https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films).*

In [1]:
# import libraries

from bs4 import BeautifulSoup as bs # library use to do the web scraping
import requests # library to make the requests

*The goal of this step is create a list of dictionaries with the title movie and some informations about the movie, this information can be found on the page in the right side below the image*  

In [2]:
# first it'll be made for one example, then it'll iterate for all movies

url = "https://en.wikipedia.org/wiki/Dumbo" # i'll be taking the Dumbo movie as example

r = requests.get(url)

dumbo_page = bs(r.content)


In [10]:
# after we load the page, we inspect the page and found the tags that matchs our objective.
info_movie = dumbo_page.find(class_ = "infobox vevent")
info_movie_ = info_movie.find_all("tr")

movie = {}

def get_content_value(row_data):
    # I noticed that some pages have the tag "b" for multiple names/strings and other has the tag "li"
    
    if row_data.find("li"):
        return [li.get_text(" ", strip = True).replace("\xa0", " ") for li in row_data.find_all("li")]
    
    
    elif row_data.find("b") or row_data.find("br"):
        return [br.get_text(" ", strip = True).replace("\xa0", " ") for br in row_data.find_all("a")]
    
    
    else:
        return row_data.get_text(" ", strip = True).replace("\xa0", " ")

    



for index, info in enumerate(info_movie_):
    
    if index == 0:
        movie['title'] = info.get_text(" ", strip = True)
        
    
    else:
        header = info.find('th')
        if header:
            content_key = info.find("th").get_text(" ", strip = True)
            content_value = get_content_value(info.find("td"))
            movie[content_key] = content_value

movie
        

{'title': 'Dumbo',
 'Directed by': ['Ben Sharpsteen',
  'Norman Ferguson',
  'Wilfred Jackson',
  'Jack Kinney'],
 'Produced by': 'Walt Disney',
 'Story by': ['Joe Grant', 'Dick Huemer'],
 'Based on': ['Helen Aberson'],
 'Starring': ['Edward Brophy',
  'Herman Bing',
  'Sterling Holloway',
  'Verna Felton',
  'Cliff Edwards',
  'James Baskett',
  'Nick Stewart',
  'Hall Johnson'],
 'Narrated by': 'John McLeish',
 'Music by': ['Frank Churchill', 'Oliver Wallace'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['October 23, 1941 ( 1941-10-23 ) (New York City) [1]',
  'October 31, 1941 ( 1941-10-31 ) (U.S.)'],
 'Running time': '64 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$950,000 [2]',
 'Box office': '$1.3 million (est. United States/Canada rentals, 1941) [3]'}

*Now that we acomplish for one example we can create a function to scraping all links*

In [13]:
def clean_tags(soup):
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose()
        


def get_movies(url):
    
    r = requests.get(url)
    soup = bs(r.content)
    
    info_movie = soup.find(class_ = "infobox vevent")
    info_movie_ = info_movie.find_all("tr")
    
    movie_info = {}
    
    clean_tags(soup)
    
    for index, info in enumerate(info_movie_):

        if index == 0:
            movie_info['title'] = info.find("th").get_text(" ", strip = True)

        
        else:
            header = info.find('th')
            if header:
                content_key = info.find("th").get_text(" ", strip = True)
                content_key = info.find("th").get_text(" ", strip = True)
                content_value = get_content_value(info.find("td"))
                movie_info[content_key] = content_value
            
    return movie_info




In [14]:
url = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"

r = requests.get(url) # we get a request on the principal page
soup = bs(r.content)

movies = soup.select(".wikitable.sortable i a") # get only the link

base_path = "https://en.wikipedia.org/"
movie_info_list = []

for index, movie in enumerate(movies):
    
    try:
        
        relative_path = movie['href']
        title = movie['title']
        full_path = base_path + relative_path
        
        movie_info_list.append(get_movies(full_path))
    
    except Exception as e:
        print(movie.get_text())
        print(e)

Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
True-Life Adventures
'NoneType' object has no attribute 'find_all'


In [15]:
print(f'Number of movies collected: {len(movie_info_list)}')

Number of movies collected: 438


In [16]:
# Save the movie data

import json

def save_data(title, data):
    with open(title, 'w', encoding = 'utf-8') as f:
        json.dump(data, f, ensure_ascii = False, indent = 2)

In [17]:
save_data('Disney_movies.json', movie_info_list)