<a href="https://colab.research.google.com/github/rsh456/SchoolOfAIDemo/blob/master/WebScraping_DataLit_W1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Webscraping Demo

### School of AI - DataLit Week 1

#### Any mistakes made here are the sole property of I, Carson Bentley

### Just for fun, let's collect some movie posters from imdb.com

First, import the necessary libraries. 

In google-colab these come preinstalled, but they are not part of the python standard library as of the writing of this article.

In [0]:
from bs4 import BeautifulSoup

In [0]:
import requests

In [0]:
def request_webpage(url):
  res = requests.get(url)
  try:
    res.raise_for_status()
  except Exception as exc:
    print('There was a problem with the request: %s' % (exc))
  return res

Next, create an object to store the webpage locally

In [0]:
coming_soon_page = request_webpage('https://www.imdb.com/movies-coming-soon/2019-01/')

Let's take a look at what we got back

In [0]:
coming_soon_page.text

'\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>New Movies Coming Soon - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex =

The 'prettify' function will help to make this a bit more human readable

In [0]:
coming_soon_soup = BeautifulSoup(coming_soon_page.text)
print(coming_soon_soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   New Movies Coming Soon - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
 

Next, let's find the images.

You can locate an HTML element by right clicking on the webpage and selecting 'inspect'.

The code below is looking for an element like this: 

< div class="list detail" >... (content)...< /div >

In [0]:
details = coming_soon_soup.find('div', attrs = {'class': 'list detail'})
image_details = details.find_all('img')

#this list comprehension will get all the 'src' data (urls) of the posters,
#while filter out icons for ratings
image_list = [x['src'] for x in image_details if 'poster' in x['class']]
image_list

['https://m.media-amazon.com/images/M/MV5BMjQ2NDMwMTY3MF5BMl5BanBnXkFtZTgwNDg5OTc1NjM@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMzJlMjU0YjYtYTkxNy00YThmLWJlY2ItMDIxZjZhNGVkOTFiXkEyXkFqcGdeQXVyMTkzNTY1NTU@._V1_UY209_CR11,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BYTAzYzI1ZDYtNDUwYS00OTg4LTg4N2ItMTk5MTUxYzgyMGM4XkEyXkFqcGdeQXVyMTk2MDc1MjQ@._V1_UY209_CR3,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BODI5Y2I0MTktZDRkNS00NWFjLTg1MmEtNWFkMzEwOGM0ZDE3XkEyXkFqcGdeQXVyNTA1NjYyMDk@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BNzY3NzYyNjI0N15BMl5BanBnXkFtZTgwNjYzMDc0NjM@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTg5MjcwMzY5OV5BMl5BanBnXkFtZTgwMDM0OTI1NjM@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BOTdkNDliOWQtOGE3YS00ZmE4LWE3YTAtZDZkNTU0ZDFlMWNiXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY209_CR3,0,140,209_AL_.jpg',
 'https://m.media-amazon

We can get the full size image URLs by removing everything between '_V1_' and '.jpg'

In [0]:
image_url = image_list[0]
slice_index = image_url.find('_V1_')
full_size_image_url = image_url[:slice_index] + '_V1_.jpg'

In [0]:
full_size_image_url

'https://m.media-amazon.com/images/M/MV5BMjQ2NDMwMTY3MF5BMl5BanBnXkFtZTgwNDg5OTc1NjM@._V1_.jpg'

In [0]:
img_res = request_webpage(full_size_image_url)

In [0]:
imageFile = open('MoviePoster0'+'.jpg', 'wb')
for chunk in img_res.iter_content(100000):
  imageFile.write(chunk)
imageFile.close()

You can find files (in colab) by clicking the '>' icon on the top-left side of the screen. You may need to refresh.

## Challenge 1: 

Find the names of the movies (for this month). 

Put them in a list.

## (Solution)

In [0]:
## hide me!!!!

# these two lines are just a reminder, since we've already run them above:
# details = coming_soon_soup.find('div', attrs = {'class': 'list detail'})
# image_details = details.find_all('img')

name_list = [x['alt'] for x in image_details if 'poster' in x['class']]
name_list

['Escape Room (2019) Poster',
 'Great Great Great (2017) Poster',
 'Yun nan chong gu (2018) Poster',
 'The Nun (1966) Poster',
 'The Upside (2017) Poster',
 "A Dog's Way Home (2019) Poster",
 'Perfectos desconocidos (2017) Poster',
 'Replicas (2018) Poster',
 'The Untold Story (2019) Poster',
 'SGT. Will Gardner (2019) Poster',
 'The Aspern Papers (2018) Poster',
 'Glass (2019) Poster',
 'Las herederas (2018) Poster',
 'An Acceptable Loss (2018) Poster',
 'Adult Life Skills (2016) Poster',
 'Serenity (2019) Poster',
 'The Kid Who Would Be King (2019) Poster',
 'Die Unsichtbaren (2017) Poster',
 'Werk ohne Autor (2018) Poster',
 'The Final Wish (2018) Poster',
 'In Like Flynn (2018) Poster',
 'Jihadists (2016) Poster',
 'Bricked (2018) Poster']

## Challenge 2: 

Collect all of the movie posters (for this month).

Put them in a folder.

In [0]:
# hint:
import os 

## (Solution)

In [0]:
## hide me!!!!

current_date = '2019-01'
try:
  os.makedirs(current_date)
except:
  print('failed gracefully, you probably already made the folder')

failed gracefully, you probably already made the folder


In [0]:
for i in range(len(image_list)):
  image_url = image_list[i]
  slice_index = image_url.find('_V1_')
  full_size_image_url = image_url[:slice_index] + '_V1_.jpg'
  img_res = request_webpage(full_size_image_url)
  try:
    imageFile = open(os.path.join(current_date, name_list[i] + '.jpg'), 'wb')
    for chunk in img_res.iter_content(100000):
      imageFile.write(chunk)
    imageFile.close()
  except Exception as exc:
    print('There was a problem with writing the file for %s: %s' % (name_list[i], exc))
    
print('All Finished')

All Finished


## Challenge 3:

Collect a year's worth of movie posters, placing them in folders by month.


(feel free to explore the past, as the interface allows you to go back as far as January 2011)

## (Solution)

In [0]:
## hide me!!!!

def collect_media_info(date):
  url = 'https://www.imdb.com/movies-coming-soon/' + date + '/'
  soup = BeautifulSoup(request_webpage(url).text)
  
  details = soup.find('div', attrs = {'class': 'list detail'})
  image_details = details.find_all('img')
  
  image_list = [x['src'] for x in image_details if 'poster' in x['class']]
  name_list = [x['alt'] for x in image_details if 'poster' in x['class']]
  return (image_list, name_list)

In [0]:
def download_month_of_posters(images, names, date):
  try:
    os.makedirs(date)
  except:
    print('failed gracefully, you probably already made the folder')
    
  for i in range(len(images)):
    image_url = images[i]
    slice_index = image_url.find('_V1_')
    full_size_image_url = image_url[:slice_index] + '_V1_.jpg'
    img_res = request_webpage(full_size_image_url)
    name = names[i]
    if ('/' in name): # because file names can't have a slash
      name = name.replace('/', '-') 
    try:
      imageFile = open(os.path.join(date, name + '.jpg'), 'wb')
      for chunk in img_res.iter_content(100000):
        imageFile.write(chunk)
      imageFile.close()
    except Exception as exc:
      print('There was a problem with writing the file for %s: %s' % (names[i], exc))
    
  print('All Finished with %s' % (date))

In [0]:
# month numbers for the URLs
month_nums = [str(x+1).zfill(2) for x in range(12)]
month_nums

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

In [0]:
year = '2019'
for ii in month_nums:
  date = year + '-' + ii
  images, names = collect_media_info(date)
  download_month_of_posters(images, names, date)

failed gracefully, you probably already made the folder
All Finished with 2019-01
All Finished with 2019-02
All Finished with 2019-03
All Finished with 2019-04
All Finished with 2019-05
All Finished with 2019-06
All Finished with 2019-07
All Finished with 2019-08
All Finished with 2019-09
All Finished with 2019-10
All Finished with 2019-11
All Finished with 2019-12


## Challenge 4:

Zip all the files you have collected in order to download them on your local machine (without a million clicks)

See this article for how to go about zipping: https://www.geeksforgeeks.org/working-zip-files-python/

In [0]:
# hint 1:
from zipfile import ZipFile 

In [0]:
# hint 2:
# code for deleting a directory
# in case you make a mistake and want to clean up: 
!rm -rf '2018-01' # '2018-01' is a folder of files to delete

## (Solution)

In [0]:
## hide me!!!!

def get_file_paths(directory): 
  file_paths = []
  files = os.listdir(directory)
  for filename in files: 
    filepath = os.path.join(directory, filename)
    file_paths.append(filepath)
  return file_paths 

In [0]:
year = '2019'
cwd_file_paths = []
for i in range(12):
  date = year + '-' + str(i+1).zfill(2)
  cwd_file_paths += get_file_paths(date)

In [0]:
print('The following files will be zipped:') 
for file_name in cwd_file_paths: 
  print(file_name) 
  
with ZipFile(year+'-movie-posters.zip','w') as zip: 
  for file in cwd_file_paths:
    zip.write(file)
  
print('All files zipped successfully!')

The following files will be zipped:
2019-01/Perfectos desconocidos (2017) Poster.jpg
2019-01/Las herederas (2018) Poster.jpg
2019-01/Serenity (2019) Poster.jpg
2019-01/Adult Life Skills (2016) Poster.jpg
2019-01/Yun nan chong gu (2018) Poster.jpg
2019-01/The Aspern Papers (2018) Poster.jpg
2019-01/The Untold Story (2019) Poster.jpg
2019-01/The Upside (2017) Poster.jpg
2019-01/A Dog's Way Home (2019) Poster.jpg
2019-01/Jihadists (2016) Poster.jpg
2019-01/The Nun (1966) Poster.jpg
2019-01/Werk ohne Autor (2018) Poster.jpg
2019-01/An Acceptable Loss (2018) Poster.jpg
2019-01/Escape Room (2019) Poster.jpg
2019-01/The Kid Who Would Be King (2019) Poster.jpg
2019-01/Glass (2019) Poster.jpg
2019-01/In Like Flynn (2018) Poster.jpg
2019-01/Bricked (2018) Poster.jpg
2019-01/Replicas (2018) Poster.jpg
2019-01/The Final Wish (2018) Poster.jpg
2019-01/Die Unsichtbaren (2017) Poster.jpg
2019-01/SGT. Will Gardner (2019) Poster.jpg
2019-01/Great Great Great (2017) Poster.jpg
2019-02/How to Train Your 