# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# your code here
website = requests.get(url)
if website.status_code == 200:
    print('Success!')
elif website.status_code == 404:
    print('Not Found.')

Success!


In [4]:
soup = BeautifulSoup(website.content)

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-0c343b529849.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/assets/dar

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [6]:
# your code here
names = soup.find_all("h1", attrs = {"class":"h3 lh-condensed"})
names

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":5702664,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="04aeedfbc9de260dd52f38f63f45292d51045204103a7d974bfd1bbc2ba6ba1c" data-view-component="true" href="/rwightman">
             Ross Wightman
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":22450188,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="3a4c76186f636f200e4c5f60e58d95df08ed668fc6a1e7dea3f609335de2c55e" data-view-component="true" href="/atomiks">
             atomiks
 </a> </h1>,
 <h1 class=

In [7]:
list = []
for name in names:
    name2 = name.find_all("a")
    for line in name2:
        list.append(line.text.replace("\n", ""))

list

['            Ross Wightman',
 '            atomiks',
 '            Himself65',
 '            Xiaoyu Zhang',
 '            Florian Rival',
 '            Payton Swick',
 '            Stephen Celis',
 '            Lee Robinson',
 '            Dominic Gannaway',
 '            Azure SDK Bot',
 '            Massimiliano Pippi',
 '            Rafał Cieślak',
 '            Nicolas Gallagher',
 '            Henrik Rydgård',
 '            Casey Rodarmor',
 '            Paul Querna',
 '            Manu MA',
 '            Phil Ewels',
 '            Janosh Riebesell',
 '            Jason2866',
 '            R.I.Pienaar',
 '            Nolan Lawson',
 '            Lianmin Zheng',
 '            Wangchong Zhou',
 '            Pandapip1']

In [8]:
tags = soup.find_all("p", attrs = {"class":"f4 text-normal mb-1"})
tags

[<p class="f4 text-normal mb-1">
 <a class="Link--secondary" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":5702664,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="04aeedfbc9de260dd52f38f63f45292d51045204103a7d974bfd1bbc2ba6ba1c" data-view-component="true" href="/rwightman">
               rwightman
 </a> </p>,
 <p class="f4 text-normal mb-1">
 <a class="Link--secondary" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":14026360,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="86f055e7e27489565206c1828dae543f62e0266f4cb0035352d6a817b977d202" data-view-component="true" href="/Hims

In [9]:
lis = []
for value in tags:
    lis.append(value.text)


test1 = [value.replace('             ', "") for value in lis]
test2 = [value.replace('\n\n', "") for value in test1]
test3 = [value.replace('\n', "") for value in test2]
test3

[' rwightman ',
 ' Himself65 ',
 ' BBuf ',
 ' 4ian ',
 ' sirbrillig ',
 ' stephencelis ',
 ' leerob ',
 ' trueadm ',
 ' azure-sdk ',
 ' masci ',
 ' ravicious ',
 ' necolas ',
 ' hrydgard ',
 ' casey ',
 ' pquerna ',
 ' manucorporat ',
 ' ewels ',
 ' janosh ',
 ' Jason2866 ',
 ' ripienaar ',
 ' nolanlawson ',
 ' merrymercy ',
 ' fffonion ']

In [10]:
# initializing lists
test_keys = list
test_values = test3

# to convert lists to dictionary
res = dict(zip(test_keys, test_values))
 
res

{'            Ross Wightman': ' rwightman ',
 '            atomiks': ' Himself65 ',
 '            Himself65': ' BBuf ',
 '            Xiaoyu Zhang': ' 4ian ',
 '            Florian Rival': ' sirbrillig ',
 '            Payton Swick': ' stephencelis ',
 '            Stephen Celis': ' leerob ',
 '            Lee Robinson': ' trueadm ',
 '            Dominic Gannaway': ' azure-sdk ',
 '            Azure SDK Bot': ' masci ',
 '            Massimiliano Pippi': ' ravicious ',
 '            Rafał Cieślak': ' necolas ',
 '            Nicolas Gallagher': ' hrydgard ',
 '            Henrik Rydgård': ' casey ',
 '            Casey Rodarmor': ' pquerna ',
 '            Paul Querna': ' manucorporat ',
 '            Manu MA': ' ewels ',
 '            Phil Ewels': ' janosh ',
 '            Janosh Riebesell': ' Jason2866 ',
 '            Jason2866': ' ripienaar ',
 '            R.I.Pienaar': ' nolanlawson ',
 '            Nolan Lawson': ' merrymercy ',
 '            Lianmin Zheng': ' fffonion '}

#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [11]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [12]:
# your code here
response = requests.get(url)
soup = BeautifulSoup(response.content)

names1 = soup.find_all("h1", attrs = {"class":"h3 lh-condensed"})
list1 = []
for name in names1:
    id = name.find_all("a")
    for line in id:
        list1.append(line.text.replace("\n", "")[:-1])

list1

['        openai /      openai-cookboo',
 '        microsoft /      unil',
 '        public-apis /      public-api',
 '        open-mmlab /      mmdetectio',
 '        wagtail /      wagtai',
 '        open-mmlab /      mmclassificatio',
 '        google /      ja',
 '        triple-Mu /      YOLOv8-TensorR',
 '        carson-katri /      dream-texture',
 '        Sanster /      lama-cleane',
 '        cosmicpb /      FascistFre',
 '        OpenEthan /      SMSBoo',
 '        apache /      airflo',
 '        ultralytics /      yolov',
 '        matplotlib /      matplotli',
 '        ansible /      aw',
 '        pandas-dev /      panda',
 '        mli /      autocu',
 '        hwchase17 /      langchai',
 '        unifyai /      iv',
 '        oegedijk /      explainerdashboar',
 '        dagster-io /      dagste',
 '        coqui-ai /      TT',
 '        django /      djang',
 '        Project-MONAI /      MONA']

#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [13]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [14]:
# your code here
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [15]:
images = soup.find_all('img')
print(images)

[<img alt="Featured article" data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>, <img alt="Extended-protected article" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/30px-Extended-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/40px-Extended-protection-shackle.svg.png 2x" width="20"/>, <img alt="Walt Disney 1946.JPG" data-file-height="675" data-file-width="4

In [16]:
img  = [tag.get("src") for tag in images]
display(img)

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikime

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [17]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [18]:
# your code here
lang_names = soup.findAll('a', attrs='link-box')

for lang in lang_names:
    print(lang.get_text())


English
6 585 000+ articles


日本語
1 354 000+ 記事


Русский
1 875 000+ статей


Français
2 477 000+ articles


Deutsch
2 751 000+ Artikel


Español
1 823 000+ artículos


Italiano
1 785 000+ voci


中文
1 323 000+ 条目 / 條目


فارسی
941 000+ مقاله


Português
1 096 000+ artigos



#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [34]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [44]:
# your code here
table = soup.find_all("a",attrs={"class":"mw-redirect"})
table #after 7th line you see the languages

[<a class="mw-redirect" href="/wiki/Native_speaker" title="Native speaker">native speakers</a>,
 <a class="mw-redirect" href="/wiki/Mutually_intelligible" title="Mutually intelligible">mutually intelligible</a>,
 <a class="mw-redirect" href="/wiki/Wikipedia:NOTRS" title="Wikipedia:NOTRS"><span title="times out (December 2021)">better source needed</span></a>,
 <a class="mw-redirect" href="/wiki/Arabic_language" title="Arabic language">Arabic</a>,
 <a class="mw-redirect" href="/wiki/SIL_Ethnologue" title="SIL Ethnologue"><i>Ethnologue</i></a>,
 <a class="mw-redirect" href="/wiki/Arabic_language" title="Arabic language">Arabic</a>,
 <a class="mw-redirect" href="/wiki/Pashto_language" title="Pashto language">Pashto</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>,
 <a class="mw-redi

In [110]:
languages = [i.text for i in table]
top10 = languages[7:17]
top10

['Mandarin Chinese',
 'Spanish',
 'English',
 'Hindi',
 'Bengali',
 'Portuguese',
 'Russian',
 'Japanese',
 'Yue Chinese',
 'Vietnamese']

In [106]:
import re
pattern = r"<td>\d+(?:\.\d+)?"

a = str(soup.find_all("td"))
result = re.findall(pattern, a)

In [107]:
result = re.findall(pattern, a)
type(result)
print(result)

['<td>920', '<td>475', '<td>373', '<td>344', '<td>234', '<td>232', '<td>154', '<td>125', '<td>85.2', '<td>84.6', '<td>83.1', '<td>82.7', '<td>82.2', '<td>81.8', '<td>81.7', '<td>79.9', '<td>78.4', '<td>75.6', '<td>74.8', '<td>70.2', '<td>68.3', '<td>66.4', '<td>64.8', '<td>57.0', '<td>56.4', '<td>52.3', '<td>50.8', '<td>1', '<td>12.3', '<td>2', '<td>6.0', '<td>3', '<td>5.1', '<td>3', '<td>5.1', '<td>5', '<td>3.5', '<td>6', '<td>3.3', '<td>7', '<td>3.0', '<td>8', '<td>2.1', '<td>9', '<td>1.7', '<td>10', '<td>1.3', '<td>11', '<td>1.1']


In [113]:
for i in range(len(result)):
    result[i] = result[i].replace("<td>", "")

native_10 = result[0:10]

In [116]:
import pandas as pd

df1 = pd.DataFrame({'Language': top10})
df2 = pd.DataFrame({'Native speakers (millions)': native_10})
df = pd.concat([df1, df2], axis=1)
df

Unnamed: 0,Language,Native speakers (millions)
0,Mandarin Chinese,920.0
1,Spanish,475.0
2,English,373.0
3,Hindi,344.0
4,Bengali,234.0
5,Portuguese,232.0
6,Russian,154.0
7,Japanese,125.0
8,Yue Chinese,85.2
9,Vietnamese,84.6


#### 3. Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [136]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [137]:
# your code here
response = requests.get(url)
response.status_code

200

In [138]:
soup = BeautifulSoup(response.content)
soup.text

"\n\n\n\n\n\nTop 250 Movies - IMDb\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTop 250 Movies\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMenuTrendingTop 250 MoviesMost Popular MoviesTop 250 TV ShowsMost Popular TV ShowsMost Popular Video GamesMost Popular Music VideosMost Popular PodcastsMoviesRelease CalendarBrowse Movies by GenreTop Box OfficeShowtimes & TicketsMovie NewsIndia Movie SpotlightTV ShowsWhat's on TV & StreamingBrowse TV Shows by GenreTV NewsIndia TV SpotlightWatchWhat to WatchLatest TrailersIMDb OriginalsIMDb PicksIMDb PodcastsAwards & EventsOscarsBest Picture WinnersBest Picture WinnersEmmysSTARmeter AwardsSan Diego Comic-ConNew York Comic-ConSundance Film FestivalToronto Int'l Film FestivalAwards CentralFestival CentralAll EventsCelebsBorn TodayMost Popular CelebsMost Popular CelebsCelebrity NewsCommunityHelp CenterContributor ZonePollsFor Industry ProfessionalsAllAllTitlesTV EpisodesCelebsCompaniesKeywordsAdvanced S

In [142]:
table = soup.find_all("td", attrs={"class":"titleColumn"})
table

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Os Condenados de Shawshank</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Padrinho</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">O Cavaleiro das Trevas</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">O Padrinho: Parte II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">Doze Homens em Fúria</a>
 <span class="sec

In [145]:
movie_name = [i.find("a").text for i in table]
movie_name

['Os Condenados de Shawshank',
 'O Padrinho',
 'O Cavaleiro das Trevas',
 'O Padrinho: Parte II',
 'Doze Homens em Fúria',
 'A Lista de Schindler',
 'O Senhor dos Anéis - O Regresso do Rei',
 'Pulp Fiction',
 'O Senhor dos Anéis - A Irmandade do Anel',
 'O Bom, o Mau e o Vilão',
 'Forrest Gump',
 'Clube de Combate',
 'O Senhor dos Anéis - As Duas Torres',
 'A Origem',
 'Star Wars: Episódio V - O Império Contra-Ataca',
 'Matrix',
 'Tudo Bons Rapazes',
 'Voando Sobre Um Ninho de Cucos',
 'Seven - 7 Pecados Mortais',
 'Os Sete Samurais',
 'Do Céu Caiu Uma Estrela',
 'O Silêncio dos Inocentes',
 'Cidade de Deus',
 'O Resgate do Soldado Ryan',
 'A Vida É Bela',
 'Interstellar',
 'À Espera de Um Milagre',
 'A Guerra das Estrelas',
 'Exterminador Implacável 2 - O Dia do Julgamento',
 'Regresso ao Futuro',
 'A Viagem de Chihiro',
 'Psico',
 'O Pianista',
 'Parasitas',
 'Léon, o Profissional',
 'The Lion King',
 'Gladiador',
 'América Proibida',
 'The Departed - Entre Inimigos',
 'Os Suspeitos 

In [146]:
initial_release = [i.find("span").text for i in table]
initial_release

['(1994)',
 '(1972)',
 '(2008)',
 '(1974)',
 '(1957)',
 '(1993)',
 '(2003)',
 '(1994)',
 '(2001)',
 '(1966)',
 '(1994)',
 '(1999)',
 '(2002)',
 '(2010)',
 '(1980)',
 '(1999)',
 '(1990)',
 '(1975)',
 '(1995)',
 '(1954)',
 '(1946)',
 '(1991)',
 '(2002)',
 '(1998)',
 '(1997)',
 '(2014)',
 '(1999)',
 '(1977)',
 '(1991)',
 '(1985)',
 '(2001)',
 '(1960)',
 '(2002)',
 '(2019)',
 '(1994)',
 '(1994)',
 '(2000)',
 '(1998)',
 '(2006)',
 '(1995)',
 '(2006)',
 '(2014)',
 '(1942)',
 '(1962)',
 '(1988)',
 '(2011)',
 '(1936)',
 '(1968)',
 '(1954)',
 '(1988)',
 '(1979)',
 '(1931)',
 '(1979)',
 '(2000)',
 '(2012)',
 '(1981)',
 '(2008)',
 '(2006)',
 '(1950)',
 '(1957)',
 '(1980)',
 '(1940)',
 '(2018)',
 '(1957)',
 '(1986)',
 '(2018)',
 '(1999)',
 '(1964)',
 '(2012)',
 '(2003)',
 '(2009)',
 '(1984)',
 '(2017)',
 '(2019)',
 '(1995)',
 '(1995)',
 '(1981)',
 '(2019)',
 '(1997)',
 '(1984)',
 '(1997)',
 '(2016)',
 '(2000)',
 '(1952)',
 '(2009)',
 '(2010)',
 '(1963)',
 '(2018)',
 '(1983)',
 '(2004)',
 '(1968)',

In [150]:
director_test = [i.find("a").get("title") for i in table]
director_test

['Frank Darabont (dir.), Tim Robbins, Morgan Freeman',
 'Francis Ford Coppola (dir.), Marlon Brando, Al Pacino',
 'Christopher Nolan (dir.), Christian Bale, Heath Ledger',
 'Francis Ford Coppola (dir.), Al Pacino, Robert De Niro',
 'Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb',
 'Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes',
 'Peter Jackson (dir.), Elijah Wood, Viggo Mortensen',
 'Quentin Tarantino (dir.), John Travolta, Uma Thurman',
 'Peter Jackson (dir.), Elijah Wood, Ian McKellen',
 'Sergio Leone (dir.), Clint Eastwood, Eli Wallach',
 'Robert Zemeckis (dir.), Tom Hanks, Robin Wright',
 'David Fincher (dir.), Brad Pitt, Edward Norton',
 'Peter Jackson (dir.), Elijah Wood, Ian McKellen',
 'Christopher Nolan (dir.), Leonardo DiCaprio, Joseph Gordon-Levitt',
 'Irvin Kershner (dir.), Mark Hamill, Harrison Ford',
 'Lana Wachowski (dir.), Keanu Reeves, Laurence Fishburne',
 'Martin Scorsese (dir.), Robert De Niro, Ray Liotta',
 'Milos Forman (dir.), Jack Nicholson, Louise Fletch

In [153]:
director = [dir[:dir.index('(dir.)')] for dir in director_test]
director

['Frank Darabont ',
 'Francis Ford Coppola ',
 'Christopher Nolan ',
 'Francis Ford Coppola ',
 'Sidney Lumet ',
 'Steven Spielberg ',
 'Peter Jackson ',
 'Quentin Tarantino ',
 'Peter Jackson ',
 'Sergio Leone ',
 'Robert Zemeckis ',
 'David Fincher ',
 'Peter Jackson ',
 'Christopher Nolan ',
 'Irvin Kershner ',
 'Lana Wachowski ',
 'Martin Scorsese ',
 'Milos Forman ',
 'David Fincher ',
 'Akira Kurosawa ',
 'Frank Capra ',
 'Jonathan Demme ',
 'Fernando Meirelles ',
 'Steven Spielberg ',
 'Roberto Benigni ',
 'Christopher Nolan ',
 'Frank Darabont ',
 'George Lucas ',
 'James Cameron ',
 'Robert Zemeckis ',
 'Hayao Miyazaki ',
 'Alfred Hitchcock ',
 'Roman Polanski ',
 'Bong Joon Ho ',
 'Luc Besson ',
 'Roger Allers ',
 'Ridley Scott ',
 'Tony Kaye ',
 'Martin Scorsese ',
 'Bryan Singer ',
 'Christopher Nolan ',
 'Damien Chazelle ',
 'Michael Curtiz ',
 'Masaki Kobayashi ',
 'Isao Takahata ',
 'Olivier Nakache ',
 'Charles Chaplin ',
 'Sergio Leone ',
 'Alfred Hitchcock ',
 'Giusep

In [154]:
imdb_rating = soup.find_all("td", attrs={"class":"ratingColumn imdbRating"})
stars = [i.find("strong").text for i in imdb_rating]
stars

['9.2',
 '9.2',
 '9.0',
 '9.0',
 '9.0',
 '8.9',
 '8.9',
 '8.8',
 '8.8',
 '8.8',
 '8.8',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',


In [167]:
print(type(movie_name))
print(type(initial_release))
print(type(director))
print(type(stars))

<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>


In [175]:
top_250 = [x for x in zip(movie_name, initial_release, director, stars)]


In [211]:
final = pd.DataFrame.from_records(top_250, columns=["Movie Name", "Initial Release", "Director", "Stars"])
final.head()

Unnamed: 0,Movie Name,Initial Release,Director,Stars
0,Os Condenados de Shawshank,(1994),Frank Darabont,9.2
1,O Padrinho,(1972),Francis Ford Coppola,9.2
2,O Cavaleiro das Trevas,(2008),Christopher Nolan,9.0
3,O Padrinho: Parte II,(1974),Francis Ford Coppola,9.0
4,Doze Homens em Fúria,(1957),Sidney Lumet,9.0


#### 3.1. Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [182]:
#This is the url you will scrape in this exercise
url = 'https://www.imdb.com/list/ls009796553/'

In [183]:
# your code here
response = requests.get(url)
response.status_code

200

In [184]:
soup = BeautifulSoup(response.content)

In [193]:
geral = soup.find_all("div", attrs={"class":"lister-item-content"})
geral

[<div class="lister-item-content">
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">1.</span>
 <a href="/title/tt0087800/">Pesadelo em Elm Street</a>
 <span class="lister-item-year text-muted unbold">(1984)</span>
 </h3>
 <p class="text-muted text-small">
 <span class="certificate">M/18</span>
 <span class="ghost">|</span>
 <span class="runtime">91 min</span>
 <span class="ghost">|</span>
 <span class="genre">
 Horror            </span>
 </p>
 <div class="ipl-rating-widget">
 <div class="ipl-rating-star small">
 <span class="ipl-rating-star__star">
 <svg class="ipl-icon ipl-star-icon" fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
 <path d="M0 0h24v24H0z" fill="none"></path>
 <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"></path>
 <path d="M0 0h24v24H0z" fill="none"></path>
 </svg>
 </span>
 <span class="ipl-rating-star__rating">7.4</span>
 </div>
 <di

In [191]:
movie = [i.find("a").text for i in geral]
movie

['Pesadelo em Elm Street',
 'Despertares',
 'Liga de Mulheres',
 'Um Bairro em Nova Iorque',
 'Anjos em Campo',
 'Tempo de Matar',
 'Amistad',
 'Anaconda',
 'A Cool, Dry Place',
 'América Proibida',
 'Uma Questão de Nervos',
 'Quase Famosos',
 'American Psycho',
 'A.I. Inteligência Artificial',
 'Ali',
 'American Pie 2 - O Ano Seguinte',
 'Uma Mente Brilhante',
 'Olhos de Anjo',
 'Coração de Cavaleiro',
 'Antwone Fisher',
 'Era Uma Vez Um Rapaz',
 'Outra Questão de Nervos',
 'A Hora dos Benjamins',
 'Cody Banks - Agente de Palmo e Meio',
 'American Splendor',
 'American Pie - O Casamento',
 'Terapia de Choque',
 'O Repórter: A Lenda de Ron Burgundy',
 'Alfie e as Mulheres',
 "A Man's Gotta Do",
 'Uma História de Violência',
 'Estás Frito, Meu!',
 'Admitido',
 'Estás Cada Vez Mais Frito, Meu!',
 'Adventureland',
 'Uma Fuga Perfeita',
 'Avatar',
 'Alice no País das Maravilhas',
 'Pesadelo em Elm Street',
 'Uma Bela Orgia à Moda Antiga',
 'Mocas Felizes, Meu!',
 'Arena',
 'Força Destruido

In [201]:
year = [i.find("span").text for i in geral]
year 
# -_- 

['1.',
 '2.',
 '3.',
 '4.',
 '5.',
 '6.',
 '7.',
 '8.',
 '9.',
 '10.',
 '11.',
 '12.',
 '13.',
 '14.',
 '15.',
 '16.',
 '17.',
 '18.',
 '19.',
 '20.',
 '21.',
 '22.',
 '23.',
 '24.',
 '25.',
 '26.',
 '27.',
 '28.',
 '29.',
 '30.',
 '31.',
 '32.',
 '33.',
 '34.',
 '35.',
 '36.',
 '37.',
 '38.',
 '39.',
 '40.',
 '41.',
 '42.',
 '43.',
 '44.',
 '45.',
 '46.',
 '47.',
 '48.',
 '49.',
 '50.',
 '51.',
 '52.',
 '53.',
 '54.',
 '55.',
 '56.',
 '57.',
 '58.',
 '59.',
 '60.',
 '61.',
 '62.',
 '63.',
 '64.',
 '65.',
 '66.',
 '67.',
 '68.',
 '69.',
 '70.',
 '71.',
 '72.',
 '73.',
 '74.',
 '75.',
 '76.',
 '77.',
 '78.',
 '79.',
 '80.',
 '81.',
 '82.',
 '83.',
 '84.',
 '85.',
 '86.',
 '87.',
 '88.',
 '89.',
 '90.',
 '91.',
 '92.',
 '93.',
 '94.',
 '95.',
 '96.',
 '97.',
 '98.',
 '99.',
 '100.']

In [207]:
year = [i.find_all("span", attrs={"class":"lister-item-year text-muted unbold"})[0].text for i in geral]
year

['(1984)',
 '(1990)',
 '(1992)',
 '(1993)',
 '(1994)',
 '(1996)',
 '(1997)',
 '(1997)',
 '(1998)',
 '(1998)',
 '(1999)',
 '(2000)',
 '(2000)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2002)',
 '(2002)',
 '(2002)',
 '(2002)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2004)',
 '(2004)',
 '(2004)',
 '(2005)',
 '(2005)',
 '(2006)',
 '(2007)',
 '(2009)',
 '(2009)',
 '(2009)',
 '(I) (2010)',
 '(2010)',
 '(2011)',
 '(2011)',
 '(2011 Video)',
 '(1988)',
 '(1990)',
 '(1991)',
 '(1991)',
 '(1992)',
 '(1992)',
 '(1994)',
 '(1995)',
 '(1995)',
 '(1995)',
 '(1997)',
 '(1998)',
 '(1998)',
 '(1999)',
 '(2000)',
 '(2000)',
 '(2000)',
 '(2000)',
 '(2000)',
 '(2000)',
 '(2000)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2001)',
 '(2002)',
 '(2002)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2003)',
 '(2004)',
 '(2004)',
 '(2005)',
 '(2005)',
 '(2005)',
 '(2006)',
 '(2007)',
 '(2007)',


In [209]:
summary = [i.find_all("p",attrs={"class":""})[0].text.strip() for i in geral]
summary

['Teenager Nancy Thompson must uncover the dark truth concealed by her parents after she and her friends become targets of the spirit of a serial killer with a bladed glove in their dreams, in which if they die, it kills them in real life.',
 'The victims of an encephalitis epidemic many years ago have been catatonic ever since, but now a new drug offers the prospect of reviving them.',
 'Two sisters join the first female professional baseball league and struggle to help it succeed amid their own growing rivalry.',
 'A father becomes worried when a local gangster befriends his son in the Bronx in the 1960s.',
 'When a boy prays for a chance to have a family if the California Angels win the pennant, angels are assigned to make that possible.',
 'In Clanton, Mississippi, a fearless young lawyer and his assistant defend a black man accused of murdering two white men who raped his ten-year-old daughter, inciting violent retribution and revenge from the Ku Klux Klan.',
 'In 1839, the revolt

In [213]:
all = [x for x in zip(movie, year, summary)]
final_2 = pd.DataFrame(all,columns=["Movie Name", "Release Year", "Summary"])

In [215]:
final_2.head(10)

Unnamed: 0,Movie Name,Release Year,Summary
0,Pesadelo em Elm Street,(1984),Teenager Nancy Thompson must uncover the dark ...
1,Despertares,(1990),The victims of an encephalitis epidemic many y...
2,Liga de Mulheres,(1992),Two sisters join the first female professional...
3,Um Bairro em Nova Iorque,(1993),A father becomes worried when a local gangster...
4,Anjos em Campo,(1994),When a boy prays for a chance to have a family...
5,Tempo de Matar,(1996),"In Clanton, Mississippi, a fearless young lawy..."
6,Amistad,(1997),"In 1839, the revolt of Mende captives aboard a..."
7,Anaconda,(1997),"A ""National Geographic"" film crew is taken hos..."
8,"A Cool, Dry Place",(1998),"Russell, single father balances his work as a ..."
9,América Proibida,(1998),A former neo-nazi skinhead tries to prevent hi...
