# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [4]:
# your code here
r = requests.get(url) # `url` has been defined before
if r.status_code < 300:
    print('request was successful')
elif r.status_code >= 400 and r.status_code < 500:
    print('request failed because the resource either does not exist or is forbidden')
else:
    print('request failed because the response server encountered an error')

request was successful


In [5]:
r.status_code

200

In [6]:
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-0669c64f32137d2083f522a555f9e065.css" integrity="sha512-BmnGTzITfSCD9SKlVfngZdzNq8Fa33lRq00rF1eRsg4zcCH3VtX8QtS6687+5GdeaVj1LzKyLj6+oXJLcswj6w==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-217e41a0ce3f099705fabc3eca10cb84.css" integrity="sha512-IX5BoM4/CZcF+rw+yhDLhCjHTA1gz+

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [14]:
#tags = ['a']
text = [element.find('a').get_text(strip=True) for element in soup.find_all('h1', attrs={'class':'h3 lh-condensed'})]
text

['Graham Campbell',
 'Seth Vargo',
 'Héctor Ramón',
 'Mike Penz',
 'Tianon Gravi',
 'Shohei Ueda',
 'Kevin Sheppard',
 'isaacs',
 'Steven Loria',
 'José Padilla',
 'Klaus Post',
 'Matheus Teixeira',
 'Chocobozzz',
 'Florent CHAMPIGNY',
 'Sam Sam',
 'Pascal Vizeli',
 'Wei He',
 'Hisham Muhammad',
 'Javier Suárez',
 'Minko Gechev',
 'Jeremy Tuloup',
 'Alisue',
 'XhmikosR',
 'Paul Beusterien',
 'Kévin Dunglas']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [40]:
# This is the url you will scrape in this exercise
url1 = 'https://github.com/trending/python?since=daily'

In [41]:
# your code here
html1 = requests.get(url1).content
soup1 = BeautifulSoup(html1, "lxml")
soup1

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-0669c64f32137d2083f522a555f9e065.css" integrity="sha512-BmnGTzITfSCD9SKlVfngZdzNq8Fa33lRq00rF1eRsg4zcCH3VtX8QtS6687+5GdeaVj1LzKyLj6+oXJLcswj6w==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-217e41a0ce3f099705fabc3eca10cb84.css" integrity="sha512-IX5BoM4/CZcF+rw+yhDLhCjHTA1gz+

In [18]:
text1 = [element.find('span').get_text(strip=True) for element in soup.find_all('a', attrs={'class':'select-menu-item'})]
text1

['C++',
 'HTML',
 'Java',
 'JavaScript',
 'PHP',
 'Python',
 'Ruby',
 'Unknown languages',
 '1C Enterprise',
 '4D',
 'ABAP',
 'ABNF',
 'ActionScript',
 'Ada',
 'Adobe Font Metrics',
 'Agda',
 'AGS Script',
 'Alloy',
 'Alpine Abuild',
 'Altium Designer',
 'AMPL',
 'AngelScript',
 'Ant Build System',
 'ANTLR',
 'ApacheConf',
 'Apex',
 'API Blueprint',
 'APL',
 'Apollo Guidance Computer',
 'AppleScript',
 'Arc',
 'AsciiDoc',
 'ASN.1',
 'ASP',
 'AspectJ',
 'Assembly',
 'Asymptote',
 'ATS',
 'Augeas',
 'AutoHotkey',
 'AutoIt',
 'Awk',
 'Ballerina',
 'Batchfile',
 'Befunge',
 'BibTeX',
 'Bison',
 'BitBake',
 'Blade',
 'BlitzBasic',
 'BlitzMax',
 'Bluespec',
 'Boo',
 'Brainfuck',
 'Brightscript',
 'Zeek',
 'C',
 'C#',
 'C++',
 'C-ObjDump',
 'C2hs Haskell',
 'Cabal Config',
 "Cap'n Proto",
 'CartoCSS',
 'Ceylon',
 'Chapel',
 'Charity',
 'ChucK',
 'Cirru',
 'Clarion',
 'Clean',
 'Click',
 'CLIPS',
 'Clojure',
 'Closure Templates',
 'Cloud Firestore Security Rules',
 'CMake',
 'COBOL',
 'CodeQL'

#### Display all the image links from Walt Disney wikipedia page.

In [20]:
# This is the url you will scrape in this exercise
url2 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [22]:
# your code here
html2 = requests.get(url2).content
soup2 = BeautifulSoup(html2, "lxml")


In [66]:
#tags2 = ['a']
#text2 = [element.text.split() for element in soup2.find_all(tags2)]
#text2
imag = soup2.find_all('a',{'class':'image'})# .get('href') dentro de uma lista comprehension
imag

[<a class="image" href="/wiki/File:Walt_Disney_1946.JPG"><img alt="Walt Disney 1946.JPG" data-file-height="675" data-file-width="450" decoding="async" height="330" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/330px-Walt_Disney_1946.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/440px-Walt_Disney_1946.JPG 2x" width="220"/></a>,
 <a class="image" href="/wiki/File:Walt_Disney_1942_signature.svg"><img alt="Walt Disney 1942 signature.svg" data-file-height="218" data-file-width="585" decoding="async" height="56" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/225px-Walt_Disney_1942_signature.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/t

In [54]:
imag1 = [element.find('a').get('href') for element in soup2.find_all('div', attrs={'class':'thumbinner'})]
imag1

['/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/4/4d/Newman_Laugh-O-Gram_%281921%29.webm',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg']

In [50]:
images = []
for a in soup2.findAll('a'):
    images.append(a.get('href'))
print(images)

[None, '/wiki/Wikipedia:Featured_articles', '/wiki/Wikipedia:Protection_policy#extended', '#mw-head', '#p-search', '/wiki/The_Walt_Disney_Company', '/wiki/Walt_Disney_(disambiguation)', '/wiki/File:Walt_Disney_1946.JPG', '/wiki/Chicago', '/wiki/Illinois', '/wiki/Burbank,_California', '/wiki/The_Walt_Disney_Company', '/wiki/Disney_family', '/wiki/Academy_Awards', '/wiki/Golden_Globe_Award', '/wiki/Emmy_Award', '/wiki/File:Walt_Disney_1942_signature.svg', '/wiki/Help:IPA/English', '#cite_note-OD:_pronunciation-1', '/wiki/Modern_animation_in_the_United_States', '/wiki/Cartoon', '/wiki/Academy_Awards', '/wiki/Golden_Globe_Awards', '/wiki/Emmy_Award', '/wiki/National_Film_Registry', '/wiki/Library_of_Congress', '/wiki/The_Walt_Disney_Company', '/wiki/Roy_O._Disney', '/wiki/Ub_Iwerks', '/wiki/Mickey_Mouse', '/wiki/Technicolor', '/wiki/Feature-length', '/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)', '/wiki/Pinocchio_(1940_film)', '/wiki/Fantasia_(1940_film)', '/wiki/Dumbo', '/wiki/Bambi'

#### Find the number of titles that have changed in the United States Code since its last release point.

In [56]:
# This is the url you will scrape in this exercise
url3 = 'http://uscode.house.gov/download/download.shtml'

In [65]:
# your code here
html3 = requests.get(url3).content
soup3 = BeautifulSoup(html3, "lxml")
#updt = soup3.find_all('div.usctitlechanged',{'font-weight':'bold'})
#updt
updt = [element.find('id').get_text() for element in soup3.find_all('div', attrs={'class':'usctitlechanged'})]
updt

#changes = []
#for a in soup3.findAll('div'):
   # changes.append(div.get('usctitlechanged'))
#print(changes)

AttributeError: 'NoneType' object has no attribute 'get_text'

#### Find a Python list with the top ten FBI's Most Wanted names.

In [67]:
# This is the url you will scrape in this exercise
url4 = 'https://www.fbi.gov/wanted/topten'

In [69]:
# your code here
html4 = requests.get(url4).content
soup4 = BeautifulSoup(html4, "lxml")
soup4

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.fbi.gov/wanted/topten" rel="canonical"/><meta content="Folder" name="DC.type"/>
<meta content="text/plain" name="DC.format"/>
<meta content="2010-07-16T15:30:00+00:00" name="DC.date.created"/>
<meta content="Wanted by the FBI, Top Ten Most Wanted, Ten Most Wanted Fugitives, Top Ten Fugitives, Top Ten, Historical Ten Most Wanted" name="keywords"/>
<meta content="The FBI is offering rewards for information leading to the apprehension of the Ten Most Wanted Fugitives. Select the images of suspects to display more information." name="description"/>
<meta content="2010/07/16 - " name="DC.date.valid_range"/>
<meta content="The FBI is offering rewards for information leading to the apprehension of the Ten Most Wanted Fugitives. Select the images of suspects to

In [71]:
text4 = [element.find('a').get_text(strip=True) for element in soup4.find_all('h3', attrs={'class':'title'})]
text4

['EUGENE PALMER',
 'SANTIAGO VILLALBA MEDEROS',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'YASER ABDEL SAID',
 'ALEXIS FLORES']

#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [72]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url5 = 'https://twitter.com/'

In [73]:
# your code here
html5 = requests.get(url5).content
soup5 = BeautifulSoup(html5, "lxml")
soup5

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/>
<link href="//abs.twimg.com" rel="preconnect"/>
<link href="//api.twitter.com" rel="preconnect"/>
<link href="//pbs.twimg.com" rel="preconnect"/>
<link href="//t.co" rel="preconnect"/>
<link href="//video.twimg.com" rel="preconnect"/>
<link href="//abs.twimg.com" rel="dns-prefetch"/>
<link href="//api.twitter.com" rel="dns-prefetch"/>
<link href="//pbs.twimg.com" rel="dns-prefetch"/>
<link href="//t.co" rel="dns-prefetch"/>
<link href="//video.twimg.com" rel="dns-prefetch"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/web/polyfills.604422d4.js" nonce="MThhOTU5NDgtODA4NS00NTBhLWI1YzEtNjhjZWUwMTg4ODU1" rel="preload"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/web/vendors~main.55bd4704.js" nonce="MThhOTU5NDgtODA4NS00

In [77]:
text10 = [element.find('span').get_text(strip=True) for element in soup5.find_all('div', attrs={'class':'css-1dbjc4n r-13awgt0 r-zso239'})]
text10

[]

In [82]:
import requests

handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("{} tweets {} number of tweets.".format(handle,tweets.get('data-count')))

except:
    print('Account name not found...')
    

Input your account name on Twitter: @raphaeleon
Account name not found...


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [83]:
# your code here
from bs4 import BeautifulSoup
import requests
handle = input('Input your account name on Twitter: ') 
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')
try:
    follow_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--followers'})
    followers = follow_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("Number of followers: {} ".format(followers.get('data-count')))
except:
    print('Account name not found...')

Input your account name on Twitter: @prasanta
Account name not found...


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [86]:
# This is the url you will scrape in this exercise
url6 = 'https://www.wikipedia.org/'
html6 = requests.get(url6).content
soup6 = BeautifulSoup(html6, "lxml")

In [89]:
# your code here
text6 = [element.find('a').get_text(strip=True) for element in soup6.find_all('div', attrs={'class':'langlist langlist-large hlist'})]
text6

['العربية', 'Asturianu']

#### A list with the different kind of datasets available in data.gov.uk.

In [90]:
# This is the url you will scrape in this exercise
url7 = 'https://data.gov.uk/'

In [95]:
# your code here
html7 = requests.get(url7).content
soup7 = BeautifulSoup(html7, "lxml")

text6 = [element.find('a').get_text(strip=True) for element in soup7.find_all('div', attrs={'class':'column-one-third'})]
text6

['Business and economy', 'Environment', 'Mapping']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [2]:
# This is the url you will scrape in this exercise
url8 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
import pandas as pd

In [12]:
# your code here

html8 = requests.get(url8).content
soup8 = BeautifulSoup(html8, "lxml")

tables = soup8.find_all("table")
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
df = pd.DataFrame(tab_data)
df.columns = df.iloc[0,:]
df.drop(index=0,inplace=True)
df.replace(r'\s', '', regex = True, inplace = True)
df.columns = ['Rank', 'Language', 'Speakers(millions)', '% of World pop.(March 2019)', 'Language familyBranch']
df.head(10)

#text8 = [element.find('a').get_text(strip=True) for element in soup8.find_all('td')]
#text8

#header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}

#r = requests.get(url8, headers=header)
#dataframes=pd.read_html(r.text, header=0)
#dataframesTT = pd.DataFrame(dataframes)
#dataframesTT



Unnamed: 0,Rank,Language,Speakers(millions),% of World pop.(March 2019),Language familyBranch
1,1,MandarinChinese,918.0,11.922,Sino-TibetanSinitic
2,2,Spanish,480.0,5.994,Indo-EuropeanRomance
3,3,English,379.0,4.922,Indo-EuropeanGermanic
4,4,Hindi(SanskritisedHindustani)[9],341.0,4.429,Indo-EuropeanIndo-Aryan
5,5,Bengali,228.0,2.961,Indo-EuropeanIndo-Aryan
6,6,Portuguese,221.0,2.87,Indo-EuropeanRomance
7,7,Russian,154.0,2.0,Indo-EuropeanBalto-Slavic
8,8,Japanese,128.0,1.662,JaponicJapanese
9,9,WesternPunjabi[10],92.7,1.204,Indo-EuropeanIndo-Aryan
10,10,Marathi,83.1,1.079,Indo-EuropeanIndo-Aryan


#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [14]:
# This is the url you will scrape in this exercise 
url9 = 'https://www.imdb.com/chart/top'

In [17]:
# your code here
html9 = requests.get(url9).content
soup9 = BeautifulSoup(html9, "lxml")

tables1 = soup9.find_all("table")
table1 = tables1[0]
tab_data1 = [[cell.text for cell in row.find_all(["th","td"])]
                        for row in table1.find_all("tr")]
df1 = pd.DataFrame(tab_data1)
df1.replace(r'\s', '', regex = True, inplace = True)
df1.columns = df1.iloc[0,:]

df1

Unnamed: 0,Unnamed: 1,Rank&Title,IMDbRating,YourRating,Unnamed: 5
0,,Rank&Title,IMDbRating,YourRating,
1,,1.OsCondenadosdeShawshank(1994),9.2,12345678910NOTYETRELEASEDSeen,
2,,2.OPadrinho(1972),9.1,12345678910NOTYETRELEASEDSeen,
3,,3.OPadrinho:ParteII(1974),9.0,12345678910NOTYETRELEASEDSeen,
4,,4.OCavaleirodasTrevas(2008),9.0,12345678910NOTYETRELEASEDSeen,
...,...,...,...,...,...
246,,246.FannyeAlexandre(1982),8.0,12345678910NOTYETRELEASEDSeen,
247,,247.TrêsCores:Vermelho(1994),8.0,12345678910NOTYETRELEASEDSeen,
248,,248.Aladdin(1992),8.0,12345678910NOTYETRELEASEDSeen,
249,,249.Koenokatachi(2016),8.0,12345678910NOTYETRELEASEDSeen,


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [18]:
#This is the url you will scrape in this exercise
url11 = 'http://www.imdb.com/chart/top'

In [19]:
# your code here
import random
def get_imd_movies(url11):
    page = requests.get(url11)
    soup = BeautifulSoup(page.text, 'html.parser')
    movies = soup.find_all("td", class_="titleColumn")
    random.shuffle(movies)
    return movies
def get_imd_summary(url11):
    movie_page = requests.get(url11)
    soup = BeautifulSoup(movie_page.text, 'html.parser')
    return soup.find("div", class_="summary_text").contents[0].strip()

def get_imd_movie_info(movie):
    movie_title = movie.a.contents[0]
    movie_year = movie.span.contents[0]
    movie_url = 'http://www.imdb.com' + movie.a['href']
    return movie_title, movie_year, movie_url

def imd_movie_picker():
    ctr=0
    print("--------------------------------------------")
    for movie in get_imd_movies('http://www.imdb.com/chart/top'):
        movie_title, movie_year, movie_url = get_imd_movie_info(movie)
        movie_summary = get_imd_summary(movie_url)
        print(movie_title, movie_year)
        print(movie_summary)
        print("--------------------------------------------")
        ctr=ctr+1
        if (ctr==10):
            break;   
if __name__ == '__main__':
    imd_movie_picker()

--------------------------------------------
Klaus (2019)
A simple act of kindness always sparks another, even in a frozen, faraway place. When Smeerensburg's new postman, Jesper, befriends toymaker Klaus, their gifts melt an age-old feud and deliver a sleigh full of holiday traditions.
--------------------------------------------
À Procura de Nemo (2003)
After his son is captured in the Great Barrier Reef and taken to Sydney, a timid clownfish sets out on a journey to bring him home.
--------------------------------------------
Vingadores: Guerra do Infinito (2018)
The Avengers and their allies must be willing to sacrifice all in an attempt to defeat the powerful Thanos before his blitz of devastation and ruin puts an end to the universe.
--------------------------------------------
Do Céu Caiu Uma Estrela (1946)
An angel is sent from Heaven to help a desperately frustrated businessman by showing him what life would have been like if he had never existed.
-----------------------------

## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
# your code here