# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from bs4 import SoupStrainer

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [7]:
page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [58]:
# #pa-antfu > div.d-sm-flex.flex-auto > div.col-sm-8.d-md-flex > div:nth-child(1) > h1 > a
#pa-antfu > div.d-sm-flex.flex-auto > div.col-sm-8.d-md-flex > div:nth-child(1) > p > a

names = soup.select("div:nth-child(1) > h1 > a")
nicks = soup.select('p > a')
result = []

for i in range(len(names)):
    
    name = ' '.join(re.findall('\(?\w+\)?',re.sub('\n','',names[i].get_text())))
    nick = ' '.join(re.findall('\(?\w+\)?',re.sub('\n','',nicks[i].get_text())))
    
    result.append(name + ' (' + nick + ')')

    
result

['Anthony Fu (antfu)',
 'Luan Nico (luanpotter)',
 'Tom Payne (twpayne)',
 'Arvid Norberg (arvidn)',
 'Roger Peppe (rogpeppe)',
 'Michael (Parker) Parker (parkervcp)',
 'Kevin Papst (kevinpapst)',
 'Nuno Maduro (nunomaduro)',
 'Florian (1technophile)',
 'Dane Mackier (FilledStacks)',
 'Daishi Kato (dai shi)',
 'Anton Babenko (antonbabenko)',
 'Sylvain Gugger (sgugger)',
 'Ha Thach (hathach)',
 'Josh Bleecher Snyder (josharian)',
 'Ayke (aykevl)',
 'Stephen Haberman (stephenh)',
 'Leo Farias (leoafarias)',
 'Steve Macenski (SteveMacenski)',
 'Tomas Votruba (TomasVotruba)',
 'Adrienne Walker (quisquous)',
 'Stephen Celis (stephencelis)',
 'Jonny Burger (JonnyBurger)',
 'Daniel Agar (dagar)',
 'Welly (wellyshen)']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [61]:
# This is the url you will scrape in this exercise

url = 'https://github.com/trending/python?since=daily'
page2 = requests.get(url)

soup2 = BeautifulSoup(page2.content,'html.parser')


In [80]:
##js-pjax-container > div.position-relative.container-lg.p-responsive.pt-6 > div > div:nth-child(2) > article:nth-child(1) > h1 > a

repos = soup2.select('h1 > a')

repo_list = ['/'.join(re.findall(r'\S+',re.sub('\n|/','',repo.get_text()))) for repo in repos]

repo_list

['archlinux/archinstall',
 'home-assistant/core',
 'Rapptz/discord.py',
 'oppia/oppia',
 'activeloopai/Hub',
 'willmcgugan/rich',
 'CovidTrackerFr/vitemadose',
 'optuna/optuna',
 'Chia-Network/chia-blockchain',
 'EleutherAI/gpt-neo',
 'swisskyrepo/PayloadsAllTheThings',
 'yasinkuyu/binance-trader',
 'k4yt3x/video2x',
 'lazyprogrammer/machine_learning_examples',
 'PyGithub/PyGithub',
 'aiogram/aiogram',
 'freqtrade/freqtrade',
 'minimaxir/gpt-2-simple',
 'ArchiveBox/ArchiveBox',
 'DIGITALCRIMINAL/OnlyFans',
 'engineer-man/youtube',
 'frappe/erpnext',
 'pydanny/cookiecutter-django',
 'ManimCommunity/manim',
 'PyCQA/flake8']

#### Display all the image links from Walt Disney wikipedia page.

In [81]:
# This is the url you will scrape in this exercise

url = 'https://en.wikipedia.org/wiki/Walt_Disney'

page3 = requests.get(url)

soup3 = BeautifulSoup(page3.content,'html.parser')


In [196]:
#mw-content-text > div.mw-parser-output > table.infobox.biography.vcard > tbody > tr:nth-child(2) > td > a
# #mw-content-text > div.mw-parser-output > table.infobox.biography.vcard > tbody > tr:nth-child(2) > td > a > img
# #mw-content-text > div.mw-parser-output > div:nth-child(52) > div > a > img

imgs = soup3.select('div.mw-parser-output > div > div > a > img')
imgs_main = soup3.select('tr:nth-child(2) > td > a > img')

links = ['https:'+imgs_main[0]['src']]

for image in imgs:
    links.append('https:'+image['src'])
    print('https:'+image['src'])
#print(links)

https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg
https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawi

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [227]:
# This is the url you will scrape in this exercise

url4 ='https://en.wikipedia.org/wiki/Python_(programming_language)' 

page4 = requests.get(url4)

#product = SoupStrainer(id =['mw-parser-output'])

soup4 = BeautifulSoup(page4.content,'html.parser') #parse_only=product)

In [241]:
# #mw-content-text > div.mw-parser-output > div:nth-child(3) > a
##mw-content-text
##mw-content-text > div.mw-parser-output > p:nth-child(5)
##mw-content-text > div.mw-parser-output > p:nth-child(5) > a:nth-child(6)
# #mw-content-text > div.mw-parser-output > ul:nth-child(13) > li:nth-child(1) > a

links4 = soup4.find_all('a',href=True)

links_4 = []
trash = []
k=0
for link in links4:
    if link['href'][0]=='/' or link['href'][0:4]=='http':
        if link['href'][0:2]=='//':
            link['href'] = 'https:' + link['href']
        elif link['href'][0:2]=='/w':
            link['href'] = 'https://en.wikipedia.org' + link['href']
        links_4.append(link['href'])
        print(link['href'])
    else:
        trash.append(link['href'])

https://en.wikipedia.org/wiki/Wikipedia:Good_articles
https://en.wikipedia.org/wiki/Python_(disambiguation)
https://en.wikipedia.org/wiki/File:Python_logo_and_wordmark.svg
https://en.wikipedia.org/wiki/Programming_paradigm
https://en.wikipedia.org/wiki/Multi-paradigm_programming_language
https://en.wikipedia.org/wiki/Object-oriented_programming
https://en.wikipedia.org/wiki/Procedural_programming
https://en.wikipedia.org/wiki/Imperative_programming
https://en.wikipedia.org/wiki/Functional_programming
https://en.wikipedia.org/wiki/Structured_programming
https://en.wikipedia.org/wiki/Reflective_programming
https://en.wikipedia.org/wiki/Software_design
https://en.wikipedia.org/wiki/Guido_van_Rossum
https://en.wikipedia.org/wiki/Software_developer
https://en.wikipedia.org/wiki/Python_Software_Foundation
https://en.wikipedia.org/wiki/Software_release_life_cycle
https://www.wikidata.org/wiki/Q28865?uselang=en#P348
https://en.wikipedia.org/wiki/Software_release_life_cycle#BETA
https://www.wik

https://en.wikipedia.org/wiki/MedCalc
https://en.wikipedia.org/wiki/Microfit
https://en.wikipedia.org/wiki/Minitab
https://en.wikipedia.org/wiki/MLwiN
https://en.wikipedia.org/wiki/NCSS_(statistical_software)
https://en.wikipedia.org/wiki/SHAZAM_(software)
https://en.wikipedia.org/wiki/SigmaStat
https://en.wikipedia.org/wiki/Statistica
https://en.wikipedia.org/wiki/StatsDirect
https://en.wikipedia.org/wiki/StatXact
https://en.wikipedia.org/wiki/SYSTAT_(software)
https://en.wikipedia.org/wiki/The_Unscrambler
https://en.wikipedia.org/wiki/Unistat
https://en.wikipedia.org/wiki/Microsoft_Excel
https://en.wikipedia.org/wiki/Analyse-it
https://en.wikipedia.org/wiki/SigmaXL
https://en.wikipedia.org/wiki/Unistat
https://en.wikipedia.org/wiki/XLfit
https://en.wikipedia.org/wiki/RExcel
https://en.wikipedia.org/wiki/Category:Statistical_software
https://en.wikipedia.org/wiki/Comparison_of_statistical_packages
https://en.wikipedia.org/wiki/Template:Numerical_analysis_software
https://en.wikipedia.

#### Find the number of titles that have changed in the United States Code since its last release point.

In [242]:
# This is the url you will scrape in this exercise

url = 'http://uscode.house.gov/download/download.shtml'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')


In [269]:
# #content > div > div > div.uscitemlist > div:nth-child(14)
# <div class="usctitlechanged" id="us/usc/t10">

changes = soup.find_all('div',class_='usctitlechanged')
changes[0].get_text()

chan_list = []
print('Titles that have changed: \n')
for change in changes:
    change_long = re.sub('٭','',change.get_text())
    change_clean = ' '.join(re.findall(r'\S+',change_long))
    chan_list.append(change_clean)
    print(change_clean)

Titles that have changed: 

Title 10 - Armed Forces
Title 38 - Veterans' Benefits
Title 42 - The Public Health and Welfare
Title 50 - War and National Defense


#### Find a Python list with the top ten FBI's Most Wanted names.

In [4]:
# This is the url you will scrape in this exercise

url = 'https://www.fbi.gov/wanted/topten'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

In [281]:
# #query-results-0f737222c5054a81a120bce207b0446a > ul > li:nth-child(4) > h3

top10 = soup.find_all('h3',class_='title')

top10_list = []
print("Top 10 FBI'S Most Wanted: \n")
for criminal in top10:
    name = re.sub('\n','',criminal.get_text())
    top10_list.append(name)
    print(name)
    
top10_list

Top 10 FBI'S Most Wanted: 

ALEJANDRO ROSALES CASTILLO
ARNOLDO JIMENEZ
JASON DEREK BROWN
ALEXIS FLORES
JOSE RODOLFO VILLARREAL-HERNANDEZ
EUGENE PALMER
RAFAEL CARO-QUINTERO
ROBERT WILLIAM FISHER
BHADRESHKUMAR CHETANBHAI PATEL
YASER ABDEL SAID


['ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'YASER ABDEL SAID']

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [52]:
# This is the url you will scrape in this exercise

url = 'https://www.emsc-csem.org/Earthquake/'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

In [53]:
# [id="\39 69071"] > td:nth-child(5)

latest_earthqs = soup.tbody.find_all('tr',class_='normal')

In [55]:
earthq_dict = {'Date':[],'Time':[],'Latitude':[],'Longitude':[],'Region name':[]}

for earthq in latest_earthqs[:20]:

    date_time = earthq.find_all('td',class_='tabev6')
    date_time_list = date_time[0].b.a.get_text().split('\xa0\xa0\xa0')
    earthq_dict['Date'].append(date_time_list[0])
    earthq_dict['Time'].append(date_time_list[1])
    
    location = earthq.find_all('td',class_='tabev1')
    dir_and_mag = earthq.find_all('td',class_='tabev2')
    earthq_dict['Latitude'].append(location[0].get_text()+dir_and_mag[0].get_text().strip())
    earthq_dict['Longitude'].append(location[1].get_text()+dir_and_mag[1].get_text().strip())
    
    region = earthq.find_all('td',class_='tb_region')
    earthq_dict['Region name'].append(region[0].get_text().strip())

table_earthq = pd.DataFrame(earthq_dict)
table_earthq

Unnamed: 0,Date,Time,Latitude,Longitude,Region name
0,2021-04-12,09:16:09.9,36.54 N,25.56 E,"DODECANESE ISLANDS, GREECE"
1,2021-04-12,09:15:19.0,31.74 S,69.53 W,"SAN JUAN, ARGENTINA"
2,2021-04-12,09:10:44.2,38.11 N,2.20 W,SPAIN
3,2021-04-12,09:02:16.9,17.81 N,68.09 W,DOMINICAN REPUBLIC REGION
4,2021-04-12,08:38:21.3,36.94 N,27.39 E,DODECANESE IS.-TURKEY BORDER REG
5,2021-04-12,08:38:01.4,38.00 N,2.18 W,SPAIN
6,2021-04-12,08:36:59.8,38.02 N,2.19 W,SPAIN
7,2021-04-12,08:19:12.4,38.28 N,38.79 E,EASTERN TURKEY
8,2021-04-12,07:49:15.4,36.20 N,7.70 W,STRAIT OF GIBRALTAR
9,2021-04-12,07:32:20.4,36.41 N,27.13 E,DODECANESE IS.-TURKEY BORDER REG


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [135]:
# This is the url you will scrape in this exercise

url = 'https://www.wikipedia.org/'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')


In [138]:
# #js-link-box-es > strong
# body > div.central-featured //  body > div.central-textlogo


globe = soup.find_all('div',class_='central-featured')

In [139]:
languages = globe[0].find_all('div')

In [82]:
wiki_dict = {}

for language in languages:
    
    wiki_dict[language.a.strong.get_text()] = language.a.small.bdi.get_text().replace('\xa0','.')
    
wiki_dict

{'English': '6.280.000+',
 'Español': '1.673.000+',
 'Deutsch': '2.562.000+',
 '日本語': '1.263.000+',
 'Русский': '1.714.000+',
 'Français': '2.317.000+',
 'Italiano': '1.685.000+',
 '中文': '1.190.000+',
 'Português': '1.065.000+',
 'Polski': '1.467.000+'}

#### A list with the different kind of datasets available in data.gov.uk.

In [142]:
# This is the url you will scrape in this exercise

url = 'https://data.gov.uk/'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

In [143]:
# #main-content > div:nth-child(3) > div > ul class="govuk-list dgu-topics__list" class="govuk-grid-column-full"

datasets = soup.find_all('ul',class_='govuk-list dgu-topics__list')

In [155]:
topics = datasets[0].find_all('li')

topic_dict = {}

for topic in topics:
    
    topic_dict[topic.h3.a.get_text()] =  topic.p.get_text()
    
topic_dict

{'Business and economy': 'Small businesses, industry, imports, exports and trade',
 'Crime and justice': 'Courts, police, prison, offenders, borders and immigration',
 'Defence': 'Armed forces, health and safety, search and rescue',
 'Education': 'Students, training, qualifications and the National Curriculum',
 'Environment': 'Weather, flooding, rivers, air quality, geology and agriculture',
 'Government': 'Staff numbers and pay, local councillors and department business plans',
 'Government spending': 'Includes all payments by government departments over £25,000',
 'Health': 'Includes smoking, drugs, alcohol, medicine performance and hospitals',
 'Mapping': 'Addresses, boundaries, land ownership, aerial photographs, seabed and land terrain',
 'Society': 'Employment, benefits, household finances, poverty and population',
 'Towns and cities': 'Includes housing, urban planning, leisure, waste and energy, consumption',
 'Transport': 'Airports, roads, freight, electric vehicles, parking, 

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [179]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

In [180]:
# #mw-content-text > div.mw-parser-output > table:nth-child(12)

table = soup.select('div.mw-parser-output > table:nth-child(12)')

rows = table[0].tbody.find_all('tr')

In [253]:
languages_dict = {'Language':[],'Native Speakers (in millions)':[]}

for row in rows[1:11]:
    
    cols = row.find_all('td')
    
    languages_dict['Language'].append(re.sub(r'\[\d+\]','',cols[1].get_text().replace('\n','')))
    languages_dict['Native Speakers (in millions)'].append(cols[2].get_text().replace('\n',''))

lang_df = pd.DataFrame(languages_dict,)
lang_df

Unnamed: 0,Language,Native Speakers (in millions)
0,Mandarin Chinese,918.0
1,Spanish,480.0
2,English,379.0
3,Hindi (sanskritised Hindustani),341.0
4,Bengali,228.0
5,Portuguese,221.0
6,Russian,154.0
7,Japanese,128.0
8,Western Punjabi,92.7
9,Marathi,83.1


## Bonus


#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [254]:
# This is the url you will scrape in this exercise 

url = 'https://www.imdb.com/chart/top'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

In [263]:
#main > div > span > div > div > div.lister > table > tbody lister-list

movies = soup.select('tbody.lister-list tr')

In [294]:
imbd = {'Movie':[],'Release':[],'Director':[],'Stars':[]}

for movie in movies:
    
    info = movie.select('td.titleColumn')[0]
    
    imbd['Movie'].append(info.a.get_text())
    
    imbd['Release'].append(re.sub(r'\D','',info.span.get_text()))
    
    dir_n_stars = info.a['title'].split(' (dir.), ')
    
    imbd['Director'].append(dir_n_stars[0])
    imbd['Stars'].append(dir_n_stars[1])
    
imbd_df = pd.DataFrame(imbd)

imbd_df

Unnamed: 0,Movie,Release,Director,Stars
0,Cadena perpetua,1994,Frank Darabont,"Tim Robbins, Morgan Freeman"
1,El padrino,1972,Francis Ford Coppola,"Marlon Brando, Al Pacino"
2,El padrino: Parte II,1974,Francis Ford Coppola,"Al Pacino, Robert De Niro"
3,El caballero oscuro,2008,Christopher Nolan,"Christian Bale, Heath Ledger"
4,12 hombres sin piedad,1957,Sidney Lumet,"Henry Fonda, Lee J. Cobb"
...,...,...,...,...
245,Sucedió una noche,1934,Frank Capra,"Clark Gable, Claudette Colbert"
246,Milagro en la celda 7,2019,Mehmet Ada Öztekin,"Aras Bulut Iynemli, Nisa Sofiya Aksongur"
247,Mandarinas,2013,Zaza Urushadze,"Lembit Ulfsak, Elmo Nüganen"
248,Neon Genesis Evangelion: The End of Evangelion,1997,Hideaki Anno,"Megumi Ogata, Megumi Hayashibara"


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [295]:
#This is the url you will scrape in this exercise

url = 'https://www.imdb.com/list/ls091294718/'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

In [308]:
# #main > div > div.lister.list.detail.sub-list

movies_list = soup.find_all('div',class_='lister-list')

In [323]:
movies = movies_list[0].select('div.lister-item-content')

movie_dict = {'Name':[],'Year':[],'Summary':[]}

for movie in movies[:10]:
    
    movie_dict['Name'].append(movie.h3.a.get_text())
    
    movie_dict['Year'].append(movie.h3.select('span')[1].get_text())
    
    movie_dict['Summary'].append(movie.select('p')[1].get_text().strip())

movies_pd = pd.DataFrame(movie_dict)

movies_pd

Unnamed: 0,Name,Year,Summary
0,American Psycho,(2000),A wealthy New York City investment banking exe...
1,La cosa,(1982),A research team in Antarctica is hunted by a s...
2,Tiburón,(1975),When a killer shark unleashes chaos on a beach...
3,Heat,(1995),A group of professional bank robbers start to ...
4,Top Gun (Ídolos del aire),(1986),As students at the United States Navy's elite ...
5,Pulp Fiction,(1994),"The lives of two mob hitmen, a boxer, a gangst..."
6,Jóvenes ocultos,(1987),"After moving to a new town, two brothers disco..."
7,Ed Wood,(1994),Ambitious but troubled movie director Edward D...
8,The Game,(1997),After a wealthy banker is given an opportunity...
9,Con faldas y a lo loco,(1959),"After two male musicians witness a mob hit, th..."


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [386]:
#https://openweathermap.org/current

import json

city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

loc = json.loads(str(soup))

Enter the city: sevilla


In [387]:
weather_report = {}

weather_report['Temperature'] = loc['main']['temp']
weather_report['Wind speed'] = loc['wind']['speed']
weather_report['Description'] = loc['weather'][0]['description']
weather_report['Weather'] = loc['weather'][0]['main']

weather_report

{'Temperature': 20.33,
 'Wind speed': 2.06,
 'Description': 'clear sky',
 'Weather': 'Clear'}

#### Find the book name, price and stock availability as a pandas dataframe.

In [394]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 

url = 'http://books.toscrape.com/'

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')


In [395]:
# body > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article

books = soup.find_all('article',class_='product_pod')

In [401]:
book_dict = {'Name':[],'Price':[],'Availability':[]}

for book in books:
    
    book_dict['Name'].append(book.h3.a.get_text())
    
    book_dict['Price'].append(book.select('div')[1].p.get_text())
    
    book_dict['Availability'].append(book.select('div')[1].select('p')[1].get_text().strip())
    
bookspd = pd.DataFrame(book_dict)

In [402]:
bookspd

Unnamed: 0,Name,Price,Availability
0,A Light in the ...,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History ...,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets ...,£33.34,In stock
7,The Coming Woman: A ...,£17.93,In stock
8,The Boys in the ...,£22.60,In stock
9,The Black Maria,£52.15,In stock
