# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pprint import pprint
from lxml import html
from lxml.html import fromstring
import urllib.request
from urllib.request import urlopen
import random
import re
#import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
#your code
html=requests.get(url).content
sopa=BeautifulSoup(html,'lxml')
print(sopa)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-b798bd8bbfd812425a6347adbee0d040.css" integrity="sha512-t5i9i7/YEkJaY0etvuDQQJWuNbLjmHbatMDO3mkDPwa34wYXh7HYaGl2onO8idtPizB95h6SsERiIF+v2vbBPw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8f94276ad938bb2b866a7078e1df8040.css" integrity="sha512-j5Qnatk4uyuGanB44d+AQJb0t5hmCK

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [4]:
#your code
tags = ['h1','p']
text = [element.text for element in sopa.find_all(tags)]
print('\n'.join(text))

Trending

      These are the developers building the hot tools today.
    


            JP Simard
 


              jpsim
 



      Yams
 


            Franck Nijhof
 


              frenck
 



      home-assistant-config
 


            Michiel Borkent
 


              borkdude
 



      grasp
 


            Anthony Fu
 


              antfu
 



      ni
 


            Fons van der Plas
 


              fonsp
 



      Pluto.jl
 


            Koen Kanters
 


              Koenkk
 



      zigbee2mqtt
 


            Paulus Schoutsen
 


              balloob
 
 @home-assistant, @nabucasa


            Anthony Sottile
 


              asottile
 



      pyupgrade
 


            Jan Karger ツ ☀
 


              punker76
 



      gong-wpf-dragdrop
 


            Javier Suárez
 


              jsuarezruiz
 



      xamarin-forms-goodlooking-UI
 


            Siddharth Dushantha
 


              sdushantha
 



      tmpmail
 


            Sebastián Ramírez
 




#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
html=requests.get(url).content
sopa=BeautifulSoup(html,'lxml')
print(sopa)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-b798bd8bbfd812425a6347adbee0d040.css" integrity="sha512-t5i9i7/YEkJaY0etvuDQQJWuNbLjmHbatMDO3mkDPwa34wYXh7HYaGl2onO8idtPizB95h6SsERiIF+v2vbBPw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8f94276ad938bb2b866a7078e1df8040.css" integrity="sha512-j5Qnatk4uyuGanB44d+AQJb0t5hmCK

In [6]:
#your code
tags = ['h1']
text = [element.text for element in sopa.find_all(tags)]
print('\n'.join(text))

Trending




        facebookresearch /

      pifuhd
 




        guardicore /

      monkey
 




        geohot /

      tinygrad
 




        jofpin /

      trape
 




        LonamiWebs /

      Telethon
 




        django /

      django
 




        python-telegram-bot /

      python-telegram-bot
 




        lazyprogrammer /

      machine_learning_examples
 




        teja156 /

      microsoft-teams-class-attender
 




        sherlock-project /

      sherlock
 




        Cog-Creators /

      Red-DiscordBot
 




        iearn-finance /

      yearn-vaults
 




        PrefectHQ /

      prefect
 




        tiangolo /

      fastapi
 




        pennersr /

      django-allauth
 




        TheSpeedX /

      TBomb
 




        aapatre /

      Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE
 




        plotly /

      dash
 




        microsoft /

      playwright-python
 




        thenewboston-developers /

      thenewboston-pyt

#### Display all the image links from Walt Disney wikipedia page

In [7]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
html=requests.get(url).content
sopa=BeautifulSoup(html,'lxml')
print(sopa)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f2a25c07-36e7-4bf9-9239-99c22a02a421","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":985305640,"wgRevisionId":985305640,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Short description is different from Wikidata","Wikipedia extended-confirmed-protected pag

In [8]:
#your code
tags = ['img']
text = sopa.find_all(tags)
print(text)

[<img alt="This is a featured article. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>, <img alt="Extended-protected article" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/30px-Extended-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/40px-Extended-protection-shackle.svg.png 2x" width="20"/>, <img alt="Walt Disney 1946.JP

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [9]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 
html=requests.get(url).content
sopa=BeautifulSoup(html,'lxml')
print(sopa)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3bfcf670-dd73-480c-bbb4-5096cc74a32f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":987482924,"wgRevisionId":987482924,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common name disambiguation

In [10]:
#your code
tags = ['a']
text = sopa.find_all(tags)
print(text)

[<a id="top"></a>, <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>, <a class="mw-jump-link" href="#searchInput">Jump to search</a>, <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>, <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>, <a class="mw-redirect" href="/wiki/Pythons" title="Pythons">Pythons</a>, <a href="/wiki/Python_(genus)" title="Python (genus)"><i>Python</i> (genus)</a>, <a href="#Computing"><span class="tocnumber">1</span> <span class="toctext">Computing</span></a>, <a href="#People"><span class="tocnumber">2</span> <span class="toctext">People</span></a>, <a href="#Roller_coasters"><span class="tocnumber">3</span> <span class="toctext">Roller coasters</span></a>, <a href="#Vehicles"><span class="tocnumber">4</span> <span class="toctext">Vehicles</span></a>, <a href="#Weaponry"><span class="tocnumber">5</span> <span class="toctext">Weaponry</span></a>, <a href

#### Number of Titles that have changed in the United States Code since its last release point 

In [11]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'
html=requests.get(url).content
sopa=BeautifulSoup(html,'lxml')
print(sopa)

<?xml version='1.0' encoding='UTF-8' ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=scripts" type="text/javascri

In [12]:
#your code

text = sopa.find_all('div',{'class':"usctitlechanged"})
print(text)

[<div class="usctitlechanged" id="us/usc/t8">

          Title 8 - Aliens and Nationality

        </div>, <div class="usctitlechanged" id="us/usc/t11">

          Title 11 - Bankruptcy <span class="footnote"><a class="fn" href="#fn">٭</a></span>
</div>, <div class="usctitlechanged" id="us/usc/t15">

          Title 15 - Commerce and Trade

        </div>, <div class="usctitlechanged" id="us/usc/t18">

          Title 18 - Crimes and Criminal Procedure <span class="footnote"><a class="fn" href="#fn">٭</a></span>
</div>, <div class="usctitlechanged" id="us/usc/t25">

          Title 25 - Indians

        </div>, <div class="usctitlechanged" id="us/usc/t31">

          Title 31 - Money and Finance <span class="footnote"><a class="fn" href="#fn">٭</a></span>
</div>, <div class="usctitlechanged" id="us/usc/t32">

          Title 32 - National Guard <span class="footnote"><a class="fn" href="#fn">٭</a></span>
</div>, <div class="usctitlechanged" id="us/usc/t38">

          Title 38 - Vetera

#### A Python list with the top ten FBI's Most Wanted names 

In [13]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
html=requests.get(url).content
sopa=BeautifulSoup(html,'lxml')
print(sopa)

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.fbi.gov/wanted/topten" rel="canonical"/><title>Ten Most Wanted Fugitives — FBI</title>
<link href="https://www.fbi.gov/wanted/topten/RSS" rel="alternate" title="Ten Most Wanted Fugitives - RSS 1.0" type="application/rss+xml"/>
<link href="https://www.fbi.gov/wanted/topten/rss.xml" rel="alternate" title="Ten Most Wanted Fugitives - RSS 2.0" type="application/rss+xml"/>
<link href="https://www.fbi.gov/wanted/topten/atom.xml" rel="alternate" title="Ten Most Wanted Fugitives - Atom" type="application/rss+xml"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="Ten Most Wanted Fugitives | Federal Bureau of Investigation" name="twitter:title"/>
<meta content="Federal Bureau of Investigation" property="og:site_name"/>
<meta content="Ten 

In [14]:
#your code 
tags = ['h3']
text = [element.text for element in sopa.find_all(tags)]
print('\n'.join(text))


ARNOLDO JIMENEZ


JASON DEREK BROWN


ALEXIS FLORES


JOSE RODOLFO VILLARREAL-HERNANDEZ


EUGENE PALMER


RAFAEL CARO-QUINTERO


ROBERT WILLIAM FISHER


BHADRESHKUMAR CHETANBHAI PATEL


ALEJANDRO ROSALES CASTILLO


YASER ABDEL SAID



####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [15]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
<head><meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/><meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/><meta content="43b36314ccb77957" name="y_key"/><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
<meta content="en" http-equiv="Content-Language"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="all" name="robots"/>
<meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftershock,tremor" name="keywo

In [16]:
table = sopa.find_all('tbody')[0]
table

<tbody id="tbody"><tr class="ligne1 normal" id="919468" onclick="go_details(event,919468);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=919468">2020-11-08   22:15:46.0</a></b><i class="ago" id="ago0">22min ago</i></td><td class="tabev1">11.30 </td><td class="tabev2">N  </td><td class="tabev1">85.64 </td><td class="tabev2">W  </td><td class="tabev3">176</td><td class="tabev5" id="magtyp0"> M</td><td class="tabev2">3.5</td><td class="tb_region" id="reg0"> NICARAGUA</td><td class="comment updatetimeno" id="upd0" style="text-align:right;">2020-11-08 22:21</td></tr>
<tr class="ligne2 normal" id="919467" onclick="go_details(event,919467);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=919467">2020-11-08   22:02:37.0</a></b><i class="ago" id="

In [17]:
#your code
rows = table.find_all('tr')
rows = [row.text.strip().split("\n") for row in rows]
rows


[['earthquake2020-11-08\xa0\xa0\xa022:15:46.022min ago11.30\xa0N\xa0\xa085.64\xa0W\xa0\xa0176 M3.5\xa0NICARAGUA2020-11-08 22:21'],
 ['earthquake2020-11-08\xa0\xa0\xa022:02:37.035min ago23.92\xa0S\xa0\xa067.21\xa0W\xa0\xa0226ML3.3\xa0SALTA, ARGENTINA2020-11-08 22:15'],
 ['earthquake2020-11-08\xa0\xa0\xa022:00:23.037min ago2.67\xa0S\xa0\xa0121.92\xa0E\xa0\xa010 M3.1\xa0SULAWESI, INDONESIA2020-11-08 22:05'],
 ['earthquake2020-11-08\xa0\xa0\xa021:49:23.448min ago36.66\xa0N\xa0\xa0121.28\xa0W\xa0\xa05Md2.8\xa0CENTRAL CALIFORNIA2020-11-08 21:51'],
 ['earthquake2020-11-08\xa0\xa0\xa021:44:20.053min ago27.85\xa0N\xa0\xa084.97\xa0E\xa0\xa010 M4.3\xa0NEPAL-INDIA BORDER REGION2020-11-08 21:55'],
 ['earthquake2020-11-08\xa0\xa0\xa021:42:05.056min ago6.00\xa0S\xa0\xa0122.51\xa0E\xa0\xa010 M4.4\xa0FLORES SEA2020-11-08 21:50'],
 ['earthquake2020-11-08\xa0\xa0\xa021:25:20.01hr 12min ago2.27\xa0S\xa0\xa0119.15\xa0E\xa0\xa010 M2.8\xa0SULAWESI, INDONESIA2020-11-08 21:35'],
 ['earthquake2020-11-08\xa0\xa0

In [18]:
colnames = rows[0]
data = rows[1:]

df = pd.DataFrame(data, columns=colnames)
df


Unnamed: 0,earthquake2020-11-08 22:15:46.022min ago11.30 N 85.64 W 176 M3.5 NICARAGUA2020-11-08 22:21
0,earthquake2020-11-08 22:02:37.035min ago23.9...
1,earthquake2020-11-08 22:00:23.037min ago2.67...
2,earthquake2020-11-08 21:49:23.448min ago36.6...
3,earthquake2020-11-08 21:44:20.053min ago27.8...
4,earthquake2020-11-08 21:42:05.056min ago6.00...
5,earthquake2020-11-08 21:25:20.01hr 12min ago...
6,earthquake2020-11-08 21:23:30.01hr 14min ago...
7,earthquake2020-11-08 21:14:33.51hr 23min ago...
8,earthquake2020-11-08 21:06:53.11hr 31min ago...
9,earthquake2020-11-08 21:03:39.01hr 34min ago...


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [19]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

ConnectionError: HTTPSConnectionPool(host='hackevents.co', port=443): Max retries exceeded with url: /hackathons (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002A286F590D0>: Failed to establish a new connection: [WinError 10060] Se produjo un error durante el intento de conexión ya que la parte conectada no respondió adecuadamente tras un periodo de tiempo, o bien se produjo un error en la conexión establecida ya que el host conectado no ha podido responder'))

In [20]:
#your code
#No abre el url

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [21]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/LopezObrador_'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/>
<link href="//abs.twimg.com" rel="preconnect"/>
<link href="//api.twitter.com" rel="preconnect"/>
<link href="//pbs.twimg.com" rel="preconnect"/>
<link href="//t.co" rel="preconnect"/>
<link href="//video.twimg.com" rel="preconnect"/>
<link href="//abs.twimg.com" rel="dns-prefetch"/>
<link href="//api.twitter.com" rel="dns-prefetch"/>
<link href="//pbs.twimg.com" rel="dns-prefetch"/>
<link href="//t.co" rel="dns-prefetch"/>
<link href="//video.twimg.com" rel="dns-prefetch"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.90d86535.js" nonce="YTc3NGVjMjMtMTcyMi00NzJjLTgxNTItOWIxNTg2NzE0NTcz" rel="preload"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/vendors~main.45e01195.js"

In [22]:
#your code


text = sopa.find_all('script')
print(text)

[<script nonce="YTc3NGVjMjMtMTcyMi00NzJjLTgxNTItOWIxNTg2NzE0NTcz">
window.__INITIAL_STATE__ = {"optimist":[],"featureSwitch":{"config":{"2fa_multikey_management_enabled":{"value":false},"account_country_setting_countries_whitelist":{"value":["ad","ae","af","ag","ai","al","am","ao","ar","as","at","au","aw","ax","az","ba","bb","bd","be","bf","bg","bh","bi","bj","bl","bm","bn","bo","bq","br","bs","bt","bv","bw","by","bz","ca","cc","cd","cf","cg","ch","ci","ck","cl","cm","co","cr","cu","cv","cw","cx","cy","cz","de","dj","dk","dm","do","dz","ec","ee","eg","er","es","et","fi","fj","fk","fm","fo","fr","ga","gb","gd","ge","gf","gg","gh","gi","gl","gm","gn","gp","gq","gr","gs","gt","gu","gw","gy","hk","hn","hr","ht","hu","id","ie","il","im","in","io","iq","ir","is","it","je","jm","jo","jp","ke","kg","kh","ki","km","kn","kr","kw","ky","kz","la","lb","lc","li","lk","lr","ls","lt","lu","lv","ly","ma","mc","md","me","mf","mg","mh","mk","ml","mn","mo","mp","mq","mr","ms","mt","mu","mv","mw","mx","my

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [23]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [24]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [25]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<html class="no-js" lang="mul">
<head>
<meta charset="utf-8"/>
<title>Wikipedia</title>
<meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
<link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
<link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
<link href="//creativecommons.org/licenses/by-sa/3.0/" rel="license"/>
<style>
.sprite{background-image:url(portal/wikipedia.org/assets/img/sprite-46c49284.png);background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-46c49284.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background

In [26]:
#your code
tags = ['div']
text = [element.text for element in sopa.find_all(tags)]
print('\n'.join(text))





Wikipedia

The Free Encyclopedia








English
6 183 000+ articles





Español
1 637 000+ artículos





日本語
1 235 000+ 記事





Deutsch
2 495 000+ Artikel





Русский
1 672 000+ статей





Français
2 262 000+ articles





Italiano
1 645 000+ voci





中文
1 155 000+ 條目





Português
1 045 000+ artigos





Polski
1 435 000+ haseł





English
6 183 000+ articles




Español
1 637 000+ artículos




日本語
1 235 000+ 記事




Deutsch
2 495 000+ Artikel




Русский
1 672 000+ статей




Français
2 262 000+ articles




Italiano
1 645 000+ voci




中文
1 155 000+ 條目




Português
1 045 000+ artigos




Polski
1 435 000+ haseł









Search Wikipedia





العربية
Asturianu
Azərbaycanca
Български
Bân-lâm-gú / Hō-ló-oē
Беларуская
Català
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
English
Español
Esperanto
Euskara
فارسی
Français
Galego
Հայերեն
हिन्दी
Hrvatski
Bahasa Indonesia
Italiano
עברית
ქართული
Latina
Latviešu
Lietuvių
Magyar
Македонски
مصرى
Bahasa Melayu
Bahaso Minangkabau
Nederla

#### A list with the different kind of datasets available in data.gov.uk 

In [27]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if gt IE 8]><!--><html lang="en"><!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title>Find open data - data.gov.uk</title>
<meta content="#0b0c0c" name="theme-color"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/find-assets/application-5c2a7e4c520f496c90cd227b2544108a747f107f1bea7397ea3fccaefa8fe904.css" media="screen" rel="stylesheet"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="X9IbsAQqjiO4yVcBTl5uMyPcWloxzMxNqM5xcxKrrvKLj5Z6017Vio8VwzPwLnxkZXpgB6JHQzuqekgD35Xc8w==" name="csrf-token"/>
</head><body class="govuk-template__body">
<script>document.body.className = ((document.body.className) ? document.body.className + ' js-enabled' : 'js-enabled');</script>
<a class="gem-c-skip-link govuk-skip-link" href="#main-content">Skip to main content</a>
<div aria-label="cookie banner" class="gem-c-cookie-banner govuk-clearfix" data-module="cookie-b

In [28]:
#your code 
tags = ['h3']
text = [element.text for element in sopa.find_all(tags)]
print('\n'.join(text))

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [29]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"5c596dc0-3e19-44f0-a5d2-9152631f4212","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_languages_by_number_of_native_speakers","wgTitle":"List of languages by number of native speakers","wgCurRevisionId":985620308,"wgRevisionId":985620308,"wgArticleId":405385,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia indefinitely semi-protected pages","Articles with short descr

In [30]:
#your code
table = sopa.find_all('table',{'class':'wikitable sortable'})[0]
table

<table class="wikitable sortable">
<caption>Languages with at least 10 million first-language speakers<sup class="reference" id="cite_ref-:0_7-1"><a href="#cite_note-:0-7">[7]</a></sup>
</caption>
<tbody><tr>
<th>Rank
</th>
<th>Language
</th>
<th>Speakers<br/><small>(millions)</small>
</th>
<th>% of World pop.<br/><small>(March 2019)<sup class="reference" id="cite_ref-8"><a href="#cite_note-8">[8]</a></sup></small>
</th>
<th>Language family
</th>
<th>Branch
</th></tr>
<tr>
<td>1
</td>
<td><a href="/wiki/Mandarin_Chinese" title="Mandarin Chinese">Mandarin Chinese</a>
</td>
<td>918
</td>
<td>11.922
</td>
<td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
</td>
<td><a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">Sinitic</a>
</td></tr>
<tr>
<td>2
</td>
<td><a href="/wiki/Spanish_language" title="Spanish language">Spanish</a>
</td>
<td>480
</td>
<td>5.994
</td>
<td><a href="/wiki/Indo-European_languages" title="Indo-European language

In [31]:
rows = table.find_all('tr')
rows = [row.text.strip().split("\n") for row in rows]
rows


[['Rank',
  '',
  'Language',
  '',
  'Speakers(millions)',
  '',
  '% of World pop.(March 2019)[8]',
  '',
  'Language family',
  '',
  'Branch'],
 ['1',
  '',
  'Mandarin Chinese',
  '',
  '918',
  '',
  '11.922',
  '',
  'Sino-Tibetan',
  '',
  'Sinitic'],
 ['2',
  '',
  'Spanish',
  '',
  '480',
  '',
  '5.994',
  '',
  'Indo-European',
  '',
  'Romance'],
 ['3',
  '',
  'English',
  '',
  '379',
  '',
  '4.922',
  '',
  'Indo-European',
  '',
  'Germanic'],
 ['4',
  '',
  'Hindi (Sanskritised Hindustani)[9]',
  '',
  '341',
  '',
  '4.429',
  '',
  'Indo-European',
  '',
  'Indo-Aryan'],
 ['5',
  '',
  'Bengali',
  '',
  '228',
  '',
  '2.961',
  '',
  'Indo-European',
  '',
  'Indo-Aryan'],
 ['6',
  '',
  'Portuguese',
  '',
  '221',
  '',
  '2.870',
  '',
  'Indo-European',
  '',
  'Romance'],
 ['7',
  '',
  'Russian',
  '',
  '154',
  '',
  '2.000',
  '',
  'Indo-European',
  '',
  'Balto-Slavic'],
 ['8', '', 'Japanese', '', '128', '', '1.662', '', 'Japonic', '', 'Japanese'],
 

In [32]:
colnames = rows[0]
data = rows[1:11]

df = pd.DataFrame(data, columns=colnames)
df


Unnamed: 0,Rank,Unnamed: 2,Language,Unnamed: 4,Speakers(millions),Unnamed: 6,% of World pop.(March 2019)[8],Unnamed: 8,Language family,Unnamed: 10,Branch
0,1,,Mandarin Chinese,,918.0,,11.922,,Sino-Tibetan,,Sinitic
1,2,,Spanish,,480.0,,5.994,,Indo-European,,Romance
2,3,,English,,379.0,,4.922,,Indo-European,,Germanic
3,4,,Hindi (Sanskritised Hindustani)[9],,341.0,,4.429,,Indo-European,,Indo-Aryan
4,5,,Bengali,,228.0,,2.961,,Indo-European,,Indo-Aryan
5,6,,Portuguese,,221.0,,2.87,,Indo-European,,Romance
6,7,,Russian,,154.0,,2.0,,Indo-European,,Balto-Slavic
7,8,,Japanese,,128.0,,1.662,,Japonic,,Japanese
8,9,,Western Punjabi[10],,92.7,,1.204,,Indo-European,,Indo-Aryan
9,10,,Marathi,,83.1,,1.079,,Indo-European,,Indo-Aryan


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [33]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/>
<link href="//abs.twimg.com" rel="preconnect"/>
<link href="//api.twitter.com" rel="preconnect"/>
<link href="//pbs.twimg.com" rel="preconnect"/>
<link href="//t.co" rel="preconnect"/>
<link href="//video.twimg.com" rel="preconnect"/>
<link href="//abs.twimg.com" rel="dns-prefetch"/>
<link href="//api.twitter.com" rel="dns-prefetch"/>
<link href="//pbs.twimg.com" rel="dns-prefetch"/>
<link href="//t.co" rel="dns-prefetch"/>
<link href="//video.twimg.com" rel="dns-prefetch"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.90d86535.js" nonce="YmFlYTViMGMtNGI1ZS00ZmUzLTgzYWEtMzEzY2Q5MmY3MWQ4" rel="preload"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/vendors~main.45e01195.js"

In [34]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [35]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<s

In [36]:
# your code
table = sopa.find_all('table',{'class':'chart full-width'})[0]
table

<table class="chart full-width" data-caller-name="chart-top250movie">
<colgroup>
<col class="chartTableColumnPoster"/>
<col class="chartTableColumnTitle"/>
<col class="chartTableColumnIMDbRating"/>
<col class="chartTableColumnYourRating"/>
<col class="chartTableColumnWatchlistRibbon"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>
</thead>
<tbody class="lister-list">
<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.222793074760872" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="2302823" name="nv"></span>
<span data-value="-1.7772069252391276" name="ur"></span>
<a href="/title/tt0111161/"> <img alt="Sueño de fuga" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
      1.
     

In [37]:
rows = table.find_all('tr')
rows = [row.text.strip().split("\n") for row in rows]
rows

[['Rank & Title', 'IMDb Rating', 'Your Rating'],
 ['1.',
  '      Sueño de fuga',
  '(1994)',
  '',
  '',
  '9.2',
  '',
  '',
  '',
  '',
  '',
  '\xa012345678910 ',
  '',
  '',
  '',
  'NOT YET RELEASED',
  ' ',
  '',
  'Seen'],
 ['2.',
  '      El Padrino',
  '(1972)',
  '',
  '',
  '9.1',
  '',
  '',
  '',
  '',
  '',
  '\xa012345678910 ',
  '',
  '',
  '',
  'NOT YET RELEASED',
  ' ',
  '',
  'Seen'],
 ['3.',
  '      El padrino 2a parte',
  '(1974)',
  '',
  '',
  '9.0',
  '',
  '',
  '',
  '',
  '',
  '\xa012345678910 ',
  '',
  '',
  '',
  'NOT YET RELEASED',
  ' ',
  '',
  'Seen'],
 ['4.',
  '      Batman: El Caballero de la Noche',
  '(2008)',
  '',
  '',
  '9.0',
  '',
  '',
  '',
  '',
  '',
  '\xa012345678910 ',
  '',
  '',
  '',
  'NOT YET RELEASED',
  ' ',
  '',
  'Seen'],
 ['5.',
  '      12 hombres en pugna',
  '(1957)',
  '',
  '',
  '8.9',
  '',
  '',
  '',
  '',
  '',
  '\xa012345678910 ',
  '',
  '',
  '',
  'NOT YET RELEASED',
  ' ',
  '',
  'Seen'],
 ['6.',
  '  

In [38]:
colnames = ['Rank & Title','', 'IMDb Rating', 'Your Rating','','','','','','','','','','','','','','','']
data = rows[1:]

df = pd.DataFrame(data, columns=colnames)
df

Unnamed: 0,Rank & Title,Unnamed: 2,IMDb Rating,Your Rating,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19
0,1.,Sueño de fuga,(1994),,,9.2,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
1,2.,El Padrino,(1972),,,9.1,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
2,3.,El padrino 2a parte,(1974),,,9.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
3,4.,Batman: El Caballero de la Noche,(2008),,,9.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
4,5.,12 hombres en pugna,(1957),,,8.9,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,246.,La batalla de Argel,(1966),,,8.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
246,247.,Terminator,(1984),,,8.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
247,248.,Una voz silenciosa: Koe No Katachi,(2016),,,8.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
248,249.,Aladdín,(1992),,,8.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [39]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<s

In [40]:
#your code
table = sopa.find_all('table',{'class':'chart full-width'})[0]
rows = table.find_all('tr')
rows = [row.text.strip().split("\n") for row in rows]
colnames = ['Rank & Title','', 'IMDb Rating', 'Your Rating','','','','','','','','','','','','','','','']
data = rows[1:11]

df = pd.DataFrame(data, columns=colnames)
df

Unnamed: 0,Rank & Title,Unnamed: 2,IMDb Rating,Your Rating,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19
0,1.0,Sueño de fuga,(1994),,,9.2,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
1,2.0,El Padrino,(1972),,,9.1,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
2,3.0,El padrino 2a parte,(1974),,,9.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
3,4.0,Batman: El Caballero de la Noche,(2008),,,9.0,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
4,5.0,12 hombres en pugna,(1957),,,8.9,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
5,6.0,La lista de Schindler,(1993),,,8.9,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
6,7.0,El señor de los anillos: El retorno del rey,(2003),,,8.9,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
7,8.0,Tiempos violentos,(1994),,,8.8,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
8,9.0,"El bueno, el malo y el feo",(1966),,,8.8,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen
9,10.0,El señor de los anillos: La comunidad de...,(2001),,,8.8,,,,,,12345678910,,,,NOT YET RELEASED,,,Seen


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [41]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city:Mexico City


In [42]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [43]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'
html = requests.get(url).content
sopa = BeautifulSoup(html, "lxml")
print(sopa)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

In [44]:
#your code
tags = ['h3','p']
text = [element.text for element in sopa.find_all(tags)]
print('\n'.join(text))








A Light in the ...
£51.77


    
        In stock
    








Tipping the Velvet
£53.74


    
        In stock
    








Soumission
£50.10


    
        In stock
    








Sharp Objects
£47.82


    
        In stock
    








Sapiens: A Brief History ...
£54.23


    
        In stock
    








The Requiem Red
£22.65


    
        In stock
    








The Dirty Little Secrets ...
£33.34


    
        In stock
    








The Coming Woman: A ...
£17.93


    
        In stock
    








The Boys in the ...
£22.60


    
        In stock
    








The Black Maria
£52.15


    
        In stock
    








Starving Hearts (Triangular Trade ...
£13.99


    
        In stock
    








Shakespeare's Sonnets
£20.66


    
        In stock
    








Set Me Free
£17.46


    
        In stock
    








Scott Pilgrim's Precious Little ...
£52.29


    
        In stock
    








Rip it Up and ...
£35.02


    
        In stock
    








Our Band C