# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url1 = 'https://github.com/trending/developers'

In [3]:
#your code
resp1 = requests.get(url1)
resp1

<Response [200]>

In [4]:
sopa1 = bs(resp1.content, 'html.parser')


In [5]:
tablaapellido= sopa1.findAll('p',{'class':'f4'})
tablaapellido


[<p class="f4 text-gray col-md-10 mx-auto">
       These are the developers building the hot tools today.
     </p>,
 <p class="f4 text-normal mb-1">
 <a class="link-gray" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":1066253,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="88126d2c2d60cb84ab46ec843ccc865efc18789be40f12dbc5280108dcc1c8d7" href="/robdodson">
               robdodson
 </a> </p>,
 <p class="f4 text-normal mb-1">
 <a class="link-gray" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":2212006,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="a33f1e43be6afe40c82f849854c

In [6]:
tabla1 = sopa1.findAll('h1',{'class':'h3'})

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [7]:
#your code

lista1 = [i.text.replace('\n','').strip() for i in tabla1]


listaapellido = [i.text.replace('\n','').strip() for i in tablaapellido]


print(lista1)

['Rob Dodson', 'MichaIng', 'Gleb Bahmutov', 'Lukas Taegert-Atkinson', 'Till Krüss', 'Jesse Duffield', 'ᴜɴᴋɴᴡᴏɴ', 'Arve Knudsen', 'Niklas von Hertzen', 'Stephen Celis', 'Damian Dulisz', 'Yufan You', 'Christian Clauss', 'Jirka Borovec', 'Timothy Edmund Crosley', 'James Newton-King', 'Michael Shilman', 'Mike Penz', 'Alex Hall', 'Diego Sampaio', 'Dries Vints', 'JK Jung', 'Steven', 'Daniel Martí', 'Łukasz Magiera']


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [8]:
# This is the url you will scrape in this exercise
url2 = 'https://github.com/trending/python?since=daily'

In [9]:
#your code
resp2 = requests.get(url2)
resp2

<Response [200]>

In [10]:
sopa2 = bs(resp2.content, 'html.parser')
tabla2 = sopa2.findAll('h1',{'class':'h3'})
tabla2

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":254755757,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="a1f6809e5c7f3d5638a26f98b49d54b71019ee6536c0446be821c66fdab9eb8a" href="/Palashio/libra">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 text-gray" color="gray" height="16" mr="1" version="1.1" viewbox="0 0 16 16" width="16"><path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path

In [11]:
lista2 = [i.text.replace('\n','').strip() for i in tabla2]
lista2

['Palashio /      libra',
 'h1st-ai /      h1st',
 'tgorgdotcom /      locast2plex',
 'horovod /      horovod',
 'PostHog /      posthog',
 'tiangolo /      fastapi',
 'pythonstock /      stock',
 'matplotlib /      matplotlib',
 'Felienne /      hedy',
 'PaddlePaddle /      PaddleDetection',
 'jina-ai /      jina',
 'Dod-o /      Statistical-Learning-Method_Code',
 'openai /      image-gpt',
 'Ha0Tang /      XingGAN',
 'PyGithub /      PyGithub',
 'blackorbird /      APT_REPORT',
 'kangvcar /      InfoSpider',
 'aws-samples /      aws-cdk-examples',
 'connorferster /      handcalcs',
 'huggingface /      transformers',
 'demisto /      content',
 'matrix-org /      synapse',
 'vishnubob /      wait-for-it',
 'open-mmlab /      mmcv',
 'mingrammer /      diagrams']

#### Display all the image links from Walt Disney wikipedia page

In [12]:
# This is the url you will scrape in this exercise
url3 = 'https://en.wikipedia.org/wiki/Walt_Disney'
resp3 = requests.get(url3)
resp3


<Response [200]>

In [13]:
sopa3 = bs(resp3.content, 'html.parser')

In [14]:
#your code
tabla3 = sopa3.findAll('a',{'class':'image'},{'href'})
tabla3 

[<a class="image" href="/wiki/File:Walt_Disney_1946.JPG"><img alt="Walt Disney 1946.JPG" data-file-height="675" data-file-width="450" decoding="async" height="330" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/330px-Walt_Disney_1946.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/440px-Walt_Disney_1946.JPG 2x" width="220"/></a>,
 <a class="image" href="/wiki/File:Walt_Disney_1942_signature.svg"><img alt="Walt Disney 1942 signature.svg" data-file-height="218" data-file-width="585" decoding="async" height="56" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/225px-Walt_Disney_1942_signature.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/t

In [15]:
lista3 = [i.get('href') for i in tabla3]
lista3

#for i in tabla3:
   # print(i.get("href"))

['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:The_Walt_Disney_Company_Logo.svg',
 '/wiki/File:Animation_disc.svg',
 '/wiki/File:P_vip.svg',
 '/wiki/File:Magic_Kingdom_castle.jpg',
 '/wiki/File:Video-x-generic.svg',
 '/wiki/File:Flag_of_Los_Angeles_County,_C

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [16]:
# This is the url you will scrape in this exercise
url4 ='https://en.wikipedia.org/wiki/Python' 

In [17]:
#your code
resp4 = requests.get(url4)
resp4


<Response [200]>

In [18]:
sopa4 = bs(resp4.content, 'html.parser')

In [19]:
tabla4 = sopa4.findAll({'a':'href'})
tabla4

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="#Snakes"><span class="tocnumber">1</span> <span class="toctext">Snakes</span></a>,
 <a href="#Ancient_Greece"><span class="tocnumber">2</span> <span class="toctext">Ancient Greece</span></a>,
 <a href="#Media_and_entertainment"><span class="tocnumber">3</span> <span class="toctext">Media and entertainment</span></a>,
 <a href="#Computing"><span class="tocnumber">4</span> <span class="toctext">Computing</span></a>,
 <a href="#Engineering"><span class="tocnumber">5</span> <span class="toctext">Engineering</span></a>,
 <a href="#Roller_coasters"><span class="tocnumber">5.1</span> <span class="toctext">Roller coasters</span></a>,
 <a h

In [20]:
lista4 = [i.get('href') for i in tabla4]
lista42 = []

for i in lista4:
    a= str(i)
    if re.search('Py',a):
        lista42.append(i)
lista42

['https://en.wiktionary.org/wiki/Python',
 '/w/index.php?title=Python&action=edit&section=1',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/w/index.php?title=Python&action=edit&section=2',
 '/wiki/Python_(mythology)',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/w/index.php?title=Python&action=edit&section=3',
 '/wiki/Python_(film)',
 '/wiki/Pythons_2',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/w/index.php?title=Python&action=edit&section=4',
 '/wiki/Python_(programming_language)',
 '/wiki/CPython',
 '/w/index.php?title=Python&action=edit&section=5',
 '/w/index.php?title=Python&action=edit&section=6',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/wiki/Python_(Efteling)',
 '/w/index.php?title=Python&action=edit&section=7',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/w/index.php?title=Python&action=edit&section=8',
 '/wiki

#### Number of Titles that have changed in the United States Code since its last release point 

In [21]:
# This is the url you will scrape in this exercise
url5 = 'http://uscode.house.gov/download/download.shtml'
resp5 = requests.get(url5)
resp5

<Response [200]>

In [22]:
#your code
sopa5 = bs(resp5.content, 'html.parser')
tabla5 = sopa5.findAll('div',{'class':'uscitem'})
tabla5

[<div class="uscitem">
 <div class="usctitle" id="alltitles">
 
           All titles in the format selected compressed into a zip archive.
 
         </div>
 <div class="itemcurrency">
 
            
 
         </div>
 <div class="itemdownloadlinks">
 <a href="releasepoints/us/pl/116/158/xml_uscAll@116-158.zip" title="All USC Titles in XML">[XML]</a> <a href="releasepoints/us/pl/116/158/htm_uscAll@116-158.zip" title="All USC Titles in XHTML">[XHTML]</a> <a href="releasepoints/us/pl/116/158/pcc_uscAll@116-158.zip" title="All USC Titles in PCC">[PCC]</a> <a href="releasepoints/us/pl/116/158/pdf_uscAll@116-158.zip" title="All USC Titles in PDF">[PDF]</a>
 </div>
 </div>,
 <div class="uscitem">
 <div class="usctitle" id="us/usc/t1">
 
           Title 1 - General Provisions <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>
 <div class="itemcurrency">
 
           116-158
 
         </div>
 <div class="itemdownloadlinks">
 <a href="releasepoints/us/pl/116/158/xml_usc01@1

In [23]:
lista5 = [i.text.replace('\n','').replace('[XML]','').replace('[XHTML]','').replace('[PCC]','').replace('[PDF]','').strip() for i in tabla5]
lista52 = []

for i in lista5:
    if i.startswith('Title'):
        lista52.append(i)
lista52

['Title 1 - General Provisions ٭          116-158',
 'Title 2 - The Congress                  116-158',
 'Title 3 - The President ٭          116-158',
 'Title 4 - Flag and Seal, Seat of Government, and the States ٭          116-158',
 'Title 5 - Government Organization and Employees ٭          116-158',
 'Title 6 - Domestic Security                  116-158',
 'Title 7 - Agriculture                  116-158',
 'Title 8 - Aliens and Nationality                  116-158',
 'Title 9 - Arbitration ٭          116-158',
 'Title 10 - Armed Forces ٭          116-158',
 'Title 11 - Bankruptcy ٭          116-158',
 'Title 12 - Banks and Banking                  116-158',
 'Title 13 - Census ٭          116-158',
 'Title 14 - Coast Guard ٭          116-158',
 'Title 15 - Commerce and Trade                  116-158',
 'Title 16 - Conservation                  116-158',
 'Title 17 - Copyrights ٭          116-158',
 'Title 18 - Crimes and Criminal Procedure ٭          116-158',
 'Title 19 - Customs D

#### A Python list with the top ten FBI's Most Wanted names 

In [24]:
# This is the url you will scrape in this exercise
url6 = 'https://www.fbi.gov/wanted/topten'
resp6 = requests.get(url6)
resp6

<Response [200]>

In [25]:
#your code 
sopa6 = bs(resp6.content, 'html.parser')
tabla6 = sopa6.findAll({'img':'class'})
tabla6


[<img alt="Federal Bureau of Investigation Logo" src="https://www.fbi.gov/++theme++fbigov.theme/images/fbibannerseal.png" title="Federal Bureau of Investigation"/>,
 <img alt="ALEXIS FLORES" class="" src="https://www.fbi.gov/wanted/topten/alexis-flores/@@images/image/preview"/>,
 <img alt="EUGENE PALMER" class="" src="https://www.fbi.gov/wanted/topten/eugene-palmer/@@images/image/preview"/>,
 <img alt="RAFAEL CARO-QUINTERO" class="" src="https://www.fbi.gov/wanted/topten/rafael-caro-quintero/@@images/image/preview"/>,
 <img alt="ROBERT WILLIAM FISHER" class="" src="https://www.fbi.gov/wanted/topten/robert-william-fisher/@@images/image/preview"/>,
 <img alt="BHADRESHKUMAR CHETANBHAI PATEL" class="" src="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel/@@images/image/preview"/>,
 <img alt="ALEJANDRO ROSALES CASTILLO" class="" src="https://www.fbi.gov/wanted/topten/alejandro-castillo/@@images/image/preview"/>,
 <img alt="ARNOLDO JIMENEZ" class="" src="https://www.fbi.gov/w

In [26]:
lista61 = []
for i in tabla6:
    x = str(i.get('alt'))
    lista61.append(x)
lista61.pop(0)
lista61

['ALEXIS FLORES',
 'EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'YASER ABDEL SAID',
 'SANTIAGO VILLALBA MEDEROS']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [27]:
# This is the url you will scrape in this exercise
url7 = 'https://www.emsc-csem.org/Earthquake/'

resp7 = requests.get(url7)
resp7

<Response [200]>

In [28]:
#your code
sopa7 = bs(resp7.content, 'html.parser')

cabecera7 = sopa7.find('tr',{'id':'haut_tableau'}) 
listacabecera7 =[]

for i in cabecera7:
    listacabecera7.append(i.text)
listacabecera7

['CitizenResponse',
 'Date & Time UTC',
 'Latitude degrees',
 'Longitude degrees',
 'Depth km',
 'Mag  [+]',
 'Region name  [+]',
 'Last update [-]']

In [154]:
cuerpo7 = sopa7.find('tbody',{'id':'tbody'})
#lo separo
import unicodedata as uni
cuerpo71 = cuerpo7.text.replace(' ','')
cuerpo72 =uni.normalize("NFKD", cuerpo71)
cuerpo73 = cuerpo72.split() 
cuerpo73

['earthquake2020-08-19',
 '23:22:36.106minago36.47',
 'N',
 '117.51',
 'W',
 '-2ML2.9',
 'CENTRALCALIFORNIA2020-08-1923:24',
 'earthquake2020-08-19',
 '23:03:35.025minago1.58',
 'N',
 '126.34',
 'E',
 '10M3.6',
 'MOLUCCASEA2020-08-1923:16',
 'earthquake2020-08-19',
 '22:51:42.937minago40.49',
 'N',
 '1.08',
 'W',
 '0ML2.2',
 'SPAIN2020-08-1923:20',
 'earthquake2020-08-19',
 '22:46:07.042minago3.85',
 'N',
 '126.69',
 'E',
 '32M3.8',
 'KEPULAUANTALAUD,INDONESIA2020-08-1923:05',
 'earthquake2020-08-19',
 '22:38:36.050minago20.67',
 'S',
 '69.01',
 'W',
 '94ML2.8',
 'TARAPACA,CHILE2020-08-1922:53',
 'earthquake2020-08-19',
 '22:19:11.01hr09minago21.51',
 'S',
 '68.82',
 'W',
 '119ML2.7',
 'ANTOFAGASTA,CHILE2020-08-1922:28',
 'earthquake2020-08-19',
 '22:14:52.41hr14minago38.16',
 'N',
 '117.91',
 'W',
 '11ML2.3',
 'NEVADA2020-08-1922:39',
 'earthquake2020-08-19',
 '22:13:50.91hr15minago28.75',
 'N',
 '51.52',
 'E',
 '2ML4.0',
 'SOUTHERNIRAN2020-08-1922:51',
 'earthquake2020-08-19',
 '21:5

In [156]:
j = 0
cuerpo74 = []
for i in range(0,len(cuerpo73),7):

    cuerpo74.append(cuerpo73[j:i])
    j = i 
cuerpo74

[[],
 ['earthquake2020-08-19',
  '23:22:36.106minago36.47',
  'N',
  '117.51',
  'W',
  '-2ML2.9',
  'CENTRALCALIFORNIA2020-08-1923:24'],
 ['earthquake2020-08-19',
  '23:03:35.025minago1.58',
  'N',
  '126.34',
  'E',
  '10M3.6',
  'MOLUCCASEA2020-08-1923:16'],
 ['earthquake2020-08-19',
  '22:51:42.937minago40.49',
  'N',
  '1.08',
  'W',
  '0ML2.2',
  'SPAIN2020-08-1923:20'],
 ['earthquake2020-08-19',
  '22:46:07.042minago3.85',
  'N',
  '126.69',
  'E',
  '32M3.8',
  'KEPULAUANTALAUD,INDONESIA2020-08-1923:05'],
 ['earthquake2020-08-19',
  '22:38:36.050minago20.67',
  'S',
  '69.01',
  'W',
  '94ML2.8',
  'TARAPACA,CHILE2020-08-1922:53'],
 ['earthquake2020-08-19',
  '22:19:11.01hr09minago21.51',
  'S',
  '68.82',
  'W',
  '119ML2.7',
  'ANTOFAGASTA,CHILE2020-08-1922:28'],
 ['earthquake2020-08-19',
  '22:14:52.41hr14minago38.16',
  'N',
  '117.91',
  'W',
  '11ML2.3',
  'NEVADA2020-08-1922:39'],
 ['earthquake2020-08-19',
  '22:13:50.91hr15minago28.75',
  'N',
  '51.52',
  'E',
  '2ML4.

In [158]:
cuerpo74.pop(0)
cabecera7 = ['date', 'time and longitud','N or S', 'latitud', 'E or W', 'depht (km)', 'place']

In [159]:
tabla7 = pd.DataFrame(cuerpo74, columns = cabecera7)
tabla7
#esta pagina esta muy fea da toda la información como un instring super grande


Unnamed: 0,date,time and longitud,N or S,latitud,E or W,depht (km),place
0,earthquake2020-08-19,23:22:36.106minago36.47,N,117.51,W,-2ML2.9,CENTRALCALIFORNIA2020-08-1923:24
1,earthquake2020-08-19,23:03:35.025minago1.58,N,126.34,E,10M3.6,MOLUCCASEA2020-08-1923:16
2,earthquake2020-08-19,22:51:42.937minago40.49,N,1.08,W,0ML2.2,SPAIN2020-08-1923:20
3,earthquake2020-08-19,22:46:07.042minago3.85,N,126.69,E,32M3.8,"KEPULAUANTALAUD,INDONESIA2020-08-1923:05"
4,earthquake2020-08-19,22:38:36.050minago20.67,S,69.01,W,94ML2.8,"TARAPACA,CHILE2020-08-1922:53"
5,earthquake2020-08-19,22:19:11.01hr09minago21.51,S,68.82,W,119ML2.7,"ANTOFAGASTA,CHILE2020-08-1922:28"
6,earthquake2020-08-19,22:14:52.41hr14minago38.16,N,117.91,W,11ML2.3,NEVADA2020-08-1922:39
7,earthquake2020-08-19,22:13:50.91hr15minago28.75,N,51.52,E,2ML4.0,SOUTHERNIRAN2020-08-1922:51
8,earthquake2020-08-19,21:58:08.01hr30minago16.22,N,97.27,W,16M4.3,"OAXACA,MEXICO2020-08-1922:16"
9,earthquake2020-08-19,21:51:17.91hr37minago38.13,N,118.08,W,5ML2.4,NEVADA2020-08-1922:17


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [43]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

In [44]:
#your code #link caaduco

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [45]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url8 = 'https://twitter.com/gaymersmexico/'
resp8 = requests.get(url8)
resp8

<Response [200]>

In [46]:
sopa8 = bs(resp8.content, 'html.parser')
tabla8 = sopa8.findAll('div',{'class':'css-1dbjc4n'})


for i in tabla8:
    print(i.text)
len(tabla8)

Something went wrong, but don’t fret — let’s give it another shot.
Something went wrong, but don’t fret — let’s give it another shot.

Something went wrong, but don’t fret — let’s give it another shot.


4

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [47]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url9 = 'https://twitter.com/gaymersmexico/'

resp9 = requests.get(url9)
resp9

<Response [200]>

In [48]:
#your code
sopa9 = bs(resp9.content, 'html.parser')
tabla9 = sopa9.findAll('div',{'class':'css-90loao'})
tabla9
#twitter

[]

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [49]:
# This is the url you will scrape in this exercise
url10 = 'https://www.wikipedia.org/'
resp10 = requests.get(url10)
resp10




<Response [200]>

In [50]:
#your code
sopa10 = bs(resp10.content, 'html.parser')
tabla10 = sopa10.findAll('div',{'class':'central-featured'})



In [51]:
lista10 = [i.text.replace('\n','').strip() for i in tabla10]

lista101 = [i.encode('ascii', 'replace') for i in lista10]
lista102 = [i.split() for i in lista101]
lista103 = lista102[0]
lista103

[b'English6?137?000+',
 b'articles???1?222?000+',
 b'??Espa?ol1?617?000+',
 b'art?culosDeutsch2?467?000+',
 b'Artikel???????1?651?000+',
 b'??????Fran?ais2?241?000+',
 b'articlesItaliano1?627?000+',
 b'voci??1?136?000+',
 b'??Portugu?s1?041?000+',
 b'artigosPolski1?423?000+',
 b'hase?']

#### A list with the different kind of datasets available in data.gov.uk 

In [57]:
# This is the url you will scrape in this exercise
url11 = 'https://data.gov.uk/'
resp11 = requests.get(url11)
resp11

<Response [200]>

In [66]:
#your code 
sopa11 = bs(resp11.content, 'html.parser')
tabla11 = sopa11.findAll('h3',{'class':'govuk-heading-s'})
tabla11

[<h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a></

In [68]:
lista11 = [i.text for i in tabla11]
lista11

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [70]:
# This is the url you will scrape in this exercise
url12 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
resp12 = requests.get(url12)
resp12

<Response [200]>

In [127]:
#your code
sopa12 = bs(resp12.content, 'html.parser')
#cabecera12 = sopa12.findAll('div',{'class':'mw-parser-output'})
tabla12 = sopa12.findAll({'tr':'th'})


In [132]:
lista12 = [i.text.replace(' ','').replace('\n',' ').strip() for i in tabla12]
lista12
cabecera12 = lista12[0]
lista121 = [i for i in lista12 if i.startswith(('0', '1', '2', '3', '4', '5', '6', '7', '8', '9'))]
lista121

['1  MandarinChinese  918  11.922  Sino-Tibetan  Sinitic',
 '2  Spanish  480  5.994  Indo-European  Romance',
 '3  English  379  4.922  Indo-European  Germanic',
 '4  Hindi(SanskritisedHindustani)[9]  341  4.429  Indo-European  Indo-Aryan',
 '5  Bengali  228  2.961  Indo-European  Indo-Aryan',
 '6  Portuguese  221  2.870  Indo-European  Romance',
 '7  Russian  154  2.000  Indo-European  Balto-Slavic',
 '8  Japanese  128  1.662  Japonic  Japanese',
 '9  WesternPunjabi[10]  92.7  1.204  Indo-European  Indo-Aryan',
 '10  Marathi  83.1  1.079  Indo-European  Indo-Aryan',
 '11  Telugu  82.0  1.065  Dravidian  South-Central',
 '12  WuChinese  81.4  1.057  Sino-Tibetan  Sinitic',
 '13  Turkish  79.4  1.031  Turkic  Oghuz',
 '14  Korean  77.3  1.004  Koreanic  languageisolate',
 '15  French  77.2  1.003  Indo-European  Romance',
 '16  German  76.1  0.988  Indo-European  Germanic',
 '17  Vietnamese  76.0  0.987  Austroasiatic  Vietic',
 '18  Tamil  75.0  0.974  Dravidian  South',
 '19  YueChine

In [135]:
lista12_top = lista121[0:10]
lista122 = [i.split() for i in lista12_top]
cabecera12_col = cabecera12.split()

In [136]:
pd12 = pd.DataFrame(lista122, columns = cabecera12_col)
pd12

Unnamed: 0,Rank,Language,Speakers(millions),%ofWorldpop.(March2019)[8],Languagefamily,Branch
0,1,MandarinChinese,918.0,11.922,Sino-Tibetan,Sinitic
1,2,Spanish,480.0,5.994,Indo-European,Romance
2,3,English,379.0,4.922,Indo-European,Germanic
3,4,Hindi(SanskritisedHindustani)[9],341.0,4.429,Indo-European,Indo-Aryan
4,5,Bengali,228.0,2.961,Indo-European,Indo-Aryan
5,6,Portuguese,221.0,2.87,Indo-European,Romance
6,7,Russian,154.0,2.0,Indo-European,Balto-Slavic
7,8,Japanese,128.0,1.662,Japonic,Japanese
8,9,WesternPunjabi[10],92.7,1.204,Indo-European,Indo-Aryan
9,10,Marathi,83.1,1.079,Indo-European,Indo-Aryan


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'
#twiter esta bloqueado

In [None]:
# your code


#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [160]:
# This is the url you will scrape in this exercise 
url13 = 'https://www.imdb.com/chart/top'
resp13 = requests.get(url13)
resp13

<Response [200]>

In [190]:
# your code
import requests 

import re

sopa13 = bs(resp13.content, 'html.parser')

tabla13 = sopa13.findAll('td',{'class':'titleColumn'})
tabla13

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Sueño de fuga</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">El Padrino</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">El padrino 2a parte</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: El Caballero de la Noche</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 hombres en pugna</a>
 <span class="secondar

In [194]:
lista13 = [i.text.replace('\n','').replace(' ','') for i in tabla13]
lista13

['1.Sueñodefuga(1994)',
 '2.ElPadrino(1972)',
 '3.Elpadrino2aparte(1974)',
 '4.Batman:ElCaballerodelaNoche(2008)',
 '5.12hombresenpugna(1957)',
 '6.LalistadeSchindler(1993)',
 '7.Elseñordelosanillos:Elretornodelrey(2003)',
 '8.Tiemposviolentos(1994)',
 '9.Elbueno,elmaloyelfeo(1966)',
 '10.Elseñordelosanillos:Lacomunidaddelanillo(2001)',
 '11.Elclubdelapelea(1999)',
 '12.ForrestGump(1994)',
 '13.Elorigen(2010)',
 '14.Elimperiocontraataca(1980)',
 '15.Elseñordelosanillos:Lasdostorres(2002)',
 '16.Matrix(1999)',
 '17.Buenosmuchachos(1990)',
 '18.Atrapadosinsalida(1975)',
 '19.Lossietesamurais(1954)',
 '20.Seven,lossietepecadoscapitales(1995)',
 '21.Lavidaesbella(1997)',
 '22.CiudaddeDios(2002)',
 '23.Elsilenciodelosinocentes(1991)',
 '24.¡Québelloesvivir!(1946)',
 '25.Laguerradelasgalaxias(1977)',
 '26.RescatandoalsoldadoRyan(1998)',
 '27.ElviajedeChihiro(2001)',
 '28.Milagrosinesperados(1999)',
 '29.Parásitos(2019)',
 '30.Hamilton(2020)',
 '31.Interestelar(2014)',
 '32.Elperfectoasesino(

In [216]:
lista131 = [i.strip().split('.') for i in lista13]


In [214]:
dt13 = pd.DataFrame(lista131, columns = ['ranking','name','none'])
dt131 = dt13[['ranking','name']]
dt131

Unnamed: 0,ranking,name
0,1,Sueñodefuga(1994)
1,2,ElPadrino(1972)
2,3,Elpadrino2aparte(1974)
3,4,Batman:ElCaballerodelaNoche(2008)
4,5,12hombresenpugna(1957)
...,...,...
245,246,LabatalladeArgel(1966)
246,247,Tronodesangre(1957)
247,248,Unavozsilenciosa:KoeNoKatachi(2016)
248,249,"ShinseikiEvangelionGekijô-ban:Air/Magokorowo,k..."


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [254]:
#your code
dt131['año'] = dt131['name'].apply(lambda x: re.findall('\(+\d+',x))

In [255]:
dt132 = dt131[['ranking', 'name','año']]
dt132
#no hay reseña en la página

Unnamed: 0,ranking,name,año
0,1,Sueñodefuga(1994),[(1994]
1,2,ElPadrino(1972),[(1972]
2,3,Elpadrino2aparte(1974),[(1974]
3,4,Batman:ElCaballerodelaNoche(2008),[(2008]
4,5,12hombresenpugna(1957),[(1957]
...,...,...,...
245,246,LabatalladeArgel(1966),[(1966]
246,247,Tronodesangre(1957),[(1957]
247,248,Unavozsilenciosa:KoeNoKatachi(2016),[(2016]
248,249,"ShinseikiEvangelionGekijô-ban:Air/Magokorowo,k...",[(1997]


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code
#no carga la pagina

#### Book name,price and stock availability as a pandas dataframe.

In [258]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url14 = 'http://books.toscrape.com/'

resp14 = requests.get(url14)
resp14

<Response [200]>

In [259]:
#your code
sopa14 = bs(resp14.content, 'html.parser')

tabla14 = sopa14.findAll('ol',{'class':'row'})
tabla14




[<ol class="row">
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <

In [269]:
lista14 = [i.text.replace(' ','').replace('\n',' ').strip() for i in tabla14]

lista14

["ALightinthe...  £51.77    Instock    Addtobasket                 TippingtheVelvet  £53.74    Instock    Addtobasket                 Soumission  £50.10    Instock    Addtobasket                 SharpObjects  £47.82    Instock    Addtobasket                 Sapiens:ABriefHistory...  £54.23    Instock    Addtobasket                 TheRequiemRed  £22.65    Instock    Addtobasket                 TheDirtyLittleSecrets...  £33.34    Instock    Addtobasket                 TheComingWoman:A...  £17.93    Instock    Addtobasket                 TheBoysinthe...  £22.60    Instock    Addtobasket                 TheBlackMaria  £52.15    Instock    Addtobasket                 StarvingHearts(TriangularTrade...  £13.99    Instock    Addtobasket                 Shakespeare'sSonnets  £20.66    Instock    Addtobasket                 SetMeFree  £17.46    Instock    Addtobasket                 ScottPilgrim'sPreciousLittle...  £52.29    Instock    Addtobasket                 RipitUpand...  £35.02    Instoc

In [274]:
lista141 =lista14[0].split()
lista141

['ALightinthe...',
 '£51.77',
 'Instock',
 'Addtobasket',
 'TippingtheVelvet',
 '£53.74',
 'Instock',
 'Addtobasket',
 'Soumission',
 '£50.10',
 'Instock',
 'Addtobasket',
 'SharpObjects',
 '£47.82',
 'Instock',
 'Addtobasket',
 'Sapiens:ABriefHistory...',
 '£54.23',
 'Instock',
 'Addtobasket',
 'TheRequiemRed',
 '£22.65',
 'Instock',
 'Addtobasket',
 'TheDirtyLittleSecrets...',
 '£33.34',
 'Instock',
 'Addtobasket',
 'TheComingWoman:A...',
 '£17.93',
 'Instock',
 'Addtobasket',
 'TheBoysinthe...',
 '£22.60',
 'Instock',
 'Addtobasket',
 'TheBlackMaria',
 '£52.15',
 'Instock',
 'Addtobasket',
 'StarvingHearts(TriangularTrade...',
 '£13.99',
 'Instock',
 'Addtobasket',
 "Shakespeare'sSonnets",
 '£20.66',
 'Instock',
 'Addtobasket',
 'SetMeFree',
 '£17.46',
 'Instock',
 'Addtobasket',
 "ScottPilgrim'sPreciousLittle...",
 '£52.29',
 'Instock',
 'Addtobasket',
 'RipitUpand...',
 '£35.02',
 'Instock',
 'Addtobasket',
 'OurBandCouldBe...',
 '£57.25',
 'Instock',
 'Addtobasket',
 'Olio',
 '£2

In [281]:
j = 0
lista142 = []

for i in range(0,len(lista141),4):
    lista142.append(lista141[j:i])
    j = i 
        
        
lista142

[[],
 ['ALightinthe...', '£51.77', 'Instock', 'Addtobasket'],
 ['TippingtheVelvet', '£53.74', 'Instock', 'Addtobasket'],
 ['Soumission', '£50.10', 'Instock', 'Addtobasket'],
 ['SharpObjects', '£47.82', 'Instock', 'Addtobasket'],
 ['Sapiens:ABriefHistory...', '£54.23', 'Instock', 'Addtobasket'],
 ['TheRequiemRed', '£22.65', 'Instock', 'Addtobasket'],
 ['TheDirtyLittleSecrets...', '£33.34', 'Instock', 'Addtobasket'],
 ['TheComingWoman:A...', '£17.93', 'Instock', 'Addtobasket'],
 ['TheBoysinthe...', '£22.60', 'Instock', 'Addtobasket'],
 ['TheBlackMaria', '£52.15', 'Instock', 'Addtobasket'],
 ['StarvingHearts(TriangularTrade...', '£13.99', 'Instock', 'Addtobasket'],
 ["Shakespeare'sSonnets", '£20.66', 'Instock', 'Addtobasket'],
 ['SetMeFree', '£17.46', 'Instock', 'Addtobasket'],
 ["ScottPilgrim'sPreciousLittle...", '£52.29', 'Instock', 'Addtobasket'],
 ['RipitUpand...', '£35.02', 'Instock', 'Addtobasket'],
 ['OurBandCouldBe...', '£57.25', 'Instock', 'Addtobasket'],
 ['Olio', '£23.88', 'Ins

In [286]:
lista142.pop(0)
lista142

[['ALightinthe...', '£51.77', 'Instock', 'Addtobasket'],
 ['TippingtheVelvet', '£53.74', 'Instock', 'Addtobasket'],
 ['Soumission', '£50.10', 'Instock', 'Addtobasket'],
 ['SharpObjects', '£47.82', 'Instock', 'Addtobasket'],
 ['Sapiens:ABriefHistory...', '£54.23', 'Instock', 'Addtobasket'],
 ['TheRequiemRed', '£22.65', 'Instock', 'Addtobasket'],
 ['TheDirtyLittleSecrets...', '£33.34', 'Instock', 'Addtobasket'],
 ['TheComingWoman:A...', '£17.93', 'Instock', 'Addtobasket'],
 ['TheBoysinthe...', '£22.60', 'Instock', 'Addtobasket'],
 ['TheBlackMaria', '£52.15', 'Instock', 'Addtobasket'],
 ['StarvingHearts(TriangularTrade...', '£13.99', 'Instock', 'Addtobasket'],
 ["Shakespeare'sSonnets", '£20.66', 'Instock', 'Addtobasket'],
 ['SetMeFree', '£17.46', 'Instock', 'Addtobasket'],
 ["ScottPilgrim'sPreciousLittle...", '£52.29', 'Instock', 'Addtobasket'],
 ['RipitUpand...', '£35.02', 'Instock', 'Addtobasket'],
 ['OurBandCouldBe...', '£57.25', 'Instock', 'Addtobasket'],
 ['Olio', '£23.88', 'Instock'

In [287]:
df14 = pd.DataFrame(lista142, columns = ['name','price', 'inventory', 'add'])

In [288]:
df141 = df14[['name','price', 'inventory']]
df141

Unnamed: 0,name,price,inventory
0,ALightinthe...,£51.77,Instock
1,TippingtheVelvet,£53.74,Instock
2,Soumission,£50.10,Instock
3,SharpObjects,£47.82,Instock
4,Sapiens:ABriefHistory...,£54.23,Instock
5,TheRequiemRed,£22.65,Instock
6,TheDirtyLittleSecrets...,£33.34,Instock
7,TheComingWoman:A...,£17.93,Instock
8,TheBoysinthe...,£22.60,Instock
9,TheBlackMaria,£52.15,Instock
