![Nuclio logo](https://nuclio.school/wp-content/uploads/2018/12/nucleoDS-newBlack.png)

# Ejercicio opcional de Web Scraping

Este ejercicio consiste en extraer datos de una página web, procesarlos y guardarlos en un fichero `csv`. Para ello, debes:

1. Extraer los artículos en la página de inicio de [https://slashdot.org/](https://slashdot.org/) utilizando `BeautifulSoup`.
2. Procesar los datos y guardarlos en un `DataFrame`.
3. Crear un fichero `csv` a partir de dicho `DataFrame`.

## Importar librerías

In [14]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [15]:
url = "https://slashdot.org/"

In [16]:
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')

## Hacer scraping de artículos

In [17]:
for title in soup.find_all("h2"):
    print(title.get_text().strip())

Can Invasive Fish Be Scared Off With a Menacing Robot Predator?  (nytimes.com) 


9
Is the Cloud Making Internet Services More Fragile?  (nbcnews.com) 


39
RadioShack Announces Ambitious New Cryptocurrency Exchange  (radioshack.com) 


73
Who's Paying to Fix Open Source Software?  (dev.to) 


82
Forget Dogs: These Rats Could Be the Future of Search and Rescue  (science.org) 


23
Beatings, Doxxings, Harassment: the War Over Chinese Wikipedia  (fastcompany.com) 


37
Imagining an All-Renewable Grid With No Blackouts Without Long-Duration Batteries  (stanford.edu) 


150
Astronomers Nervously Counting Down to Christmas Eve Launch of $10B Webb Telescope  (nytimes.com) 


51
Hospital's Computer System Always Marks Up Costs Automatically, Leaked Records Show  (msn.com) 


154
Guitarist Eric Clapton Successfully Sues Woman For Posting $11 Bootleg  (guitarworld.com) 


133
Trial Ends For Theranos Founder Elizabeth Holmes  (msn.com) 


70
FSF Adopts New Governance Measures: a Board Member Agr

In [18]:

for fuente in soup.find_all('span',class_='no extlnk'):
    print(fuente.get_text().strip())

(nytimes.com)
(nbcnews.com)
(radioshack.com)
(dev.to)
(science.org)
(fastcompany.com)
(stanford.edu)
(nytimes.com)
(msn.com)
(guitarworld.com)
(msn.com)
(fsf.org)
(propublica.org)
(cnbc.com)
(arstechnica.com)


In [19]:


for fecha in soup.find_all('time'):
    print(fecha.get_text().strip())
   


on Sunday December 19, 2021 @03:34AM
on Sunday December 19, 2021 @12:29AM
on Saturday December 18, 2021 @09:29PM
on Saturday December 18, 2021 @06:34PM
on Saturday December 18, 2021 @05:34PM
on Saturday December 18, 2021 @04:34PM
on Saturday December 18, 2021 @03:34PM
on Saturday December 18, 2021 @02:34PM
on Saturday December 18, 2021 @01:34PM
on Saturday December 18, 2021 @12:34PM
on Saturday December 18, 2021 @11:34AM
on Saturday December 18, 2021 @10:34AM
on Saturday December 18, 2021 @08:00AM
on Saturday December 18, 2021 @05:00AM
on Saturday December 18, 2021 @02:00AM


In [20]:

for parrafo in soup.find_all('div', class_='body'):
    print(parrafo.get_text().strip())

The mosquitofish threatens native fish populations in Australia — including two of the most criticially endangered, reports the New York Times.   And in various parts of the world, "For decades scientists have been trying to figure out how to control it, without damaging the surrounding ecosystem.  

But in a new lab experiment, "the mosquitofish may have finally met its match: A menacing fish-shaped robot."


It's "their worst nightmare," said Giovanni Polverino, a behavioral ecologist at the University of Western Australia and the lead author of a paper published Thursday in iScience, in which scientists designed a simulacrum of the fish's natural predator, the largemouth bass, to strike at the mosquitofish, scaring it away from its prey.  The robot not only freaked the mosquitofish out, but scarred them with such lasting anxiety that their reproduction rates dropped; evidence that could have long term implications for the species' viability, according to the paper.   "You don't need

## Guardar dataframe

In [21]:
articles = soup.find_all("article", class_="card")
for a in articles:
    if a.parent["class"][0] == 'nothumbs':
        print(a.get_text().strip())

In [26]:
articles = soup.select("[class='nothumbs'] > div > article")
data = []
for a in articles:
    dict_df = {"titulo":a.h2.get_text(),
            "fuente":a.find('span',class_='no extlnk'),
    
        "datetime": a.time.attrs["datetime"]
        
    }
    data.append(dict_df)
#df = pd.DataFrame(dict_df)




In [23]:
data

[{'titulo': '\n Can Invasive Fish Be Scared Off With a Menacing Robot Predator?  (nytimes.com) \n\n\n9\n',
  'fuente': <span class="no extlnk"><a class="story-sourcelnk" href="https://www.nytimes.com/2021/12/16/science/mosquitofish-robot.html" target="_blank" title="External link - https://www.nytimes.com/2021/12/16/science/mosquitofish-robot.html"> (nytimes.com) </a></span>,
  'datetime': 'on Sunday December 19, 2021 @03:34AM'},
 {'titulo': '\n Is the Cloud Making Internet Services More Fragile?  (nbcnews.com) \n\n\n39\n',
  'fuente': <span class="no extlnk"><a class="story-sourcelnk" href="https://www.nbcnews.com/tech/tech-news/internet-outages-web-concentrations-power-rcna8942" target="_blank" title="External link - https://www.nbcnews.com/tech/tech-news/internet-outages-web-concentrations-power-rcna8942"> (nbcnews.com) </a></span>,
  'datetime': 'on Sunday December 19, 2021 @12:29AM'},
 {'titulo': '\n RadioShack Announces Ambitious New Cryptocurrency Exchange  (radioshack.com) \n\n

In [24]:
df = pd.DataFrame(data)
df

Unnamed: 0,titulo,fuente,datetime
0,\n Can Invasive Fish Be Scared Off With a Mena...,[[ (nytimes.com) ]],"on Sunday December 19, 2021 @03:34AM"
1,\n Is the Cloud Making Internet Services More ...,[[ (nbcnews.com) ]],"on Sunday December 19, 2021 @12:29AM"
2,\n RadioShack Announces Ambitious New Cryptocu...,[[ (radioshack.com) ]],"on Saturday December 18, 2021 @09:29PM"
3,\n Who's Paying to Fix Open Source Software? ...,[[ (dev.to) ]],"on Saturday December 18, 2021 @06:34PM"
4,\n Forget Dogs: These Rats Could Be the Future...,[[ (science.org) ]],"on Saturday December 18, 2021 @05:34PM"
5,"\n Beatings, Doxxings, Harassment: the War Ove...",[[ (fastcompany.com) ]],"on Saturday December 18, 2021 @04:34PM"
6,\n Imagining an All-Renewable Grid With No Bla...,[[ (stanford.edu) ]],"on Saturday December 18, 2021 @03:34PM"
7,\n Astronomers Nervously Counting Down to Chri...,[[ (nytimes.com) ]],"on Saturday December 18, 2021 @02:34PM"
8,\n Hospital's Computer System Always Marks Up ...,[[ (msn.com) ]],"on Saturday December 18, 2021 @01:34PM"
9,\n Guitarist Eric Clapton Successfully Sues Wo...,[[ (guitarworld.com) ]],"on Saturday December 18, 2021 @12:34PM"
