# Introduction au Scraping

## 1. Import librairies et page web

In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.echojs.com/")
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

## 2. Récupération des données 

In [153]:
data = []

links = soup.select('article > h2 > a')

for link in links:
    data.append({
        'title': link.get_text(),
        'url': link.get('href')
    })
    
for post in data:
    print('Name: {}\nLink: {}\n'.format(post['title'], post['url']))

Name: VS Code Extension: New Remote Development Pack introduced
Link: https://medium.com/@billys.moustakas/vs-code-extension-new-remote-development-pack-introduced-fe3730dde771

Name: Koji is a platform that makes building and deploying full-stack web apps dramatically faster, and easier.
Link: https://gokoji.com/

Name: Licia: Useful Utility Collection with Zero Dependencies:)
Link: https://licia.liriliri.io/

Name: PCI DSS for “Blockchain Based” Crypto Projects
Link: https://2muchcoffee.com/blog/pci-dss-for-crypto-projects-blockchain-based/

Name: Node.js TypeScript #12. Introduction to Worker Threads with TypeScript
Link: https://wanago.io/2019/05/06/node-js-typescript-12-worker-threads/

Name: Open source full-stack solution for fast PWA development
Link: https://bento-starter.netlify.com/

Name: Build a Modern Chat Application with React
Link: https://www.cometchat.com/tutorials/build-a-modern-chat-application-with-react/

Name: Angular JS chat tutorial: Anonymous group chat
Link:

## 3. Écriture dans un fichier .csv

In [171]:
import csv

with open('./echojs.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'url'])
    writer.writeheader()
    
    for post in data:
        writer.writerow(post)

---

# Reformater pour du stockage en JSON

Essayons de récupérer les informations suivantes pour chaque article du site [HackerNews](https://news.ycombinator.com) :
- id
- url
- title
- points
- author
- comments

Nous les stockerons ensuite au format JSON.

## 1. Import librairies et page web

In [175]:
import requests
from bs4 import BeautifulSoup

res = requests.get("https://news.ycombinator.com")
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.prettify())

<html op="news">
 <head>
  <meta content="origin" name="referrer"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="news.css?0cBqUHCbnKTxrOsi1kz1" rel="stylesheet" type="text/css"/>
  <link href="favicon.ico" rel="shortcut icon"/>
  <link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
  <title>
   Hacker News
  </title>
 </head>
 <body>
  <center>
   <table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
    <tr>
     <td bgcolor="#ff6600">
      <table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%">
       <tr>
        <td style="width:18px;padding-right:4px">
         <a href="https://news.ycombinator.com">
          <img height="18" src="y18.gif" style="border:1px white solid;" width="18"/>
         </a>
        </td>
        <td style="line-height:12pt; height:10px;">
         <span class="pagetop">
          <b class="hnname">
           <a href="news">
 

## 2. Récupération et préparation des données 

In [261]:
data = []

for tr in soup.select("tr.athing"):
    sibling = tr.find_next_sibling()
    link = tr.select_one('a.storylink')
    
def get_comments(elem):
    #Retrieving number of comments from target sibling element.
    text = elem.get_text().strip()
    
    if text == 'discuss':
        return 0
    
    return int(text.split('comment')[0].strip())

def get_user(elem):
    #Retrieving user id from target sibling element.
    if elem is None:
        return ''
    return elem.get_text().strip()

for tr in soup.select('tr.athing'):
    sibling = tr.find_next_sibling()
    link = tr.select_one('a.storylink')
    
    post = {
        'id': tr.get('id'),
        'url': link.get('href'),
        'title': link.get_text(),
        'points': int(sibling
                      .select_one('.score')
                      .get_text().split(' points')[0]
                      .strip()),
        'comments': get_comments(sibling.select('a')[-1]),
        'author': get_user(sibling.select_one('a[href^="user?"]'))
    }

## 3. Écriture dans un fichier .json

In [None]:
import json

with open('./hackernews.json', 'w') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

---

# Scraper le web moderne (PWA)

Dans le web moderne, l'expansion des PWA (Progressive Web App) met des bâtons dans les roues du scraping classique. En effet, le HTML ne s'affiche plus car il est contenu directement dans le JavaScript.  
Pour lire ce HTML, on peut décider d'éxecuter le JavaScript dans un navigateur émulé grâce à [Selenium](https://selenium-python.readthedocs.io/) par exemple.

In [262]:
import requests
from bs4 import BeautifulSoup

res = requests.get("https://angular2-hn.firebaseapp.com/news/1")
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Angular 2 HN
  </title>
  <base href="/"/>
  <meta content="A Hacker News client built with Angular CLI, RxJS and Webpack" name="description">
   <meta content="summary" name="twitter:card"/>
   <meta content="@hdjirdeh" name="twitter:site"/>
   <meta content="Angular 2 HN" name="twitter:title"/>
   <meta content="A Hacker News client built with Angular CLI, RxJS and Webpack" name="twitter:description"/>
   <meta content="@hdjirdeh" name="twitter:creator"/>
   <meta content="assets/images/logo-loading.png" name="twitter:image"/>
   <meta content="Angular 2 HN" property="og:title">
    <meta content="website" property="og:type">
     <meta content="https://angular2-hn.firebaseapp.com/" property="og:url">
      <meta content="assets/images/logo-loading.png" property="og:image">
       <meta content="A Hacker News client built with Angular CLI, RxJS and Webpack" property="og:description">
        <meta

> En faisant une requête à une PWA, on remarque en effet qu'on ne récupère pas le contenu de la page.

In [267]:
soup.select('li')

[]