# Gathering Web Data

1. Data gathering web data using cURL

One of the most important tools for gathering data from the web is cURL. It's a command line utility that allows you to request and download a webpage or other data over HTTP, like images or documents. cURL is often scripted to scrape web data from various sources or used as part of a web crawler to index pages.
Bring up the webpage, top5ofanything.com and look at a list of data that contains the Top 5 Most Reliable Car Brands for 2021 : https://top5ofanything.com/list/5b6d4671/The-Top-5-Most-Reliable- Car-Brands-for-2021
In a folder called curl-demo, run the command curl (Use sudo apt install curl to install it if required). direct it into a file and save it (which file type should you use?)
use curl --head and then the URL that shows header information. The important thing is that you shuld get a status code 200 as a valid request
to save on bandwidth if you are downloading a lot of data (a lot of pages), you can use the compressed option : use the command, curl --compressed –head and then the URL. the content encoding should come back as gzip
to search an element, you can use the following command: curl –data ‘’q=France’’ and then the url/search that it's posting to (in double quotes, you give the form element q that is the search element). This is similar to top5ofanything.com/search on the browser, you will get the same results from that posted data (to check).
executes the command, `curl --data-urlencode "q=highest mountains" https://top5ofanything.com/search/` and comments the option urlencode.
cURL provides cookie handling. It’s information that gets passed back and forth between the browser and the web server. To store the cookies in file cookies.txt, check the following command:
`curl --cookie-jar cookies.txt --output output.html https://www.google.com` examines the file cookies.txt (using command cat or more).

2. Extracting Spreadsheet Data with in2csv

Having data in Microsoft Excel format is very common, but this is not always a good format for data processing. We'd like tabular data like CSV.
use in2csv to convert an Excel spreadsheet data to CSV format. In a folder in2csv-demo, get an excel file , e.g., https://www.itu.int/ITU-D/ict/statistics/material/excel/EstimatedInternetUsers00-09.xls, which is Estimated based on urban/rural distribution of Internet users. Then use in2csv which is a Python utility that comes part of the CSV kit library from Python (install csvkit first using the pip package manager). Pipe the command into grep, to do grep for France. Save in a csv file and check it.

3. Extracting Spreadsheet Data with Agate

If we use Python for scripting, then extracting spreadsheet data is best done with the Agate library. Agate is a general-purpose data analysis library that can be used for data wrangling and other data science tasks.
Write out a script xls2csv.py on extracting the spreadsheet data with the Python Agate library.

4. Extracting HTML Data using Python and BeautifulSoup

If your data science needs require that you extract web data from HTML documents, you'll need to be able to parse and extract HTML tags from documents.
Write out a script using the Python package BeautifulSoup to download and extract HTML tags from the web.

5. Work with metadata in email headers

write out a script parse-email.py to parse a source file from your email server using a Python library for email message parsing.

6. Connecting to Remote Data

to connect directly to a remote server that hosts your data, the most frequently used utility to connect to remote systems is Secure Shell, or SSH.
On a trusted computer, you can copy your SSH keys to the remote computer so that it trusts you implicitly (without password prompt except when you install it first time). Check the usage of the command ssh-copy-id to have a passwordless connection that uses a key-based authentication.

7. Copying Remote Data

To copy remote data using a secure copy, the command secure copy or SCP for short is used to copy a remote file securely using key-based authentication.
Explain how scp can be used with key-based authentication

8. Synchronizing Remote Data

Keeping remote data synchronized between devices, for example from a server to a single workstation can be complex. You can use rsync for synchronizing remote data (with ssh).
You could be using rsync between data in directories on a single computer. For instance, you can synchronize copies across different mapped drives or storage media.
Check rsync usage and answer this question: What is difference between scp and rsync?

 1. Data gathering web data using cURL

`curl --data-urlencode "q=highest mountains" https://top5ofanything.com/search/`

Le résultat de cette requête est le suivant :

```html
<!doctype html><html lang=en><head><meta charset=utf-8>
<script src="https://privacy.gatekeeperconsent.com/tcf2_stub.js" data-cfasync="false"></script><script>window.__ezWillLoadCnx=1;</script><script data-ezscrex=false data-cfasync=false data-pagespeed-no-defer>var __ez=__ez||{};__ez.stms=Date.now();__ez.evt={};
```

In [None]:
# 2. Extracting Spreadsheet Data with in2csv
import csvkit as csv
import paramiko as paramiko

In [3]:
# 3. Extracting Spreadsheet Data with Agate
import agate

def xls_to_csv(input_file, output_file):
    # Load the spreadsheet using agate
    table = agate.Table.form_xls(input_file)

    # Write the data to a CSV file
    table.to_csv(output_file)

    print(f"Data successfully extracted from {input_file} and saved as {output_file}.")

input_file = "./in2csv-demo/EstimatedInternetUsers00-09.xls"  # Replace with the path to your input spreadsheet
output_file = "./in2csv-demo/agate.csv"  # Replace with the desired path for the output CSV file

xls_to_csv(input_file, output_file)

XLRDError: No sheet named <'sheet'>

In [5]:
# 4. Extracting HTML Data using Python and BeautifulSoup
import requests
from bs4 import BeautifulSoup

# URL du site web à extraire
url = "https://rcck-ardennes.fr"

response = requests.get(url)

if response.status_code == 200:
    html_content = response.content

    soup = BeautifulSoup(html_content, "html.parser")

    links = soup.find_all("a")

    for link in links:
        print(link.text)
        print(link.get("href"))

else:
    print("Erreur lors de la requête HTTP : ", response.status_code)



./
Accueil
./
Nos locations
./locations.php
Nos activités
./activites.php
Nos photos
./photos.php?annee=2023
Vos avis
./avis.php?annee=2023
À-propos
./a_propos.php
Dossier d'inscription >
Uploads/_062822000840.pdf
En savoir plus >
locations.php

locations.php
Nous admirer >
photos.php

photos.php

photos.php

photos.php

photos.php
 
activites.php#comp
 La compétition 
activites.php#comp
 
activites.php#loi
 La rivière 
activites.php#loi
 
activites.php#pisc
 La piscine 
activites.php#pisc
FaceBook
https://www.facebook.com/profile.php?id=100040638682931
Envoyez-nous un e-mail !
mailto:rcck.ardennes@gmail.com
Nous trouver
https://www.google.com/maps/dir//rethel+chateau+canoe+kayak/data=!4m6!4m5!1m1!4e2!1m2!1m1!1s0x47e98907bf9e2ad3:0x377eff8f93d189d4?sa=X&ved=2ahUKEwjBgMmx8tz1AhWD3YUKHZYYC7IQ9Rd6BAg5EAQ
Site map
./site-map.php


In [6]:
# 5. Work with metadata in email headers
import email
from email.header import decode_header

def parse_email(file_path):
    with open(file_path, 'rb') as file:
        message = email.message_from_binary_file(file)

    # Extract metadata from email headers
    subject = decode_header(message['Subject'])[0][0]
    sender = decode_header(message['From'])[0][0]
    recipients = decode_header(message['To'])[0][0]
    date = message['Date']

    print(f"Subject: {subject}")
    print(f"Sender: {sender}")
    print(f"Recipients: {recipients}")
    print(f"Date: {date}")

    # Additional header fields can be extracted in a similar manner
    # For example: message['Cc'], message['Bcc'], message['Message-ID'], etc.

# Example usage
file_path = './email/email.eml'
parse_email(file_path)

Subject: Rappel - Recall
Sender: Jaafar Gaber <jaafar.gaber@utbm.fr>
Recipients: DS54 <DS54@utbm.fr>
Date: Mon, 15 May 2023 11:41:28 +0200 (CEST)


In [11]:
# 6. Connecting to Remote Data
import paramiko

hostname = ''
port = 22
username = ''
password = ''


client = paramiko.SSHClient()

client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

try:
    client.connect(hostname, port, username, password)
    stdin, stdout, stderr = client.exec_command('ls -l')
    output = stdout.read().decode('utf-8')
    print(output)

finally:
    client.close()



NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 127.0.0.1 or ::1

In [None]:
# 7. Copying Remote Data


In [None]:
# 8. Synchronizing Remote Data