# Extract

ETL is the most frequent process for data engineers, DE needs to integrate data from various heterogeneous sources for data analysts and data scientist.

This practice is under a context where DE needs to get sales data from a URL and then store it locally.

### 1. requests library

**GET** request/fetch data from a resource

In [37]:
# requests library
import requests
import pandas as pd

# Get the zip file
path = 'https://assets.datacamp.com/production/repositories/5899/datasets/19d6cf619d6a771314f0eb489262a31f89c424c2/ppr-all.zip'
response = requests.get(path)

# Print the status code, 200 means ok
print(response.status_code)

# Print the headers (metadata)
print(pd.DataFrame(response.headers.items(), columns=['Header','Value']))



200
              Header                                              Value
0               Date                      Sun, 30 Apr 2023 08:55:35 GMT
1       Content-Type                           application/octet-stream
2     Content-Length                                             249296
3         Connection                                         keep-alive
4         x-amz-id-2  EcqpYuhELCzFHdZw/qZoDw/Ju0QJ+ABowRlv6WshZQoZg/...
5   x-amz-request-id                                   6KZWC3W4RVWAFY5S
6      Last-Modified                      Sun, 30 May 2021 14:00:42 GMT
7   x-amz-version-id                   yT6365UyrWqhSlRsEPyBehY7HKsnLPmH
8               ETag                 "5840e486b3afdf58267d80163cb5d0cf"
9    CF-Cache-Status                                        REVALIDATED
10     Accept-Ranges                                              bytes
11        Set-Cookie  __cf_bm=DyoFFNs36NrL3Y26b9h3gbyhQOHwwz5_erL465...
12              Vary                                    Acce

### 2. f-strings to build a local path

f-strings is a fancy way to format strings, it combines expressions inside string literals.

In [38]:
# os library
import os

# Set routes
root_dir = 'D:/Learn_DS/Git/python-learning/DataCamp/'
data_dir = 'data/2023April'
file_dir = 'downloaded_at=2023-04-30'
file_name = 'downloaded_at=2023-04-30.zip'

# Create local directory
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# build path using f-string
file_folder = f"{root_dir}/{data_dir}/{file_dir}"
file_path = f"{root_dir}/{data_dir}/{file_name}"

# check if file exists
if os.path.exists(file_path):
    print(f"File {file_name} exists at {file_path}")
else:
    print(f"File {file_name} does not exist at {file_path}")

File downloaded_at=2023-04-30.zip does not exist at D:/Learn_DS/Git/python-learning/DataCamp//data/2023April/downloaded_at=2023-04-30.zip


### 3. save online file locally

In [39]:
# Save the file locally, with can close the file after writing it, mode wb for zip file
with open(file_path, "wb") as f:
    f.write(response.content)

### 4. unzip the Zip file

In [44]:
# Unzipped the ZIP file
from zipfile import ZipFile

# Check files

with ZipFile(file_path, mode="r") as f:
  	# Get the list of files and print it
    file_names = f.namelist()
    print(file_names)
    # Extract the csv file
    csv_file_path = f.extract(file_names[0], path=file_folder)
    print(csv_file_path)



['ppr-all.csv']
D:\Learn_DS\Git\python-learning\DataCamp\data\2023April\downloaded_at=2023-04-30\ppr-all.csv


### 5. read the CSV file

In [42]:
# csv model
import csv
from pprint import pprint

# open function
with open(csv_file_path, mode="r", encoding="windows-1252") as csv_file:
    reader= csv.DictReader(csv_file)
    # get the first row
    row=next(reader)
    pprint(row)

{'Address': '16 BURLEIGH COURT, BURLINGTON ROAD, DUBLIN 4',
 'County': 'Dublin',
 'Date of Sale (dd/mm/yyyy)': '03/01/2021',
 'Description of Property': 'Second-Hand Dwelling house /Apartment',
 'Postal Code': 'Dublin 4',
 'Price (€)': '€450,000.00'}


In [43]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path, encoding="windows-1252")

# Print the first five rows of the DataFrame
print(df.sample(5))

     Date of Sale (dd/mm/yyyy)                                  Address  \
3915                05/02/2021    14 WENTWORTH PLACE, JIGGINSTOWN, NAAS   
7398                26/02/2021    LOGAN STREET, THOMASTOWN, CO KILKENNY   
807                 13/01/2021  3 TOGHER ROAD, MONASTEREVIN, CO KILDARE   
1700                20/01/2021               DUNBOY, HOLLYMOUNT, LEE RD   
3504                02/02/2021         NO 8 ARD CAOIN, GORT ROAD, ENNIS   

     Postal Code    County    Price (€)                Description of Property  
3915         NaN   Kildare  €192,500.00  Second-Hand Dwelling house /Apartment  
7398         NaN  Kilkenny   €93,000.00  Second-Hand Dwelling house /Apartment  
807          NaN   Kildare  €165,000.00  Second-Hand Dwelling house /Apartment  
1700         NaN      Cork  €167,500.00  Second-Hand Dwelling house /Apartment  
3504         NaN     Clare  €185,000.00  Second-Hand Dwelling house /Apartment  


In [32]:
# The size of CSV file
print(df.shape[0], df.shape[1])

11999 6
