# Extract

ETL is the most frequent process for data engineers, DE needs to integrate data from various heterogeneous sources for data analysts and data scientist.

This practice is under a context where DE needs to get sales data from a URL and then store it locally.

### 1. requests library

**GET** request/fetch data from a resource

In [1]:
# requests library
import requests
import pandas as pd

# Get the zip file
path = 'https://assets.datacamp.com/production/repositories/5899/datasets/19d6cf619d6a771314f0eb489262a31f89c424c2/ppr-all.zip'
response = requests.get(path)

# Print the status code, 200 means ok
print(response.status_code)

# Print the headers (metadata)
print(pd.DataFrame(response.headers.items(), columns=['Header','Value']))



200
              Header                                              Value
0               Date                      Sun, 30 Apr 2023 09:27:59 GMT
1       Content-Type                           application/octet-stream
2     Content-Length                                             249296
3         Connection                                         keep-alive
4         x-amz-id-2  EcqpYuhELCzFHdZw/qZoDw/Ju0QJ+ABowRlv6WshZQoZg/...
5   x-amz-request-id                                   6KZWC3W4RVWAFY5S
6      Last-Modified                      Sun, 30 May 2021 14:00:42 GMT
7   x-amz-version-id                   yT6365UyrWqhSlRsEPyBehY7HKsnLPmH
8               ETag                 "5840e486b3afdf58267d80163cb5d0cf"
9    CF-Cache-Status                                                HIT
10               Age                                               1944
11     Accept-Ranges                                              bytes
12        Set-Cookie  __cf_bm=.J9ZyMIW4r8iaIzApCVKQ.Ehe2_myP

### 2. os model and f-strings to build a local path

f-strings is a fancy way to format strings, it combines expressions inside string literals.

In [2]:
# os library
import os

# Set routes
root_dir = 'D:/Learn_DS/Git/python-learning/ETL'
data_dir = 'data/2023April'

# put zip files in the source, put unzipped files in the raw
source_dir = 'data/2023April/source'
raw_dir = 'data/2023April/raw'

# file folder holds unzipped files
file_name = 'downloaded_at=2023-04-30.zip'
file_folder = 'download_at=2023-04-30'

# Create local directory
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

if not os.path.exists(source_dir):
    os.makedirs(source_dir)

if not os.path.exists(raw_dir):
    os.makedirs(raw_dir)

# build path using f-strings
file_path = f"{root_dir}/{source_dir}/{file_name}"
file_folder = f"{root_dir}/{raw_dir}/{file_folder}"

# check if file exists
if os.path.exists(file_path):
    print(f"File {file_name} exists at {file_path}")
else:
    print(f"File {file_name} does not exist at {file_path}")

File downloaded_at=2023-04-30.zip does not exist at D:/Learn_DS/Git/python-learning/ETL/data/2023April/source/downloaded_at=2023-04-30.zip


### 3. save online file locally

In [3]:
# Save the file locally, with can close the file after writing it, mode wb for zip file
with open(file_path, "wb") as f:
    f.write(response.content)

### 4. unzip the Zip file

In [4]:
# Unzipped the ZIP file
from zipfile import ZipFile

# Check files

with ZipFile(file_path, mode="r") as f:
  	# Get the list of files and print it
    file_names = f.namelist()
    print(file_names)
    # Extract the csv file
    csv_file_path = f.extract(file_names[0], path=file_folder)
    print(csv_file_path)



['ppr-all.csv']
D:\Learn_DS\Git\python-learning\ETL\data\2023April\raw\download_at=2023-04-30\ppr-all.csv


### 5. read the CSV file

In [5]:
# csv model
import csv
from pprint import pprint

# open function
with open(csv_file_path, mode="r", encoding="windows-1252") as csv_file:
    reader= csv.DictReader(csv_file)
    # get the first row
    row=next(reader)
    pprint(row)

{'Address': '16 BURLEIGH COURT, BURLINGTON ROAD, DUBLIN 4',
 'County': 'Dublin',
 'Date of Sale (dd/mm/yyyy)': '03/01/2021',
 'Description of Property': 'Second-Hand Dwelling house /Apartment',
 'Postal Code': 'Dublin 4',
 'Price (€)': '€450,000.00'}


In [6]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path, encoding="windows-1252")

# Print the first five rows of the DataFrame
print(df.sample(5))

      Date of Sale (dd/mm/yyyy)  \
9616                 15/03/2021   
11959                08/04/2021   
3448                 02/02/2021   
9714                 16/03/2021   
9557                 15/03/2021   

                                                 Address Postal Code   County  \
9616                       CNOCAN PODRAIG, BEACH, BANTRY         NaN     Cork   
11959               RIDGE ROAD, PORTLAOISE, COUNTY LAOIS         NaN    Laois   
3448   APT 8 BLOCK H THE CAMPION, MARINA VILLAGE, GRE...         NaN  Wicklow   
9714                     28 SUMMERSEAT DR, CLONEE, MEATH         NaN    Meath   
9557         46 SOUTHBAY POINT, ROSSLARE STRAND, WEXFORD         NaN  Wexford   

         Price (€)                Description of Property  
9616   €130,000.00  Second-Hand Dwelling house /Apartment  
11959  €115,000.00  Second-Hand Dwelling house /Apartment  
3448   €368,274.61  Second-Hand Dwelling house /Apartment  
9714   €259,222.00  Second-Hand Dwelling house /Apartment  
95

In [7]:
# The size of CSV file
print(df.shape[0], df.shape[1])

11999 6
