# Extract, Transform and Load

ETL is the most frequent process for data engineers, DE needs to integrate data from various heterogeneous sources for data analysts and data scientist.

This practice is under a context where DE needs to get sales data from a URL and then store it locally.

## Extract
Extract data from a storage, I request data and data are sent back. 

In this case, I have an URL to source data.

### 1. requests library

**GET** request/fetch data from a resource

In [16]:
# requests library
import requests
import pandas as pd

# Get the zip file
path = "https://assets.datacamp.com/production/repositories/5899/datasets/19d6cf619d6a771314f0eb489262a31f89c424c2/ppr-all.zip"
response = requests.get(path)

# Print the status code, 200 means ok
print(response.status_code)

# Print the headers (metadata)
print(pd.DataFrame(response.headers.items(), columns=["Header","Value"]))



200
              Header                                              Value
0               Date                      Sat, 29 Apr 2023 22:49:27 GMT
1       Content-Type                           application/octet-stream
2     Content-Length                                             249296
3         Connection                                         keep-alive
4         x-amz-id-2  XW89DzPlpbUccYH/eZ5vX66OyaaVjqG0VuwuzCmpovCOU+...
5   x-amz-request-id                                   SDQ1W0F2DF1G27D2
6      Last-Modified                      Sun, 30 May 2021 14:00:42 GMT
7   x-amz-version-id                   yT6365UyrWqhSlRsEPyBehY7HKsnLPmH
8               ETag                 "5840e486b3afdf58267d80163cb5d0cf"
9    CF-Cache-Status                                                HIT
10               Age                                               5902
11     Accept-Ranges                                              bytes
12        Set-Cookie  __cf_bm=cQ5YfeZM7iZm86DOjWUq932JBK4dd5

### 2. f-strings to build a local path

f-strings is a fancy way to format strings, it combines expressions inside string literals.

In [6]:
# os library
import os

# Set routes
root_dir = 'D:/Learn_DS/Git/python-learning/DataCamp/'
data_dir = 'data'
file_name = 'ETL_Demo.zip'

# Create local directory
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# build path using f-string
file_path = f"{root_dir}/{data_dir}/{file_name}"

# check if file exists
if os.path.exists(file_path):
    print(f"File {file_name} exists at {file_path}")
else:
    print(f"File {file_name} does not exist at {file_path}")

File ETL_Demo.zip does not exist at D:/Learn_DS/Git/python-learning/DataCamp//data/ETL_Demo.zip


### 3. save online file locally

In [7]:
# Save the file locally, with can close the file after writing it, mode wb for zip file
with open(file_path, "wb") as f:
    f.write(response.content)

### 4. unzip the Zip file

In [10]:
# Unzipped the ZIP file
from zipfile import ZipFile

# Check files

with ZipFile(file_path, mode="r") as f:
  	# Get the list of files and print it
    file_names = f.namelist()
    print(file_names)
    # Extract the csv file
    csv_file_path = f.extract(file_names[0], path=data_dir)
    print(csv_file_path)



['ppr-all.csv']
data\ppr-all.csv


### 5. read the CSV file

In [14]:
# csv model
import csv
from pprint import pprint

# open function
with open(csv_file_path, mode="r", encoding="windows-1252") as csv_file:
    reader= csv.DictReader(csv_file)
    # get the first row
    row=next(reader)
    pprint(row)

{'Address': '16 BURLEIGH COURT, BURLINGTON ROAD, DUBLIN 4',
 'County': 'Dublin',
 'Date of Sale (dd/mm/yyyy)': '03/01/2021',
 'Description of Property': 'Second-Hand Dwelling house /Apartment',
 'Postal Code': 'Dublin 4',
 'Price (€)': '€450,000.00'}


In [23]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path, encoding="windows-1252")

# Print the first five rows of the DataFrame
print(df.sample(5))

     Date of Sale (dd/mm/yyyy)                                       Address  \
7611                01/03/2021           CORMAC STREET, TULLAMORE, CO OFFALY   
8382                05/03/2021                   66 PARKLANDS, YOUGHAL, CORK   
6971                25/02/2021  72 MARIAN AVENUE, CLONMINAN ROAD, PORTLAOISE   
8328                05/03/2021            41 MERCHANT SQ, EASTWALL, DUBLIN 3   
3518                02/02/2021           WEAVERS SQ, BALTINGLASS, CO WICKLOW   

     Postal Code   County    Price (€)                Description of Property  
7611         NaN   Offaly  €370,000.00  Second-Hand Dwelling house /Apartment  
8382         NaN     Cork  €210,000.00  Second-Hand Dwelling house /Apartment  
6971         NaN    Laois  €189,000.00  Second-Hand Dwelling house /Apartment  
8328    Dublin 3   Dublin  €315,000.00  Second-Hand Dwelling house /Apartment  
3518         NaN  Wicklow  €151,000.00  Second-Hand Dwelling house /Apartment  


In [26]:
# The size of CSV file
print(df.shape[0], df.shape[1])

11999 6


# Transform
