# Extract

ETL is the most frequent process for data engineers, DE needs to integrate data from various heterogeneous sources for data analysts and data scientist.

This practice is under a context where DE needs to get sales data from a URL and then store it locally.

### 1. requests library

**GET** request/fetch data from a resource

In [17]:
# requests library
import requests
import pandas as pd

# Get the zip file
path = 'https://assets.datacamp.com/production/repositories/5899/datasets/19d6cf619d6a771314f0eb489262a31f89c424c2/ppr-all.zip'
response = requests.get(path)

# Print the status code, 200 means ok
print(response.status_code)

# Print the headers (metadata)
print(pd.DataFrame(response.headers.items(), columns=['Header','Value']))



200
              Header                                              Value
0               Date                      Sun, 30 Apr 2023 11:50:04 GMT
1       Content-Type                           application/octet-stream
2     Content-Length                                             249296
3         Connection                                         keep-alive
4         x-amz-id-2  EcqpYuhELCzFHdZw/qZoDw/Ju0QJ+ABowRlv6WshZQoZg/...
5   x-amz-request-id                                   6KZWC3W4RVWAFY5S
6      Last-Modified                      Sun, 30 May 2021 14:00:42 GMT
7   x-amz-version-id                   yT6365UyrWqhSlRsEPyBehY7HKsnLPmH
8               ETag                 "5840e486b3afdf58267d80163cb5d0cf"
9    CF-Cache-Status                                               MISS
10     Accept-Ranges                                              bytes
11        Set-Cookie  __cf_bm=MT0AHzcAoycUupPIkmouiAaIkMm8zcSZYoOFxl...
12              Vary                                    Acce

### 2. os model and f-strings to build a local path

f-strings is a fancy way to format strings, it combines expressions inside string literals.

In [18]:
# os library
import os

# Set routes
root_dir = 'D:/Learn_DS/Git/python-learning/ETL'

# put zip files in the source, put unzipped files in the raw
source_dir = 'data/2023April/source'
raw_dir = 'data/2023April/raw'

# file folder holds unzipped files
file_name = 'downloaded_at=2023-04-30.zip'
file_folder = 'download_at=2023-04-30'

# Define a function for creation
def create_folder_if_not_exists(path):
     os.makedirs(os.path.dirname(path), exist_ok=True)

# build path using f-strings
source_path = f"{root_dir}/{source_dir}/{file_name}"
raw_path = f"{root_dir}/{raw_dir}/{file_folder}"


### 3. save zipped file in the source path

In [19]:
# Save the file locally, with can close the file after writing it, mode wb for zip file
create_folder_if_not_exists(source_path)
with open(source_path, "wb") as source_file:
    source_file.write(response.content)

### 4. unzip the Zip file then put them in the raw path

In [22]:
# Unzipped the ZIP file
from zipfile import ZipFile

# Check files
create_folder_if_not_exists(raw_path)
with ZipFile(source_path, mode="r") as f:
  	# Get the list of files and print it
    file_names = f.namelist()
    print(file_names)
    # Extract the csv file
    csv_file_path = f.extract(file_names[0], path=raw_path)
    print(csv_file_path)



['ppr-all.csv']
D:\Learn_DS\Git\python-learning\ETL\data\2023April\raw\download_at=2023-04-30\ppr-all.csv


### 5. read the CSV file

In [14]:
# csv model
import csv
from pprint import pprint

# open function
with open(csv_file_path, mode="r", encoding="windows-1252") as csv_file:
    reader= csv.DictReader(csv_file)
    # get the first row
    row=next(reader)
    pprint(row)

{'Address': '16 BURLEIGH COURT, BURLINGTON ROAD, DUBLIN 4',
 'County': 'Dublin',
 'Date of Sale (dd/mm/yyyy)': '03/01/2021',
 'Description of Property': 'Second-Hand Dwelling house /Apartment',
 'Postal Code': 'Dublin 4',
 'Price (€)': '€450,000.00'}


In [15]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path, encoding="windows-1252")

# Print the first five rows of the DataFrame
print(df.sample(5))

      Date of Sale (dd/mm/yyyy)                                   Address  \
5011                 12/02/2021      4 MERRY MEETING, RATHNEW, CO.WICKLOW   
2659                 28/01/2021     25 TEMPLEOGUE RD, TERENURE, DUBLIN 6W   
11939                08/04/2021          24 EMERALD SQ, CORK ST, DUBLIN 8   
6198                 19/02/2021  TOWNAMULLOGUE, COURTNACUDDY, ENNISCORTHY   
5502                 16/02/2021  7 CARMANHALL COURT, SANDYFORD, DUBLIN 18   

      Postal Code   County    Price (€)                Description of Property  
5011          NaN  Wicklow  €250,000.00  Second-Hand Dwelling house /Apartment  
2659          NaN   Dublin  €695,000.00  Second-Hand Dwelling house /Apartment  
11939    Dublin 8   Dublin  €277,000.00  Second-Hand Dwelling house /Apartment  
6198          NaN  Wexford  €122,000.00  Second-Hand Dwelling house /Apartment  
5502    Dublin 18   Dublin  €310,000.00  Second-Hand Dwelling house /Apartment  


In [16]:
# The size of CSV file
print(df.shape[0], df.shape[1])

11999 6
