# Extract of ETL (Extract-Transform-Load) pipeline
[Link to GitHub](https://github.com/stanislavlia/datascience_club_projects/blob/main/project1_etl_pipeline/extract.py)

It is crucial to use a `venv` (virtual environment) in Python to establish an isolated environment for project dependencies that ensures the following:

1. **Dependency Management**: Projects can have their specific packages and versions without worrying about interfering with other projects.
2. **Environment Consistency**: Code runs the same regardless of where it's executed, as long as the same basic setup is used because of the independent management of dependencies from the system Python environment.
3. **Easy Cleanup**: When no longer needed, a project's dependencies can be "uninstalled" simply by deleting the `venv` directory without any danger of messing up other projects.

In short, `venv` maintains clean environments across Python projects.

[Documentation](https://docs.python.org/3/library/venv.html)

In [1]:
# Install the Python3 virtual environment package
!apt install python3-venv

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  python3-pip-whl python3-setuptools-whl python3.10-venv
The following NEW packages will be installed:
  python3-pip-whl python3-setuptools-whl python3-venv python3.10-venv
0 upgraded, 4 newly installed, 0 to remove and 49 not upgraded.
Need to get 2,475 kB of archives.
After this operation, 2,891 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip-whl all 22.0.2+dfsg-1ubuntu0.5 [1,680 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-setuptools-whl all 59.6.0-1.2ubuntu0.22.04.2 [788 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3.10-venv amd64 3.10.12-1~22.04.6 [5,722 B]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-venv amd64 3.10.6-1~22.04.1 [1,042 B]
Fetched 2,475 kB in 1s (2

In [2]:
# Create a virtual environment named 'etl_venv'
!python3 -m venv etl_venv

In [3]:
# Activate the virtual environment 'etl_venv'
!source etl_venv/bin/activate

**pip** is Python's package installer, allowing to download, install, and manage libraries and dependencies.

[Documnetation](https://pip.pypa.io/en/stable/)

In [4]:
# List all installed packages in the current environment
!pip list

Package                            Version
---------------------------------- -------------------
absl-py                            1.4.0
accelerate                         0.34.2
aiohappyeyeballs                   2.4.3
aiohttp                            3.10.10
aiosignal                          1.3.1
alabaster                          0.7.16
albucore                           0.0.19
albumentations                     1.4.20
altair                             4.2.2
annotated-types                    0.7.0
anyio                              3.7.1
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
array_record                       0.5.1
arviz                              0.20.0
astropy                            6.1.4
astropy-iers-data                  0.2024.10.28.0.34.7
astunparse                         1.6.3
async-timeout                      4.0.3
atpublic                           4.1.0
attrs                              24.2.0
audioread        

In [5]:
# Upgrade pip to the latest version
!pip install --upgrade pip

# Install the modules
!pip install requests
!pip install prettyprint
!pip install tqdm
!pip install click

Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.3.1
Collecting prettyprint
  Downloading prettyprint-0.1.5.tar.gz (2.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: prettyprint
  Building wheel for prettyprint (setup.py) ... [?25l[?25hdone
  Created wheel for prettyprint: filename=prettyprint-0.1.5-py3-none-any.whl size=3027 sha256=d71328063a5866543d892f85347faacaf54fc0227c2c7b775259db8d5d773ea4
  Stored in directory: /root/.cache/pip/wheels/b2/d0/51/477413885481c635ab7c6400f96f47b8a0971bbc1241ff9c9f
Successfully built prettyprint
Ins

In [6]:
# Show details of the packages
!pip show prettyprint
!echo ""
!pip show requests
!echo ""
!pip show tqdm
!echo ""
!pip show click

Name: prettyprint
Version: 0.1.5
Summary: prettyprint print list/dict/tuple object prettily
Home-page: http://github.com/taichino/prettyprint
Author: Matsumoto Taichi
Author-email: taichino@gmail.com
License: MIT License
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: 

Name: requests
Version: 2.32.3
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by: bigframes, CacheControl, community, diffusers, earthengine-api, fastai, folium, gcsfs, gdown, geocoder, google-api-core, google-cloud-bigquery, google-cloud-storage, google-colab, huggingface-hub, kaggle, kagglehub, langchain, langsmith, moviepy, music21, pandas-datareader, panel, pooch, pymystem3, requests-oauthlib, requests-toolbelt, spacy, Sphinx, tensorflow, tensorflow-datasets, transformers, twee

**Used Python Modules:**

[Requests: HTTP for Humans](https://requests.readthedocs.io/)

[pprint — Data pretty printer](https://docs.python.org/3/library/pprint.html)

[json — JSON encoder and decoder](https://docs.python.org/3/library/json.html)

[tqdm.std - Customisable progressbar decorator for iterators](https://tqdm.github.io/docs/tqdm/)

[time — Time access and conversions](https://docs.python.org/3/library/time.html)

[Click - package for creating CLI](https://click.palletsprojects.com/)

[Click and Python: Build Extensible and Composable CLI Apps](https://realpython.com/python-click/)

In [7]:
# Import the libraries
import requests
from pprint import pprint
import json
from tqdm import tqdm
from datetime import datetime
import click

**HTTP** or Hypertext Transfer Protocol, is the foundational protocol for transferring data over the Internet. It operates as an application layer protocol within the TCP/IP suite, enabling communication between clients and servers.

[What is HTTP?](https://www.w3schools.com/whatis/whatis_http.asp)

[HTTP Request Components](https://www.helloapi.co/blog/what-is-http-request/#http-request-components)

[HTTP Response Components](https://brightdata.jp/glossary/http-response)

**API** (Application Programming Interface) is the tools that facilitate communication and interaction between different software applications. It defines a set of rules and protocols that enable developers to request and exchange information, allowing applications to access functionalities or data from other software components or services.

[What is API?](https://www.developerupdates.com/blog/what-is-api-learn-about-api-in-5-minutes)

In [8]:
# Set up global variable for API URL
RANDOMUSER_API_URL = "https://randomuser.me/api/"

In [9]:
# Define function to parse JSON data into a structured dictionary
def parse_json(user_json: dict) -> dict:
    # Extract ID
    id = user_json["results"][0]["id"]["name"] + " " + str(user_json["results"][0]["id"]["value"])

    # Extract name details
    first_name = user_json["results"][0]["name"]["first"]
    last_name = user_json["results"][0]["name"]["last"]

    # Extract location details
    location_city = user_json["results"][0]["location"]["city"]
    location_country = user_json["results"][0]["location"]["country"]
    location_latitude = user_json["results"][0]["location"]["coordinates"]["latitude"]
    location_longitude = user_json["results"][0]["location"]["coordinates"]["longitude"]
    location_postcode = user_json["results"][0]["location"]["postcode"]
    location_state = user_json["results"][0]["location"]["state"]
    location_street_info = f"{user_json['results'][0]['location']['street']['name']}, {user_json['results'][0]['location']['street']['number']}"

    # Extract other fields
    email = user_json["results"][0].get("email")
    gender = user_json["results"][0].get("gender")

    # Extract login details
    login_uuid = user_json["results"][0]["login"].get("uuid")
    login_username = user_json["results"][0]["login"].get("username")
    login_password = user_json["results"][0]["login"].get("password")

    # Extract contact details
    phone = user_json["results"][0].get("phone")
    cell = user_json["results"][0].get("cell")

    # Extract date of birth and registration details
    date_of_birth = user_json["results"][0]["dob"].get("date")
    age = user_json["results"][0]["dob"].get("age")
    date_of_registration = user_json["results"][0]["registered"].get("date")

    # Extract picture link
    photo_link = user_json["results"][0]["picture"].get("large")

    # Capture the extract date and time
    extract_time = str(datetime.now())

    # Return structured dictionary with extracted information
    return {
        "id": id,
        "firstname": first_name,
        "lastname": last_name,
        "location_city": location_city,
        "location_country": location_country,
        "location_state": location_state,
        "location_latitude": location_latitude,
        "location_longitude": location_longitude,
        "location_postcode": location_postcode,
        "location_street_info": location_street_info,
        "email": email,
        "gender": gender,
        "login_uuid": login_uuid,
        "login_username": login_username,
        "login_password": login_password,
        "phone": phone,
        "cell": cell,
        "date_of_birth": date_of_birth,
        "age": age,
        "date_of_registration": date_of_registration,
        "photo_link": photo_link,
        "extract_time" : extract_time
    }

In [10]:
# Define function to fetch user data from the API
def fetch_user_from_api(url: str):

    # Send GET request to the specified URL
    r = requests.get(url=url)

    # Parse the response as JSON
    user_json = r.json()

    # Process the JSON data into a structured format
    parsed_user = parse_json(user_json)

    # Return the parsed user data
    return parsed_user

In [11]:
# Define function to load batch user data from the API and save to file
def load_batch_data(result_path: str, n_users: int):

    #Print start info
    print(f"Collecting data from {RANDOMUSER_API_URL}; n_users = {n_users}")

    # Initialize empty list to store user data
    users = []

    # Loop to fetch specified number of users, with progress bar display
    for _ in tqdm(range(n_users), desc="Fetching users from API..."):

        # Fetch individual user data from API and append to users list
        user = fetch_user_from_api(url=RANDOMUSER_API_URL)
        users.append(user)

    # Save the collected user data to a JSON file
    batch_data = {"n_users": n_users, "users": users}

    print("Saving users to file ", result_path)
    with open(result_path, "w") as file:
        json.dump(batch_data, file, indent=2, ensure_ascii=False)

    # Confirm job completion
    print("JOB IS DONE")


In [12]:
# Define command-line interface for loading batch user data
@click.command()
@click.option('--result_path', type=str, help='Path to save loaded batch of users')
@click.option('--n_users', type=int, help='How many users to fetch from API')
def load_batch_cli(result_path: str, n_users: int):

    # Call the function to load batch data with provided CLI arguments
    load_batch_data(result_path=result_path, n_users=n_users)


**How to use (CLI)**

`python3 extract.py --result_path batch100users.json --n_users 100`

`python3 extract.py --result_path batch15users.json --n_users 15`

In [13]:
# load_batch_cli()  # Uncomment to enable CLI functionality when running as a standalone Python script

In [14]:
# Run CLI function if script is executed directly (for instance in Colab)
if __name__ == "__main__":
    import sys

    sys.argv = sys.argv[:1]

    path = 'output.json'
    num = 1000

    with click.Context(load_batch_cli) as ctx:
        ctx.invoke(load_batch_cli, result_path=path, n_users=num)

Collecting data from https://randomuser.me/api/; n_users = 1000


Fetching users from API...: 100%|██████████| 1000/1000 [03:44<00:00,  4.46it/s]

Saving users to file  output.json
JOB DONE





In [15]:
#How to use
#python3 extract.py --result_path batch100users.json --n_users 100
#python3 extract.py --result_path batch15users.json --n_users 15