# Extract of ETL (Extract-Transform-Load) pipeline
[Link to GitHub](https://github.com/stanislavlia/datascience_club_projects/blob/main/project1_etl_pipeline/extract.py)

It is crucial to use a `venv` (virtual environment) in Python to establish an isolated environment for project dependencies that ensures the following:

1. **Dependency Management**: Projects can have their specific packages and versions without worrying about interfering with other projects.
2. **Environment Consistency**: Code runs the same regardless of where it's executed, as long as the same basic setup is used because of the independent management of dependencies from the system Python environment.
3. **Easy Cleanup**: When no longer needed, a project's dependencies can be "uninstalled" simply by deleting the `venv` directory without any danger of messing up other projects.

In short, `venv` maintains clean environments across Python projects.

[Documentation](https://docs.python.org/3/library/venv.html)

In [15]:
# Install the Python3 virtual environment package
!apt install python3-venv

The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.



In [16]:
# Create a virtual environment named 'etl_venv'
!python3 -m venv etl_venv

In [17]:
# Activate the virtual environment 'etl_venv'
!source etl_venv/bin/activate

**pip** is Python's package installer, allowing to download, install, and manage libraries and dependencies.

[Documnetation](https://pip.pypa.io/en/stable/)

In [18]:
# List all installed packages in the current environment
!pip3 list

Package            Version
------------------ -----------
appnope            0.1.4
asttokens          2.4.1
certifi            2024.8.30
charset-normalizer 3.4.0
click              8.1.7
comm               0.2.2
country-converter  1.2
debugpy            1.8.7
decorator          5.1.1
exceptiongroup     1.2.2
executing          2.1.0
idna               3.10
ipykernel          6.29.5
ipython            8.29.0
jedi               0.19.1
jupyter_client     8.6.3
jupyter_core       5.7.2
matplotlib-inline  0.1.7
nest-asyncio       1.6.0
numpy              2.1.3
packaging          24.1
pandas             2.2.3
parso              0.8.4
pexpect            4.9.0
phonenumbers       8.13.49
pip                24.3.1
platformdirs       4.3.6
prettyprint        0.1.5
prompt_toolkit     3.0.48
psutil             6.1.0
ptyprocess         0.7.0
pure_eval          0.2.3
Pygments           2.18.0
python-dateutil    2.9.0.post0
pytz               2024.2
pyzmq              26.2.0
requests           2.32.3


In [19]:
# Upgrade pip to the latest version
!pip3 install --upgrade pip

# Install the modules
!pip3 install requests
!pip3 install prettyprint
!pip3 install tqdm
!pip3 install click



In [20]:
# Show details of the packages
!pip3 show prettyprint
!echo ""
!pip3 show requests
!echo ""
!pip3 show tqdm
!echo ""
!pip3 show click

Name: prettyprint
Version: 0.1.5
Summary: prettyprint print list/dict/tuple object prettily
Home-page: http://github.com/taichino/prettyprint
Author: Matsumoto Taichi
Author-email: taichino@gmail.com
License: MIT License
Location: /Users/yuran653/42_ds_club/.venv/lib/python3.10/site-packages
Requires: 
Required-by: 

Name: requests
Version: 2.32.3
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache-2.0
Location: /Users/yuran653/42_ds_club/.venv/lib/python3.10/site-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by: 

Name: tqdm
Version: 4.66.6
Summary: Fast, Extensible Progress Meter
Home-page: https://tqdm.github.io
Author: 
Author-email: 
License: MPL-2.0 AND MIT
Location: /Users/yuran653/42_ds_club/.venv/lib/python3.10/site-packages
Requires: 
Required-by: 

Name: click
Version: 8.1.7
Summary: Composable command line interface toolkit
Home-page: https://palletsproje

**Used Python Modules:**

[Requests: HTTP for Humans](https://requests.readthedocs.io/)

[pprint — Data pretty printer](https://docs.python.org/3/library/pprint.html)

[json — JSON encoder and decoder](https://docs.python.org/3/library/json.html)

[tqdm.std - Customisable progressbar decorator for iterators](https://tqdm.github.io/docs/tqdm/)

[time — Time access and conversions](https://docs.python.org/3/library/time.html)

[Click - package for creating CLI](https://click.palletsprojects.com/)

[Click and Python: Build Extensible and Composable CLI Apps](https://realpython.com/python-click/)

In [21]:
# Import the libraries
import requests
from pprint import pprint
import json
from tqdm import tqdm
from datetime import datetime
import click

**HTTP** or Hypertext Transfer Protocol, is the foundational protocol for transferring data over the Internet. It operates as an application layer protocol within the TCP/IP suite, enabling communication between clients and servers.

[What is HTTP?](https://www.w3schools.com/whatis/whatis_http.asp)

[HTTP Request Components](https://www.helloapi.co/blog/what-is-http-request/#http-request-components)

[HTTP Response Components](https://brightdata.jp/glossary/http-response)

**API** (Application Programming Interface) is the tools that facilitate communication and interaction between different software applications. It defines a set of rules and protocols that enable developers to request and exchange information, allowing applications to access functionalities or data from other software components or services.

[What is API?](https://www.developerupdates.com/blog/what-is-api-learn-about-api-in-5-minutes)

In [22]:
# Set up global variable for API URL
RANDOMUSER_API_URL = "https://randomuser.me/api/"

In [23]:
# Define function to parse JSON data into a structured dictionary
def parse_json(user_json: dict) -> dict:
    # Extract ID
    id = user_json["results"][0]["id"]["name"] + " " + str(user_json["results"][0]["id"]["value"])

    # Extract name details
    first_name = user_json["results"][0]["name"]["first"]
    last_name = user_json["results"][0]["name"]["last"]

    # Extract location details
    location_city = user_json["results"][0]["location"]["city"]
    location_country = user_json["results"][0]["location"]["country"]
    location_latitude = user_json["results"][0]["location"]["coordinates"]["latitude"]
    location_longitude = user_json["results"][0]["location"]["coordinates"]["longitude"]
    location_postcode = user_json["results"][0]["location"]["postcode"]
    location_state = user_json["results"][0]["location"]["state"]
    location_street_info = f"{user_json['results'][0]['location']['street']['name']}, {user_json['results'][0]['location']['street']['number']}"

    # Extract other fields
    email = user_json["results"][0].get("email")
    gender = user_json["results"][0].get("gender")

    # Extract login details
    login_uuid = user_json["results"][0]["login"].get("uuid")
    login_username = user_json["results"][0]["login"].get("username")
    login_password = user_json["results"][0]["login"].get("password")

    # Extract contact details
    phone = user_json["results"][0].get("phone")
    cell = user_json["results"][0].get("cell")

    # Extract date of birth and registration details
    date_of_birth = user_json["results"][0]["dob"].get("date")
    age = user_json["results"][0]["dob"].get("age")
    date_of_registration = user_json["results"][0]["registered"].get("date")

    # Extract picture link
    photo_link = user_json["results"][0]["picture"].get("large")

    # Capture the extract date and time
    extract_time = str(datetime.now())

    # Return structured dictionary with extracted information
    return {
        "id": id,
        "firstname": first_name,
        "lastname": last_name,
        "location_city": location_city,
        "location_country": location_country,
        "location_state": location_state,
        "location_latitude": location_latitude,
        "location_longitude": location_longitude,
        "location_postcode": location_postcode,
        "location_street_info": location_street_info,
        "email": email,
        "gender": gender,
        "login_uuid": login_uuid,
        "login_username": login_username,
        "login_password": login_password,
        "phone": phone,
        "cell": cell,
        "date_of_birth": date_of_birth,
        "age": age,
        "date_of_registration": date_of_registration,
        "photo_link": photo_link,
        "extract_time" : extract_time
    }

In [24]:
# Define function to fetch user data from the API
def fetch_user_from_api(url: str):

    # Send GET request to the specified URL
    r = requests.get(url=url)

    # Parse the response as JSON
    user_json = r.json()

    # Process the JSON data into a structured format
    parsed_user = parse_json(user_json)

    # Return the parsed user data
    return parsed_user

In [25]:
# Define function to load batch user data from the API and save to file
def load_batch_data(result_path: str, n_users: int):

    #Print start info
    print(f"Collecting data from {RANDOMUSER_API_URL}; n_users = {n_users}")

    # Initialize empty list to store user data
    users = []

    # Loop to fetch specified number of users, with progress bar display
    for _ in tqdm(range(n_users), desc="Fetching users from API..."):

        # Fetch individual user data from API and append to users list
        user = fetch_user_from_api(url=RANDOMUSER_API_URL)
        users.append(user)

    # Save the collected user data to a JSON file
    batch_data = {"n_users": n_users, "users": users}

    print("Saving users to file ", result_path)
    with open(result_path, "w") as file:
        json.dump(batch_data, file, indent=2, ensure_ascii=False)

    # Confirm job completion
    print("JOB IS DONE")


In [26]:
# Define command-line interface for loading batch user data
@click.command()
@click.option('--result_path', type=str, help='Path to save loaded batch of users')
@click.option('--n_users', type=int, help='How many users to fetch from API')
def load_batch_cli(result_path: str, n_users: int):

    # Call the function to load batch data with provided CLI arguments
    load_batch_data(result_path=result_path, n_users=n_users)


**How to use (CLI)**

`python3 extract.py --result_path batch100users.json --n_users 100`

`python3 extract.py --result_path batch15users.json --n_users 15`

In [27]:
# load_batch_cli()  # Uncomment to enable CLI functionality when running as a standalone Python script

In [28]:
# Run CLI function if script is executed directly (for instance in Colab)
if __name__ == "__main__":
    import sys

    sys.argv = sys.argv[:1]

    path = 'batch1000users.json'
    num = 1000

    with click.Context(load_batch_cli) as ctx:
        ctx.invoke(load_batch_cli, result_path=path, n_users=num)

Collecting data from https://randomuser.me/api/; n_users = 1000


Fetching users from API...: 100%|██████████| 1000/1000 [10:04<00:00,  1.66it/s]

Saving users to file  batch1000users.json
JOB IS DONE





In [29]:
#How to use
#python3 extract.py --result_path batch100users.json --n_users 100
#python3 extract.py --result_path batch15users.json --n_users 15