# Database Connection and Initial Data Load  

This notebook is part of the data analysis workflow. Its objective is to connect to the database and extract raw data for further processing.

To begin, the necessary libraries are imported to ensure the correct functioning of the code.

In [1]:
import psycopg2
import json
import os

## Connection Configuration  
The credentials are read and extracted to establish a connection to PostgreSQL. Then, the connection to the database stored in PostgreSQL is established.

In [2]:
with open("/home/user/leucemia/Leukemia-Cancer-Risk-ETL/notebooks/credentials.json", "r", encoding="utf-8") as file:
    credentials = json.load(file)

In [3]:
db_host = credentials["db_host"]
db_name = credentials["db_name"]
db_user = credentials["db_user"]
db_password = credentials["db_password"]
db_port = credentials["db_port"] 

conn = psycopg2.connect(
    host=db_host,
    dbname=db_name,
    user=db_user,
    password=db_password,
    port=db_port
)

A cursor is established to execute SQL commands, followed by the creation of a table in the database to store the raw data.

In [4]:
cur = conn.cursor()

cur.execute("""CREATE TABLE leukemia_raw_data (
            patient_id SERIAL PRIMARY KEY,
            age INTEGER,
            gender TEXT ,
            country TEXT,
            wbc_count INTEGER,
            rbc_count NUMERIC(5,2),
            platelet_count INTEGER,
            hemoglobin_level NUMERIC(5,2),
            bone_marrow_blasts INTEGER,
            genetic_mutation TEXT,
            family_history TEXT,
            smoking_status TEXT,
            alcohol_consumption TEXT,
            radiation_exposure TEXT,
            infection_history TEXT,
            bmi NUMERIC(5,2),
            chronic_illness TEXT,
            immune_disorders TEXT,
            ethnicity TEXT,
            socioeconomic_status TEXT,
            urban_rural TEXT,
            leukemia_status TEXT
        );
""")
conn.commit()

DuplicateTable: relation "leukemia_raw_data" already exists


The working directory is changed so that the system can locate the dataset.

In [5]:
os.chdir("..")
print(os.getcwd())

/home/user/leucemia/Leukemia-Cancer-Risk-ETL


In [6]:
conn.rollback()

Finally, the CSV file path is built, and the data is loaded into the previously created table.

In [7]:
csv_file_path = os.path.join(os.getcwd(), "data", "biased_leukemia_dataset.csv")

table_name = 'leukemia_raw_data'

copy_sql = f"""
           COPY {table_name} FROM stdin 
           DELIMITER as ','
           CSV HEADER
           """
with open(csv_file_path, 'r') as f:
    cur.copy_expert(sql=copy_sql, file=f)

conn.commit()

conn.close()

UniqueViolation: duplicate key value violates unique constraint "leukemia_raw_data_pkey"
DETAIL:  Key (patient_id)=(1) already exists.
CONTEXT:  COPY leukemia_raw_data, line 2
