# Import der Customer Daten (ETL_RecordStartDate)

Liest jede Datei in die Tabelle CUSTOMERS.
Es werden die täglichen Datenfiles in eine Temp Tabelle geschrieben.
Ändert sich ein Datensatz (andere Stammdaten), wird in der Haupttabelle ein neuer Datensatz angelegt und der alte Datensatz wird mit einem Endedatum versehen.
Kommt ein Kunde in einem Datenfile nicht mehr vor, wird auch ein Endedatum gesetzt.

Voraussetzung ist ein SQL Server Container:

*docker run -d -p 1433:1433  --name sqlserver2019 -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=SqlServer2019" mcr.microsoft.com/azure-sql-edge*

Außerdem müssen sich die Customer Textdateien von *Code_ETL_RecordStartDate.zip* in diesem Ordner befinden.

Quelle: http://www.griesmayer.com/?menu=Business%20Intelligence&semester=Semsester_6&topic=02_ETL_RecordStartDate

![](textfiles_2301.png)

In [1]:
from pathlib import Path
import sqlalchemy, re
import pandas as pd

database="Sales"
connection_url = sqlalchemy.engine.URL.create("mssql+pyodbc", username="sa",
    password="SqlServer2019", host="localhost", database=database,
    query={ "driver": "ODBC Driver 18 for SQL Server" })
# We cannot connect to sales to create the database (does not exist at this time). We use tempdb.
# Autocommit is necessary for create database and ddl statements.
tempdb_engine = sqlalchemy.create_engine(
    connection_url.set(database="tempdb"), isolation_level="AUTOCOMMIT", 
    connect_args={"TrustServerCertificate": "yes"})
# We drop the database just before connecting, so we set pool_pre_ping=True
engine = sqlalchemy.create_engine(
    connection_url, fast_executemany=True, pool_pre_ping=True,
    connect_args={"TrustServerCertificate": "yes"})


Zuerst löschen wir die Datenbank und erstellen sie neu.
Das ist natürlich nur zum Testen, sonst ist das Löschen der Datenbank nicht ideal...

In [2]:
with tempdb_engine.connect() as conn: 
    try: conn.execute(sqlalchemy.text(f"ALTER DATABASE {database} SET SINGLE_USER WITH ROLLBACK IMMEDIATE"))
    except: pass
    conn.execute(sqlalchemy.text(f"DROP DATABASE IF EXISTS {database}"))
    conn.execute(sqlalchemy.text(f"CREATE DATABASE {database}"))
with engine.connect() as conn:
    conn.execution_options(isolation_level="AUTOCOMMIT")
    conn.execute(sqlalchemy.text("""
        CREATE TABLE CUSTOMER
        (
            CustomerID           INTEGER,
            RecordStartDate      DATE,
            RecordEndDate        DATE,
            FirstName            VARCHAR(30),
            LastName             VARCHAR(30),
            ZIP                  CHAR(4),
            City                 VARCHAR(30),
            PRIMARY KEY (CustomerID, RecordStartDate)
        )  
    """))

Nun werden alle Dateien, die mit dem Namen *customer_* beginnen, gelesen.

In [3]:
def read_customers(filename):
    # Extract date of file from filename with regular expression.
    match = re.search(r"_(?P<date>\d{8}).txt", filename)
    if match is None: raise Exception(f"Invalid filename: {filename}, no date found.")
    filedate = pd.to_datetime(match.group("date"), format="%Y%m%d")
    # Load file into dataframe
    customers = pd.read_csv(filename, sep="\t", encoding='utf-8',
        dtype={"CustomerID": int, "FirstName": "string", "LastName": "string",
                "ZIP": int, "City": "string"})
    customers["RecordStartDate"] = filedate
    return (filedate, customers)

In [4]:
filenames = sorted(map(str, Path(".").glob("Customer_*.txt")))
for filename in filenames:
    print(f"Import {filename}...")
    filedate, customers_new = read_customers(filename)
    with engine.connect() as conn:
        # Write the content of the new text file to temp table
        customers_new.to_sql("CUSTOMER_TMP", conn, if_exists="replace", index=False)
        conn.commit()
        # Insert missing customers or customers who have changed one of the fields.
        conn.execute(sqlalchemy.text("""
            INSERT INTO CUSTOMER
            SELECT CustomerID, RecordStartDate, '9999-12-31', FirstName, LastName, ZIP, City
            FROM CUSTOMER_TMP tmp
            WHERE NOT EXISTS (SELECT * FROM CUSTOMER c
                WHERE 
                    c.CustomerID = tmp.CustomerID AND c.FirstName = tmp.FirstName AND
                    c.LastName = tmp.LastName AND c.ZIP = tmp.ZIP AND c.City = tmp.City)
        """))
        conn.commit()
        # Update the enddate if we have inserted a second record (case 1) or if the customer is not
        # present in the textfile (case 2).
        conn.execute(sqlalchemy.text("""
            UPDATE c SET c.RecordEndDate = :enddate
            FROM CUSTOMER c
            WHERE c.RecordEndDate = '9999-12-31' AND (
                NOT EXISTS (SELECT * FROM CUSTOMER_TMP tmp WHERE tmp.CustomerID = c.CustomerID) OR
                EXISTS (SELECT * FROM CUSTOMER c2 WHERE c2.CustomerID = c.CustomerID AND c2.RecordStartDate > c.RecordStartDate))
        """), { "enddate": filedate - pd.Timedelta(1, "day") })
        conn.commit()



Import Customer_20170201.txt...
Import Customer_20170202.txt...
Import Customer_20170203.txt...
Import Customer_20170204.txt...


## Prüfen des Importes

Die CUSTOMER Tabelle lesen:

In [5]:
with engine.connect() as conn:
    customer_check = pd.read_sql(sqlalchemy.text("SELECT * FROM CUSTOMER"), conn)
print(f"{len(customer_check)} Datensätze gelesen.")
display(customer_check.sort_values(["CustomerID", "RecordStartDate"]))

13 Datensätze gelesen.


Unnamed: 0,CustomerID,RecordStartDate,RecordEndDate,FirstName,LastName,ZIP,City
0,1,2017-02-01,2017-02-03,Fritz,Müller,2500,Baden
1,1,2017-02-04,9999-12-31,Fritz,Müller,1010,Wien
2,2,2017-02-01,2017-02-02,Susi,Berger,1010,Wien
3,2,2017-02-03,9999-12-31,Susi,Müller,1010,Wien
4,3,2017-02-01,2017-02-02,Werner,Mayer,2700,Wr. Neustadt
5,4,2017-02-01,9999-12-31,Gudrun,Wastan,1050,Wien
6,5,2017-02-01,2017-02-01,Markus,Merau,1010,Wien
7,5,2017-02-02,2017-02-03,Markus,Merau,1050,Wien
8,5,2017-02-04,9999-12-31,Markus,Merau,2500,Baden
9,6,2017-02-02,9999-12-31,Alexander,Fleisch,2700,Wr. Neustadt
