# Import der Sales Daten

Lädt die Sales Daten ohne Integration Services in die Tabelle Sales. Voraussetzung ist ein SQL Server Container:

```
docker run -d -p 1433:1433  --name sqlserver2019 -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=SqlServer2019" mcr.microsoft.com/azure-sql-edge
```

Außerdem müssen sich die Sales Textdateien von *Code_InternetShop.zip* in diesem Ordner befinden.

Quelle: http://www.griesmayer.com/?menu=Business%20Intelligence&semester=Semsester_6&topic=01_ETL_MLoad

In [6]:
from pathlib import Path
import sqlalchemy
import pandas as pd

connection_url = sqlalchemy.engine.URL.create("mssql+pyodbc", username="sa",
    password="SqlServer2019", host="localhost", database="Sales",
    query={ "driver": "ODBC Driver 18 for SQL Server" })
# We cannot connect to sales to create the database (does not exist at this time). We use tempdb.
# Autocommit is necessary for create database and ddl statements.
tempdb_engine = sqlalchemy.create_engine(
    connection_url.set(database="tempdb"), isolation_level="AUTOCOMMIT", 
    connect_args={"TrustServerCertificate": "yes"})
# We drop the database just before connecting, so we set pool_pre_ping=True
engine = sqlalchemy.create_engine(
    connection_url, fast_executemany=True, pool_pre_ping=True,
    connect_args={"TrustServerCertificate": "yes"})


Zuerst löschen wir die Datenbank und erstellen sie neu.
Das ist natürlich nur zum Testen, sonst ist das Löschen der Datenbank nicht ideal...

In [7]:
with tempdb_engine.connect() as conn: 
    try: conn.execute(sqlalchemy.text("ALTER DATABASE Sales SET SINGLE_USER WITH ROLLBACK IMMEDIATE"))
    except: pass
    conn.execute(sqlalchemy.text("DROP DATABASE IF EXISTS Sales"))
    conn.execute(sqlalchemy.text("CREATE DATABASE Sales"))
with engine.connect() as conn:
    conn.execution_options(isolation_level="AUTOCOMMIT")
    conn.execute(sqlalchemy.text("""
        CREATE TABLE [SALES]
        (
            CUSTOMERID  VARCHAR(9),
            SALESDATE   DATE,
            SALESTIME   DECIMAL(6),
            CHANNELID   CHAR(1),
            PRODUCTID   INTEGER,
            DEALERID    INTEGER,
            RECIPIENTID VARCHAR(9),
            EXPLORER    VARCHAR(30),
            RETURNED    CHAR(1),
            DURATION    DECIMAL(5),
            CLICKS      DECIMAL(5),
            PIECES      DECIMAL(5),
            DISCOUNT    DECIMAL(3)
        )    
    """))

Nun werden alle Dateien, die mit dem Namen *sales_* beginnen, gelesen.
Die Daten werden in die Tabelle SALES eingefügt.

In [8]:
def read_sales(filename):
    sales = pd.read_csv(filename, sep="\t", encoding='utf-8',
            engine='python',          # support for skipfooter needed
            skipfooter=2,             # EOF and the number of lines are at the end of each text file.
            dtype={"CustomerID": int, "Date": "string", "Time": int, "ChannelID": "string",
                   "ProductID": int, "DealerID": int, "RecipientID": int,
                   "Explorer": "string", "Returned": "string", "Duration": int,
                   "Clicks": int, "Pieces": int, "Discount": int})
    sales["Date"] = pd.to_datetime(sales.Date, format="%Y%m%d")
    return sales

In [9]:
filenames = sorted(map(str, Path(".").glob("sales_*.txt")))
for filename in filenames:
    sales = read_sales(filename)
    # Column mapping to match table definition.
    sales = sales.rename({"Date": "SALESDATE", "Time": "SALESTIME"}, axis=1)
    with engine.connect() as conn:
        sales.to_sql("SALES", conn, if_exists="append", index=False)    

# Solution without loop: map every filename to the dataframe, then you can write
# all data at once
# sales = pd.concat(map(read_sales, filenames))
# sales = sales.rename({"Date": "SALESDATE", "Time": "SALESTIME"}, axis=1)
# with engine.connect() as conn:
#     sales.to_sql("SALES", conn, if_exists="append", index=False)


## Prüfen des Importes

Die letzten 3 Verkäufe aus der letzten gelesenen Textdatei anzeigen.

In [10]:
display(sales.sort_values(["SALESDATE", "SALESTIME"], ascending=False).head(3))

Unnamed: 0,CustomerID,SALESDATE,SALESTIME,ChannelID,ProductID,DealerID,RecipientID,Explorer,Returned,Duration,Clicks,Pieces,Discount
6845,566667,2017-02-13,215100,I,8,9998,566667,Netscape,Y,342,51,1,0
7043,349727,2017-02-13,214035,I,7,9996,349727,Internet Explorer,Y,309,18,2,0
6428,743155,2017-02-13,205306,I,5,9995,743155,Netscape,Y,342,17,1,10


Zurücklesen der Daten. Die ausgegebenen Daten müssen den oben ausgegebenen Daten entsprechen.
Der Index ist natürlich unterschiedlich, die Datenbank wurde ohne Sortierung gelesen.

In [11]:
with engine.connect() as conn:
    sales_check = pd.read_sql(sqlalchemy.text("SELECT * FROM SALES"), conn)
print(f"{len(sales_check)} Datensätze gelesen.")
display(sales_check.sort_values(["SALESDATE", "SALESTIME"], ascending=False).head(3))

45130 Datensätze gelesen.


Unnamed: 0,CUSTOMERID,SALESDATE,SALESTIME,CHANNELID,PRODUCTID,DEALERID,RECIPIENTID,EXPLORER,RETURNED,DURATION,CLICKS,PIECES,DISCOUNT
42957,566667,2017-02-13,215100.0,I,8,9998,566667,Netscape,Y,342.0,51.0,1.0,0.0
43155,349727,2017-02-13,214035.0,I,7,9996,349727,Internet Explorer,Y,309.0,18.0,2.0,0.0
42540,743155,2017-02-13,205306.0,I,5,9995,743155,Netscape,Y,342.0,17.0,1.0,10.0
