### Connecting to a Database on Azure

Prerequisite:
- Install odbc driver https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver15
- You may need pyodbc\
[pip] https://pypi.org/project/pyodbc/ \
[conda] https://anaconda.org/anaconda/pyodbc


In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from datetime import datetime
from tqdm import tqdm
import urllib

tqdm.pandas()

In [2]:
# connect to azure
password = "cro-r5sweDlVay5t=eta"
conn_string_odbc="Driver={ODBC Driver 18 for SQL Server};Server=tcp:smartspace.database.windows.net,1433;Database=connectionspace;Uid=stats170-G6;Pwd=cro-r5sweDlVay5t=eta;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
params = urllib.parse.quote_plus(conn_string_odbc)
conn_str_formatted = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str_formatted)

Loading data from database on cloud may take a few minutes for a large dataset.

In [8]:
# get Spring 2018
# This will take like a minute.
df = pd.read_sql('''SELECT [macAddress]
      ,[userId]
      ,[startTimestamp]
      ,[endTimestamp]
      ,[room_location]
FROM [Processed_Data]
WHERE [startTimestamp] >= '2018-03-28'
  AND [startTimestamp] <= '2018-06-15';''',
                  con=engine)
engine.dispose()
len(df)

1578918

In [9]:
df.head(3)

Unnamed: 0,macAddress,userId,startTimestamp,endTimestamp,room_location
0,00014c815769ff2a99662e77a228abc50f0ab012,174218,2018-04-04 12:48:52,2018-04-04 12:54:21,2019
1,00014c815769ff2a99662e77a228abc50f0ab012,174218,2018-04-04 12:54:21,2018-04-04 13:04:21,2019
2,00014c815769ff2a99662e77a228abc50f0ab012,174218,2018-04-04 13:04:21,2018-04-04 13:53:18,2019


In [11]:
df.dtypes

macAddress                object
userId                     int64
startTimestamp    datetime64[ns]
endTimestamp      datetime64[ns]
room_location             object
dtype: object

### Example operation using tqdm

In [12]:
# eliminate rows with location == 'out'
df = df.loc[df["room_location"] != 'out']

In [14]:
# filter by date
start_date = datetime(2018, 4, 22) # Sunday
end_date = datetime(2018, 4, 28) # Saturday
wdf = df.loc[df["startTimestamp"]
             .progress_map(lambda c : c.date() >= start_date.date() 
                           and c.date() <= end_date.date())]
len(wdf)

100%|██████████| 1492039/1492039 [00:02<00:00, 633098.54it/s] 


147909