# Azure ML Notebooks - Getting started (Python)


This notebook has the following:

&emsp;[a) Installing python libraries to workspace](#installing_modules_or_packages)<br>
&emsp;[b) Importing modules or packages](#importing_modules_or_packages)<br>
&emsp;[c) Managing secrets using Azure Key Vault](#managing_secrets)<br>
&emsp;[d) Reading data from data storage](#reading_data_from_data_storage)<br>
&emsp;&emsp;[i) Reading csv file in Azure ML workspace local file system](#reading_data_from_data_storage)<br>
&emsp;&emsp;[ii) Reading csv file in Azure Blob Storage](#reading_data_from_data_storage)<br>
&emsp;[e) Data Wrangling](#managing_secrets)<br>
&emsp;[f) Saving data to data storage](#managing_secrets)<br>
&emsp;&emsp;[i) Saving csv file to Azure ML workspace local file system](#reading_data_from_data_storage)<br>
&emsp;&emsp;[ii) Saving csv file to Azure Blob Storage](#reading_data_from_data_storage)<br>
&emsp;[g) Managing an Azure Machine Learning compute instance](#managing_secrets)<br>


<a id='installing_modules_or_packages'></a>
## a) Installing python libraries

In [None]:
%pip install pandas==1.1.5
# NB. You will need to restart the Kernel after installing new packages

<a id="importing_modules_or_packages"></a>
## b) Importing modules or packages 

In [3]:
#importing libraries in the notebook
import pandas as pd
import numpy as np
from urllib import parse
from sqlalchemy import create_engine
import pandas as pd

<a id="managing_secrets"></a>
## c) Managing secrets

Your Azure Machine Learning workspace has an associated resource called a Azure Key Vault. This Key Vault is used to store and retrieve secrets like passwords used in your notebook. It is part of best practices NOT to code password/sectret values into notebooks\scripts in cleartext as it is insecure as it would expose the secret to everyone who reads the notebook\script. Below is the code to access the key vault and to write\retrive secrets from the key vault.



In [4]:
#importing the libraries
from azureml.core import Workspace
from azureml.core import Keyvault

ws = Workspace.from_config()

#Setting the secret
keyvault = ws.get_default_keyvault()
keyvault.set_secret(name="mysecret", value = "my_secret")

#To list the secrets in the key vault use the code below. This will list the names of all the secrets in the vault
print("The list of secrets are {}".format(keyvault.list_secrets()))

#To retrive the secret
retrieved_secret = keyvault.get_secret(name="mysecret")


The list of secrets are [{'name': 'mysecret'}, {'name': 'ooi-teamsite-client-secret'}, {'name': 'stg-db-user'}, {'name': 'dev-db-user'}, {'name': 'stg-db-pwd'}, {'name': 'dev-db-pwd'}]


<a id="reading_data_from_data_storage"></a>
## d) Reading data from data storage

#### i) Reading csv file in Azure ML Workspace local file system

In [5]:
#using pandas to load a csv file
country_info_from_ws_df = pd.read_csv("data/country_list.csv")

#### ii) Reading csv file in Azure Blob Storage

In [12]:
from azureml.core import Workspace, Datastore, Dataset

# get existing workspace
workspace = Workspace.from_config()
    
# retrieve an existing datastore in the workspace by name
datastore_name = 'saunigiga'
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'country_list.csv')]

country_info = Dataset.Tabular.from_delimited_files(path=datastore_paths)

#converting the data to a pandas df 
df = country_info.to_pandas_dataframe()

df.head()

Unnamed: 0,country,alias,region,alpha-2-code,alpha-3-code,numeric-code
0,Afghanistan,Afghanistan,ROSA,AF,AFG,4
1,Albania,Albania,ECAR,AL,ALB,8
2,Algeria,Algeria,MENA,DZ,DZA,12
3,American Samoa,American Samoa,,AS,ASM,16
4,Andorra,Andorra,,AD,AND,20


In [14]:
df.dtypes

country         object
alias           object
region          object
alpha-2-code    object
alpha-3-code    object
numeric-code     int64
dtype: object

#### iii) Reading DB table from Datasets

In [18]:
# get existing workspace
workspace = Workspace.from_config()

ds = Dataset.get_by_name(workspace, 'connection_statistics_schooldailystatus')
ds2 = Dataset.get_by_name(workspace, 'connection_statistics_countrydailystatus')


#converting the data to a pandas df 
df = ds2.to_pandas_dataframe()

df.tail()

Unnamed: 0,id,connectivity_latency,country_id,created,modified,date,connectivity_speed
387,390,114,285,2021-08-22 16:01:40.465750,2021-08-23 00:01:54.461422,2021-08-22,26875654
388,388,219,7,2021-08-22 04:01:30.407123,2021-08-23 00:01:57.408973,2021-08-22,7252142
389,391,83,285,2021-08-23 04:01:15.569780,2021-08-23 16:02:25.862390,2021-08-23,53262881
390,393,218,7,2021-08-23 10:01:26.126378,2021-08-23 16:02:30.537349,2021-08-23,1993741
391,392,64,144,2021-08-23 10:00:47.345939,2021-08-23 16:05:08.800860,2021-08-23,58700597


#### iV) Reading data directly from the DB using python

In [24]:
## CONNECTING TO THE STAGING DB

#getting creadentials for the DB from the key vault
stg_db_user = keyvault.get_secret(name="stg-db-user")
stg_db_pwd = keyvault.get_secret(name="stg-db-pwd")

#Connecting to the db
engine = create_engine(f"postgresql://{stg_db_user}%40pgsql-10-shared-uniwebs-tst:{stg_db_pwd}@pgsql-10-shared-uniwebs-prd.postgres.database.azure.com/stg_oia_prjrazordb?sslmode=require")
conn = engine.connect()

#running a query
sql =f"SELECT * FROM schools_school LIMIT 10"
df = pd.read_sql(sql, con=engine)
df

Unnamed: 0,id,created,modified,name,timezone,geopoint,gps_confidence,altitude,address,postal_code,...,school_type,country_id,location_id,admin_2_name,admin_3_name,admin_4_name,external_id,admin_1_name,last_weekly_status_id,name_lower
0,53384,2020-09-14 11:21:28.270399+00:00,2021-08-18 01:00:14.386171+00:00,EMEF ENGº WADIH DARWICH ZACARIAS,,0101000020E6100000BEC1172653ED4FC0CB10C7BAB88D...,,0,RUA CIDADE,,...,,144,,,Porto Velho,PORTO VELHO,11001003.0,RONDONIA,4963556,emef engº wadih darwich zacarias
1,170398,2020-10-21 01:35:13.233225+00:00,2021-08-18 01:00:18.132726+00:00,EEEFM RIO BRANCO,,0101000020E6100000280F0BB5A6F14FC0D734EF384587...,,0,RUA RAFAEL VAZ E SILVA,,...,,144,,,Porto Velho,PORTO VELHO,11002549.0,RONDONIA,4949398,eeefm rio branco
2,1564005,2021-03-17 15:30:23.115219+00:00,2021-03-23 13:10:06.650704+00:00,Minanga ES,,0101000020E61000003259DC7F64725E4002D9EBDD1F73...,,0,,,...,,20,,Isabela,San Pablo,,,Region II,3006586,minanga es
3,1500511,2021-03-17 14:37:07.004348+00:00,2021-03-23 12:57:13.983364+00:00,Name unknown,,0101000020E6100000C26D6DE179690A400B462575025A...,,0,,,...,,23,,25003,,,,LA,2943231,name unknown
4,1066996,2021-03-15 13:14:54.854159+00:00,2021-03-23 11:45:23.688261+00:00,ISHILA,,0101000020E6100000064CE0D6DD544040E0DBF4673F72...,,0,,,...,,88,,,,,,,2481909,ishila
5,1201186,2021-03-15 17:00:07.603053+00:00,2021-03-23 12:01:47.705300+00:00,Galai,,0101000020E6100000035DFB027A1923C0EBE1CB441162...,,0,,,...,,37,,,Suacoco,,,,2787231,galai
6,257597,2020-11-10 18:39:04.123229+00:00,2021-08-18 01:00:18.600896+00:00,EMEF SAUL BENNESBY,,0101000020E61000006DE7FBA9F1F24FC00EBE30992A98...,,0,RUA ALGODOEIRO,,...,,144,,Porto Velho,Porto Velho,PORTO VELHO,11002999.0,RONDONIA,4949861,emef saul bennesby
7,2279271,2021-04-13 22:58:18.515385+00:00,2021-08-18 01:25:46.512019+00:00,EE - COLEGIO ESTADUAL MARIA TEOFILA,,0101000020E6100000386744696F6043C07593180456CE...,,0,,,...,,144,,,,AMELIA RODRIGUES,29157803.0,BAHIA,4955667,ee - colegio estadual maria teofila
8,259435,2020-11-10 18:39:04.226389+00:00,2021-08-18 01:11:32.095396+00:00,CENTRO DE ENSINO DRA AGOSTINHA MELO CORREIA LIMA,,0101000020E61000009BE61DA7E89846C0C58F31772D21...,,0,RUA ELIEZER MOREIRA 87,,...,,144,,IgarapÃ© do Meio,IgarapÃ© do Meio,IGARAPE DO MEIO,21252530.0,MARANHAO,4965487,centro de ensino dra agostinha melo correia lima
9,1468615,2021-03-17 14:37:00.225565+00:00,2021-03-23 12:50:39.476632+00:00,Name unknown,,0101000020E6100000C0CE4D9B711A1E403C50A73CBA21...,,0,,,...,,23,,14008,,,,EN,2911340,name unknown


In [25]:
## CONNECTING TO THE DEV DB

#getting creadentials for the DB from the key vault
dev_db_user = keyvault.get_secret(name="dev-db-user")
dev_db_pwd = keyvault.get_secret(name="dev-db-pwd")

#Connecting to the db
engine = create_engine(f"postgresql://{dev_db_user}%40pgsql-10-shared-uniwebs-tst:{dev_db_pwd}@pgsql-10-shared-uniwebs-tst.postgres.database.azure.com/dev_oia_prjrazordb?sslmode=require")
conn = engine.connect()

#running a query
sql =f"SELECT * FROM schools_school LIMIT 5"
df = pd.read_sql(sql, con=engine)
df

Unnamed: 0,id,created,modified,name,timezone,geopoint,gps_confidence,altitude,address,postal_code,...,school_type,country_id,location_id,admin_2_name,admin_3_name,admin_4_name,external_id,admin_1_name,last_weekly_status_id,name_lower
0,1564005,2021-03-17 15:30:23.115219+00:00,2021-03-23 13:10:06.650704+00:00,Minanga ES,,0101000020E61000003259DC7F64725E4002D9EBDD1F73...,,0,,,...,,20,,Isabela,San Pablo,,,Region II,3006586,minanga es
1,1500511,2021-03-17 14:37:07.004348+00:00,2021-03-23 12:57:13.983364+00:00,Name unknown,,0101000020E6100000C26D6DE179690A400B462575025A...,,0,,,...,,23,,25003,,,,LA,2943231,name unknown
2,1066996,2021-03-15 13:14:54.854159+00:00,2021-03-23 11:45:23.688261+00:00,ISHILA,,0101000020E6100000064CE0D6DD544040E0DBF4673F72...,,0,,,...,,88,,,,,,,2481909,ishila
3,1201186,2021-03-15 17:00:07.603053+00:00,2021-03-23 12:01:47.705300+00:00,Galai,,0101000020E6100000035DFB027A1923C0EBE1CB441162...,,0,,,...,,37,,,Suacoco,,,,2787231,galai
4,2286814,2021-04-13 22:58:20.179990+00:00,2021-08-23 01:35:30.288079+00:00,EE PROFESSOR PLINIO RIBEIRO,,0101000020E61000009CA223B9FCEF45C0098A1F63EEBA...,,0,,,...,,144,,,,MONTES CLAROS,31081264.0,MINAS GERAIS,4955927,ee professor plinio ribeiro


In [30]:
df.columns

Index(['id', 'created', 'modified', 'name', 'timezone', 'geopoint',
       'gps_confidence', 'altitude', 'address', 'postal_code', 'email',
       'education_level', 'environment', 'school_type', 'country_id',
       'location_id', 'admin_2_name', 'admin_3_name', 'admin_4_name',
       'external_id', 'admin_1_name', 'last_weekly_status_id', 'name_lower'],
      dtype='object')

## e) Data Wrangling

In [26]:
country_info_from_ws_df = country_info_from_ws_df[country_info_from_ws_df['region'].notnull()]
country_info_from_ws_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 220 entries, 0 to 331
Data columns (total 6 columns):
country         220 non-null object
alias           220 non-null object
region          220 non-null object
alpha-2-code    218 non-null object
alpha-3-code    220 non-null object
numeric-code    220 non-null int64
dtypes: int64(1), object(5)
memory usage: 12.0+ KB


## f) Saving the data

#### i) Saving csv file to Azure ML Workspace

In [27]:
#This saves a pandas df as csv in the Azure ML workspace
country_info_from_ws_df.to_csv("data/country_list_updated.csv")

#### ii) Saving csv file to Azure Blob Storage

In [28]:
country_info_from_ws_df.to_csv("data/country_list_updated.csv")
datastore.upload_files(["data/country_list_updated.csv"], target_path="/processed_data/")

Uploading an estimated of 1 files
Target already exists. Skipping upload for processed_data/country_list_updated.csv
Uploaded 0 files


$AZUREML_DATAREFERENCE_894de9202fe54bfab2cdf69f84b22782

## g) Managing an Azure Machine Learning compute instance

An Azure ML compute instance can be managed using code in notebook. This gives the user more control when the instance is started and stopped. A compute instance does not automatically scale down, so it is important to make sure you stop the resource to prevent ongoing charges. Below is sample code you can use to manage an existing compute instance.

In [None]:
from azureml.core.compute import ComputeInstance
from azureml.core import Workspace, Datastore, Dataset

#setting the workspace
ws = Workspace.from_config()


#defining the instance
instance = ComputeInstance(workspace=ws, name='uni-giga-compute')

#starting the instance
instance.start(wait_for_completion=True, show_output=True)

#stopping the instance
instance.stop(wait_for_completion=True, show_output=True)

In [3]:
from azureml.core import Workspace, Datastore, Dataset

# get existing workspace
workspace = Workspace.from_config()

datastore_name = 'saunigiga'
datastore = Datastore.get(workspace, datastore_name)

In [7]:
from sqlalchemy import create_engine

In [6]:
pip install  sqlalchemy

Collecting sqlalchemy
  Downloading SQLAlchemy-1.4.23-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 32.8 MB/s eta 0:00:01
Installing collected packages: sqlalchemy
Successfully installed sqlalchemy-1.4.23
Note: you may need to restart the kernel to use updated packages.


In [9]:
selected_start_date = '2020-4-4'
selected_end_date = '2021-5-1'

sql =f"SELECT * FROM BW.VirtualConferencing_Summary WHERE StartDate>= '{selected_start_date}' and StartDate <= '{selected_end_date}'"
df = pd.read_sql(sql, con=engine)
df['StartDate'] = pd.to_datetime(df['StartDate'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6829 entries, 0 to 6828
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   EntryID                     6829 non-null   int64         
 1   TotalDurationInMinutes      6829 non-null   int64         
 2   TotalNoOfParticipants       6829 non-null   int64         
 3   TotalNoOfMeetingsOrganized  6829 non-null   int64         
 4   OrganizerAlpha3code         3055 non-null   object        
 5   Country                     6829 non-null   object        
 6   StartDate                   6829 non-null   datetime64[ns]
 7   VirtualConferencingService  6829 non-null   object        
 8   WeekNo                      6829 non-null   int64         
 9   Region                      6829 non-null   object        
dtypes: datetime64[ns](1), int64(5), object(4)
memory usage: 533.6+ KB
