# Access stereotype data

For MLADS June 2023 tutorial

The attached lakeHouse embeddingLH has a shortcut to this container https://stereotypes4syn.blob.core.windows.net/stereotypes

Demonstration in Trident of
- testing a shortcut by writing a file to it
- ... then:
- reading a raw file from github
- writing it to Lakehouse files
- unpickling it
- de-vectorizing the vector column, so it can be converted to a Spark DF
- ... write it back to a Table 
- ... save it in ADLS storage. 

In [11]:
import numpy as np
import pandas as pd
import json
import time
import pickle
import os
import subprocess
import requests

from numpy.random import Generator, PCG64
rng = Generator(PCG64())


StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 13, Finished, Available)

In [14]:
# Config from "02 - Data Tranformation" tutorial.  (Not sure what these do)
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.binSize", "1073741824")

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 16, Finished, Available)

In [15]:
# Check that the shortcut created in the LakeHouse pane exists
os.system('ls -ld /lakehouse/default/Files/stereotypes')

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 17, Finished, Available)

drw-r----- 3 trusted-service-user trusted-service-user 4096 May  3 17:35 /lakehouse/default/Files/stereotypes


0

In [17]:
# Create a dataset and check that the shortcut is writable, and appears in the linked Storage
tst_df  = pd.DataFrame({'data': rng.binomial(10, 0.8, size=30) }, index = range(30))
# Write the pandas df back to the files folder
# This is cool, since the shortcut connects back to the external blob storage, and the data frame appears there. (e.g in Azure Storage Explorer.)
tst_df.to_parquet('/lakehouse/default/Files/stereotypes/tst_df.parquet')
# And tst that it's visible in the filesystem
os.system('ls -l /lakehouse/default/Files/stereotypes')

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 19, Finished, Available)

total 4
-rwxrwx--- 1 trusted-service-user trusted-service-user 1701 May  3 17:56 tst_df.parquet


0

In [18]:
# Use the python requests package as an alternative to "wget" to read a github file
headers = {'Accept': 'application/vnd.github.v3.raw'}
# Note that the URL needs to end in "raw=true"  or you'll just get an html page for the file
stereotype_data_url = """https://github.com/rmhorton/sentence-embedding-demos/blob/main/bias_detection/stereotype_data_long_float16.pkl?raw=true"""
r = requests.get(stereotype_data_url, headers=headers)

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 20, Finished, Available)

In [19]:
# Save the bits using the Spark path to the Files directory
with open('/lakehouse/default/Files/stereotypes/stereotype_data_long_float16.pkl', 'wb') as sdl:
    sdl.write(r.content)
# Yes it made it there. You can also see it in the Trident file explorer on the right. 
os.system('ls -l /lakehouse/default/Files/stereotypes')

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 21, Finished, Available)

total 16108
-rwxrwx--- 1 trusted-service-user trusted-service-user 16490524 May  3 18:01 stereotype_data_long_float16.pkl
-rw-r----- 1 trusted-service-user trusted-service-user     1701 May  3 17:56 tst_df.parquet


0

In [21]:
# Unpickle the file - you get the pandas DF back. 
with open('/lakehouse/default/Files/stereotypes/stereotype_data_long_float16.pkl', 'rb') as pkl_fd:
    stereotype_data =pickle.load(pkl_fd)
type(stereotype_data)

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 23, Finished, Available)

pandas.core.frame.DataFrame

### Convert the data into a spark-usable Table

Having copied the pkl file from Github and converted it to a pandas DataFrame, we then unpack the vector field (from string to multiple columns),
convert it to a Spark df and write it as a table. 

In [22]:
# The Spark DF doesn't know what to do with the vector column, so we convert it into multiple columns. 
# unpack the vector column as individual columns -- assuming all are the same length.
embedding_size = len(stereotype_data.loc[0, 'vector'])
cases = stereotype_data.shape[0]
vectors = stereotype_data['vector']
pre_allocated_array = np.empty((cases, embedding_size))

# Copy each row into the np array. 
for k in range(cases):
    pre_allocated_array[k,:] = np.array(vectors[k])
v_col_names = [f'v{z}' for z in range(embedding_size)]
par_fd = pd.DataFrame(pre_allocated_array, columns=v_col_names)
# vstack the new columns in place of the previous vector column. 
stereotype_vector_data = pd.concat([stereotype_data.drop('vector', axis=1), par_fd], axis=1)
stereotype_vector_data.info()

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 24, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9578 entries, 0 to 9577
Columns: 775 entries, stereotype_id to v767
dtypes: float64(768), int64(1), object(6)
memory usage: 56.7+ MB


In [23]:
# Write the python dataframe to a spark table. 
# Hmm - this doesn't look like a table, instead, like a file in the Tables section (?)
# .. but the lakehouse did do an automatic conversion to a table, visible in the Tables directory. 
sdf_stereo_data = spark.createDataFrame(stereotype_vector_data)
sdf_stereo_data.write.mode("overwrite").format("delta").option("overwriteSchema", "true").save("Tables/stereotype_data")

StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 25, Finished, Available)



In [24]:
#Instead, lets try writing the Spark df out when there's a shortcut to a subfolder in Tables
sdf_stereo_data.write.mode("overwrite").format("delta").option("overwriteSchema", "true").save("Tables/stereotypetable/stereotype")
# Strange, it seems to have moved the spark df to the new shortcut. 


StatementMeta(, 1e090f6f-6214-4af6-b1d7-18adcd1b65d6, 26, Finished, Available)