## Load JSON data

Replace **YOUR_STORAGE_ACCOUNT_NAME_HERE** in the cell below with your Synapse Workspace primary data lake account.


In [4]:
account_name = 'YOUR_STORAGE_ACCOUNT_NAME_HERE'

In [5]:
datapath = 'abfss://datalake@'+account_name+'.dfs.core.windows.net/SensorDatasets/sensor/*/*/*/'
df = spark.read.json(datapath)
df.printSchema()
df.count()

root
 |-- Cycle: long (nullable = true)
 |-- DeviceId: string (nullable = true)
 |-- EventEnqueuedUtcTime: string (nullable = true)
 |-- EventProcessedUtcTime: string (nullable = true)
 |-- IoTHub: struct (nullable = true)
 |    |-- ConnectionDeviceGenerationId: string (nullable = true)
 |    |-- ConnectionDeviceId: string (nullable = true)
 |    |-- CorrelationId: string (nullable = true)
 |    |-- EnqueuedTime: string (nullable = true)
 |    |-- MessageId: string (nullable = true)
 |    |-- StreamId: string (nullable = true)
 |-- PartitionId: long (nullable = true)
 |-- Period: long (nullable = true)
 |-- Sensor11: double (nullable = true)
 |-- Sensor14: double (nullable = true)
 |-- Sensor15: double (nullable = true)
 |-- Sensor9: double (nullable = true)

1038971

## Data processing

In [6]:
from pyspark.sql.functions import from_utc_timestamp,to_date,hour

# If necessary, change PST to your date/time code
df_datetime_conv = df.withColumn("timestamp_PST",from_utc_timestamp('EventEnqueuedUtcTime', "PST"))
df_date_conv = df_datetime_conv\
    .withColumn("date_PST",to_date("timestamp_PST"))\
    .withColumn("hour_PST",hour("timestamp_PST"))\
    .drop('IoTHub','PartitionId','EventProcessedUtcTime','EventEnqueuedUtcTime')


In [7]:
# Sample the data
df_date_conv.show()

+-----+---------+------+--------+--------+--------+-------+--------------------+----------+--------+
|Cycle| DeviceId|Period|Sensor11|Sensor14|Sensor15|Sensor9|       timestamp_PST|  date_PST|hour_PST|
+-----+---------+------+--------+--------+--------+-------+--------------------+----------+--------+
|   68|N3172FJ-2|    49|   44.29| 8061.17|  9.1958|8728.68|2020-04-27 17:00:...|2020-04-27|      17|
|   68|N3172FJ-1|    49|   41.98| 8087.09|  9.3633|8328.02|2020-04-27 17:00:...|2020-04-27|      17|
|   68|N4172FJ-2|    49|   41.94|  8074.2|  9.3415|8306.42|2020-04-27 17:00:...|2020-04-27|      17|
|   68|N4172FJ-1|    49|   45.17| 8132.85|  8.6752|8775.17|2020-04-27 17:00:...|2020-04-27|      17|
|   69|N1172FJ-2|    49|   45.47| 8125.21|  8.6166| 8759.7|2020-04-27 17:00:...|2020-04-27|      17|
|   69|N1172FJ-1|    49|   47.35| 8140.41|   8.398|9058.72|2020-04-27 17:00:...|2020-04-27|      17|
|   69|N2172FJ-2|    49|   41.77| 8092.79|  9.3043|8323.55|2020-04-27 17:00:...|2020-04-27|

In [8]:
from pyspark.sql.functions import row_number,col,when
from pyspark.sql.window import Window

# Use windoing to flag the final cycle before maintenance occurs.

df_rownum = df_date_conv\
.withColumn("row_num",row_number()\
.over(Window.partitionBy("Period","DeviceId","date_PST")\
.orderBy(col("Cycle").desc())))

df_curated = df_rownum.withColumn("EndOfPeriod",when(df_rownum.row_num==1,1).otherwise(0))\
    .drop('row_num')

In [9]:
df_curated.show()

+-----+---------+------+--------+--------+--------+-------+--------------------+----------+--------+-----------+
|Cycle| DeviceId|Period|Sensor11|Sensor14|Sensor15|Sensor9|       timestamp_PST|  date_PST|hour_PST|EndOfPeriod|
+-----+---------+------+--------+--------+--------+-------+--------------------+----------+--------+-----------+
|   54|N3172FJ-2|     1|   41.94|  8066.8|  9.3881|8312.53|2020-04-29 14:10:...|2020-04-29|      14|          1|
|   53|N3172FJ-2|     1|   36.99| 7864.04| 10.9041|8001.09|2020-04-29 14:10:...|2020-04-29|      14|          0|
|   52|N3172FJ-2|     1|   36.89|  7864.5| 10.9291| 8004.4|2020-04-29 14:10:...|2020-04-29|      14|          0|
|   51|N3172FJ-2|     1|   42.04| 8052.26|  9.3318|8336.03|2020-04-29 14:10:...|2020-04-29|      14|          0|
|   50|N3172FJ-2|     1|   45.54| 8114.55|   8.649|8772.46|2020-04-29 14:10:...|2020-04-29|      14|          0|
|   49|N3172FJ-2|     1|   45.55| 8115.32|  8.6801|8762.43|2020-04-29 14:10:...|2020-04-29|     

## Notebook visualization

Execute the cell below, then select the **Chart** view to display a graph. Use the **view options** button in the Chart view to adjust the chart rendering.

In [10]:
# Sample the data
display(df_curated)

## Write the dataframe to Parquet file(s)

By saving the results after processing as a file, the data can be reused for applications other than loading into the data warehouse.

In [11]:
# Save the path
curated_path='abfss://datalake@'+account_name+'.dfs.core.windows.net/curated/sensor_spark/'

# Write the dataframe to Parquet
df_curated\
    .write\
    .format('parquet')\
    .mode("overwrite")\
    .save(curated_path)


## Create an external Spark table

Create a Spark table using Parquet files in storage.


In [12]:
spark.sql("""
    DROP TABLE IF EXISTS sensortablespark
  """)
spark.sql("""
    CREATE TABLE sensortablespark
    USING Parquet
    LOCATION '{}'
""".format(curated_path)
)

DataFrame[]

## Selecting a development language with cell magics

You can select a language other than the default on a cell-by-cell basis:

- `%%pyspark` -&gt; Python
- `%%spark` -&gt; Scala
- `%%sql` -&gt; SQL

In [13]:
%%sql
SELECT 
    *
FROM 
    sensortablespark 
    
LIMIT(10)