# A Complete Solution to the BackBaze.com Kaggle Problem

## Step Four.  Adding features to the data

## Table of Contents

1. [Introduction](#10)<br>

2. [Establish environment and parameters](#20)<br>
3. [Create lagged features](#30)<br>
4. [Create Max, Min, Sum, Mean and Relative Change Features](#40)<br>
5. [Append mean encoded data variables](#50)<br>
6.  [Create even more features](#60)<br>
7. [Export to Parquet file for step 5](#70)<br>


### 1.0 Introduction <a id="10"></a>

Note this is part two of a four-part solution.

BackBlaze.com, you are the "GOAT." You are the "cat's meow." You "Rock the House." In case you don't know why BackBaze.com is so totally "kick-ass," they open-sourced a vast set of hard drive information a few years ago and continue updating it each quarter.  What a treasure trove of superb data.  BackBlaze.com, thank you from the bottom of my heart.

The backblaze.com data includes operational metrics from hard drives with an indicator of a hard-drive failure.  It is an excellent source for teaching techniques related to machine failure.  Again, thank you for making this available to the open-source community.
Here is a link to the data.

https://www.backblaze.com/b2/hard-drive-test-data.html

My goal in this series of articles is not to give the best solution with the highest AUC.  My goal is to show you how to approach equipment failure problems and build solutions that reflect realistic accuracy, and provide an easy transition from the lab to the real world.

I will use a Spark/Python Jupyter notebook inside IBM's Watson Studio on the cloud as a tool in this discussion.

https://www.ibm.com/cloud/watson-studio

I will also be using cloud object storage on the IBM cloud.

https://www.ibm.com/cloud/block-storage


The fourth article in this series we will create featues and append them to the data.  We will also append the features we created in Step Two. 

I created these notebooks with a runtime useing 1 driver with 1 vCPU and 4 GB RAM, and 2 executors each with 1 vCPU and 4 GB RAM. This is available for free on the IBM Cloud. Some of the notebooks take a few hours to run. You'll need to schedule your notebooks to run as jobs.

https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/schedule-task.html

### 2.0 Establish environment and parameters <a id="20"></a>

Import the Relevant Libraries, connect to object storage and import data from previous step.

In [1]:
from functools import reduce
from pyspark.sql import DataFrame

import pyspark.sql.functions as F
from pyspark.sql.functions import *

from pyspark.sql.functions import when

from pyspark.sql.functions import rand
from pyspark.sql.functions import lit

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, round

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number


import pandas as pd


spark.conf.set("spark.sql.broadcastTimeout",  7200000)
spark.conf.set("spark.sql.parquet.compression.codec", "gzip")


In [2]:
# The code was removed by Watson Studio for sharing.

In [3]:
df = spark.read.parquet(cos.url('data_2020_final.parquet', 'backblazedata-donotdelete-pr-cij57grgkoctem'))

Reformat numeric values to double

In [4]:
for c in [ 'REALLOCATED_SECTOR_COUNT_N',
 'REPORTED_UNCORRECTABLE_ERRORS_N',
 'COMMAND_TIMEOUT_N',
 'CURRENT_PENDING_SECTOR_COUNT_N',
 'POWER_ON_HOURS_N',
 'REALLOCATED_SECTOR_COUNT_R',
 'REPORTED_UNCORRECTABLE_ERRORS_R',
 'COMMAND_TIMEOUT_R',
 'CURRENT_PENDING_SECTOR_COUNT_R',
 'POWER_ON_HOURS_R','FAILURE','CAPACITY_BYTES']:
    # add condition for the cols to be type cast
    df=df.withColumn(c, df[c].cast('double'))

### 3.0 Create lagged features  <a id="30"></a>


Create a consecutive row number for each record and serial number.

In [5]:

df=df.sort("SERIAL_NUMBER", "DATE")
windowSpec  = Window.partitionBy("SERIAL_NUMBER").orderBy("DATE")

df=df.withColumn("ROW",row_number().over(windowSpec))

In [6]:
dfx=df

Create lagged by 7 features

In [7]:

df_7 = dfx.withColumn('ROW_7', ( dfx['ROW'] + 7 ) )
df_7 = df_7.drop("ROW")
df_7 = df_7.drop("DATE")
df_7 = df_7.drop("MODEL")
df_7 = df_7.drop("MANUFACTURER")
df_7 = df_7.drop("CAPACITY_BYTES")
df_7 = df_7.drop("FAILURE")

df_7=df_7.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_7")
df_7=df_7.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_7")
df_7=df_7.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_7")
df_7=df_7.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_7")
df_7=df_7.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_7")
df_7=df_7.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_7")
df_7=df_7.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_7")
df_7=df_7.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_7")
df_7=df_7.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_7")
df_7=df_7.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_7")
df_7=df_7.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_7")



df=df.join(df_7,(((df.ROW) ==  (df_7.ROW_7)) & ((df.SERIAL_NUMBER) ==  (df_7.SERIAL_NUMBER_7))),"left")
df = df.drop("SERIAL_NUMBER_7")
df = df.drop("ROW_7")


Create lagged by 6 features

In [8]:

df_6 = dfx.withColumn('ROW_6', ( dfx['ROW'] + 6 ) )
df_6 = df_6.drop("ROW")
df_6 = df_6.drop("DATE")
df_6 = df_6.drop("MODEL")
df_6 = df_6.drop("MANUFACTURER")

df_6 = df_6.drop("CAPACITY_BYTES")
df_6 = df_6.drop("FAILURE")


df_6=df_6.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_6")
df_6=df_6.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_6")
df_6=df_6.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_6")
df_6=df_6.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_6")
df_6=df_6.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_6")
df_6=df_6.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_6")
df_6=df_6.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_6")
df_6=df_6.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_6")
df_6=df_6.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_6")
df_6=df_6.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_6")
df_6=df_6.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_6")

df=df.join(df_6,(((df.ROW) ==  (df_6.ROW_6)) & ((df.SERIAL_NUMBER) ==  (df_6.SERIAL_NUMBER_6))),"left")
df = df.drop("SERIAL_NUMBER_6")
df = df.drop("ROW_6")


Create lagged by 5 features

In [9]:

df_5 = dfx.withColumn('ROW_5', ( dfx['ROW'] + 5 ) )
df_5 = df_5.drop("ROW")
df_5 = df_5.drop("DATE")
df_5 = df_5.drop("MODEL")
df_5 = df_5.drop("MANUFACTURER")

df_5 = df_5.drop("CAPACITY_BYTES")
df_5 = df_5.drop("FAILURE")
df_5=df_5.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_5")
df_5=df_5.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_5")
df_5=df_5.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_5")
df_5=df_5.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_5")
df_5=df_5.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_5")
df_5=df_5.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_5")
df_5=df_5.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_5")
df_5=df_5.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_5")
df_5=df_5.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_5")
df_5=df_5.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_5")
df_5=df_5.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_5")

df=df.join(df_5,(((df.ROW) ==  (df_5.ROW_5)) & ((df.SERIAL_NUMBER) ==  (df_5.SERIAL_NUMBER_5))),"left")
df = df.drop("SERIAL_NUMBER_5")
df = df.drop("ROW_5")



Create lagged by 4 features

In [10]:

df_4 = dfx.withColumn('ROW_4', ( dfx['ROW'] + 4 ) )
df_4 = df_4.drop("ROW")
df_4 = df_4.drop("DATE")
df_4 = df_4.drop("MODEL")
df_4 = df_4.drop("MANUFACTURER")

df_4 = df_4.drop("CAPACITY_BYTES")
df_4 = df_4.drop("FAILURE")
df_4=df_4.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_4")
df_4=df_4.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_4")
df_4=df_4.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_4")
df_4=df_4.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_4")
df_4=df_4.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_4")
df_4=df_4.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_4")
df_4=df_4.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_4")
df_4=df_4.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_4")
df_4=df_4.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_4")
df_4=df_4.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_4")
df_4=df_4.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_4")

df=df.join(df_4,(((df.ROW) ==  (df_4.ROW_4)) & ((df.SERIAL_NUMBER) ==  (df_4.SERIAL_NUMBER_4))),"left")
df = df.drop("SERIAL_NUMBER_4")
df = df.drop("ROW_4")


Create lagged by 3 features

In [11]:
df_3 = dfx.withColumn('ROW_3', ( dfx['ROW'] + 3 ) )
df_3 = df_3.drop("ROW")
df_3 = df_3.drop("DATE")
df_3 = df_3.drop("MODEL")
df_3 = df_3.drop("MANUFACTURER")
df_3 = df_3.drop("CAPACITY_BYTES")
df_3 = df_3.drop("FAILURE")

df_3=df_3.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_3")
df_3=df_3.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_3")
df_3=df_3.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_3")
df_3=df_3.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_3")
df_3=df_3.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_3")
df_3=df_3.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_3")
df_3=df_3.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_3")
df_3=df_3.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_3")
df_3=df_3.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_3")
df_3=df_3.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_3")
df_3=df_3.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_3")

df=df.join(df_3,(((df.ROW) ==  (df_3.ROW_3)) & ((df.SERIAL_NUMBER) ==  (df_3.SERIAL_NUMBER_3))),"left")
df = df.drop("SERIAL_NUMBER_3")
df = df.drop("ROW_3")


Create lagged by 2 features

In [12]:
df_2 = dfx.withColumn('ROW_2', ( dfx['ROW'] + 2 ) )
df_2 = df_2.drop("ROW")
df_2 = df_2.drop("DATE")
df_2 = df_2.drop("MODEL")
df_2 = df_2.drop("MANUFACTURER")
df_2 = df_2.drop("CAPACITY_BYTES")
df_2 = df_2.drop("FAILURE")
df_2=df_2.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_2")
df_2=df_2.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_2")
df_2=df_2.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_2")
df_2=df_2.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_2")
df_2=df_2.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_2")
df_2=df_2.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_2")
df_2=df_2.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_2")
df_2=df_2.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_2")
df_2=df_2.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_2")
df_2=df_2.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_2")
df_2=df_2.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_2")

df=df.join(df_2,(((df.ROW) ==  (df_2.ROW_2)) & ((df.SERIAL_NUMBER) ==  (df_2.SERIAL_NUMBER_2))),"left")
df = df.drop("SERIAL_NUMBER_2")
df = df.drop("ROW_2")


Create lagged by 1 features

In [13]:
df_1 = dfx.withColumn('ROW_1', ( dfx['ROW'] + 1 ) )
df_1 = df_1.drop("ROW")
df_1 = df_1.drop("DATE")
df_1 = df_1.drop("MODEL")
df_1 = df_1.drop("MANUFACTURER")
df_1 = df_1.drop("CAPACITY_BYTES")
df_1 = df_1.drop("FAILURE")
df_1=df_1.withColumnRenamed("REALLOCATED_SECTOR_COUNT_R","REALLOCATED_SECTOR_COUNT_R_1")
df_1=df_1.withColumnRenamed("SERIAL_NUMBER","SERIAL_NUMBER_1")
df_1=df_1.withColumnRenamed("REALLOCATED_SECTOR_COUNT_N","REALLOCATED_SECTOR_COUNT_N_1")
df_1=df_1.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_N","REPORTED_UNCORRECTABLE_ERRORS_N_1")
df_1=df_1.withColumnRenamed("COMMAND_TIMEOUT_N","COMMAND_TIMEOUT_N_1")
df_1=df_1.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_N","CURRENT_PENDING_SECTOR_COUNT_N_1")
df_1=df_1.withColumnRenamed("POWER_ON_HOURS_N","POWER_ON_HOURS_N_1")
df_1=df_1.withColumnRenamed("REPORTED_UNCORRECTABLE_ERRORS_R","REPORTED_UNCORRECTABLE_ERRORS_R_1")
df_1=df_1.withColumnRenamed("COMMAND_TIMEOUT_R","COMMAND_TIMEOUT_R_1")
df_1=df_1.withColumnRenamed("CURRENT_PENDING_SECTOR_COUNT_R","CURRENT_PENDING_SECTOR_COUNT_R_1")
df_1=df_1.withColumnRenamed("POWER_ON_HOURS_R","POWER_ON_HOURS_R_1")
df=df.join(df_1,(((df.ROW) ==  (df_1.ROW_1)) & ((df.SERIAL_NUMBER) ==  (df_1.SERIAL_NUMBER_1))),"left")
df = df.drop("SERIAL_NUMBER_1")
df = df.drop("ROW_1")


### 4.0 Create Max, Min, Sum, Mean and Relative Change Features <a id="40"></a>

In [14]:
from pyspark.sql.functions import col

from pyspark.sql.functions import greatest
from pyspark.sql.functions import  least
from pyspark.sql.functions import  mean
# max of last 8
df = df.withColumn('POWER_ON_HOURS_R_MAX', greatest('POWER_ON_HOURS_R','POWER_ON_HOURS_R_7','POWER_ON_HOURS_R_6','POWER_ON_HOURS_R_5',\
                                                    'POWER_ON_HOURS_R_4','POWER_ON_HOURS_R_3','POWER_ON_HOURS_R_2','POWER_ON_HOURS_R_1'))
# min of last 8
df = df.withColumn('POWER_ON_HOURS_R_MIN', least('POWER_ON_HOURS_R','POWER_ON_HOURS_R_7','POWER_ON_HOURS_R_6','POWER_ON_HOURS_R_5',\
                                                    'POWER_ON_HOURS_R_4','POWER_ON_HOURS_R_3','POWER_ON_HOURS_R_2','POWER_ON_HOURS_R_1'))
# sum of last 8
df = df.withColumn('POWER_ON_HOURS_R_SUM', (df.POWER_ON_HOURS_R+df.POWER_ON_HOURS_R_7+df.POWER_ON_HOURS_R_6+df.POWER_ON_HOURS_R_5+\
                                                    df.POWER_ON_HOURS_R_4+df.POWER_ON_HOURS_R_3+df.POWER_ON_HOURS_R_2+df.POWER_ON_HOURS_R_1))

#mean of last 8 periods
df = df.withColumn('POWER_ON_HOURS_R_MEAN', df.POWER_ON_HOURS_R_SUM/8)
#variance of last 8
df = df.withColumn('POWER_ON_HOURS_R_VAR', df.POWER_ON_HOURS_R_MAX/df.POWER_ON_HOURS_R_MIN)
#daily variance from running mean
df = df.withColumn('POWER_ON_HOURS_R_DELTA', df.POWER_ON_HOURS_R/df.POWER_ON_HOURS_R_MEAN)
#Running average divided by running max
df = df.withColumn('POWER_ON_HOURS_R_VARX', df.POWER_ON_HOURS_R_MEAN/df.POWER_ON_HOURS_R_MAX)
#runnning average divided by running min
df = df.withColumn('POWER_ON_HOURS_R_VARN', df.POWER_ON_HOURS_R_MEAN/df.POWER_ON_HOURS_R_MIN)
#current valud divided by running max
df = df.withColumn('POWER_ON_HOURS_R_DELTAX', df.POWER_ON_HOURS_R/df.POWER_ON_HOURS_R_MAX)
#current value divided by running min
df = df.withColumn('POWER_ON_HOURS_R_DELTAN', df.POWER_ON_HOURS_R/df.POWER_ON_HOURS_R_MIN)


In [15]:
df = df.withColumn('POWER_ON_HOURS_N_MAX', greatest('POWER_ON_HOURS_N','POWER_ON_HOURS_N_7','POWER_ON_HOURS_N_6','POWER_ON_HOURS_N_5',\
                                                    'POWER_ON_HOURS_N_4','POWER_ON_HOURS_N_3','POWER_ON_HOURS_N_2','POWER_ON_HOURS_N_1'))

df = df.withColumn('POWER_ON_HOURS_N_MIN', least('POWER_ON_HOURS_N','POWER_ON_HOURS_N_7','POWER_ON_HOURS_N_6','POWER_ON_HOURS_N_5',\
                                                    'POWER_ON_HOURS_N_4','POWER_ON_HOURS_N_3','POWER_ON_HOURS_N_2','POWER_ON_HOURS_N_1'))

df = df.withColumn('POWER_ON_HOURS_N_SUM', (df.POWER_ON_HOURS_N+df.POWER_ON_HOURS_N_7+df.POWER_ON_HOURS_N_6+df.POWER_ON_HOURS_N_5+\
                                                    df.POWER_ON_HOURS_N_4+df.POWER_ON_HOURS_N_3+df.POWER_ON_HOURS_N_2+df.POWER_ON_HOURS_N_1))


df = df.withColumn('POWER_ON_HOURS_N_MEAN', df.POWER_ON_HOURS_N_SUM/8)
df = df.withColumn('POWER_ON_HOURS_N_VAR', df.POWER_ON_HOURS_N_MAX/df.POWER_ON_HOURS_N_MIN)
df = df.withColumn('POWER_ON_HOURS_N_DELTA', df.POWER_ON_HOURS_N/df.POWER_ON_HOURS_N_MEAN)
df = df.withColumn('POWER_ON_HOURS_N_VARX', df.POWER_ON_HOURS_N_MEAN/df.POWER_ON_HOURS_N_MAX)
df = df.withColumn('POWER_ON_HOURS_N_VARN', df.POWER_ON_HOURS_N_MEAN/df.POWER_ON_HOURS_N_MIN)
df = df.withColumn('POWER_ON_HOURS_N_DELTAX', df.POWER_ON_HOURS_N/df.POWER_ON_HOURS_N_MAX)
df = df.withColumn('POWER_ON_HOURS_N_DELTAN', df.POWER_ON_HOURS_N/df.POWER_ON_HOURS_N_MIN)


In [16]:
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_MAX', greatest('REALLOCATED_SECTOR_COUNT_R','REALLOCATED_SECTOR_COUNT_R_7','REALLOCATED_SECTOR_COUNT_R_6','REALLOCATED_SECTOR_COUNT_R_5',\
                                                    'REALLOCATED_SECTOR_COUNT_R_4','REALLOCATED_SECTOR_COUNT_R_3','REALLOCATED_SECTOR_COUNT_R_2','REALLOCATED_SECTOR_COUNT_R_1'))

df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_MIN', least('REALLOCATED_SECTOR_COUNT_R','REALLOCATED_SECTOR_COUNT_R_7','REALLOCATED_SECTOR_COUNT_R_6','REALLOCATED_SECTOR_COUNT_R_5',\
                                                    'REALLOCATED_SECTOR_COUNT_R_4','REALLOCATED_SECTOR_COUNT_R_3','REALLOCATED_SECTOR_COUNT_R_2','REALLOCATED_SECTOR_COUNT_R_1'))

df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_SUM', (df.REALLOCATED_SECTOR_COUNT_R+df.REALLOCATED_SECTOR_COUNT_R_7+df.REALLOCATED_SECTOR_COUNT_R_6+df.REALLOCATED_SECTOR_COUNT_R_5+\
                                                    df.REALLOCATED_SECTOR_COUNT_R_4+df.REALLOCATED_SECTOR_COUNT_R_3+df.REALLOCATED_SECTOR_COUNT_R_2+df.REALLOCATED_SECTOR_COUNT_R_1))


df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_MEAN', df.REALLOCATED_SECTOR_COUNT_R_SUM/8)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_VAR', df.REALLOCATED_SECTOR_COUNT_R_MAX/df.REALLOCATED_SECTOR_COUNT_R_MIN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_DELTA', df.REALLOCATED_SECTOR_COUNT_R/df.REALLOCATED_SECTOR_COUNT_R_MEAN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_VARX', df.REALLOCATED_SECTOR_COUNT_R_MEAN/df.REALLOCATED_SECTOR_COUNT_R_MAX)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_VARN', df.REALLOCATED_SECTOR_COUNT_R_MEAN/df.REALLOCATED_SECTOR_COUNT_R_MIN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_DELTAX', df.REALLOCATED_SECTOR_COUNT_R/df.REALLOCATED_SECTOR_COUNT_R_MAX)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_DELTAN', df.REALLOCATED_SECTOR_COUNT_R/df.REALLOCATED_SECTOR_COUNT_R_MIN)


In [17]:
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_MAX', greatest('REPORTED_UNCORRECTABLE_ERRORS_N','REPORTED_UNCORRECTABLE_ERRORS_N_7','REPORTED_UNCORRECTABLE_ERRORS_N_6','REPORTED_UNCORRECTABLE_ERRORS_N_5',\
                                                    'REPORTED_UNCORRECTABLE_ERRORS_N_4','REPORTED_UNCORRECTABLE_ERRORS_N_3','REPORTED_UNCORRECTABLE_ERRORS_N_2','REPORTED_UNCORRECTABLE_ERRORS_N_1'))

df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_MIN', least('REPORTED_UNCORRECTABLE_ERRORS_N','REPORTED_UNCORRECTABLE_ERRORS_N_7','REPORTED_UNCORRECTABLE_ERRORS_N_6','REPORTED_UNCORRECTABLE_ERRORS_N_5',\
                                                    'REPORTED_UNCORRECTABLE_ERRORS_N_4','REPORTED_UNCORRECTABLE_ERRORS_N_3','REPORTED_UNCORRECTABLE_ERRORS_N_2','REPORTED_UNCORRECTABLE_ERRORS_N_1'))

df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_SUM', (df.REPORTED_UNCORRECTABLE_ERRORS_N+df.REPORTED_UNCORRECTABLE_ERRORS_N_7+df.REPORTED_UNCORRECTABLE_ERRORS_N_6+df.REPORTED_UNCORRECTABLE_ERRORS_N_5+\
                                                    df.REPORTED_UNCORRECTABLE_ERRORS_N_4+df.REPORTED_UNCORRECTABLE_ERRORS_N_3+df.REPORTED_UNCORRECTABLE_ERRORS_N_2+df.REPORTED_UNCORRECTABLE_ERRORS_N_1))


df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_MEAN', df.REPORTED_UNCORRECTABLE_ERRORS_N_SUM/8)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_VAR', df.REPORTED_UNCORRECTABLE_ERRORS_N_MAX/df.REPORTED_UNCORRECTABLE_ERRORS_N_MIN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_DELTA', df.REPORTED_UNCORRECTABLE_ERRORS_N/df.REPORTED_UNCORRECTABLE_ERRORS_N_MEAN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_VARX', df.REPORTED_UNCORRECTABLE_ERRORS_N_MEAN/df.REPORTED_UNCORRECTABLE_ERRORS_N_MAX)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_VARN', df.REPORTED_UNCORRECTABLE_ERRORS_N_MEAN/df.REPORTED_UNCORRECTABLE_ERRORS_N_MIN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_DELTAX', df.REPORTED_UNCORRECTABLE_ERRORS_N/df.REPORTED_UNCORRECTABLE_ERRORS_N_MAX)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_DELTAN', df.REPORTED_UNCORRECTABLE_ERRORS_N/df.REPORTED_UNCORRECTABLE_ERRORS_N_MIN)


In [18]:
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_MAX', greatest('REALLOCATED_SECTOR_COUNT_N','REALLOCATED_SECTOR_COUNT_N_7','REALLOCATED_SECTOR_COUNT_N_6','REALLOCATED_SECTOR_COUNT_N_5',\
                                                    'REALLOCATED_SECTOR_COUNT_N_4','REALLOCATED_SECTOR_COUNT_N_3','REALLOCATED_SECTOR_COUNT_N_2','REALLOCATED_SECTOR_COUNT_N_1'))

df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_MIN', least('REALLOCATED_SECTOR_COUNT_N','REALLOCATED_SECTOR_COUNT_N_7','REALLOCATED_SECTOR_COUNT_N_6','REALLOCATED_SECTOR_COUNT_N_5',\
                                                    'REALLOCATED_SECTOR_COUNT_N_4','REALLOCATED_SECTOR_COUNT_N_3','REALLOCATED_SECTOR_COUNT_N_2','REALLOCATED_SECTOR_COUNT_N_1'))

df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_SUM', (df.REALLOCATED_SECTOR_COUNT_N+df.REALLOCATED_SECTOR_COUNT_N_7+df.REALLOCATED_SECTOR_COUNT_N_6+df.REALLOCATED_SECTOR_COUNT_N_5+\
                                                    df.REALLOCATED_SECTOR_COUNT_N_4+df.REALLOCATED_SECTOR_COUNT_N_3+df.REALLOCATED_SECTOR_COUNT_N_2+df.REALLOCATED_SECTOR_COUNT_N_1))


df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_MEAN', df.REALLOCATED_SECTOR_COUNT_N_SUM/8)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_VAR', df.REALLOCATED_SECTOR_COUNT_N_MAX/df.REALLOCATED_SECTOR_COUNT_N_MIN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_DELTA', df.REALLOCATED_SECTOR_COUNT_N/df.REALLOCATED_SECTOR_COUNT_N_MEAN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_VARX', df.REALLOCATED_SECTOR_COUNT_N_MEAN/df.REALLOCATED_SECTOR_COUNT_N_MAX)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_VARN', df.REALLOCATED_SECTOR_COUNT_N_MEAN/df.REALLOCATED_SECTOR_COUNT_N_MIN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_DELTAX', df.REALLOCATED_SECTOR_COUNT_N/df.REALLOCATED_SECTOR_COUNT_N_MAX)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_DELTAN', df.REALLOCATED_SECTOR_COUNT_N/df.REALLOCATED_SECTOR_COUNT_N_MIN)


In [19]:
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_MAX', greatest('REPORTED_UNCORRECTABLE_ERRORS_R','REPORTED_UNCORRECTABLE_ERRORS_R_7','REPORTED_UNCORRECTABLE_ERRORS_R_6','REPORTED_UNCORRECTABLE_ERRORS_R_5',\
                                                    'REPORTED_UNCORRECTABLE_ERRORS_R_4','REPORTED_UNCORRECTABLE_ERRORS_R_3','REPORTED_UNCORRECTABLE_ERRORS_R_2','REPORTED_UNCORRECTABLE_ERRORS_R_1'))

df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_MIN', least('REPORTED_UNCORRECTABLE_ERRORS_R','REPORTED_UNCORRECTABLE_ERRORS_R_7','REPORTED_UNCORRECTABLE_ERRORS_R_6','REPORTED_UNCORRECTABLE_ERRORS_R_5',\
                                                    'REPORTED_UNCORRECTABLE_ERRORS_R_4','REPORTED_UNCORRECTABLE_ERRORS_R_3','REPORTED_UNCORRECTABLE_ERRORS_R_2','REPORTED_UNCORRECTABLE_ERRORS_R_1'))

df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_SUM', (df.REPORTED_UNCORRECTABLE_ERRORS_R+df.REPORTED_UNCORRECTABLE_ERRORS_R_7+df.REPORTED_UNCORRECTABLE_ERRORS_R_6+df.REPORTED_UNCORRECTABLE_ERRORS_R_5+\
                                                    df.REPORTED_UNCORRECTABLE_ERRORS_R_4+df.REPORTED_UNCORRECTABLE_ERRORS_R_3+df.REPORTED_UNCORRECTABLE_ERRORS_R_2+df.REPORTED_UNCORRECTABLE_ERRORS_R_1))


df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_MEAN', df.REPORTED_UNCORRECTABLE_ERRORS_R_SUM/8)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_VAR', df.REPORTED_UNCORRECTABLE_ERRORS_R_MAX/df.REPORTED_UNCORRECTABLE_ERRORS_R_MIN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_DELTA', df.REPORTED_UNCORRECTABLE_ERRORS_R/df.REPORTED_UNCORRECTABLE_ERRORS_R_MEAN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_VARX', df.REPORTED_UNCORRECTABLE_ERRORS_R_MEAN/df.REPORTED_UNCORRECTABLE_ERRORS_R_MAX)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_VARN', df.REPORTED_UNCORRECTABLE_ERRORS_R_MEAN/df.REPORTED_UNCORRECTABLE_ERRORS_R_MIN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_DELTAX', df.REPORTED_UNCORRECTABLE_ERRORS_R/df.REPORTED_UNCORRECTABLE_ERRORS_R_MAX)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_DELTAN', df.REPORTED_UNCORRECTABLE_ERRORS_R/df.REPORTED_UNCORRECTABLE_ERRORS_R_MIN)


In [20]:
df = df.withColumn('COMMAND_TIMEOUT_R_MAX', greatest('COMMAND_TIMEOUT_R','COMMAND_TIMEOUT_R_7','COMMAND_TIMEOUT_R_6','COMMAND_TIMEOUT_R_5',\
                                                    'COMMAND_TIMEOUT_R_4','COMMAND_TIMEOUT_R_3','COMMAND_TIMEOUT_R_2','COMMAND_TIMEOUT_R_1'))

df = df.withColumn('COMMAND_TIMEOUT_R_MIN', least('COMMAND_TIMEOUT_R','COMMAND_TIMEOUT_R_7','COMMAND_TIMEOUT_R_6','COMMAND_TIMEOUT_R_5',\
                                                    'COMMAND_TIMEOUT_R_4','COMMAND_TIMEOUT_R_3','COMMAND_TIMEOUT_R_2','COMMAND_TIMEOUT_R_1'))

df = df.withColumn('COMMAND_TIMEOUT_R_SUM', (df.COMMAND_TIMEOUT_R+df.COMMAND_TIMEOUT_R_7+df.COMMAND_TIMEOUT_R_6+df.COMMAND_TIMEOUT_R_5+\
                                                    df.COMMAND_TIMEOUT_R_4+df.COMMAND_TIMEOUT_R_3+df.COMMAND_TIMEOUT_R_2+df.COMMAND_TIMEOUT_R_1))


df = df.withColumn('COMMAND_TIMEOUT_R_MEAN', df.COMMAND_TIMEOUT_R_SUM/8)
df = df.withColumn('COMMAND_TIMEOUT_R_VAR', df.COMMAND_TIMEOUT_R_MAX/df.COMMAND_TIMEOUT_R_MIN)
df = df.withColumn('COMMAND_TIMEOUT_R_DELTA', df.COMMAND_TIMEOUT_R/df.COMMAND_TIMEOUT_R_MEAN)
df = df.withColumn('COMMAND_TIMEOUT_R_VARX', df.COMMAND_TIMEOUT_R_MEAN/df.COMMAND_TIMEOUT_R_MAX)
df = df.withColumn('COMMAND_TIMEOUT_R_VARN', df.COMMAND_TIMEOUT_R_MEAN/df.COMMAND_TIMEOUT_R_MIN)
df = df.withColumn('COMMAND_TIMEOUT_R_DELTAX', df.COMMAND_TIMEOUT_R/df.COMMAND_TIMEOUT_R_MAX)
df = df.withColumn('COMMAND_TIMEOUT_R_DELTAN', df.COMMAND_TIMEOUT_R/df.COMMAND_TIMEOUT_R_MIN)


In [21]:
df = df.withColumn('COMMAND_TIMEOUT_N_MAX', greatest('COMMAND_TIMEOUT_N','COMMAND_TIMEOUT_N_7','COMMAND_TIMEOUT_N_6','COMMAND_TIMEOUT_N_5',\
                                                    'COMMAND_TIMEOUT_N_4','COMMAND_TIMEOUT_N_3','COMMAND_TIMEOUT_N_2','COMMAND_TIMEOUT_N_1'))

df = df.withColumn('COMMAND_TIMEOUT_N_MIN', least('COMMAND_TIMEOUT_N','COMMAND_TIMEOUT_N_7','COMMAND_TIMEOUT_N_6','COMMAND_TIMEOUT_N_5',\
                                                    'COMMAND_TIMEOUT_N_4','COMMAND_TIMEOUT_N_3','COMMAND_TIMEOUT_N_2','COMMAND_TIMEOUT_N_1'))

df = df.withColumn('COMMAND_TIMEOUT_N_SUM', (df.COMMAND_TIMEOUT_N+df.COMMAND_TIMEOUT_N_7+df.COMMAND_TIMEOUT_N_6+df.COMMAND_TIMEOUT_N_5+\
                                                    df.COMMAND_TIMEOUT_N_4+df.COMMAND_TIMEOUT_N_3+df.COMMAND_TIMEOUT_N_2+df.COMMAND_TIMEOUT_N_1))


df = df.withColumn('COMMAND_TIMEOUT_N_MEAN', df.COMMAND_TIMEOUT_N_SUM/8)
df = df.withColumn('COMMAND_TIMEOUT_N_VAR', df.COMMAND_TIMEOUT_N_MAX/df.COMMAND_TIMEOUT_N_MIN)
df = df.withColumn('COMMAND_TIMEOUT_N_DELTA', df.COMMAND_TIMEOUT_N/df.COMMAND_TIMEOUT_N_MEAN)
df = df.withColumn('COMMAND_TIMEOUT_N_VARX', df.COMMAND_TIMEOUT_N_MEAN/df.COMMAND_TIMEOUT_N_MAX)
df = df.withColumn('COMMAND_TIMEOUT_N_VARN', df.COMMAND_TIMEOUT_N_MEAN/df.COMMAND_TIMEOUT_N_MIN)
df = df.withColumn('COMMAND_TIMEOUT_N_DELTAX', df.COMMAND_TIMEOUT_N/df.COMMAND_TIMEOUT_N_MAX)
df = df.withColumn('COMMAND_TIMEOUT_N_DELTAN', df.COMMAND_TIMEOUT_N/df.COMMAND_TIMEOUT_N_MIN)


In [22]:
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_MAX', greatest('CURRENT_PENDING_SECTOR_COUNT_R','CURRENT_PENDING_SECTOR_COUNT_R_7','CURRENT_PENDING_SECTOR_COUNT_R_6','CURRENT_PENDING_SECTOR_COUNT_R_5',\
                                                    'CURRENT_PENDING_SECTOR_COUNT_R_4','CURRENT_PENDING_SECTOR_COUNT_R_3','CURRENT_PENDING_SECTOR_COUNT_R_2','CURRENT_PENDING_SECTOR_COUNT_R_1'))

df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_MIN', least('CURRENT_PENDING_SECTOR_COUNT_R','CURRENT_PENDING_SECTOR_COUNT_R_7','CURRENT_PENDING_SECTOR_COUNT_R_6','CURRENT_PENDING_SECTOR_COUNT_R_5',\
                                                    'CURRENT_PENDING_SECTOR_COUNT_R_4','CURRENT_PENDING_SECTOR_COUNT_R_3','CURRENT_PENDING_SECTOR_COUNT_R_2','CURRENT_PENDING_SECTOR_COUNT_R_1'))

df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_SUM', (df.CURRENT_PENDING_SECTOR_COUNT_R+df.CURRENT_PENDING_SECTOR_COUNT_R_7+df.CURRENT_PENDING_SECTOR_COUNT_R_6+df.CURRENT_PENDING_SECTOR_COUNT_R_5+\
                                                    df.CURRENT_PENDING_SECTOR_COUNT_R_4+df.CURRENT_PENDING_SECTOR_COUNT_R_3+df.CURRENT_PENDING_SECTOR_COUNT_R_2+df.CURRENT_PENDING_SECTOR_COUNT_R_1))


df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_MEAN', df.CURRENT_PENDING_SECTOR_COUNT_R_SUM/8)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_VAR', df.CURRENT_PENDING_SECTOR_COUNT_R_MAX/df.CURRENT_PENDING_SECTOR_COUNT_R_MIN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_DELTA', df.CURRENT_PENDING_SECTOR_COUNT_R/df.CURRENT_PENDING_SECTOR_COUNT_R_MEAN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_VARX', df.CURRENT_PENDING_SECTOR_COUNT_R_MEAN/df.CURRENT_PENDING_SECTOR_COUNT_R_MAX)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_VARN', df.CURRENT_PENDING_SECTOR_COUNT_R_MEAN/df.CURRENT_PENDING_SECTOR_COUNT_R_MIN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_DELTAX', df.CURRENT_PENDING_SECTOR_COUNT_R/df.CURRENT_PENDING_SECTOR_COUNT_R_MAX)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_DELTAN', df.CURRENT_PENDING_SECTOR_COUNT_R/df.CURRENT_PENDING_SECTOR_COUNT_R_MIN)


In [23]:
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_MAX', greatest('CURRENT_PENDING_SECTOR_COUNT_N','CURRENT_PENDING_SECTOR_COUNT_N_7','CURRENT_PENDING_SECTOR_COUNT_N_6','CURRENT_PENDING_SECTOR_COUNT_N_5',\
                                                    'CURRENT_PENDING_SECTOR_COUNT_N_4','CURRENT_PENDING_SECTOR_COUNT_N_3','CURRENT_PENDING_SECTOR_COUNT_N_2','CURRENT_PENDING_SECTOR_COUNT_N_1'))

df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_MIN', least('CURRENT_PENDING_SECTOR_COUNT_N','CURRENT_PENDING_SECTOR_COUNT_N_7','CURRENT_PENDING_SECTOR_COUNT_N_6','CURRENT_PENDING_SECTOR_COUNT_N_5',\
                                                    'CURRENT_PENDING_SECTOR_COUNT_N_4','CURRENT_PENDING_SECTOR_COUNT_N_3','CURRENT_PENDING_SECTOR_COUNT_N_2','CURRENT_PENDING_SECTOR_COUNT_N_1'))

df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_SUM', (df.CURRENT_PENDING_SECTOR_COUNT_N+df.CURRENT_PENDING_SECTOR_COUNT_N_7+df.CURRENT_PENDING_SECTOR_COUNT_N_6+df.CURRENT_PENDING_SECTOR_COUNT_N_5+\
                                                    df.CURRENT_PENDING_SECTOR_COUNT_N_4+df.CURRENT_PENDING_SECTOR_COUNT_N_3+df.CURRENT_PENDING_SECTOR_COUNT_N_2+df.CURRENT_PENDING_SECTOR_COUNT_N_1))


df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_MEAN', df.CURRENT_PENDING_SECTOR_COUNT_N_SUM/8)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_VAR', df.CURRENT_PENDING_SECTOR_COUNT_N_MAX/df.CURRENT_PENDING_SECTOR_COUNT_N_MIN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_DELTA', df.CURRENT_PENDING_SECTOR_COUNT_N/df.CURRENT_PENDING_SECTOR_COUNT_N_MEAN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_VARX', df.CURRENT_PENDING_SECTOR_COUNT_N_MEAN/df.CURRENT_PENDING_SECTOR_COUNT_N_MAX)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_VARN', df.CURRENT_PENDING_SECTOR_COUNT_N_MEAN/df.CURRENT_PENDING_SECTOR_COUNT_N_MIN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_DELTAX', df.CURRENT_PENDING_SECTOR_COUNT_N/df.CURRENT_PENDING_SECTOR_COUNT_N_MAX)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_DELTAN', df.CURRENT_PENDING_SECTOR_COUNT_N/df.CURRENT_PENDING_SECTOR_COUNT_N_MIN)


In [24]:
#from pyspark.sql.types import DoubleType
#from pyspark.sql.functions import col, round
#df=df.withColumn("ROW",round(df.REALLOCATED_SECTOR_COUNT_R.cast(DoubleType()),2))

Fill missing values wtih -99

In [25]:
df=df.na.fill(value=-99)

In [26]:
df=df.sort("SERIAL_NUMBER","DATE")

### 5.0 Append mean encoded data variables <a id="50"></a>

In [27]:
z='model.csv'
input_data = spark.read\
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
    .option('header', 'true')\
    .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
input_data = input_data.select("MODEL","MODEL_FAIL_RATE","MODEL_FAIL_CNT","MODEL_FAIL_TOTAL","GLOBAL_AVG_FAILURE") #select the relevant fields
df_model = reduce(DataFrame,[input_data])#append the current file to the running data frame 

df_model=df_model[['MODEL','MODEL_FAIL_RATE','MODEL_FAIL_CNT','MODEL_FAIL_TOTAL']]

In [28]:
#df_model = spark.createDataFrame(df_model)
df_model=df_model.withColumnRenamed("MODEL","MODELM")

In [29]:
df=df.join(df_model,(((df.MODEL) ==  (df_model.MODELM)) ),"left")
df = df.drop("MODELM")

In [30]:
z='manufacturer.csv'
input_data = spark.read\
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
    .option('header', 'true')\
    .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
input_data = input_data.select("MANUFACTURER","MANU_FAIL_RATE","MANU_FAIL_CNT","MANU_FAIL_TOTAL","GLOBAL_AVG_FAILURE") #select the relevant fields
df_manu = reduce(DataFrame,[input_data])#append the current file to the running data frame 

df_manu=df_manu[['MANUFACTURER','MANU_FAIL_RATE','MANU_FAIL_CNT','MANU_FAIL_TOTAL']]

#df_model = spark.createDataFrame(df_model)
df_manu=df_manu.withColumnRenamed("MANUFACTURER","MANUFACTURERM")

df=df.join(df_manu,(((df.MANUFACTURER) ==  (df_manu.MANUFACTURERM)) ),"left")
df = df.drop("MANUFACTURERM")

In [31]:
z='df_avg_by_model.csv'
input_data = spark.read\
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
    .option('header', 'true')\
    .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
input_data = input_data.select("MODEL","REALLOCATED_SECTOR_COUNT_N_MOD","REPORTED_UNCORRECTABLE_ERRORS_N_MOD","COMMAND_TIMEOUT_N_MOD",
                               "CURRENT_PENDING_SECTOR_COUNT_N_MOD","POWER_ON_HOURS_N_MOD","REALLOCATED_SECTOR_COUNT_R_MOD",
                               "REPORTED_UNCORRECTABLE_ERRORS_R_MOD","COMMAND_TIMEOUT_R_MOD","CURRENT_PENDING_SECTOR_COUNT_R_MOD",
                               "POWER_ON_HOURS_R_MOD") #select the relevant fields
df_model = reduce(DataFrame,[input_data])#append the current file to the running data frame 

df_model=df_model
#df_model = spark.createDataFrame(df_model)
df_model=df_model.withColumnRenamed("MODEL","MODELM")

df=df.join(df_model,(((df.MODEL) ==  (df_model.MODELM)) ),"left")
df = df.drop("MODELM")

In [32]:
z='df_avg_by_manu.csv'
input_data = spark.read\
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
    .option('header', 'true')\
    .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
input_data = input_data.select("MANUFACTURER","REALLOCATED_SECTOR_COUNT_N_MAN","REPORTED_UNCORRECTABLE_ERRORS_N_MAN",
                               "COMMAND_TIMEOUT_N_MAN","CURRENT_PENDING_SECTOR_COUNT_N_MAN","POWER_ON_HOURS_N_MAN",
                               "REALLOCATED_SECTOR_COUNT_R_MAN","REPORTED_UNCORRECTABLE_ERRORS_R_MAN","COMMAND_TIMEOUT_R_MAN",
                               "CURRENT_PENDING_SECTOR_COUNT_R_MAN","POWER_ON_HOURS_R_MAN") #select the relevant fields
df_manu = reduce(DataFrame,[input_data])#append the current file to the running data frame 

df_manu=df_manu

#df_model = spark.createDataFrame(df_model)
df_manu=df_manu.withColumnRenamed("MANUFACTURER","MANUFACTURERM")

df=df.join(df_manu,(((df.MANUFACTURER) ==  (df_manu.MANUFACTURERM)) ),"left")
df = df.drop("MANUFACTURERM")

Fill in Null values with -99

In [33]:
df=df.na.fill(value=-99)

### 6.0 Create even more features <a id="60"></a>

In [34]:

df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_MOD_IDX', df.REALLOCATED_SECTOR_COUNT_N/df.REALLOCATED_SECTOR_COUNT_N_MOD)

df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_MOD_IDX', df.REPORTED_UNCORRECTABLE_ERRORS_N/df.REPORTED_UNCORRECTABLE_ERRORS_N_MOD)
df = df.withColumn('COMMAND_TIMEOUT_N_MOD_MOD_IDX', df.COMMAND_TIMEOUT_N/df.COMMAND_TIMEOUT_N_MOD)

df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_MOD_IDX', df.CURRENT_PENDING_SECTOR_COUNT_N/df.CURRENT_PENDING_SECTOR_COUNT_N_MOD)

df = df.withColumn('POWER_ON_HOURS_N_MOD_IDX', df.POWER_ON_HOURS_N/df.POWER_ON_HOURS_N_MOD)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_MOD_IDX', df.REALLOCATED_SECTOR_COUNT_R/df.REALLOCATED_SECTOR_COUNT_R_MOD)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_MOD_IDX', df.REPORTED_UNCORRECTABLE_ERRORS_R/df.REPORTED_UNCORRECTABLE_ERRORS_R_MOD)

df = df.withColumn('COMMAND_TIMEOUT_R_MOD_IDX', df.COMMAND_TIMEOUT_R/df.COMMAND_TIMEOUT_R_MOD)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_MOD_IDX', df.CURRENT_PENDING_SECTOR_COUNT_R/df.CURRENT_PENDING_SECTOR_COUNT_R_MOD)

df = df.withColumn('POWER_ON_HOURS_R_MOD_IDX', df.POWER_ON_HOURS_R/df.POWER_ON_HOURS_R_MOD)


In [35]:
df = df.withColumn('REALLOCATED_SECTOR_COUNT_N_MAN_IDX', df.REALLOCATED_SECTOR_COUNT_N/df.REALLOCATED_SECTOR_COUNT_N_MAN)

df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_N_MAN_IDX', df.REPORTED_UNCORRECTABLE_ERRORS_N/df.REPORTED_UNCORRECTABLE_ERRORS_N_MAN)
df = df.withColumn('COMMAND_TIMEOUT_N_MAN_IDX', df.COMMAND_TIMEOUT_N/df.COMMAND_TIMEOUT_N_MAN)

df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_N_MAN_IDX', df.CURRENT_PENDING_SECTOR_COUNT_N/df.CURRENT_PENDING_SECTOR_COUNT_N_MAN)

df = df.withColumn('POWER_ON_HOURS_N_MAN_IDX', df.POWER_ON_HOURS_N/df.POWER_ON_HOURS_N_MAN)
df = df.withColumn('REALLOCATED_SECTOR_COUNT_R_MAN_IDX', df.REALLOCATED_SECTOR_COUNT_R/df.REALLOCATED_SECTOR_COUNT_R_MAN)
df = df.withColumn('REPORTED_UNCORRECTABLE_ERRORS_R_MAN_IDX', df.REPORTED_UNCORRECTABLE_ERRORS_R/df.REPORTED_UNCORRECTABLE_ERRORS_R_MAN)

df = df.withColumn('COMMAND_TIMEOUT_R_MAN_IDX', df.COMMAND_TIMEOUT_R/df.COMMAND_TIMEOUT_R_MAN)
df = df.withColumn('CURRENT_PENDING_SECTOR_COUNT_R_MAN_IDX', df.CURRENT_PENDING_SECTOR_COUNT_R/df.CURRENT_PENDING_SECTOR_COUNT_R_MAN)

df = df.withColumn('POWER_ON_HOURS_R_MAN_IDX', df.POWER_ON_HOURS_R/df.POWER_ON_HOURS_R_MAN)


### 7.0 Export to Parquet file for step 5 <a id="70"></a>

In [36]:
# The code was removed by Watson Studio for sharing.

In [37]:
df.write.mode("overwrite").parquet(cos.url('data_2020_model.parquet', 'bucketname'))

All data used in this notebook is the property of BackBlaze.com.

For questions regarding use of data please see the following website. https://www.backblaze.com/b2/hard-drive-test-data.html