# A Complete Solution to the BackBaze.com Kaggle Problem

## Step Two.  

### Creating Mean Encoded Features

## Table of Contents

1. [Introduction](#10)<br>

2. [Establish environment and parameters](#20)<br>
3. [Read in data created in step 1](#30)<br>
4. [Create mean encoded features](#40)<br>
    4.1 [Establish a global mean](#41)<br>
    4.2 [Aggregate the failure rate by Manufacturer](#42)<br>
    4.3 [Aggregate the failure rate by Model](#43)<br>
    4.4 [Calculate the values of predictors when a disk fails by model](#44)<br>
    4.5 [Calculate the values of predictors when a disk fails by manufacturer](#45)<br>



### 1.0 Introduction <a id="10"></a>

BackBaze.com, you are the "GOAT." You are the "cat's meow." You "Rock the House." In case you don't know why BackBaze.com is so totally "kick-ass," they open-sourced a vast set of hard drive information a few years ago and continue updating it each quarter. What a treasure trove of superb data. BackBlaze.com, thank you from the bottom of my heart. 

The backblaze.com data includes operational metrics from hard drives with an indicator of a hard-drive failure. It is an excellent source for teaching techniques related to machine failure. Again, thank you for making this available to the open-source community.

Here is a link to the data.


https://www.backblaze.com/b2/hard-drive-test-data.html

My goal in this series of articles is not to give the best solution with the highest AUC.  My goal is to show you how to approach equipment failure problems and build solutions that reflect realistic accuracy, and provide an easy transition from the lab to the real world. 

I will use a Spark/Python Jupyter notebook inside IBM's Watson Studio on the cloud as a tool in this discussion.

https://www.ibm.com/cloud/watson-studio

I will also be using cloud object storage on the IBM cloud.

https://www.ibm.com/cloud/block-storage


The third article is about designing features for a predictive model.  Specifically, using data from 2017, 2018 and 2019 to build features for our model based on 2020.  For more information on mean encoded features, please see the following article.

https://medium.com/towards-data-science/leveraging-value-from-postal-codes-naics-codes-area-codes-and-other-funky-arse-categorical-be9ce75b6d5a

I created these notebooks with a runtime useing 1 driver with 1 vCPU and 4 GB RAM, and 2 executors each with 1 vCPU and 4 GB RAM. This is available for free on the IBM Cloud. Some of the notebooks take a few hours to run. You'll need to schedule your notebooks to run as jobs.

https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/schedule-task.html

### 2.0 Establish environment and parameters <a id="20"></a>

In [1]:
from functools import reduce
from pyspark.sql import DataFrame

import pyspark.sql.functions as F

from pyspark.sql.functions import when

from pyspark.sql.functions import rand
from pyspark.sql.functions import lit

In [2]:
# The code was removed by Watson Studio for sharing.

### 3.0 Read in data created in step 1 <a id="30"></a>

Read in the data from 2017-2019 we created in step 1.  There are 10,7909,839 observations and 16 fields.

In [3]:
df = spark.read.parquet(cos.url('data2019_2017.parquet', 'backblazedata-donotdelete-pr-cij57grgkoctem'))

#df=df.show(200)
#print((df.count(), len(df.columns)))

### 4.0 Create mean encoded features. <a id="40"></a>

In the next few steps we will created mean encoded features and export to a format to be used in our predictive model.  

One thing you want to avoid when building mean encoded features is using irrelevant aggregations.  This can occur when you don't have a big enough sample for the aggregation to be meaningful.  The failure rate is tiny in this exercise.  This means we must ensure a large number of observations are available for the aggregations to be meaningful.  In the code below, I use 10,000 as a threshold.  There is no magic number, but I picked 10,000 based on the fact that overall average failure rate across all disk drives.

#### 4.1 Establish a global mean <a id="41"></a>

Having a global mean in the data frame allows us to easily replace irrelevant values with an average.

Create a Dummy field to use in aggregations.

In [4]:
df = df.withColumn("wookie", lit(1))

Create a global failure rate accross all three years of data.  We will use this when we create the features.  We now have a spark dataframe that expresses the average failure rate across all disks for the last three years.

In [5]:
#aggregate the data
total = df.groupBy('wookie').agg(F.mean("failure").alias('avg_failure')).collect()
#convert output to rdd
rdd = spark.sparkContext.parallelize(total)
#convert output to spark
zz=rdd.toDF()
#rename the column
zz=zz.withColumnRenamed("avg_failure","GLOBAL_AVG_FAILURE")
#multiply by 10,000, for formatting purposes.
zz = zz.withColumn("GLOBAL_AVG_FAILURE", zz.GLOBAL_AVG_FAILURE*10000)
#zz.show(200)








Join the global average to the original data frame.

In [6]:
df=df.join(zz,(df.wookie) ==  (zz.wookie),"inner")

#### 4.2 Aggregate the failure rate by Manufacturer. <a id="42"></a>

In [None]:
#Calculate the summaries.
total = df.groupBy('MANUFACTURER').agg(F.mean("failure").alias('avg_failure'),F.count("failure").alias('count_failure'),\
                                       F.sum("failure").alias('sum_failure'),F.mean("GLOBAL_AVG_FAILURE").alias('GLOBAL_AVG_FAILURE')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert to spark
zz=rdd.toDF()
#rename columns
zz=zz.withColumnRenamed("avg_failure","MANU_FAIL_RATE")
zz=zz.withColumnRenamed("sum_failure","MANU_FAIL_TOTAL")
zz=zz.withColumnRenamed("count_failure","MANU_FAIL_CNT")
#multiply by 10,000 to make them easier to read and deal with
zz = zz.withColumn("MANU_FAIL_RATE", zz.MANU_FAIL_RATE*10000)

#zz.show(200)

We want to avoid situations where aggregations are based on a small number of records.  In the next step we replace values with the global failure average if the total number of records used to calcluate the value is less than 10,000.  Again, 10,000 is reasonable based on the overall failure rate.

In [None]:
df_manu = zz.withColumn("MANU_FAIL_RATE", when(zz.MANU_FAIL_CNT<100000,zz.GLOBAL_AVG_FAILURE).otherwise(zz.MANU_FAIL_RATE))

#df_manu.show(200)

Convert the aggregated data frame to pandas.

In [None]:
df_manup = df_manu.toPandas()

Define credentials for object storage

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
from ibm_botocore.client import Config
import ibm_boto3
cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

Export the pandas dataframe to csv and upload to cloud object storage.

In [None]:
df_manup=df_manup.to_csv('manufacturer.csv',index=False)
cos.upload_file(Filename='manufacturer.csv',Bucket=credentials['BUCKET'],Key='manufacturer.csv')

#### 4.3 Aggregate the failure rate by Model. <a id="43"></a>

In [None]:
#Calculate the summaries.
total = df.groupBy('MODEL').agg(F.mean("failure").alias('avg_failure'),F.count("failure").alias('count_failure'),\
                                       F.sum("failure").alias('sum_failure'),F.mean("GLOBAL_AVG_FAILURE").alias('GLOBAL_AVG_FAILURE')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert output to spark

zz=rdd.toDF()
#rename columns
zz=zz.withColumnRenamed("avg_failure","MODEL_FAIL_RATE")
zz=zz.withColumnRenamed("sum_failure","MODEL_FAIL_TOTAL")
zz=zz.withColumnRenamed("count_failure","MODEL_FAIL_CNT")
#multiply by 10,000 to make them easier to read and deal with
zz = zz.withColumn("MODEL_FAIL_RATE", zz.MODEL_FAIL_RATE*10000)

#replace values when total for a summary is less than 10,000
df_model = zz.withColumn("MODEL_FAIL_RATE", when(zz.MODEL_FAIL_CNT<100000,zz.GLOBAL_AVG_FAILURE).otherwise(zz.MODEL_FAIL_RATE))

#convert to Pandas
df_modelp = df_model.toPandas()
#export to csv
df_modelp=df_modelp.to_csv('model.csv',index=False)
#upload to object storage
cos.upload_file(Filename='model.csv',Bucket=credentials['BUCKET'],Key='model.csv')
#zz.show(200)

#### 4.4 Calculate the values of predictors when a disk fails by model. <a id="44"></a>

This aggregation could be a useful predictor when compared to other non-failure values.  For example, if a disk fails everytime a field is equal to 76.4, you should probably take note.

Select disks that failed

In [None]:
df_failure=df.filter(df.FAILURE == 1)

Aggregate the fields by model

In [None]:
#Calculate the summaries.
total = df_failure.groupBy('MODEL').agg(F.mean("REAllOCATED_SECTOR_COUNT_N").alias('REALLOCATED_SECTOR_COUNT_N_MOD'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_N").alias('REPORTED_UNCORRECTABLE_ERRORS_N_MOD'),\
                                F.mean("COMMAND_TIMEOUT_N").alias('COMMAND_TIMEOUT_N_MOD'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_N").alias('CURRENT_PENDING_SECTOR_COUNT_N_MOD'),\
                                F.mean("POWER_ON_HOURS_N").alias('POWER_ON_HOURS_N_MOD'),\
                                F.mean("REAllOCATED_SECTOR_COUNT_R").alias('REALLOCATED_SECTOR_COUNT_R_MOD'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_R").alias('REPORTED_UNCORRECTABLE_ERRORS_R_MOD'),\
                                F.mean("COMMAND_TIMEOUT_R").alias('COMMAND_TIMEOUT_R_MOD'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_R").alias('CURRENT_PENDING_SECTOR_COUNT_R_MOD'),\
                                F.mean("POWER_ON_HOURS_R").alias('POWER_ON_HOURS_R_MOD')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)

#convert to spark
df_avg_by_model=rdd.toDF()
#convert to pandas
df_avg_by_model = df_avg_by_model.toPandas()
#export to csv
df_avg_by_model=df_avg_by_model.to_csv('df_avg_by_model.csv',index=False)
#upload to cloud object storage
cos.upload_file(Filename='df_avg_by_model.csv',Bucket=credentials['BUCKET'],Key='df_avg_by_model.csv')

#### 4.5 Calculate the values of predictors when a disk fails by manufacturer. <a id="45"></a>

In [None]:
#Calculate the summaries.
total = df_failure.groupBy('MANUFACTURER').agg(F.mean("REAllOCATED_SECTOR_COUNT_N").alias('REALLOCATED_SECTOR_COUNT_N_MAN'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_N").alias('REPORTED_UNCORRECTABLE_ERRORS_N_MAN'),\
                                F.mean("COMMAND_TIMEOUT_N").alias('COMMAND_TIMEOUT_N_MAN'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_N").alias('CURRENT_PENDING_SECTOR_COUNT_N_MAN'),\
                                F.mean("POWER_ON_HOURS_N").alias('POWER_ON_HOURS_N_MAN'),\
                                F.mean("REAllOCATED_SECTOR_COUNT_R").alias('REALLOCATED_SECTOR_COUNT_R_MAN'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_R").alias('REPORTED_UNCORRECTABLE_ERRORS_R_MAN'),\
                                F.mean("COMMAND_TIMEOUT_R").alias('COMMAND_TIMEOUT_R_MAN'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_R").alias('CURRENT_PENDING_SECTOR_COUNT_R_MAN'),\
                                F.mean("POWER_ON_HOURS_R").alias('POWER_ON_HOURS_R_MAN')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)

#convert to spark
df_avg_by_manu=rdd.toDF()

#convert to pandas
df_avg_by_manu = df_avg_by_manu.toPandas()
#export to csv
df_avg_by_manu=df_avg_by_manu.to_csv('df_avg_by_manu.csv',index=False)
#upload to cloud object storage
cos.upload_file(Filename='df_avg_by_manu.csv',Bucket=credentials['BUCKET'],Key='df_avg_by_manu.csv')

#df_avg_by_manu.show(10)

All data used in this notebook is the property of BackBlaze.com.

For questions regarding use of data please see the following website. https://www.backblaze.com/b2/hard-drive-test-data.html