# A Complete Solution to the BackBaze.com Kaggle Problem

## Step One.  

### File Processing

## Table of contents

1. [Introduction](#10)<br>

2. [Establish environment and parameters](#20)<br>
3. [Process Data](#30)<br>
    3.1 [Create Null Data frame](#31)<br>
    3.2 [Read in Data from csv files](#32)<br>
    3.21 [Read in Months with 31 days](#321)<br>
    3.22 [Read in Months with 30 days](#322)<br>
    3.23 [Read in Months with 29 days](#323)<br>
    3.24 [Read in Months with 28 days](#322)<br>
4. [Prepare Fields ](#40)<br>
5. [Export to Parquet files](#50)<br>

### 1.0 Introduction <a id="10"></a>

BackBaze.com, you are the "GOAT." You are the "cat's meow." You "Rock the House." In case you don't know why BackBaze.com is so totally "kick-ass," they open-sourced a vast set of hard drive information a few years ago and continue updating it each quarter. What a treasure trove of superb data. BackBlaze.com, thank you from the bottom of my heart. 

The backblaze.com data includes operational metrics from hard drives with an indicator of a hard-drive failure. It is an excellent source for teaching techniques related to machine failure. Again, thank you for making this available to the open-source community.

Here is a link to the data.


https://www.backblaze.com/b2/hard-drive-test-data.html

My goal in this series of articles is not to give the best solution with the highest AUC.  My goal is to show you how to approach equipment failure problems and build solutions that reflect realistic accuracy, and provide an easy transition from the lab to the real world. 

I will use a Spark/Python Jupyter notebook inside IBM's Watson Studio on the cloud as a tool in this discussion.

https://www.ibm.com/cloud/watson-studio

I will also be using cloud object storage on the IBM cloud.

https://www.ibm.com/cloud/block-storage


The first article in this series is, without question, the most mundane.  I will be loading data from the CSV files provided by backblaze.com into SPARK data frames.  File processing is a tedious but necessary part of the process. 

Before running this code, I downloaded the .csv files from BackBlaze.com and uploaded them to IBM Cloud object storage. Note our friends at backblaze.com conveniently labeled the CSV files in the following format. 

"YYYY_MM_DD.csv."

 The naming convention makes it easy to load them systematically from cloud object storage to spark data frames. Also, note that I only use a handful of the fields in the CSV file. And finally, I limited my work here to 2018, 2019, and 2020.
 
 I created these notebooks with a runtime useing 1 driver with 1 vCPU and 4 GB RAM, and 2 executors each with 1 vCPU and 4 GB RAM.  This is available for free on the IBM Cloud.  Some of the notebooks take a few hours to run.  You'll need to schedule your notebooks to run as jobs.
 
 https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/schedule-task.html

### 2.0 Establish environment and parameters <a id="20"></a>


Import the Relevant Libraries and connect to object storage.

In [None]:
#Import relevant Libraries
from functools import reduce
from pyspark.sql import DataFrame

import pyspark.sql.functions as F

from pyspark.sql.functions import when

In [None]:
# The code was removed by Watson Studio for sharing.

### 3.0 Process Data <a id="30"></a>

#### 3.1 Create Null Data frame <a id="31"></a>

There are many ways to import CSV files into SPARK data frames. I can not say my approach is the best, but I can say with certainty it works. 

The first step is to create a _null_ data frame with the relevant fields and no records. 

Again, I am using only some of the fields available. The schema of the files has changed over time. The fields in the final data appear historically in all files. I also removed columns that did not seem relevant. The point of this exercise is not to build the ultimate model. Instead, it is to demonstrate how I approach these types of problems.

In [None]:
#read in a specific csv file
base = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('2020-01-02.csv', 'backblazedata-donotdelete-pr-cij57grgkoctem'))
#Truncate the file so there is no records.
base=base.filter(base.serial_number =='DAN PASTORINI')
#limit the fields to those we are interested in
base = base.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
               "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
               "smart_197_raw","smart_9_raw")

#### 3.2 Read in Data from csv files  <a id="32"></a>

In the following blocks of code, I will read each .csv file for 2020, 2019, and 2018.  I created a separate block of code for months based on the number of days in the month.  I process months with 31, 30, 29, and 28 days in separate blocks of code.



Note that looping through each file one at a time from CSV import to spark minimizes the amount of memory.  The final result is a parque file I can use in the second notebook of this series. 


##### 3.21 Read in Months with 31 days.  <a id="321"></a>

In [1]:
# define the years
for d in (['2017','2020','2019','2018']):    
    #define the months with 31 days
    for t in (['01','03','05','07','08','10','12']):
        #define the first 9 days -- these will need a leading 0 to identify them
        for q in range(1,10):  
            #define the file name based on d, t and q
            z=(d + "-" + t + "-" + '0' +str(q)+".csv")
        
            #read the file from object storage
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))
            #select the relevant fields
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")
            #append the current file to the running data frame originally based on the null frame
            base = reduce(DataFrame.unionAll, [base,input_data])
        
        for q in range(10,32): #define days 10 -31 -- these will NOT need a leading 0 to identify them   
            z=(d + "-" + t + "-" +str(q)+".csv")#define the file name based on d, t and q
        
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw") #select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data])#append the current file to the running data frame 

NameError: name 'cos' is not defined

##### 3.22 Read in Months with 30 days.  <a id="322"></a>

In [None]:
for d in (['2020','2019','2018','2017']):   # define the years
    for t in (['04','06','09','11']): #define the months with 30 days
        for q in range(1,10): #define the first 9 days -- these will need a leading 0 to identify them   
            z=(d + "-" + t + "-" + '0' +str(q)+".csv")#define the file name based on d, t and q
        
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")#select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data]) #append the current file to the running data frame
        
        for q in range(10,31): #define days 10 - 30 -- these will NOT need a leading 0 to identify them     
            z=(d + "-" + t + "-" +str(q)+".csv")#define the file name based on d, t and q
        
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")#select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data])#append the current file to the running data frame 

##### 3.23 Read in Months with 29 days.  <a id="323"></a>

In [None]:
for d in (['2020']):    # define the years  
    for t in (['02']):#define the months with 29 days
        for q in range(1,10):  #define the first 9 days -- these will need a leading 0 to identify them    
            z=(d + "-" + t + "-" + '0' +str(q)+".csv")#define the file name based on d, t and q
        
            #print(q)
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")#select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data])#append the current file to the running data frame
        
        for q in range(10,30):   #define days 10 -29 -- these will NOT need a leading 0 to identify them   
            z=(d + "-"  + t + "-" +str(q)+".csv")#define the file name based on d, t and q
        
            #print(q)
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")#select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data])#append the current file to the running data frame 

##### 3.24 Read in Months with 28 days.  <a id="324"></a>

In [None]:
for d in (['2019','2018','2017']):    # define the years   
    for t in (['02']):#define the months with 28 days 
        for q in range(1,10):  #define the first 9 days -- these will need a leading 0 to identify them 
            z=(d + "-" + t + "-" + '0' +str(q)+".csv") #define the file name based on d, t and q
        
            #print(q)
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")#select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data])#append the current file to the running data frame
        
        for q in range(10,29):  #define days 10 - 28 -- these will NOT need a leading 0 to identify them   
            z=(d + "-"  + t + "-" +str(q)+".csv")#define the file name based on d, t and q
        
            #print(q)
            #print(z)
            input_data = spark.read\
              .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
              .option('header', 'true')\
              .load(cos.url(z, 'backblazedata-donotdelete-pr-cij57grgkoctem'))#read the file from object storage
            input_data = input_data.select("date", "serial_number", "model","capacity_bytes","failure","smart_5_normalized","smart_187_normalized","smart_188_normalized",\
                   "smart_197_normalized","smart_9_normalized","smart_5_raw","smart_187_raw","smart_188_raw",\
                   "smart_197_raw","smart_9_raw")#select the relevant fields
            base = reduce(DataFrame.unionAll, [base,input_data])#append the current file to the running data frame

### 4.0 Prepare Fields <a id="40"></a>

In the next section I clean up the fields.



In [None]:
# rename the dataframe for simplicity
df=base


Give the fields descriptive names

In [None]:
# rename each field so it is more descriptive
df = df.withColumnRenamed("smart_5_normalized","REAllOCATED_SECTOR_COUNT_N") \
    .withColumnRenamed("smart_187_normalized","REPORTED_UNCORRECTABLE_ERRORS_N")\
    .withColumnRenamed("smart_188_normalized","COMMAND_TIMEOUT_N")\
    .withColumnRenamed("smart_197_normalized","CURRENT_PENDING_SECTOR_COUNT_N")\
    .withColumnRenamed("smart_198_normalized","OFFLINE_UNCORRECTABLE_N")\
    .withColumnRenamed("smart_9_normalized","POWER_ON_HOURS_N")\
    .withColumnRenamed("smart_5_raw","REAllOCATED_SECTOR_COUNT_R")\
    .withColumnRenamed("smart_187_raw","REPORTED_UNCORRECTABLE_ERRORS_R")\
    .withColumnRenamed("smart_188_raw","COMMAND_TIMEOUT_R")\
    .withColumnRenamed("smart_197_raw","CURRENT_PENDING_SECTOR_COUNT_R")\
    .withColumnRenamed("smart_198_raw","OFFLINE_UNCORRECTABLE_R")\
    .withColumnRenamed("smart_9_raw","POWER_ON_HOURS_R")\
    .withColumnRenamed("date","DATE")\
    .withColumnRenamed("serial_number","SERIAL_NUMBER")\
    .withColumnRenamed("model","MODEL")\
    .withColumnRenamed("capacity_bytes","CAPACITY_BYTES")\
    .withColumnRenamed("failure","FAILURE").cache()

The model field needs a little cleaning up.  The Seagate brand, for example, has 5 different values in the data. We will create a new field called "Manufacturer" to correct this.

In [None]:


df = df.withColumn('MANUFACTURER',when (F.instr(df.MODEL, 'TOSHIBA') > 0,'TOSHIBA')\
                   .when(F.instr(df.MODEL, 'SG') > 0,'SEAGATE')\
                   .when(F.instr(df.MODEL, 'ST') > 0,'SEAGATE')\
                   .when(F.instr(df.MODEL, 'Sea') > 0,'SEAGATE')\
                   .when(F.instr(df.MODEL, 'SG') > 0,'SEAGATE')\
                   .when(F.instr(df.MODEL, 'HGST') > 0,'HGST')\
                   .when(F.instr(df.MODEL, 'WD') > 0,'WD')\
                   .when(F.instr(df.MODEL, 'DELL') > 0,'DELL')\
                   .when(F.instr(df.MODEL, 'Hit') > 0,'HITACHI').otherwise('pp')).cache()

### 5.0 Export to Parquet files <a id="50"></a>

As we progress through this exercise, I will use the 2018 and 2019 to create mean encoded dummy variables and I will use the 2020 to build my model.  Given that each year in the available data has a different function, I will create two different dataframes.  One, for 2020 and another for 2018 and 2019.

For more info on mean encoded dummy variables, please see the following article.

https://medium.com/towards-data-science/leveraging-value-from-postal-codes-naics-codes-area-codes-and-other-funky-arse-categorical-be9ce75b6d5a

In [None]:
df_2020=df.filter(df.DATE>='2020-01-01').cache()
df_2019_2017=df.filter(df.DATE<='2019-12-31').cache()

Now I will export these data frames to a parque file that we can use in subsequent steps.

Write 2020 data to parquet file.

In [None]:
#cos.url(filenametowrite,bucketnameforyourproject). For example:

df_2020.write.mode("overwrite").parquet(cos.url('data2020.parquet', 'backblazedata-donotdelete-pr-cij57grgkoctem'))

Write 2019, 2018 and 2017 data to a parquet file.

In [None]:
#cos.url(filenametowrite,bucketnameforyourproject). For example:

df_2019_2017.write.mode("overwrite").parquet(cos.url('data2019_2017.parquet', 'backblazedata-donotdelete-pr-cij57grgkoctem'))

All data used in this notebook is the property of BackBlaze.com. 

For questions regarding use of data please see the following website.
https://www.backblaze.com/b2/hard-drive-test-data.html