# Verifying Data Quality
This process is needed to check if data has any problems. This step is required before inserting data into the database. There are 7 datasets in total.

Things to check:
* Are there any duplicates in the data?
* Are any of the data in incorrect format?
* Do all the data have the same columns? (for sensor data)
* Do all data in each column have the same datatype across all 5 sensor datasets?
* Is `id` unique for driver_data?
* Is `bookingID`/`driver_id` unique for safety_labels?
* Is `bookingID` unique for sensor_data?

In [1]:
import numpy as np
import pandas as pd

In [2]:
driver_data = pd.read_csv('..\data\driver_data.csv')
driver_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             148 non-null    int64  
 1   name           148 non-null    object 
 2   date_of_birth  148 non-null    object 
 3   gender         148 non-null    object 
 4   car_model      148 non-null    object 
 5   car_make_year  148 non-null    int64  
 6   rating         148 non-null    float64
dtypes: float64(1), int64(2), object(4)
memory usage: 8.2+ KB


In [31]:
driver_data.head()

Unnamed: 0,id,name,date_of_birth,gender,car_model,car_make_year,rating
0,1,Tressa,1/12/1992,Female,Mazda,2011,4.5
1,2,Virgilio,10/23/1992,Male,Mazda,2004,3.5
2,3,Bert,8/10/1989,Male,Nissan,2008,4.5
3,4,Mahmoud,8/14/1981,Male,Toyota,2008,4.5
4,5,Felecia,7/20/1990,Female,Hyundai,2010,4.5


In [3]:
safety_labels = pd.read_csv('..\data\safety_labels.csv')
safety_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   bookingID  20000 non-null  int64
 1   driver_id  20000 non-null  int64
 2   label      20000 non-null  int64
dtypes: int64(3)
memory usage: 468.9 KB


In [32]:
safety_labels.head()

Unnamed: 0,bookingID,driver_id,label
0,111669149733,140,0
1,335007449205,15,1
2,171798691856,61,0
3,1520418422900,97,0
4,798863917116,92,0


In [4]:
sensor1 = pd.read_csv('..\data\sensor_data\sensor_data_part-1.csv')
sensor2 = pd.read_csv('..\data\sensor_data\sensor_data_part-2.csv')
sensor3 = pd.read_csv('..\data\sensor_data\sensor_data_part-3.csv')
sensor4 = pd.read_csv('..\data\sensor_data\sensor_data_part-4.csv')
sensor5 = pd.read_csv('..\data\sensor_data\sensor_data_part-5.csv')

In [5]:
sensor1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1613554 entries, 0 to 1613553
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   bookingID       1613554 non-null  int64  
 1   Accuracy        1549012 non-null  float64
 2   Bearing         1549012 non-null  float64
 3   acceleration_x  1581283 non-null  float64
 4   acceleration_y  1581283 non-null  float64
 5   acceleration_z  1549012 non-null  float64
 6   gyro_x          1565147 non-null  float64
 7   gyro_y          1597418 non-null  float64
 8   gyro_z          1613554 non-null  float64
 9   second          1613554 non-null  float64
 10  Speed           1565147 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 135.4 MB


In [63]:
sensor1.iloc[33,8]

6.4751715e-05

In [34]:
sensor1.head()

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
0,1202590843006,3.0,353.0,1.228867,8.9001,3.986968,0.008221,0.002269,-0.009966,1362.0,0.0
1,274877907034,9.293,17.0,0.032775,8.659933,4.7373,0.024629,0.004028,-0.010858,257.0,0.19
2,884763263056,3.0,189.0,1.139675,9.545974,1.951334,-0.006899,-0.01508,0.001122,973.0,0.667059
3,1073741824054,3.9,126.0,3.871542,10.386364,-0.136474,0.001344,-0.339601,-0.017956,902.0,7.913285
4,1056561954943,3.9,50.0,-0.112882,10.55096,-1.56011,0.130568,-0.061697,0.16153,820.0,20.419409


## Duplicates Check
For driver_data and safety_labels only

In [6]:
# check for duplicates in driver_data
driver_data[driver_data.duplicated()].shape[0]

0

In [7]:
# check for duplicates in safety_labels
safety_labels[safety_labels.duplicated()].shape[0]

0

<b>Results: </b>

There are no duplicates in driver and safety datasets

## Sensor data Column consistency check
Check if column names are named differently in different sensor datasets

In [9]:
# check if column names are the same in all 5 sensor data (using sensor1 as reference)
print(f"Any different columns (1 and 2): {np.any((sensor1.columns == sensor2.columns) == False)}")
print(f"Any different columns (1 and 3): {np.any((sensor1.columns == sensor3.columns) == False)}")
print(f"Any different columns (1 and 4): {np.any((sensor1.columns == sensor4.columns) == False)}")
print(f"Any different columns (1 and 5): {np.any((sensor1.columns == sensor5.columns) == False)}")

Any different columns (1 and 2): False
Any different columns (1 and 3): False
Any different columns (1 and 4): False
Any different columns (1 and 5): False


<b>Results: </b>

Column names and count are all consistent throughout the 5 sensor datasets

## Sensor data datatype consistency check
Using sensor1 as benchmark, check if all the other 4 sensor datasets have the same datatype for each column

In [10]:
sensor1.dtypes == sensor2.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [11]:
sensor1.dtypes == sensor3.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [12]:
sensor1.dtypes == sensor4.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [13]:
sensor1.dtypes == sensor5.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

<b>Results: </b>

All the sensor datasets have same datatype for each column

## Unique ID check (driver_data)

In [14]:
len(np.unique(driver_data['id'])) == driver_data.shape[0]

True

<b>Results: </b>

Number of unique id in driver data is same as the number of rows in the dataset, thus there are no duplicated ids

## Unique ID check (safety_labels)

In [15]:
len(np.unique(safety_labels['bookingID'])) == safety_labels.shape[0]

True

In [16]:
len(np.unique(safety_labels['driver_id'])) == safety_labels.shape[0]

False

<b>Results: </b>

bookingID is unique in the safety_labels dataset, however driver_id is not

## Duplicates Check (sensor_data)
Check if there are duplicate columns for all 5 sensor datasets

In [58]:
col ='gyro_x'
print(np.max(all_sensors[col]))
print(np.min(all_sensors[col]))

38.708088
-48.45575


In [59]:
all_sensors['gyro_x'].apply(lambda x: len(str(x).split('.')[1]) if len(str(x).split('.')) > 1 else 0).max()

20

In [52]:
# Combine all sensor dataset to form one sensor data
all_sensors = pd.concat([sensor1, sensor2, sensor3, sensor4, sensor5], ignore_index=True)
all_sensors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7469656 entries, 0 to 7469655
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   bookingID       int64  
 1   Accuracy        float64
 2   Bearing         float64
 3   acceleration_x  float64
 4   acceleration_y  float64
 5   acceleration_z  float64
 6   gyro_x          float64
 7   gyro_y          float64
 8   gyro_z          float64
 9   second          float64
 10  Speed           float64
dtypes: float64(10), int64(1)
memory usage: 626.9 MB


In [28]:
# check for duplicates in all 5 sensor data
print(f"Any duplicates: {all_sensors[all_sensors.duplicated()].shape[0]}")

Any duplicates: 0


<b>Results: </b>

There are no duplicates in sensor data

## Unique ID check (sensor data)
As there are 5 sensor datasets to work with, I would combine bookingIDs of all 5 datasets into a single DataFrame first then check for duplicates

In [27]:
len(np.unique(all_sensors['bookingID'])) == all_sensors.shape[0]

False

<b>Initial Results: </b>

There are duplicate bookingIDs in sensor data

### Hypothesis:
There will be no 2 rows with the same `second` for each bookingID

In [None]:
# stop running notebook
import sys
sys.exit()

In [92]:
switch = True
data = all_sensors
for bookingID in np.unique(data['bookingID']):
    if len(np.unique(data[(data['bookingID']==bookingID)]['second'])) != len(data[(data['bookingID']==bookingID)]):
        print(f"Anomaly detected: {bookingID}")
        switch = False
if switch == True:
    print("Sensor data is clean")

Sensor data is clean


<b>Results:</b>

The bookingIDs are not unique and there are multiple readings going on throught the trip for the same bookingID. However, a composite key of bookingID and number of seconds is unique.