# Verifying Data Quality
This process is needed to check if data has any problems. This step is required before inserting data into the database. There are 7 datasets in total.

Things to check:
* Are there any duplicates in the data?
* Are any of the data in incorrect format?
* Do all the data have the same columns? (for sensor data)
* Do all data in each column have the same datatype across all 5 sensor datasets?
* Is `id` unique for driver_data?
* Is `bookingID`/`driver_id` unique for safety_labels?
* Is `bookingID` unique for sensor_data?

In [3]:
import numpy as np
import pandas as pd

In [4]:
driver_data = pd.read_csv('..\data\driver_data.csv')
driver_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             148 non-null    int64  
 1   name           148 non-null    object 
 2   date_of_birth  148 non-null    object 
 3   gender         148 non-null    object 
 4   car_model      148 non-null    object 
 5   car_make_year  148 non-null    int64  
 6   rating         148 non-null    float64
dtypes: float64(1), int64(2), object(4)
memory usage: 8.2+ KB


In [5]:
safety_labels = pd.read_csv('..\data\safety_labels.csv')
safety_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   bookingID  20000 non-null  int64
 1   driver_id  20000 non-null  int64
 2   label      20000 non-null  int64
dtypes: int64(3)
memory usage: 468.9 KB


In [15]:
sensor1 = pd.read_csv('..\data\sensor_data\sensor_data_part-1.csv')
sensor2 = pd.read_csv('..\data\sensor_data\sensor_data_part-2.csv')
sensor3 = pd.read_csv('..\data\sensor_data\sensor_data_part-3.csv')
sensor4 = pd.read_csv('..\data\sensor_data\sensor_data_part-4.csv')
sensor5 = pd.read_csv('..\data\sensor_data\sensor_data_part-5.csv')

In [16]:
sensor1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1613554 entries, 0 to 1613553
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   bookingID       1613554 non-null  int64  
 1   Accuracy        1549012 non-null  float64
 2   Bearing         1549012 non-null  float64
 3   acceleration_x  1581283 non-null  float64
 4   acceleration_y  1581283 non-null  float64
 5   acceleration_z  1549012 non-null  float64
 6   gyro_x          1565147 non-null  float64
 7   gyro_y          1597418 non-null  float64
 8   gyro_z          1613554 non-null  float64
 9   second          1613554 non-null  float64
 10  Speed           1565147 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 135.4 MB


## Duplicates Check

In [9]:
# check for duplicates in driver_data
driver_data[driver_data.duplicated()].shape[0]

0

In [10]:
# check for duplicates in safety_labels
safety_labels[safety_labels.duplicated()].shape[0]

0

In [17]:
# check for duplicates in all 5 sensor data
print(f"Sensor1 duplicates: {sensor1[sensor1.duplicated()].shape[0]}")
print(f"Sensor2 duplicates: {sensor2[sensor2.duplicated()].shape[0]}")
print(f"Sensor3 duplicates: {sensor3[sensor3.duplicated()].shape[0]}")
print(f"Sensor4 duplicates: {sensor4[sensor4.duplicated()].shape[0]}")
print(f"Sensor5 duplicates: {sensor5[sensor5.duplicated()].shape[0]}")

Sensor1 duplicates: 0
Sensor2 duplicates: 0
Sensor3 duplicates: 0
Sensor4 duplicates: 0
Sensor5 duplicates: 0


<b>Results: </b>

There are no duplicates in any of the datasets

## Column consistency check


In [32]:
# check if column names are the same in all 5 sensor data (using sensor1 as reference)
print(f"Any different columns (1 and 2): {np.any((sensor1.columns == sensor2.columns) == False)}")
print(f"Any different columns (1 and 3): {np.any((sensor1.columns == sensor3.columns) == False)}")
print(f"Any different columns (1 and 4): {np.any((sensor1.columns == sensor4.columns) == False)}")
print(f"Any different columns (1 and 5): {np.any((sensor1.columns == sensor5.columns) == False)}")

Any different columns (1 and 2): False
Any different columns (1 and 3): False
Any different columns (1 and 4): False
Any different columns (1 and 5): False


<b>Results: </b>

Column names and count are all consistent throughout the 5 sensor datasets

## Datatype consistency check
Using sensor1 as benchmark, check if all the other 4 sensor datasets have the same datatype for each column

In [38]:
sensor1.dtypes == sensor2.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [39]:
sensor1.dtypes == sensor3.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [40]:
sensor1.dtypes == sensor4.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [42]:
sensor1.dtypes == sensor5.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

<b>Results: </b>

All the sensor datasets have same datatype for each column

## Unique ID check (driver_data)

In [45]:
len(np.unique(driver_data['id'])) == driver_data.shape[0]

True

<b>Results: </b>

Number of unique id in driver data is same as the number of rows in the dataset, thus there are no duplicated ids

## Unique ID check (safety_labels)

In [93]:
len(np.unique(safety_labels['bookingID'])) == safety_labels.shape[0]

True

In [94]:
len(np.unique(safety_labels['driver_id'])) == safety_labels.shape[0]

False

<b>Results: </b>

bookingID is unique in the safety_labels dataset, however driver_id is not

## Unique ID check (sensor data)
As there are 5 sensor datasets to work with, I would combine bookingIDs of all 5 datasets into a single DataFrame first then check for duplicates

In [70]:
# ignore deprecation warnings
import warnings
warnings.filterwarnings("ignore")

s1_ids = sensor1[['bookingID']]
s2_ids = sensor2[['bookingID']]
s3_ids = sensor3[['bookingID']]
s4_ids = sensor4[['bookingID']]
s5_ids = sensor5[['bookingID']]

# combine all 5 sensors' IDs into one data frame
all_ids = s1_ids.append(s2_ids, ignore_index=True)
all_ids = all_ids.append(s3_ids, ignore_index=True)
all_ids = all_ids.append(s4_ids, ignore_index=True)
all_ids = all_ids.append(s5_ids, ignore_index=True)
len(np.unique(all_ids['bookingID'])) == all_ids.shape[0] # same number 1,000,667 as before

False

In [71]:
sensor1[(sensor1['bookingID']==1202590843006)]

# get certain booking id and sort by second
sensor1[(sensor1['bookingID']==1202590843006)].sort_values(by='second', ignore_index=True)

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
0,1202590843006,6.181,316.0,1.314675,9.502998,3.751053,0.014940,-0.002269,-0.031730,25.0,7.998814
1,1202590843006,10.368,326.0,5.035418,7.967118,5.272981,-0.024662,-0.485324,-0.205059,29.0,5.132751
2,1202590843006,10.397,51.0,4.970152,8.340805,4.861608,-0.171828,-0.892771,-0.370045,31.0,2.432286
3,1202590843006,14.742,130.0,1.714325,8.839757,3.273697,-0.034121,-0.044558,0.073880,43.0,14.513086
4,1202590843006,17.825,134.0,2.229098,8.312774,5.560706,-0.177675,0.027890,0.037437,45.0,14.171288
...,...,...,...,...,...,...,...,...,...,...,...
114,1202590843006,4.700,345.0,0.311007,9.754408,4.960901,0.029828,0.146346,0.027437,1453.0,10.651598
115,1202590843006,6.413,338.0,1.291404,9.502691,,0.060388,-0.087389,-0.068120,1465.0,15.859739
116,1202590843006,6.312,359.0,2.120133,7.969474,5.537856,0.039689,-0.052447,-0.042062,1480.0,9.832068
117,1202590843006,6.131,3.0,1.171483,8.417190,1.362291,-0.071297,0.054664,0.010193,1481.0,8.770092


In [79]:
sensor1.head()

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
0,1202590843006,3.0,353.0,1.228867,8.9001,3.986968,0.008221,0.002269,-0.009966,1362.0,0.0
1,274877907034,9.293,17.0,0.032775,8.659933,4.7373,0.024629,0.004028,-0.010858,257.0,0.19
2,884763263056,3.0,189.0,1.139675,9.545974,1.951334,-0.006899,-0.01508,0.001122,973.0,0.667059
3,1073741824054,3.9,126.0,3.871542,10.386364,-0.136474,0.001344,-0.339601,-0.017956,902.0,7.913285
4,1056561954943,3.9,50.0,-0.112882,10.55096,-1.56011,0.130568,-0.061697,0.16153,820.0,20.419409


In [80]:
len(np.unique(sensor1[(sensor1['bookingID']==1056561954943)]['second'])) == len(sensor1[(sensor1['bookingID']==1056561954943)])

True

### Hypothesis:
There will be no 2 rows with the same `second` for each bookingID

In [90]:
# Combine all sensor dataset to form one sensor data
all_sensors = pd.concat([sensor1, sensor2, sensor3, sensor4, sensor5])
all_sensors.tail()

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
1613554,884763262985,3.9,226.0,0.260422,9.907822,3.162796,0.013733,0.010056,0.017792,98.0,9.98
1613555,1571958030347,5.0,341.78299,-1.168625,-9.396103,-0.009271,0.032545,0.009954,0.038534,509.0,6.44
1613556,584115552361,6.0,50.0,6.186806,6.809318,0.234639,0.505468,0.255951,0.202501,519.0,
1613557,1073741824126,10.72,324.0,-0.274582,8.512177,3.903046,-0.037451,-0.044601,-0.033173,2289.0,8.77
1613558,884763263001,12.0,357.002563,0.989182,-9.599023,-6.042905,0.035069,-0.031591,0.021383,310.0,28.19813


In [None]:
# stop running notebook
import sys
sys.exit()

In [92]:
switch = True
data = all_sensors
for bookingID in np.unique(data['bookingID']):
    if len(np.unique(data[(data['bookingID']==bookingID)]['second'])) != len(data[(data['bookingID']==bookingID)]):
        print(f"Anomaly detected: {bookingID}")
        switch = False
if switch == True:
    print("Sensor data is clean")

Sensor data is clean


<b>Results:</b>

The bookingIDs are not unique and there are multiple readings going on throught the trip for the same bookingID. However, a composite key of bookingID and number of seconds is unique.