# Verifying Data Quality
This process is needed to check if data has any problems. This step is required before inserting data into the database. There are 7 datasets in total.

Things to check:
* Are there any duplicates in the data?
* Are any of the data in incorrect format?
* Do all the data have the same columns? (for sensor data)
* Do all data in each column have the same datatype across all 5 sensor datasets?

In [3]:
import numpy as np
import pandas as pd

In [4]:
driver_data = pd.read_csv('..\data\driver_data.csv')
driver_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             148 non-null    int64  
 1   name           148 non-null    object 
 2   date_of_birth  148 non-null    object 
 3   gender         148 non-null    object 
 4   car_model      148 non-null    object 
 5   car_make_year  148 non-null    int64  
 6   rating         148 non-null    float64
dtypes: float64(1), int64(2), object(4)
memory usage: 8.2+ KB


In [5]:
safety_labels = pd.read_csv('..\data\safety_labels.csv')
safety_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   bookingID  20000 non-null  int64
 1   driver_id  20000 non-null  int64
 2   label      20000 non-null  int64
dtypes: int64(3)
memory usage: 468.9 KB


In [15]:
sensor1 = pd.read_csv('..\data\sensor_data\sensor_data_part-1.csv')
sensor2 = pd.read_csv('..\data\sensor_data\sensor_data_part-2.csv')
sensor3 = pd.read_csv('..\data\sensor_data\sensor_data_part-3.csv')
sensor4 = pd.read_csv('..\data\sensor_data\sensor_data_part-4.csv')
sensor5 = pd.read_csv('..\data\sensor_data\sensor_data_part-5.csv')

In [16]:
sensor1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1613554 entries, 0 to 1613553
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   bookingID       1613554 non-null  int64  
 1   Accuracy        1549012 non-null  float64
 2   Bearing         1549012 non-null  float64
 3   acceleration_x  1581283 non-null  float64
 4   acceleration_y  1581283 non-null  float64
 5   acceleration_z  1549012 non-null  float64
 6   gyro_x          1565147 non-null  float64
 7   gyro_y          1597418 non-null  float64
 8   gyro_z          1613554 non-null  float64
 9   second          1613554 non-null  float64
 10  Speed           1565147 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 135.4 MB


## Duplicates Check

In [9]:
# check for duplicates in driver_data
driver_data[driver_data.duplicated()].shape[0]

0

In [10]:
# check for duplicates in safety_labels
safety_labels[safety_labels.duplicated()].shape[0]

0

In [17]:
# check for duplicates in all 5 sensor data
print(f"Sensor1 duplicates: {sensor1[sensor1.duplicated()].shape[0]}")
print(f"Sensor2 duplicates: {sensor2[sensor2.duplicated()].shape[0]}")
print(f"Sensor3 duplicates: {sensor3[sensor3.duplicated()].shape[0]}")
print(f"Sensor4 duplicates: {sensor4[sensor4.duplicated()].shape[0]}")
print(f"Sensor5 duplicates: {sensor5[sensor5.duplicated()].shape[0]}")

Sensor1 duplicates: 0
Sensor2 duplicates: 0
Sensor3 duplicates: 0
Sensor4 duplicates: 0
Sensor5 duplicates: 0


<b>Results: </b>There are no duplicates in any of the datasets

## Column consistency check


In [32]:
# check if column names are the same in all 5 sensor data (using sensor1 as reference)
print(f"Any different columns (1 and 2): {np.any((sensor1.columns == sensor2.columns) == False)}")
print(f"Any different columns (1 and 3): {np.any((sensor1.columns == sensor3.columns) == False)}")
print(f"Any different columns (1 and 4): {np.any((sensor1.columns == sensor4.columns) == False)}")
print(f"Any different columns (1 and 5): {np.any((sensor1.columns == sensor5.columns) == False)}")

Any different columns (1 and 2): False
Any different columns (1 and 3): False
Any different columns (1 and 4): False
Any different columns (1 and 5): False


<b>Results: </b>Column names and count are all consistent throughout the 5 sensor datasets

## Datatype consistency check
Using sensor1 as benchmark, check if all the other 4 sensor datasets have the same datatype for each column

In [38]:
sensor1.dtypes == sensor2.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [39]:
sensor1.dtypes == sensor3.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [40]:
sensor1.dtypes == sensor4.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool

In [42]:
sensor1.dtypes == sensor5.dtypes

bookingID         True
Accuracy          True
Bearing           True
acceleration_x    True
acceleration_y    True
acceleration_z    True
gyro_x            True
gyro_y            True
gyro_z            True
second            True
Speed             True
dtype: bool