# Introduction
Before beginning the data quality report I import the data set and remove all rows which contain weather data outside of the required time period (i.e., the year 2018). Although this could have been completed later, it meant that the dataset was dramatically reduced in size and therefore quicker and easier to work with throughout the DQR.

Station Name: PHOENIX PARK
Station Height: 48 M 
Latitude:53.364  ,Longitude: -6.350
|Feature|Description|Unit|
|---|---|---|
|date|Date and Time| utc|
|rain|Precipitation Amount| mm|	  
|temp|Air Temperature| C|
|wetb|Wet Bulb Temperature|C|
|dewpt|Dew Point Temperature|C|
|vappr|Vapour Pressure|hPa|	                 
|rhum|Relative Humidity|%|
|msl|Mean Sea Level Pressure|hPa|
|ind|Indicator|

In [60]:
import pandas as pd
import numpy as np
import json
import requests
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.backends.backend_pdf import PdfPages

In [61]:
df = pd.read_csv('data/raw_data/met-2018.csv', parse_dates=[0])

  df = pd.read_csv('data/raw_data/met-2018.csv', parse_dates=[0])


In [62]:
df.head()

Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,msl
0,2003-08-16 01:00:00,0,0.0,0,9.2,0,8.9,8.5,11.1,95,1021.9
1,2003-08-16 02:00:00,0,0.0,0,9.0,0,8.7,8.5,11.1,96,1021.7
2,2003-08-16 03:00:00,0,0.0,0,8.2,0,8.0,7.7,10.5,96,1021.2
3,2003-08-16 04:00:00,0,0.0,0,8.4,0,8.1,7.9,10.7,97,1021.2
4,2003-08-16 05:00:00,0,0.0,0,7.7,0,7.5,7.3,10.2,97,1021.1


In [63]:
# drop rows not in 2018
# https://sparkbyexamples.com/pandas/pandas-delete-rows-based-on-column-value/
df.drop(df[df['date'].dt.year != 2018].index, inplace=True)

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 126047 to 134806
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    8760 non-null   datetime64[ns]
 1   ind     8760 non-null   int64         
 2   rain    8760 non-null   object        
 3   ind.1   8760 non-null   int64         
 4   temp    8760 non-null   object        
 5   ind.2   8760 non-null   int64         
 6   wetb    8760 non-null   object        
 7   dewpt   8760 non-null   object        
 8   vappr   8760 non-null   object        
 9   rhum    8760 non-null   object        
 10  msl     8760 non-null   object        
dtypes: datetime64[ns](1), int64(3), object(7)
memory usage: 821.2+ KB


In [65]:
df.to_csv('data/met-2018-data.csv', index=False)

Decided to use the data.gov weather data as the openweather api does not contain 'rain' fields if there is no rain, which means there is more work required to use this data set.

In [66]:
df.reset_index()

Unnamed: 0,index,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,msl
0,126047,2018-01-01 00:00:00,0,0.0,0,4.6,0,3.5,1.8,6.9,82,991.0
1,126048,2018-01-01 01:00:00,0,0.1,0,4.7,0,3.6,1.8,7.0,81,991.1
2,126049,2018-01-01 02:00:00,0,0.0,0,4.8,0,3.7,1.9,7.0,81,991.1
3,126050,2018-01-01 03:00:00,0,0.0,0,4.9,0,3.8,2.2,7.2,82,990.7
4,126051,2018-01-01 04:00:00,0,0.0,0,5.3,0,4.1,2.3,7.2,81,990.3
...,...,...,...,...,...,...,...,...,...,...,...,...
8755,134802,2018-12-31 19:00:00,0,0.0,0,9.9,0,7.9,5.5,9.0,74,1034.9
8756,134803,2018-12-31 20:00:00,0,0.0,0,9.9,0,8.0,5.8,9.2,75,1035.0
8757,134804,2018-12-31 21:00:00,0,0.0,0,9.9,0,7.9,5.7,9.1,75,1035.0
8758,134805,2018-12-31 22:00:00,0,0.0,0,9.9,0,8.0,5.9,9.3,76,1035.1


In [67]:
# check the data types of the columns
df.dtypes

date     datetime64[ns]
ind               int64
rain             object
ind.1             int64
temp             object
ind.2             int64
wetb             object
dewpt            object
vappr            object
rhum             object
msl              object
dtype: object

In [68]:
# print some descriptive statistucs of the df
df.describe(datetime_is_numeric=True).T 

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
date,8760.0,2018-07-02 11:30:00,2018-01-01 00:00:00,2018-04-02 05:45:00,2018-07-02 11:30:00,2018-10-01 17:15:00,2018-12-31 23:00:00,
ind,8760.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind.1,8760.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind.2,8760.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [69]:
df.nunique()

date     8760
ind         1
rain       44
ind.1       1
temp      304
ind.2       1
wetb      243
dewpt     257
vappr     171
rhum       76
msl       584
dtype: int64

As seen above, the columns ind, ind.1 and ind.2 contain just one unique value. The value of these columns is limited and I will consider dropping them from the dataset going forward.

In [70]:
object_cols = df[['rain','temp','wetb','dewpt','vappr','rhum','msl']]
for col in object_cols:
    print("Minimum value: ", df[col].min())
    print("Maximum value: ", df[col].max())
    print(df[col].describe())

Minimum value:  0.0
Maximum value:  8.6
count     8760.0
unique      44.0
top          0.0
freq      7570.0
Name: rain, dtype: float64
Minimum value:  -4.5
Maximum value:  27.5
count     8760.0
unique     304.0
top          9.4
freq        87.0
Name: temp, dtype: float64
Minimum value:  -4.6
Maximum value:  20.4
count     8760.0
unique     243.0
top          6.9
freq        89.0
Name: wetb, dtype: float64
Minimum value:  -9.8
Maximum value:  18.3
count     8760.0
unique     257.0
top          6.3
freq        85.0
Name: dewpt, dtype: float64
Minimum value:  2.9
Maximum value:  21.0
count     8760.0
unique     171.0
top          9.1
freq       128.0
Name: vappr, dtype: float64
Minimum value:  24
Maximum value:  99
count     8760
unique      76
top         93
freq       327
Name: rhum, dtype: int64
Minimum value:  979.5
Maximum value:  1041.7
count     8760.0
unique     584.0
top       1017.9
freq        53.0
Name: msl, dtype: float64


Going to convert all of these columns to the dtype float64.

In [71]:
for col in object_cols:
    df[col] = df[col].astype('float64')

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 126047 to 134806
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    8760 non-null   datetime64[ns]
 1   ind     8760 non-null   int64         
 2   rain    8760 non-null   float64       
 3   ind.1   8760 non-null   int64         
 4   temp    8760 non-null   float64       
 5   ind.2   8760 non-null   int64         
 6   wetb    8760 non-null   float64       
 7   dewpt   8760 non-null   float64       
 8   vappr   8760 non-null   float64       
 9   rhum    8760 non-null   float64       
 10  msl     8760 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(3)
memory usage: 821.2 KB


## Checking the logical integrity of the data
### Rain
* Should have no negative values

In [73]:
df['rain'].describe()

count    8760.000000
mean        0.078208
std         0.342674
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         8.600000
Name: rain, dtype: float64

In [74]:
df.nunique()

date     8760
ind         1
rain       44
ind.1       1
temp      304
ind.2       1
wetb      243
dewpt     257
vappr     171
rhum       76
msl       584
dtype: int64

In [77]:
rain = df['rain']

In [78]:
rain

126047    0.0
126048    0.1
126049    0.0
126050    0.0
126051    0.0
         ... 
134802    0.0
134803    0.0
134804    0.0
134805    0.0
134806    0.0
Name: rain, Length: 8760, dtype: float64