# Data collecting and cleaning
  
   
My DIY sensor based on NodeMCU and DHT22 recorded temperature and humidity every 10 minutes. All data was saved in a .csv file. Let's start by importing raw data from this file and preparing a dataset for further analysis.

In [69]:
import pandas as pd

df = pd.read_csv('data/nodemcu_log.csv', parse_dates=["Date"])

df

Unnamed: 0,Date,Time,Temperature,Humidity
0,2021-02-11,19:30,23.4,21.4
1,2021-02-11,19:40,23.3,22.8
2,2021-02-11,19:50,22.9,20.8
3,2021-02-11,20:00,19.3,25.5
4,2021-02-11,20:10,17.9,28.2
...,...,...,...,...
76712,2022-09-25,10:20,13.7,99.9
76713,2022-09-25,10:30,14.0,99.9
76714,2022-09-25,10:40,14.2,99.9
76715,2022-09-25,10:50,14.3,99.9


Data from the sensor starts in February 2021 and ends in September 2022. We are only interested in 2021. You can also see that the first measurements are quite overestimated and this must be taken into account when choosing the final dataset.

In [70]:
df = df.loc[(df['Date'] <= '2021-12-31') & (df['Date'] >= '2021-02-15')]
df

Unnamed: 0,Date,Time,Temperature,Humidity
452,2021-02-15,00:00,-2.0,63.8
453,2021-02-15,00:10,-2.1,63.8
454,2021-02-15,00:20,-1.9,64.7
455,2021-02-15,00:30,-2.1,64.5
456,2021-02-15,00:40,-1.9,64.8
...,...,...,...,...
43593,2021-12-31,23:10,11.2,99.9
43594,2021-12-31,23:20,11.0,99.9
43595,2021-12-31,23:30,11.3,99.9
43596,2021-12-31,23:40,10.9,99.9


Grouping all data by days and calculating the maximum, minimum and mean temperature. We will not use humidity data in further analysis

In [71]:
df = df.groupby(['Date'])['Temperature'].agg(['max','min','mean']).reset_index()
df

Unnamed: 0,Date,max,min,mean
0,2021-02-15,1.0,-4.3,-1.611111
1,2021-02-16,1.4,-9.9,-2.673611
2,2021-02-17,9.5,-1.7,1.112676
3,2021-02-18,-1.7,-12.6,-5.995804
4,2021-02-19,1.6,-12.9,-4.474074
...,...,...,...,...
308,2021-12-27,-1.7,-12.3,-6.922378
309,2021-12-28,2.0,-4.7,-0.955556
310,2021-12-29,3.6,0.1,1.690278
311,2021-12-30,6.6,2.4,4.747222


In [72]:
# Changing column names

df.rename(columns = {'Date':'date','max':'t_max','min':'t_min','mean':'t_mean'}, inplace = True)

In [73]:
df

Unnamed: 0,date,t_max,t_min,t_mean
0,2021-02-15,1.0,-4.3,-1.611111
1,2021-02-16,1.4,-9.9,-2.673611
2,2021-02-17,9.5,-1.7,1.112676
3,2021-02-18,-1.7,-12.6,-5.995804
4,2021-02-19,1.6,-12.9,-4.474074
...,...,...,...,...
308,2021-12-27,-1.7,-12.3,-6.922378
309,2021-12-28,2.0,-4.7,-0.955556
310,2021-12-29,3.6,0.1,1.690278
311,2021-12-30,6.6,2.4,4.747222


Data set in its current size contains measurements from the time period of ***15/02/21-31/12/31*** (that's 320 days). However, as you can see in the cell above, there are 313 days in the dataframe. Let's check which dates are missing.

In [21]:
for i in pd.date_range(start="2021-02-15", end="2021-12-31").difference(df['date']):
    print(i.strftime('%Y-%m-%d'))

2021-11-04
2021-11-05
2021-11-10
2021-11-11
2021-11-12
2021-11-13
2021-11-16


Probably, for some reason, the sensor was not working these days and was not recording data. In this case, the missing records with NaN values will be created.

In [75]:
for i in pd.date_range(start="2021-02-15", end="2021-12-31").difference(df['date']):
    line = pd.to_datetime(i, format="%Y-%m-%d")
    df = pd.concat([df, pd.DataFrame([[i]], columns=['date'])], axis=0)
    

In [77]:
df.tail(10)

Unnamed: 0,date,t_max,t_min,t_mean
310,2021-12-29,3.6,0.1,1.690278
311,2021-12-30,6.6,2.4,4.747222
312,2021-12-31,11.4,6.5,10.25
0,2021-11-04,,,
0,2021-11-05,,,
0,2021-11-10,,,
0,2021-11-11,,,
0,2021-11-12,,,
0,2021-11-13,,,
0,2021-11-16,,,


In [87]:
# Sorting newly added values and reseting index

df = df.sort_values(by='date', ignore_index=True)

In [88]:
# All missing rows are now filled with NaNs with correct index number

df.loc[df['date'] > '2021-11-03'].head(10)

Unnamed: 0,date,t_max,t_min,t_mean
262,2021-11-04,,,
263,2021-11-05,,,
264,2021-11-06,12.3,6.0,9.974074
265,2021-11-07,11.1,3.5,7.723704
266,2021-11-08,9.9,7.8,8.834028
267,2021-11-09,8.4,5.7,6.341176
268,2021-11-10,,,
269,2021-11-11,,,
270,2021-11-12,,,
271,2021-11-13,,,


In [99]:
# Saving cleaned data to a csv file

df.to_csv('data/nodemcu_data_final.csv')