In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.__version__

'0.25.3'

## Introduction
Here we're trying to extract 6 hours of data that was missing from the InfluxDB server due to a mistake that I (Robin Wilson) made while configuring InfluxDB imports. The data was missing between 2020-01-23 13:45 and 2020-01-23 19:30. We need to extract this data and get it into the right format for importing manually.

## Read the full past data set from Flo
This contains all of the past data available at the time of export. None of this has been corrected

In [3]:
# Load all past data, deal with nesta-2 vs nesta-2-1 issue
all_past_data = pd.read_csv('../Data/BS Sensors/Back data for 6hrs missing data - Jan 2019/aq.csv', names=['location', 'timestamp',
                                                           'temperature', 'humidity',
                                                           'pm25', 'pm10', 'count',
                                                           'pm_sensor_count', 'temphum_sensor_count', 'unknown'])
all_past_data.location[all_past_data.location == "nesta-2"] = 'nesta-2-1'
all_past_data['timestamp'] = pd.to_datetime(all_past_data.timestamp)

all_past_data.loc[all_past_data.temphum_sensor_count == 0, 'temperature'] = np.nan
all_past_data.loc[all_past_data.temphum_sensor_count == 0, 'humidity'] = np.nan

  interactivity=interactivity, compiler=compiler, result=result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [4]:
# Time range of full past data
all_past_data.timestamp.min(), all_past_data.timestamp.max()

(Timestamp('2019-03-15 16:56:27.243513'),
 Timestamp('2020-01-24 08:30:12.865538'))

In [5]:
# Sensors included in full past data
all_past_data.location.value_counts()

nesta-6          26094
nesta-7          22521
nesta-5          22365
nesta-4          21378
aurn-3           20227
nesta-2-1        20185
aurn-4           19360
aurn-2           18522
b2-new-forest    17022
nesta-1          15507
aurn-1           14650
nesta-8           9076
nesta-11          7273
b1-lanchester     7085
nesta-12          6154
nesta-9           5916
nesta-13          1029
nesta-10           672
Name: location, dtype: int64

In [31]:
final_result = all_past_data

In [32]:
# Add useful field for later - 'display' = True, so we can eventually use it for filtering for display if necessary
final_result['display'] = True

## Subset to just data that isn't already in InfluxDB
The automatic MQTT import to InfluxDB means that some data is already in there. The earliest data points available in InfluxDB are from the 12th Dec 2019 at 10:00, so we just want data from before that

In [39]:
final_result_subset = final_result[(final_result.timestamp < pd.to_datetime('2020-01-23 19:28')) &
                                          (final_result.timestamp > pd.to_datetime('2020-01-23 13:45'))]

In [40]:
final_result_subset

Unnamed: 0,location,timestamp,temperature,humidity,pm25,pm10,count,pm_sensor_count,temphum_sensor_count,unknown,display
254019,nesta-11,2020-01-23 13:45:06.316070,11.0,65.0,14,13,230,6,1,False,True
254020,nesta-1,2020-01-23 13:45:06.436135,,,10,9,176,8,0,False,True
254021,nesta-12,2020-01-23 13:45:06.757751,10.0,70.0,12,10,187,8,1,False,True
254022,nesta-8,2020-01-23 13:45:06.946875,17.0,46.0,13,12,182,6,1,False,True
254023,nesta-5,2020-01-23 13:45:07.039675,,,15,12,208,6,0,False,True
...,...,...,...,...,...,...,...,...,...,...,...
254322,b2-new-forest,2020-01-23 19:15:07.682294,9.0,100.0,50,38,43,4,1,False,True
254323,nesta-7,2020-01-23 19:15:07.690126,15.0,49.0,27,22,239,4,1,False,True
254324,aurn-1,2020-01-23 19:15:07.795391,8.0,35.0,16,17,213,7,3,False,True
254325,b1-lanchester,2020-01-23 19:15:10.837283,7.0,100.0,41,31,61,5,1,False,True


In [41]:
final_result_subset.timestamp.min(), final_result_subset.timestamp.max()

(Timestamp('2020-01-23 13:45:06.316070'),
 Timestamp('2020-01-23 19:15:13.122339'))

## Convert into the right format for importing into InfluxDB
We need the CSV in the right format and the columns etc to match what we've already got in Influx for the live MQTT data.

Useful docs:
 - https://www.influxdata.com/blog/how-to-write-points-from-csv-to-influxdb/
 - https://github.com/influxdata/telegraf/tree/master/plugins/parsers/csv
 - https://docs.influxdata.com/influxdb/v1.7/write_protocols/line_protocol_tutorial/

In [42]:
final_result_subset.columns

Index(['location', 'timestamp', 'temperature', 'humidity', 'pm25', 'pm10',
       'count', 'pm_sensor_count', 'temphum_sensor_count', 'unknown',
       'display'],
      dtype='object')

In [44]:
final_result_subset.columns = ['dev_id', 'timestamp', 'p_temperature', 'p_humidity', 'p_pm25', 'p_pm10',
                               'p_count', 'p_pm_sensor_count', 'p_temphum_sensor_count', 'p_corrected', 'p_display']

In [45]:
final_result_subset['timestamp'] = final_result_subset.timestamp.astype(int).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [46]:
final_result_subset['p_count'] = final_result_subset['p_count'].astype('Int16')
final_result_subset['p_temphum_sensor_count'] = final_result_subset['p_temphum_sensor_count'].astype('Int16')
final_result_subset['p_count'] = final_result_subset['p_count'].astype('Int16')
final_result_subset['p_pm_sensor_count'] = final_result_subset['p_pm_sensor_count'].astype('Int16')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [47]:
final_result_subset['p_pm25'] = final_result_subset['p_pm25'].round(0).astype('Int16')
final_result_subset['p_pm10'] = final_result_subset['p_pm10'].round(0).astype('Int16')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [48]:
len(final_result_subset)

308

## Write outputs to CSVs

In [50]:
final_result_subset.to_csv('../Data/BS Sensors/2020-01-23_6hrsMissingDataForImport.csv', index=False, na_rep="null")