In [1]:
import pandas
print('pandas',pandas.__version__)

pandas 0.23.4


In [2]:
!head RollingSystemDemand_20180901_0129.csv

HDR,ROLLING SYSTEM DEMAND
VD,20180601000000,25152
VD,20180601000500,25231
VD,20180601001000,25070
VD,20180601001500,25019
VD,20180601002000,24943
VD,20180601002500,24727
VD,20180601003000,24716
VD,20180601003500,24815
VD,20180601004000,24877


The "VD" entry is being used as an index because the header has two columns

Tell Pandas to not use the first column as the index

In [3]:
dframe = pandas.read_csv("RollingSystemDemand_20180901_0129.csv",
                         index_col=False)
dframe.head()

Unnamed: 0,HDR,ROLLING SYSTEM DEMAND
0,VD,20180601000000
1,VD,20180601000500
2,VD,20180601001000
3,VD,20180601001500
4,VD,20180601002000


Hmm, that's not quite what I intended.

Rather than Pandas trying to figure out what's going on, tell Pandas to skip the first row

`skiprows` : Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

In [5]:
dframe = pandas.read_csv("RollingSystemDemand_20180901_0129.csv",
                         index_col=False,
                         skiprows=1)
dframe.head()

Unnamed: 0,VD,20180601000000,25152
0,VD,20180601000500,25231.0
1,VD,20180601001000,25070.0
2,VD,20180601001500,25019.0
3,VD,20180601002000,24943.0
4,VD,20180601002500,24727.0


And now Pandas assumes the first row is the header. Let's disable that

In [6]:
dframe = pandas.read_csv("RollingSystemDemand_20180901_0129.csv",
                         index_col=False,
                         skiprows=1, 
                         header=None)
dframe.head()

Unnamed: 0,0,1,2
0,VD,20180601000000,25152.0
1,VD,20180601000500,25231.0
2,VD,20180601001000,25070.0
3,VD,20180601001500,25019.0
4,VD,20180601002000,24943.0


Now we can set the column labels

In [7]:
dframe.columns=['VD','time of measurement','value']
dframe.head()

Unnamed: 0,VD,time of measurement,value
0,VD,20180601000000,25152.0
1,VD,20180601000500,25231.0
2,VD,20180601001000,25070.0
3,VD,20180601001500,25019.0
4,VD,20180601002000,24943.0


Let's check on the status of the data types in each column

In [8]:
dframe.dtypes

VD                      object
time of measurement      int64
value                  float64
dtype: object

_Lesson_: abstraction frameworks are convenient when the assumptions they make are correct. 

(Different confusion might have arisen if we had chosen to simply read in the file manually. Here we are using read_csv.)

Change the type from "int" to string so that we can then convert to datetime

In [9]:
# https://stackoverflow.com/questions/17950374/converting-a-column-within-pandas-dataframe-from-int-to-string
dframe['time of measurement']=dframe['time of measurement'].apply(str)

Check the data type of the columns

In [10]:
dframe.dtypes

VD                      object
time of measurement     object
value                  float64
dtype: object

In [11]:
dframe.head()

Unnamed: 0,VD,time of measurement,value
0,VD,20180601000000,25152.0
1,VD,20180601000500,25231.0
2,VD,20180601001000,25070.0
3,VD,20180601001500,25019.0
4,VD,20180601002000,24943.0


Now we can apply the conversion of the time column from string to datetime

In [12]:
pandas.to_datetime(dframe['time of measurement'],format='%Y%m%d%H%M%S')

ValueError: time data '2000' does not match format '%Y%m%d%H%M%S' (match)

Still getting errors!

A similar issue was solved here:
https://www.kaggle.com/najagumbi/data-cleaning-challenge-parsing-dates-v2

In [13]:
pandas.to_datetime(dframe['time of measurement'],
                   format='%Y%m%d%H%M%S',
                   errors="coerce")

0      2018-06-01 00:00:00
1      2018-06-01 00:05:00
2      2018-06-01 00:10:00
3      2018-06-01 00:15:00
4      2018-06-01 00:20:00
5      2018-06-01 00:25:00
6      2018-06-01 00:30:00
7      2018-06-01 00:35:00
8      2018-06-01 00:40:00
9      2018-06-01 00:45:00
10     2018-06-01 00:50:00
11     2018-06-01 00:55:00
12     2018-06-01 01:00:00
13     2018-06-01 01:05:00
14     2018-06-01 01:10:00
15     2018-06-01 01:15:00
16     2018-06-01 01:20:00
17     2018-06-01 01:25:00
18     2018-06-01 01:30:00
19     2018-06-01 01:35:00
20     2018-06-01 01:40:00
21     2018-06-01 01:45:00
22     2018-06-01 01:50:00
23     2018-06-01 01:55:00
24     2018-06-01 02:00:00
25     2018-06-01 02:05:00
26     2018-06-01 02:10:00
27     2018-06-01 02:15:00
28     2018-06-01 02:20:00
29     2018-06-01 02:25:00
               ...        
1971   2018-06-07 20:15:00
1972   2018-06-07 20:20:00
1973   2018-06-07 20:25:00
1974   2018-06-07 20:30:00
1975   2018-06-07 20:35:00
1976   2018-06-07 20:40:00
1

To confirm, inspect the bottom of the CSV

In [14]:
!tail RollingSystemDemand_20180901_0129.csv

VD,20180607215500,28216
VD,20180607220000,27823
VD,20180607220500,27555
VD,20180607221000,27258
VD,20180607221500,26900
VD,20180607222000,26618
VD,20180607222500,26332
VD,20180607223000,26038
VD,20180607223500,25798
FTR,2000