-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect record length #189
Comments
Hi, |
Thanks for always answering so fast. I am sorry, I should have checked with MDFValidator earlier, but I am not usually working on a windows machine. MDFValidator issues a warning that is related to the problem:
The cycle count of the CGBlock is indeed 137646 and so is This would also explain why there is a part of the data array where no data is written to and one is left with random values. I assume you register an array of size 137646 (since this is the cycle count in the DG header) but only the first 137501 will be filled when the chunks of data are read in one after the other. This is definitely not an issue with the mdfreader, but do you see a way of how to handle this with the current implementation? Even if the read in would fail in such a case it would be helpful. Again, this is no bug but would it be possible to use "PyMem_Calloc" instead of "PyMem_Malloc" in dataread? This would at least consistently return 0 in such a case which would be easier to deal with. |
Chunks error could have come from mdfreader handling wrongly chunks at read but as the numberOfRecords is as expected, no issue there. |
thanks again, for dealing with this, since it's not really a problem of the mdfreader. The |
Hello guys, I think this happens when the writing application pre-allocates the space on disk for the data block, so when the measurement is stopped the final block may not be completely filled. In this case you should consider the channel group cycle count when loading the data from the file. Of course the nice thing would be if the writing application would also update the block_len field of the last data block when finishing the measurement. |
Hi, |
I'll be on vacation till August 11th. I'll try then. Thanks a lot
Outlook für Android<https://aka.ms/ghei36> herunterladen
…________________________________
From: Aymeric Rateau <notifications@github.com>
Sent: Friday, July 31, 2020 11:36:36 PM
To: ratal/mdfreader <mdfreader@noreply.github.com>
Cc: Oser, Ludwig <Ludwig.Oser@norcom.de>; Author <author@noreply.github.com>
Subject: Re: [ratal/mdfreader] Incorrect record length (#189)
Hi,
Did you try to use apply_all_invalid_bit() method ? It should convert your channel into masked array. then you can identify and get the mask with .mask and check for the last elements.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#189 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHHRIGMBMFBVNGZ5K754YJDR6M2OJANCNFSM4PFV2FVA>.
|
Do you also consider the cycles count when you read the raw data blocks? |
Only cycle count from the CG Block, cg_cycle_count, no deduction from block size or cross check of size from block start positions (conceivable for advanced recovery functions, but not there) |
This was what I was trying to explain. That the raw block size is not consistent with the cycle nr so you end up reading extra Bytes compared to the expected array size |
But after checking the numbers in above comments, the cycle count matches the block size and record size. The wrong data are in the block, that is why I rather suspect a wrong handling of invalid bits, either by mdfreader or the logger. |
This MDFValidator output makes me think that the problem is the one I've stated |
Yes, correct Daniel. I understood that cycle count is higher than the data block contains, reading at the end the next following block, not random data. Another block could have been written by the logger on top of data block by mistake, logger did not reflect actual cycle count after some corrections.. there could be a lot of reasons but most probably a bug from mdf writer. |
I have such a function for detecting the start addresses of all the blocks https://github.com/danielhrisca/asammdf/blob/master/asammdf/blocks/utils.py#L1442 but I only use it for the finalization steps.
I did not catch this problem at first glance. I though the cycle count was actually lower then the block size. |
@danielhrisca , thanks for the tip. |
In dev branch, I added a new function to finalise mdf files and eventually correct cycle counts. Even if file is not flag to be finalised, you could force the scan with argument force_file_integrity_check=True -> you could give a try with your file ? |
Sorry for being inactive for a while, I was swamped with other stuff.
when I check the function definition of read_dz in |
Hi @ludwig-nc , |
Hi @ratal, the errors are gone, but unfortunately it doesn't solve the problem. Maybe the files are more broken than I thought. There is one warning:
There are multiple warnings like this:
Since the latter appears multiple times, I assume there are multiple Data Blocks that aren't read in leading to the far smaller data channel? |
Hi @ludwig-nc |
Hi @ratal, It seems that when I use
The calculated length of the MDF-Validator (26978616) corresponds to Here is also the output for
I hope this is helpful in any way. |
@ludwig-nc what about if you could scramble all the text blocks so that the channel names and comments are just random letters? This way you can anonymize the data |
Hi @ludwig-nc , |
Hi @ratal,
@danielhrisca I wouldn't know how to do that, I am afraid :) without reading the file with the mdfreader and then writing it again. |
try this import re
import struct
from pathlib import Path
from random import randint
def randomized_string(size):
"""get a \0 terminated string of size length
Parameters
----------
size : int
target string length
Returns
-------
string : bytes
randomized string
"""
return bytes(randint(65, 90) for _ in range(size - 1)) + b"\0"
def scramble_mf4(name):
"""scramble text blocks and keep original file structure
Parameters
----------
name : str | pathlib.Path
file name
Returns
-------
name : pathlib.Path
scrambled file name
"""
name = Path(name)
pattern = re.compile(
rb"(?P<block>##(TX|MD))",
re.DOTALL | re.MULTILINE,
)
texts = {}
with open(name, 'rb') as stream:
stream.seek(0, 2)
file_limit = stream.tell()
stream.seek(0)
for match in re.finditer(pattern, stream.read()):
start = match.start()
if file_limit - start >= 24:
stream.seek(start + 8)
size = struct.unpac('<Q', stream.read(8))[0]
if start + size <= file_limit:
texts[start+24] = randomized_string(size-24)
dst = name.with_suffix(".scrambled.mf4")
copy(name, dst)
with open(dst, "rb+") as mdf:
for addr, bts in texts.items():
mdf.seek(addr)
mdf.write(bts)
return dst |
Hi @ludwig-nc , |
Hi @ratal |
Python version
Python 3.7.4
Platform information
OSX and Linus
Numpy version
'1.16.2'
mdfreader version
4.1
Description
Hi Aymeric,
sorry for bothering you again. I think I narrowed down my problem but I need your help to investigate further.
I am reading in a MDF4 file but the output is of incorrect length. It doesn't matter whether I am using the dataRead module or not and whether I am using a channel_list or not the result length is the same:
I changed the channel names, since I am not allowed to share the data:
I am pretty certain that this is wrong, because I get a record length of 137501 when using the asammdf package and it's exactly the last 145 entries in the array that contain incorrect information.
This last 145 entries are either 0 without using the dataRead module or when dataRead is used the last 145 entries contain random data, i.e. if I repeat the read in I get different results for those values, indicating that the array is allocated in C with a length of 137646 but only the first 137501 elements get filled.
I know this is not enough information to identify exactly what is going wrong, but I hope you can tell me again which is the best part of the code to debug this.
The text was updated successfully, but these errors were encountered: