SDCard corruption on RPI2 #397

Closed
NitroG42 opened this Issue Mar 19, 2015 · 363 comments

Comments

Projects
None yet

Lots of people seems to be affected by an issue with sdcard.
I'm using a Samsung Evo 16 Gb micro SDCard, and using raspbian, I encounter every time corruption on the sd card.
It's easy to reproduce :

  • Get the raspbian image 2015_02_16 from the website
  • I dd it from my mac to the sd card (using sudo dd bs=1m if=2015-02-16-raspbian-wheezy.img of=/dev/rdisk4)
  • Boot the pi with the freshly installed micro sd card
  • Install Raspbian, and when I have the console, I do sudo touch /forcefsck
  • On next reboot, lots of error are found and it after few minutes, it ends by the screen dying

I need to check on my linux system (I'm at work) if I can fix the card at this step or not.
I flashed the raspbian img multiple times and it doesn't work.

It also can be reproduce just by making the RPI reboot multiple times through the terminal (using sudo reboot)

I have 3 of them so I hope it's just a firmware bug (I'll try with another one to be sure it' sno the sdcard itself)

Here's two threaeds that gathered this issue (without creating a post in here though :/ ) :
http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=101183&p=703772&hilit=error+110#p703772
http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=98935

One post is interesting :

I am using Transcend UHS-I 1U 16GB Class 10 i have tried 4 of this card and same error with all four, i have also tried with 3 different Rpi2 and i could reproduce this error on all of them.

If you want card info, I can give them but you need to tell what to run on which system, because I didn't find a way to print sdcard charateristics from Mac OS X.

Contributor

popcornmix commented Mar 19, 2015

@ghollingworth does have a Samsung EVO sdcard that he can provoke into corrupting data.
He's built an fpga based sdcard analyser that can produce a log of all commands and responses and he's caught an error coming back from the sdcard.
He's just got to work out what exactly it is the card is unhappy about and how to avoid it.

For now, Transcend and Samsung EVO cards are best avoided. Other cards don't seem to suffer in the same way.
We're pretty sure that a future kernel update will make these cards reliable again and we'll post here when there is something to test.

Holy **** that was a freaking fast answer.
Thank you for the update, I'll wach the topic for future updates.

@lurch lurch referenced this issue in raspberrypi/noobs Mar 19, 2015

Open

Install fails on some SD cards with RPi2 #230

lurch commented Mar 19, 2015

Duplicate of #372 ?

If your failure is 100% reproducible then it would be interesting to see, currently I'm having trouble reproducing the problem (only seen it twice in the last week) and it makes it very difficult to understand what's going wrong

Gordon

NitroG42 commented Apr 2, 2015

In my first post, I explain how to reproduce it on my card. Basically, after a fresh install of Raspbian, I create a file ( sudo touch /forcefsck ) to run fsck on next boot, I reboot and then lots of errors are found (and it crashes in a very beautiful way).

My question is: is it 100% reproducible? Does it happen without fail every time you boot in this way?

NitroG42 commented Apr 2, 2015

Well I tried 3 or 4 times in row (with a fresh install each time) at the time I created the issue.
I'll try again tonight if it does the same, but I didn't see a "no-error" install.

CyrussM commented Apr 3, 2015

Hi,(and sry bad english) ;)
I have a Transcend UHS-I 1U 16GB Class 10, and a Sandisk 16GB. I have install raspbian image 2-3 weeks ago and it works fine 24/7. But after 1-2 sometimes 3 reboots the Rpi2 makes a lots of (filesystem) erros and can't boot up. Remove the power and the Rpi2 boot without errors. After this happen again and again i clone the System on the Transcend with dd to a Sandisk 16GB class 10. All problems a solve, reboots no problems anymore (with the same system only a clone/copy from one to another sd card).

I have RPi2 and Openelec and got similar problem with Kingston 32GB class 10.
After a successful install and setup, a reboot of the Pi2 failed with.

*** Error in mount_storage: mount_common: could not mount /dev/mmcblk0p2 ***

Starting debugging shell... type exit to quit

sh: can't access tty; job control turned off

Now using a different card.

Are you using NOOBS to install the software or an image?

Are you updating the image before rebooting?

Gordon

On 10/04/2015 15:08, "johalareewi" notifications@github.com wrote:

I have RPi2 and Openelec and got similar problem with Kingston 32GB class
10.
After a successful install and setup, a reboot of the Pi2 failed with.
*** Error in mount_storage: mount_common: could not mount /dev/mmcblk0p2


Starting debugging shell... type exit to quitsh: can't access tty; job
control turned off

Now using a different card.

Reply to this email directly or
view it on GitHub
#397 (comment)
.

Using an image. Openelec image (for RPi2) from http://openelec.tv/get-openelec
Write to Kingston 16GB class 10 U-1 card using win32diskimager.exe on Windows7
After initial Pi2 startup, Openelec goes through the set up.
I installed a few Kodi options then did a reboot and that is when the error message appeared.

Whats the minimal steps required to guarantee it will fail... Include versions of software and links do you actually need to install stuff or will it corrupt without this?

Thanks
Gordon

The minimal steps are install official Openelec or Raspbian image on a Samsung evo 16 U-1, it doesnt matter the way you install it, and switch off the pi while writting on the sd.

I saw a posible solution in other forum, but I haven´t test it yet http://openelec.tv/forum/124-raspberry-pi/75281-openelec-5-0-3-still-corrupts-sd-card-on-pi2?start=15#132032

Contributor

popcornmix commented Apr 13, 2015

@moskichi Switching the Pi off while writing to the sdcard is expected to cause corruption (with any memory device on any platform). Always shut down before removing power.
We're interested here in repeatable causes of corruption that involve shutting down cleanly.

Contributor

popcornmix commented Apr 13, 2015

@NitroG42
I wonder if you could test this kernel: https://dl.dropboxusercontent.com/u/3669512/temp/kernel7.img
By default it should be the same as the current "rpi-update" kernel, but supports some debug options that can be enabled through cmdline.txt.
Can you add to cmdline.txt

bcm2835_mmc.mmc_debug=0x1fff

You should see: "mmc_debug:1fff" and "Forcing PIO mode" in dmesg log, and see a reduction in performance. I'd like to know if you still see corruption.

integral ultima pro 8gb class 10 upto 20mb/s are a brilliant card hardly have any issues been using for 2 years , was out of stock one time so bought 2 different batches from different suppliers of kingston cards ,one of the suppliers was scan computers in bolton so not like they was clone cards , and had almost everyone back over 3months and 50% of them cant be recognised by any device i put them in

Cy4n1d3 commented Apr 17, 2015

I've bought a 16GB Samsung Evo MicSD when I first got my RPi2 (was still running my good old RPi1 /w normal SD) and experienced the same issues as NitroG42 and johalareewi - after a certain (few, 1 to 3 were sufficient) amount of reboots, the system wouldn't boot up any longer due to mounting errors.

I was able to reproduce the issue pretty much reliably any time back then: install an image (doesn't matter if I used a fresh OE image or my 'old' backuped RPi1 image with RPi2 kernel replacing the Pi1 kernel), do the usual setup stuff like config, addon installations, reboot. I even tried manual 'sync'ing and rebooting over SSH to be safe but after one to three reboots I encountered corruption anyways. As long as the Pi stayed powered on I was able to watch movies, tv shows, youtube and amazon prime without a hitch though... problems arose after the nightly power down or the aforementioned reboots.
Didn't matter if I just did a fresh install from image and simply rebooted a few times afterwards or if I did a fresh install and updated to Milhouse's testbuilds using the tar-file - sooner or later I ended up with a corrupted SD card. I was in fact even backing up the whole system to an USB hard drive before rebooting due to the sheer reliability of system corruption occuring ;)
After the dreaded card corruption I would then restore my backup which brought the system back to work until the next (or the one following that..) reboot.

I fixed the problem for now by buying a fresh Sandisk card which works without a sign of any flaws until today... got nearly mad at my RPi2 until I tried a 64GB Sandisk MicSD which worked flawlessly right from the first image install. At first I thought it might be power related but after testing 4 different power adapters (Nexus 10, Galaxy S5, generic 5V 2A adapter and a known brand adapter from a local electronics store) I kinda ruled that one out.

If there's anything I can do to maybe help debugging this one please don't hesitate to ask. I'd gladly use the Samsung Evo for my Pi, as it shows nearly doubled 4k writes in comparison to the Sandisk while retaining good 4k reads - I'd really like to run some real world usage scenario performance comparisons on the RPi2 using those cards :)

Hi Cy4n1d3 ,

If I understand correctly, you can help by testing the kernel that popcornmix posted in this thread or that is posted on OpenELEC's forum: http://openelec.tv/forum/124-raspberry-pi/75281-openelec-5-0-3-still-corrupts-sd-card-on-pi2?start=210#137875

I hope to test this kernel this weekend.

Contributor

popcornmix commented Apr 17, 2015

Yes, if anyone who is suffering corruption issues can test the kernel linked earlier, or try the OpenELEC test build, that would be very helpful.

Cy4n1d3 commented Apr 17, 2015

I'll try and see if I can still reproduce the error tomorrow.

For what it's worth I've installed the patched version of OpenElec 5.0.8 with bcm2835_mmc.mmc_debug=0x1fff set and have had the RasPi 2 go through 17 reboots (remotely triggered using SSH and a loop) without any issues (after 17 I stopped it because I thought that wasn't bad at all and wanted to get on with something else). I have a Transcend 16GB card (described as "Transcend Ultimate 16GB Micro SD (SDHC) Card - Class 10" at purchase, useful IDs from it are man:0x000074 oem:0x4a45 name:USD hwrev:0x0 fwrev:0x2) which previously would corrupt and refuse to boot after 3 or so boot ups (sometimes shutting it down overnight, others just shutting it down long enough to move the PSU to another socket).

OpenELEC:~ # cat /flash/cmdline.txt
boot=/dev/mmcblk0p1 disk=/dev/mmcblk0p2 quiet bcm2835_mmc.mmc_debug=0x1fff
OpenELEC:~ # dmesg | grep mmc-bcm
[    1.273284] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff
[    1.273295] mmc-bcm2835 3f300000.mmc: Forcing PIO mode
OpenELEC:~ # 

Tomorrow or Sunday I'll disable the debug option and send it back through some reboots and see how many I get before corruption occurs...

Thank you to everyone who is working to fix this! :)

Alec

Contributor

popcornmix commented Apr 17, 2015

If you are happy with
bcm2835_mmc.mmc_debug=0x1fff
Then I'd be interested if
bcm2835_mmc.mmc_debug=0xfff
is also good.

Sadly I am not happy with bcm2835_mmc.mmc_debug=0xfff as I got four reboots before the RasPi froze at the initial OpenElec screen. Another power cycle and I get dropped to the debugging shell and an fsck is needed to fix it (lots of things it had to fix).

Don't get me wrong, it could be pure luck that 17 reboots passed without issue with 0x1fff and the 18th might have killed it just the same, but 17 compared to 4 is a big difference!

Since switching back to 0x1fff I have just survived a further 10 reboots without issue. Now bed beckons (early start in the morning). If there's any further testing you would like doing then let me know, not sure I'll be able to do any until Sunday now but fire away nevertheless and I'll do my best to oblige.

Hi AlecEdworthy, do you got your script to remotely reboot the RPi2 for me? That would save me a lot of time. Thanks!

Contributor

popcornmix commented Apr 18, 2015

Okay if bcm2835_mmc.mmc_debug=0xfff doesn't work, can you confirm if bcm2835_mmc.mmc_debug=0x1000 is okay.

Cy4n1d3 commented Apr 18, 2015

I can still confirm the corruption issues I encountered when I first tried the Samsung EVO card.

Results so far:

  • no cmdline: corrupted file system after first reboot, didn't even install any addons or stuff like that

OpenELEC:/var/log # dmesg | grep -i mmc-bcm2835 [ 1.421993] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff [ 1.425168] mmc-bcm2835 3f300000.mmc: Forcing PIO mode

  • no corruption after 10 reboots

OpenELEC:~ # dmesg | grep -i mmc-bcm2835 [ 1.301944] mmc-bcm2835 3f300000.mmc: mmc_debug:fff [ 1.301956] mmc-bcm2835 3f300000.mmc: DMA channels allocated

  • corrupted after the 4th reboot

OpenELEC:~ # dmesg | grep -i mmc-bcm2835 [ 1.301936] mmc-bcm2835 3f300000.mmc: mmc_debug:1000 [ 1.301947] mmc-bcm2835 3f300000.mmc: Forcing PIO mode

  • system froze after first reboot while displaying CEC Adapter notification (so different behaviour then before, where corruption did happen / was noticed while booting where it got stuck then), power cycling brought the system back through a second and third reboot without subsequent system freezes. I was unable to reproduce the freeze until the sixth reboot, where it got stuck while booting. Another hard power cycle however brought the system back up and allowed further reboots.
    I did 10 reboots here aswell without encountering corruption issues - just a few freezes it seems.

So 1fff seems best so far, If someone is willing to share a reboot loop script I will put those modes to further testing.

If you need further information or have more precise instructions please don't hesitate to ask @popcornmix !

Contributor

popcornmix commented Apr 18, 2015

For automatic rebooting (or running other command) with raspbian,
sudo nano /etc/rc.local
and add reboot just before the exit.

For openelec I would suggest looking here: http://wiki.openelec.tv/index.php/Autostart.sh

Contributor

popcornmix commented Apr 18, 2015

@ernstblaauw Can you run the script from a linux machine (which could be another pi)? That's a little easier than from windows.

Contributor

popcornmix commented Apr 18, 2015

I've updated the kernel links above, and the links in OpenELEC thread. New build also has another debug option. Can you try:

bcm2708-dmaengine.dma_debug=0x1f

instead of the bcm2835_mmc.mmc_debug option and report the results.

@popcornmix, I'm running Linux Mint on my main desktop. I already tried
sshpass -p "openelec" ssh root@192.168.0.60 'reboot'
but it seems the ssh connection is not closed then. Therefore, it is difficult to count the number of (successful) reboots. Do you got an idea how to do this?

hertzg commented Apr 19, 2015

@ernstblaauw I believe you could add && exit to the command to make sure it closes the ssh session after successfully executing the reboot

sshpass -p "openelec" ssh root@192.168.0.60 'reboot && exit'

Hi,

OK for those running Linux or Mac OS X and wanting to do remote reboots of their RasPi running OpenElec you can do this,

  1. In a terminal window run ssh-keygen -t rsa -b 1024 -f ~/.ssh/id_rsa_openelec and when prompted for a passphrase either enter something memorable but secure (there are many guides on the Internet for passphrase creation) or just hit enter (i.e. don't set a passphrase). The former is more secure, the latter is more convenient for this testing (but you should stop the key from being a valid login token on the RasPi once your testing is complete if your RasPi is accessible to untrusted hosts, e.g. the Internet, see the end of the posting). If you choose to set a passphrase then you will need to use an SSH key agent to simplify logging in to the RasPi.
  2. You now need to copy the public half of the key over to the RasPi and put it in the appropriate file to permit remote access, you can do this with this command, cat ~/.ssh/id_rsa_openelec.pub | ssh root@192.168.2.7 "cat - >> .ssh/authorized_keys" when prompted enter openelec as the password (you should substitute 192.168.2.7 in that command and subsequent ones for the IP address of the RasPi or its hostname if appropriate)
  3. If you decided to add a passphrase to the SSH key then you now need to load your key agent and add the key to the key agent,
    user@your-mac:~$ eval `/usr/bin/ssh-agent`
    Agent pid 45403
    user@your-mac:~$ ssh-add .ssh/id_rsa_openelec
    Enter passphrase for .ssh/id_rsa_openelec: 
    Identity added: .ssh/id_rsa_openelec (.ssh/id_rsa_openelec)
    user@your-mac:~$ 
  1. You should now be able to log in to the RasPi using the SSH key without entering a password using the command ssh root@192.168.2.7 -i ~/.ssh/id_rsa_openelec and you should be able to issue remote commands automatically too, e.g.
     user@your-mac:~$ ssh root@192.168.2.7 -i .ssh/id_rsa_openelec whoami
     root
     user@your-mac:~$ 

The command I actually used to do the reboot cycles was,

RUN=0; while $(true); do RUN=$[$RUN+1]; echo Reboot cycle $RUN; ssh -i ~/.ssh/id_rsa_openelec root@192.168.2.7 reboot; sleep 10; CHK=1; while [[ $CHK -eq 1 ]]; do echo Checking if back; sleep 1; (ping -c 1 -t 1 192.168.2.7 2>&1 > /dev/null) && CHK=0; done; echo Openelec is back; sleep 20; done

That is a very long line so be careful with cutting and pasting etc. Basically what it breaks down into is,

Establish a variable to count the runs RUN and start a loop which never ends while $(true); do, which,

  • Increments the RUN count RUN=$[$RUN+1];
  • Echos to the terminal which reboot cycle it is on echo Reboot cycle $RUN
  • Sends the reboot command to the RasPi ssh -i ~/.ssh/id_rsa_openelec root@192.168.2.7 reboot;
  • Pauses for 10 seconds sleep 10;
  • Establishes a variable to track whether the RasPi is back online or not, CHK=1;
  • Enters a loop which exits when the tracking variable is changed to something other than 1 while [[ $CHK -eq 1 ]]; do
    • Lets you know what it's doing echo Checking if back;
    • Pauses for a second sleep 1;
    • Sends a single ICMP ping to the RasPi to see if it is back on the network (a sign that the reboot is almost over) and if it is it sets the check variable to zero (ping -c 1 -t 1 192.168.2.7 2>&1 > /dev/null) && CHK=0;
  • Once the loop ends it lets you know the RasPi is back online echo Openelec is back;
  • Then pauses for 20 seconds (to let the reboot finish properly and allow everything to settle) sleep 20
  • Before starting the cycle again done

To exit the loop and stop the reboot cycle you need to press CTRL-C a few times - depending on which stage it's at the first CTRL-C will only exit part of the loop so press it three or four times, it won't harm anything to press it more then is necessary.

To stop the SSH key you created from being allowed to log into the RasPi you need to delete it from /storage/.ssh/authorized_keys or delete the file in its entirety.

Hope that helps, Alec

Using the latest update (which I imaged onto the card and then restored my settings from backup) and bcm2708-dmaengine.dma_debug=0x1f I just managed three reboots before the system froze at boot and needed fscking. Fsck'd it, booted successfully and restarted the reboot cycle testing and it failed on the first reboot.

Contributor

popcornmix commented Apr 19, 2015

@AlecEdworthy
Does bcm2835_mmc.mmc_debug=0x1000 work for you?

Just used the same image but bcm2835_mmc.mmc_debug=0x1fff and got 13 reboots without issue. Now trying 0x1000...

Looks like 0x1000 is fine from a corruption point of view. Just survived 13 reboots without issue. Getting around 7.2MB/sec write and 16.6MB/sec read speeds with that option FWIW.

Contributor

popcornmix commented Apr 19, 2015

@AlecEdworthy any hangs with that setting? (@Cy4n1d3 reporting some hanging on boot, but no corruption).
0x1000 disables DMA in the sdcard driver (so uses PIO mode).
I had hoped the DMA wait states might produce a workaround. Can you confirm you see:
bcm2708-dmaengine soc:dma@7e007000: dma_debug:1f
in dmesg log when using bcm2708-dmaengine.dma_debug=0x1f?

When running with bcm2708-dmaengine.dma_debug=0x1f I was getting the line you asked me to check for,

[    0.829979] bcm2708-dmaengine soc:dma@7e007000: dma_debug:1f

To recap,

  • With bcm2835_mmc.mmc_debug=0x1000 I have had no corruption, no hanging, 13 perfect reboots one after another using the script I posted earlier today.
  • With bcm2835_mmc.mmc_debug=0x1fff I had no corruption, no hangs, again around 13 reboots without issue.
  • With bcm2708-dmaengine.dma_debug=0x1f (and no bcm2835_mmc.mmc_debug option) I was getting corruption within 4 reboots (one time it was on first reboot, another I got three reboots before the hang).

Each time I get corruption what happens is the box reboots fine, freezes at the first page of the reboot (where it lists the OpenElec version) and sits there. If I then power it off and on again it boots to the debug console and requires extensive repairs using fsck (leading to lost settings quite frequently, I have had to restore from backup a couple of times).

A

Contributor

popcornmix commented Apr 19, 2015

There was another test bit added in last update. Try:
bcm2708-dmaengine.dma_debug=0x1f bcm2835_mmc.mmc_debug=0x2000
which disables the MMC_QUIRK_BLK_NO_CMD23 quirk.

Looks good, 13 reboots without issue using bcm2708-dmaengine.dma_debug=0x1f bcm2835_mmc.mmc_debug=0x2000, write speeds of 6.9MB/sec and read speeds of 14.9MB/sec (off one test using dd, /dev/zero and a 500MB file).

Contributor

popcornmix commented Apr 19, 2015

@AlecEdworthy that is interesting. Can you reduce dma_debug?
e.g. Try 0x10, then 0x8 then 0x4 then 0x2 then 0x1 and then with it removed?

@popcornmix I'll give them a try. Probably won't be until later this evening but I'll try to get some testing in today. I assume I keep mmc_debug set to 0x2000 while altering dma_debug?

Contributor

popcornmix commented Apr 19, 2015

Yes

Cy4n1d3 commented Apr 19, 2015

Using bcm2708-dmaengine.dma_debug=0x1f while running the latest .img does not produce good results for me. I made a fresh setup, it rebooted once to partition the storage-partition and then it hung on the initial version display screen, didn't even get the mounting error / debug shell. After a few seconds the screen goes dark, nothing being displayed anymore. Same result after power cycling.

I've then changed the cmdline to bcm2708-dmaengine.dma_debug=0x1f bcm2835_mmc.mmc_debug=0x2000 using the same install, which let the system boot up. I configured SSH, logged in and verified the following lines inside dmesg:
[ 0.857419] bcm2708-dmaengine soc:dma@7e007000: dma_debug:1f [ 1.302699] mmc-bcm2835 3f300000.mmc: mmc_debug:2000
Updated the addons and started rebooting, which allowed 3 reboots after which I got a freeze on after CEC adapter detection. Power cycling then once again allowed booting and another series of 13 reboots without bootup-corruption or system hangs.

Quick and dirty speed testing (/dev/zero, dd /w 1024 block size and 500 mb file size) revealed the following numbers on this scenario (average of three samples):
Write: 7.96 MB/s
Read: 14.46 MB/s

Did another reboot afterwards which also succeeded.

OK,

  • 0x10, 14 reboots without issue
  • 0x8, 14 reboots without issue
  • 0x4, 14 reboots without issue
  • 0x2, 14 reboots without issue
  • 0x1, 14 reboots without issue, 7.7MB/sec write, 17.3MB/sec read (one off stats, therefore hellishly unreliable)

Where to next @popcornmix?
A

Contributor

popcornmix commented Apr 20, 2015

@AlecEdworthy Thanks. For completeness running with 0x0 (or dma_debug removed) would be useful.

Also, with dma_debug removed, I'd be interested in:

bcm2835_mmc.mmc_debug=0xffff0000

So just to confirm you want tests running with,

  • bcm2835_mmc.mmc_debug=0x2000 on its own
  • bcm2835_mmc.mmc_debug=0xffff0000 on its own

The dma_debug can be removed completely in both cases.

Kind regards, Alec

EDIT: "and" replaced with "can" in last line above.

Contributor

popcornmix commented Apr 20, 2015

Yes

Contributor

popcornmix commented Apr 20, 2015

I think 0xfff doesn't help. dma_debug doesn't help.
mmc_debug=0x1000 is known to help, but is undesirable as a solution as it disables DMA, and so increases CPU.

mmc_debug=0x2000 is the most promising one, which seems to help without performance issues. It does disable a fix that was added for a specific sdcard, so it can't just be enabled as a default, but it is a setting we'd like to gather as much information on as possible.

mmc_debug=0xffff0000 is currently unconfirmed, but we'd like to know if it helps.

So, for now, please test:
mmc_debug=0x2000 and then mmc_debug=0xffff0000

@ernstblaauw With bcm2835_mmc.mmc_debug=0x1fff did you have issues (corruption, freezing etc) on the 21st reboot or just stop there because it was all going OK?

OK @popcornmix,

  • bcm2835_mmc.mmc_debug=0x2000 on its own ran wonderfully, 14 reboots without freezes or corruption issues.
  • bcm2835_mmc.mmc_debug=0xffff0000 on its own is running very badly. From issuing the reboot command to getting the rainbow loading screen took 60 seconds, from the rainbow screen to the networking stack being up was another 132 seconds and the OpenElec main screen finally appeared a total of 296 seconds (almost five minutes) after the rainbow screen. During the boot the animated Kodi splash screen stuttered a number of times. I might try some reboot cycles later tonight but given the six minute cycle time it may be a little tricky...

Is there a way to force the boot process to pause at the initial screen and give me the debug terminal instead of going through a normal boot? I ask because I would like to force an fsck of the SD card but can't from the normal OpenElec screen because I can't unmount /storage. I've tried break=load_modules and break=check_disks in /flash/cmdline.txt but neither worked (both booted as normal).

Alec

Contributor

popcornmix commented Apr 20, 2015

Okay, if bcm2835_mmc.mmc_debug=0xffff0000 is too slow, try bcm2835_mmc.mmc_debug=0x7f7f0000 or bcm2835_mmc.mmc_debug=0x3f3f0000 or bcm2835_mmc.mmc_debug=0x1f1f0000 until you get a usable sort of speed.

Following up on my earlier post, with bcm2835_mmc.mmc_debug=0xffff0000 after four or so reboots it looked like some of my settings had been damaged (sound effects were turned back on, overscan was disabled). The system was however still booting at that point. I aborted after 9 reboots (but it was still slowly cycling) and am doing a full reformat (using the SD Association's Formatter tool) before re-imaging, restoring from backup and then trying the other options. Might not get around to the other options tonight I'm afraid :-(

A

I've tried break=load_modules and break=check_disks in /flash/cmdline.txt but neither worked (both booted as normal).

Add debugging to cmdline.txt as well.

Also, textmode (add to cmdline.txt) is useful if you don't want to load Kodi and just want OpenELEC to boot into a console (although /storage will be mounted).

Hi,
Below you'll find my test results, including my earlier reported findings.

default cmdline
3 times corrupt during resizing

bcm2835_mmc.mmc_debug=0x1fff

# dmesg | grep mmc-bcm
[    1.302113] mmc-bcm2835 3f300000.mmc: mmc_debug:1fff
[    1.302124] mmc-bcm2835 3f300000.mmc: Forcing PIO mode

20 reboots: no crash

bcm2835_mmc.mmc_debug=0x2000

# dmesg | grep mmc-bcm
[    1.302496] mmc-bcm2835 3f300000.mmc: mmc_debug:2000
[    1.302509] mmc-bcm2835 3f300000.mmc: DMA channels allocated

20 reboots: no crash

bcm2835_mmc.mmc_debug=0xffff0000
Boots really slowly, I aborted this one

bcm2835_mmc.mmc_debug=0x7f7f0000
Boots into Kodi, but not very fast. Wifi did not come up, so I couldn't test this via ssh

bcm2835_mmc.mmc_debug=0x3f3f0000

# dmesg | grep mmc-bcm
[    1.302822] mmc-bcm2835 3f300000.mmc: mmc_debug:3f3f0000
[    1.302834] mmc-bcm2835 3f300000.mmc: DMA channels allocated

It looks quite slow (for sure it boots much slower than 0x2000).
10 reboots: no crash

To be precise: no crash means I stopped the testing by hand and thus no corruption took place.
I used the following test command:

RUN=0; while $(true); do RUN=$[$RUN+1]; echo Reboot cycle $RUN; sshpass -p "openelec" ssh root@192.168.0.60 '({ sleep 2; reboot; } >/dev/null &) ; exit '  ; sleep 10; CHK=1; while [[ $CHK -eq 1 ]]; do echo Checking if back; sleep 1; (ping -c 1 -t 1 192.168.0.60 2>&1 > /dev/null) && CHK=0; done; echo Openelec is back; sleep 20; done
  • bcm2835_mmc.mmc_debug=0xffff0000
    Painfully slow, no corruption or complete freezes (over 10 reboots) but unusable really (five minute from rainbow to interface).
  • bcm2835_mmc.mmc_debug=0x2000
    No freezes or corruption (14 reboots), 13 seconds between rainbow pixels and OpenElec main interface, used for quite a while yesterday and the interface felt fine
  • bcm2835_mmc.mmc_debug=0x7f7f0000
    No freezes or corruption (14 reboots), 51 seconds between rainbow pixels and OpenElec main interface (one timing but all reboots felt appreciable slower than ’0x2000` and those below), interface was occasionally a little lumpy (scrolling between menu items would pause for a moment every now and again, perhaps one in 10 pings would take around 500ms instead of sub-5ms).
  • bcm2835_mmc.mmc_debug=0x3f3f0000
    Not done in-depth reboots for corruption tests yet, 15 seconds between rainbow pixels and OpenElec main interface (one try), not tested interface really but felt OK.
  • bcm2835_mmc.mmc_debug=0x1f1f0000
    Not done in-depth reboots for corruption tests yet, 13 seconds between rainbow pixels and OpenElec main interface (one try), interface feels responsive, all pings sub-20ms, most sub-5ms.

Do you need/want repeated reboot tests with 0x3f3f0000 or 0x1f1f0000?

A

Contributor

popcornmix commented Apr 21, 2015

I'd like to find out the smallest delays that avoid the corruption.
Assuming 0x1f1f0000 doesn't corrupt, then continue with 0x0f0f0000, 0x08080000, 0x04040000, 0x02020000, 0x01010000. The lower numbers will be better performance, but I'd imagine at some point you'll start seeing corruption. Hopefully it will be at a small enough number that performance isn't measurably affected.

Contributor

pelwell commented Mar 1, 2016

Except for an issue when using gpu_freq to overclock (core_freq is fine). A patch to solve that is currently only available via rpi-update, but if you think it might be affecting you then the workaround is to set core_freq to the same value as gpu_freq.

Ok thanks that's what I thought, I updated many PI2 yesterday and I will try to reproduce the issue on another one, than update it and test if it fix our problem. Our pi2 are not overclocked so I should be fine with dist-upgrade

We have about 10 Pi2 actually running at customers, we updated all Pi2 2 weeks ago to the latest firmware and 2 days ago one got corrupted again, than yesterday another one and I just got a call about a third one that is not working anymore. We also have many Pi1 in the fields and those doesn't seems to be affected by the issue.

Contributor

pelwell commented Mar 15, 2016

Which cards are you using? There was an rpi-update pushed this afternoon that includes a workaround for an issue seen with some old cards.

Also, how are the Pis being shut down?

We are using Kingston SDCA10/32GB and they never shut down except if the power is loss...

[Edit] Our software on the Pi is a webserver (mono) processing/displaying serial data.

We got 2 more defective PI2 since yesterday. Our Pi1 are using Kingston SDC10/32GB the slower model. Now we are replacing our Pi2 with the Pi1 but with the same SDCA cards.

After mounting a defecting SDCard on linux, we were able to recover all files and badblocks is not reporting any issues. We do have a lot of "failed" message on boot and the first one is "Failed start Restore / save the current clock.". Do you have any advice to diagnose what is happening?

Contributor

popcornmix commented Mar 18, 2016

@JocPelletier what does

vcgencmd version
uname -a

report on the Pi's with issues?

We updated all of our PI2 2 weeks ago to:

Linux raspberrypi 4.1.18-v7+ #846 SMP Thu Feb 25 14:22:53 GMT 2016 armv7l GNU/Linux

I've a status page with all of them including the "uname -a" result, to make sure they are all updated. For the other command, the result on one of them is:

Feb 25 2016 14:25:47
Copyright (c) 2012 Broadcom
version dea971b793dd6cf89133ede5a8362eb77e4f4ade (clean) (release)

It's the same image on all SDCard so it should be the same on all of them

BTW we have another one dead

Contributor

pelwell commented Mar 18, 2016

  1. Can you say a little more about the corruption process? What is the transition from working to corrupted?
  2. Do the "failed" messages on boot happen all the time, at the point of corruption, or after corruption?
  3. Is there any other hardware attached?
  4. How good are the power supplies?
  1. AFAIK, the Pi2 is working normally: Webserver + SQLAPI + NLog. Than after a while it start to behave strangly: VPN no more opening, network failure, etc.. so we ask the customer to reboot it OR we do it ourself through ssh (if it work) and it doesn't reboot.
  2. The failed messages happen after the corruption. We tested a fresh image and it's working without any issue. Right now I only have 1 corrupted SDCard, waiting for the other ones to see if it's the same problem.
  3. Yes, the power is attached to a custom board (5V 2Amp power supply) and the Pi is powered by that board through GPIO pins: 5V on Pin 2, ground on 6, Tx and Rx on 8 and 10. It's the exact same board for the Pi1 and Pi2.
  4. They all seems good.

What is weird is that it's happening approx. at the same time for each customer and they don't have the same setup/number of devices talking through serial.

[Edit] Our Pi1 are B+ model

Contributor

pelwell commented Mar 18, 2016

I'm suspicious that you running out of space in the filing system, since it runs well for some time (and approximately the same time). Is that possible?

Another potential explanation is that the cards aren't actually as large as they say they are, i.e. that each reported sector is backed by real flash storage and that they aren't fakes. This is a real problem that has happened to many Pi users. It might be worth running something like h2testw (Windows) or f3 (Linux, Mac, Cygwin and others) just to be sure.

No we are not running out of space, I'm also monitoring that. I will run h2testw and report the result

Just ran the h2testw on one of the card: Test finished without errors.

Contributor

pelwell commented Mar 18, 2016

I'm running out of ideas. Clearly you are having a problem that appears to be filing system corruption, although it would be nice to see some dmesg output that supports that. You may be able to retrieve something from /var/log/syslog using your Linux PC. But it is also clear that there are millions of Pi2s out there that seem to be working happily, because our Forums aren't filled with screaming.

I'm currently connected by ssh on another Pi2 that just got corrupted. I can connect through ssh but if I do "htop" or "sudo nano /etc/resolv.conf" I get the same message:

Inconsistency detected by ld.so: get-dynamic-info.h: 115: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 17 || info[20]->d_un.d_val == 7' failed!

And for the other ones:

pi@raspberrypi:~ $ uname -a
Linux raspberrypi 4.1.18-v7+ #846 SMP Thu Feb 25 14:22:53 GMT 2016 armv7l GNU/Linux
pi@raspberrypi:~ $ vcgencmd version
Feb 25 2016 14:25:47
Copyright (c) 2012 Broadcom
version dea971b793dd6cf89133ede5a8362eb77e4f4ade (clean) (release)

pi@raspberrypi:~ $ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 32G 4.2G 26G 14% /
devtmpfs 481M 0 481M 0% /dev
tmpfs 486M 8.2k 486M 1% /dev/shm
tmpfs 486M 50M 437M 11% /run
tmpfs 5.3M 4.1k 5.3M 1% /run/lock
tmpfs 486M 0 486M 0% /sys/fs/cgroup
/dev/mmcblk0p1 63M 21M 43M 34% /boot
tmpfs 98M 0 98M 0% /run/user/1000
pi@raspberrypi:~ $ free -mh
total used free shared buffers cached
Mem: 925M 495M 430M 46M 81M 229M
-/+ buffers/cache: 184M 741M
Swap: 99M 0B 99M
pi@raspberrypi:~ $ top
top - 14:47:07 up 6 days, 23:12, 1 user, load average: 2.22, 2.31, 2.26
Tasks: 88 total, 1 running, 87 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.4 us, 0.2 sy, 0.0 ni, 80.5 id, 17.8 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 948060 total, 508132 used, 439928 free, 83580 buffers
KiB Swap: 102396 total, 0 used, 102396 free. 235156 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
528 root 20 0 64256 37620 12912 S 6.0 4.0 172:41.94 mono
4977 pi 20 0 5132 2556 2164 R 6.0 0.3 0:00.03 top
1 root 20 0 5484 3892 2740 S 0.0 0.4 0:35.65 systemd

Contributor

pelwell commented Mar 20, 2016

If you are certain that you aren't seeing undervoltage conditions (the Pi2 can draw a lot more current than the Pi1 under load - you should try and measure it during a representative peak workload), then perhaps you are tickling a threading problem in some of the code you are running - 4 cores will tickle any issues.

Going back to that error from ld.so, can you try running md5sum on all .so files on "good" Pi, also after a clean install if you have the time, and then the same again once it starts failing - although that may have to be on your Linux machine. This simple command ought to do the job - you can presumably capture the results from an ssh session:

sudo find / -name "*.so" -exec md5 {} \;

Comparing the output from the 2 or 3 runs should tell us something.

I'll verify if it's a threading issue, about your command do I have to install a package for md5? i tried to run it and all I get is:

find: `md5': No such file or directory

I tried to install md5 / md5sum but it's not in the repo

Contributor

pelwell commented Mar 20, 2016

It's should be md5sum on the Pi - I was testing on a Mac. Is it really not installed as standard? dpkg itself uses md5 hashes.

On the pi /usr/bin/md5sum is in package coreutils

As a rule of thumb faster (Class 10) is not better. Since I switched to a Class 4 sd card no file system corruption occurred. Maybe the firmware timings are fragile and work better with conservative speeds? I could trigger corruption with one or more "apt-get update; apt-get upgrade' since these commands use many write operations on the file systems in short time.

Note, there was a time a year back when I was seeing corruption on the SD card no matter what I did if I had SQLite running... It would never shut down cleanly no matter what I did.

Can't remember now the details, but it would be interesting to see if the problem still exists (i.e. whether it always requires 'fixing' the disk when booting up

Prior to the sdcard driver switch I would see this style of corruption
quite often, where the device booted Ok but started to appear to show
corruption a little later.

At the time the problem seemed to be helped most by changing the power
supply (from one 2A supply to another 2A supply), and possibly also the
power supply cables (thicker wire to avoid voltage drop).

Changing the power supply seemed to help a lot, so I tend to think it was
ONE of the root causes. The sdcard driver appeared to be the other root
cause, but the latter root cause seemed more crucial to avoid
shutdown/reboot corruption.

Unfortunately, my change was just from a Samsung 2A 5V supply, to another
2A 5V supply that was reported to work well with the Pi, so exactly what
the Pi is intolerant of I'm still not sure.

The inevitable problem I tended to see is that the corruption roulette
would eventually make the PI unbootable, because the corruption, although
fsckable, would break the boot process first. Partitioning the sdcard into
a read-only root and read-write data partitions (again, something I did
prior to the sdcard driver update) helped that a lot too.

That's a general problem with all linux distributions with one big root
partition and no UPS of course, not just the Pi. Ubuntu is just as capable
of becoming unbootable after a power outage.
On 19/03/2016 7:49 AM, "JocPelletier" notifications@github.com wrote:

I'm currently connecter by ssh on another Pi2 that just got corrupted. I
can connect through ssh but if I do "htop" or "sudo nano /etc/resolv.conf"
I get the same message:

Inconsistency detected by ld.so: get-dynamic-info.h: 115:
elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 17 ||
info[20]->d_un.d_val == 7' failed!

And for the other ones:

pi@raspberrypi:~ $ uname -a
Linux raspberrypi 4.1.18-v7+ #846 SMP Thu Feb 25 14:22:53 GMT 2016 armv7l
GNU/Linux
pi@raspberrypi:~ $ vcgencmd version
Feb 25 2016 14:25:47
Copyright (c) 2012 Broadcom
version dea971b793dd6cf89133ede5a8362eb77e4f4ade (clean) (release)

pi@raspberrypi:~ $ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 32G 4.2G 26G 14% /
devtmpfs 481M 0 481M 0% /dev
tmpfs 486M 8.2k 486M 1% /dev/shm
tmpfs 486M 50M 437M 11% /run
tmpfs 5.3M 4.1k 5.3M 1% /run/lock
tmpfs 486M 0 486M 0% /sys/fs/cgroup
/dev/mmcblk0p1 63M 21M 43M 34% /boot
tmpfs 98M 0 98M 0% /run/user/1000
pi@raspberrypi:~ $ free -mh
total used free shared buffers cached
Mem: 925M 495M 430M 46M 81M 229M
-/+ buffers/cache: 184M 741M
Swap: 99M 0B 99M
pi@raspberrypi:~ $ top
top - 14:47:07 up 6 days, 23:12, 1 user, load average: 2.22, 2.31, 2.26
Tasks: 88 total, 1 running, 87 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.4 us, 0.2 sy, 0.0 ni, 80.5 id, 17.8 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 948060 total, 508132 used, 439928 free, 83580 buffers
KiB Swap: 102396 total, 0 used, 102396 free. 235156 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
528 root 20 0 64256 37620 12912 S 6.0 4.0 172:41.94 mono
4977 pi 20 0 5132 2556 2164 R 6.0 0.3 0:00.03 top
1 root 20 0 5484 3892 2740 S 0.0 0.4 0:35.65 systemd


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#397 (comment)

Here are the results after running sudo find / -name "*.so" -exec /usr/bin/md5sum {} \; Both SD was at Linux raspberrypi 4.1.18-v7+ #846 SMP Thu Feb 25 14:22:53 GMT 2016 armv7l GNU/Linux but one may have additional updates.

Good: http://pastebin.com/X6BErQG7
Corrupted: http://pastebin.com/Tk3xPMNg

I think that we really have to create a read-only root partition

Contributor

pelwell commented Mar 21, 2016

The md5sums are only useful if the images match before the corruption, which isn't going to be the case if "one may have additional updates." But despite that, the way the differences are grouped makes be believe that all of them could be due to an upgrade. The few singletons are in less critical code, except perhaps /lib/arm-linux-gnueabihf/security/pam_systemd.so.

I did a test program writing data with SQLite and some log with NLog. The program was running the last 2 weeks on 2 different PI2: one GPIO powered and the other one using a 5V 2A Micro USB power supply. Today, both are corrupted.

This is the power supply that I switched to when I was getting file system
corruption. I've got months of uptime off it on a reasonably
filesystem-intensive application.

https://nicegear.co.nz/electronics-gear/5v-2a-power-supply-w-20awg-6-microusb-cable-international/

Note that there MAY be some confounding variables. I'm pretty sure the PSU
made an important difference, but I also switched to the new drivers and
started using a read-only root partition around the same time in an effort
to ward off file system corruption.

On 5 April 2016 at 00:17, JocPelletier notifications@github.com wrote:

I did a test program writing data with SQLite and some log with NLog. The
program was running the last 2 weeks on 2 different PI2: one GPIO powered
and the other one using a 5V 2A Micro USB power supply. Today, both are
corrupted.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#397 (comment)

lurch commented Apr 7, 2016

Today, both are corrupted.

Both of them became corrupted at the same time? Perhaps there was a power cut/glitch/surge?
Or did you only notice they were corrupted on the same day, and they could have each corrupted at different times previous to that?

They became corrupted on different day. The first one on friday and the other one later in the weekend, I noticed it on monday morning. I will start another test with a new SDCard (SanDisk Ultra ou Extreme 32GB) instead of the Kingston SDCA10/32GB.

JocPelletier commented Apr 20, 2016

A little update on my tests:
PI B+ with Kingston SDCA10/32GB -> Corrupted
PI2 with Kingston SDCA10/32GB -> Corrupted
PI2 with Sandisk Ultra 32GB -> Corrupted (From Costco)

I sent my test program to your email pelwell about 2 weeks ago if it can help investigate the issue. The SanDisk Ultra on the PI2 got corrupted after only 1 week with my test program running.

I had two PI2's, both with the same Kingston SDCA10/16GB card.
Both got repeatedly corrupted (especially after reboot, but the system also often became unreachable without power glitches or reboots). At a certain moment even the reboot after installation was enough to get the thing corrupted...
I ran h2testw on both cards and both clearly showed physical corruption of the data (fake cards??). After several tests I even started to get block errors in the Windows logs and the software fails to do any further testing.
I returned both cards to the shop (RMA). They confirmed the cards were bad and refunded them.

Currently both PI2's run on a Sandisk Ultra 8GB and a Transcend Premium 8GB for serveral months without an issue.

First I was thinking the repeated small writes of the mySQL DB caused the cards to get corrupted. But for the moment I'm convinced those cards are just crap, especially because both PI's are running fine now for a long time...

DrCord commented May 4, 2016

I just got this happening on a completely new samsung SDHC 32GB class 10 card.

lylesvendsen commented May 15, 2016

I am able to produce this consistently on a RPI3 and Samsung 16 & 64gb mSD cards by trying to use NPM to install new node red nodes and they fail. GPIO Node for example ($sudo npm install node-red-contrib-gpio or $sudo npm install node-red-contrib-mpr121) I hope this helps debug this. I'm downloading the 2016-05-10 build of Raspbian, I hope it fixes the issue.

Contributor

pelwell commented May 15, 2016

The Kingston SDC10G2/16GB cards seem to have a controller fault that makes the ERASE operation both slow and hazardous. The ext4 filesystem will attempt to use ERASE on nodes as they are deleted, and this can lead to corruption. The rpi-4.4.y tree contains a patch that disables erase on such cards (name="SD16G", manfid=0x41, oemid=0x3432") to improve performance and stop the corruption; a kernel including this patch is now available via rpi-update.

If you have a card that seems to have the advertised capacity (passes h2testw, etc.) but gets corrupted during an update then please post your card details here, as found using:

grep . /sys/class/mmc_host/mmc0/mmc0:*/* 2>/dev/null

For example, my failing card returns:

cid:4134325344313647307e3002af0102b5
csd:400e005a5b59000073677f800a400027
date:02/2016
erase_size:512
fwrev:0x0
hwrev:0x3
manfid:0x000041
name:SD16G
oemid:0x3432
preferred_erase_size:12582912
scr:0235800300000000
serial:0x7e3002af
type:SD

I imagine I'll need to enable the same "quirk" for additional capacities of the Kingston cards - it sounds like 32GB is affected, so perhaps "SD32G" - but other manufacturers could have the same controller.

popcornmix added a commit that referenced this issue May 19, 2016

kernel: bump to 4.4.11
kernel: mmc: Apply QUIRK_BROKEN_ERASE to other capacities
See: #397 (comment)

kernel: New driver for the AudioInjector audio input and output card
See: raspberrypi/linux#1476

popcornmix added a commit to Hexxeh/rpi-firmware that referenced this issue May 19, 2016

kernel: bump to 4.4.11
kernel: mmc: Apply QUIRK_BROKEN_ERASE to other capacities
See: raspberrypi/firmware#397 (comment)

kernel: New driver for the AudioInjector audio input and output card
See: raspberrypi/linux#1476
Contributor

pelwell commented May 19, 2016

There is a new rpi-update release that adds a quirk to disable ERASE commands on cards with:

name:SD16G or SD32G or SD64G
manfid:0x00000041
oemid:0x3432

JocPelletier commented May 19, 2016

Thanks for the update, I'll try that now on a Pi 2 using a 32G Kingston:

uname -a
Linux raspberry 4.4.11-v7+ #887 SMP Thu May 19 16:24:03 BST 2016 armv7l GNU/Linux
cid:41343253443332473000238f1700fa05
csd:400e00325b590000ef1f7f800a4000cf
date:10/2015
erase_size:512
fwrev:0x0
hwrev:0x3
manfid:0x000041
name:SD32G
oemid:0x3432
preferred_erase_size:4194304
scr:0235800201000000
serial:0x00238f17
type:SD
uevent:DRIVER=mmcblk
uevent:MMC_TYPE=SD
uevent:MMC_NAME=SD32G
uevent:MODALIAS=mmc:block

But we also got corruptions on Sandisk Ultra 32G cards, if it's the same issue you should add:

cid:035344534c3332478043d82f1500f8e5
csd:400e00325b590000e6487f800a404095
date:08/2015
erase_size:512
fwrev:0x0
hwrev:0x8
manfid:0x000003
name:SL32G
oemid:0x5344
preferred_erase_size:4194304
scr:0235800300000000
serial:0x43d82f15
type:SD
uevent:DRIVER=mmcblk
uevent:MMC_TYPE=SD
uevent:MMC_NAME=SL32G
uevent:MODALIAS=mmc:block

TheSin- commented May 19, 2016

I've been following this issue, and although I have not had any issues with my SanDisk Ultra 8G or Kingston 8G. I'm just curious as to why limit the patch for ERASE? Why not just patch it for everything, is there a benefit to keeping it at all? Wouldn't it be safer to just patch it out if it's an SD period and only enable it for eMMC/USB? Again I haven't had any issues I'm just curious as the include list might be harder then an exclude list is all.

Contributor

pelwell commented May 19, 2016

When it works, ERASE should be a performance boost because it reduces any subsequent write time. I don't think this is a Pi-specific problem, therefore it can't be so widespread otherwise ERASE would be disabled in the kernel for all SD cards, so for now I'd rather be cautious.

TheSin- commented May 19, 2016

makes sense, just thought I'd ask I know you guys work very hard on this and the last thing you want is a list with hundreds of cards on it that you have to maintain ;) So thought I'd bring it up to see if it made sense.

Contributor

pelwell commented May 19, 2016

It's not a bad idea, just one I'd rather keep up my sleeve for now.

JocPelletier commented May 27, 2016

Still have corruption issue with 4.4.11-v7+ and Kingston SD32G after 1 week using my test program.

I've been running archlinux arm for a long time now and I've been using the following card with a Raspberry Pi Model B Rev 2 and I haven't seen any problems.

cid:4134325344333247300051861800e519
csd:400e00325b590000ef177f800a400059
date:05/2014
erase_size:512
fwrev:0x0
hwrev:0x3
manfid:0x000041
name:SD32G
oemid:0x3432
preferred_erase_size:4194304
scr:0235800001000000
serial:0x00518618
type:SD
uevent:DRIVER=mmcblk
uevent:MMC_TYPE=SD
uevent:MMC_NAME=SD32G
uevent:MODALIAS=mmc:block

I guess the difference might be that I've been using f2fs for the root fs for as long as it has been supported in the archlinux arm kernel. If I recall correctly mmc erase support was introduced after that.

As pelwell says, if mmc erase works it should be used as it helps increase write speed by avoiding an erase before a write and might also help with sd card longevity.

SigiK commented Nov 9, 2016

@JocPelletier Could you please tell me what caused your corruptions of your SD-cards? I am having similar issues with Samsung MB-SS32SD, but dont know whats causing it.

JocPelletier commented Nov 9, 2016

@SigiK We decided to stop using the RPi in our products because of this issue so I've not investigated further. I've sent a program to @pelwell that reproduce the issue without any reply/feedback so I'm not sure they used it. It seems to be "caused" by SQLite.

SigiK commented Nov 9, 2016

@JocPelletier Thx for fast answer! Could it in your opionion have to do with mono and serial-communication (we are using RS485), because we are using both.

@SigiK I don't think so, because we have 2 services running, one managing all RS485 communications and one running a webserver with data logging. We tried to disable the data-logging on some RPi and it seems to "fix" the corruption issue.

Also, my test program reproducing the issue is not using the RS485

@JocPelletier If you enable sqlite, does this give you 100% reliable corruption? I've seen this in the past and similarly put it down to something about SQLite hoping it would be fixed at some time in the future. When you say corruption, does this result in a non-booting system or just a system that needs to do a e2fsck at startup?

@ghollingworth 99% Yes but it can take more than a week. But it can be NLog too because we are using it too. I've not tried to remove NLog in my test program that's why I can't say 100% yes.

By corruption I mean a non-booting system.

SigiK commented Nov 9, 2016

@JocPelletier Thx for this useful information! We used swap before, which probably also corrupted SD-cards. Anyway, we didnt have such issues with smaller (4GB) SD-cards before (having swap and logging activated). In our case it seems to be a combination of SD-card type and heavy writes to SD-card.

When you say non-booting, how far does it get? Do you get:

  1. Nothing, no lights flashing etc
  2. Multicoloured splash screen
  3. Boot to kernel looking for filesystem but cannot find valid ext4 partition
  4. Found ext4 partition but failed e2fsck
  5. Tried to fix partition (using e2fsck) but failed and crashed...

JocPelletier commented Nov 9, 2016

@ghollingworthi don't remember exactly, and I'm not sure it was always the same result, but it was not 1) or 2) it was kernel panic.

@SigiK Yes "In our case it seems to be a combination of SD-card type and heavy writes to SD-card." This is true. SQLite ist not the cause, every process using many write operations will cause corruption over time. How long it'll take is a matter of SD-Card Model (actually hardware used for this model at the time of production) and RPi firmware version. This may take from 1 day to 3 months and longer. The RPi3 is able to boot without SD-Card and the RPi2 may boot from a read-only partition of a SD-Card and use an USB Stick oder harddrive.

Ferroin commented Nov 10, 2016

It's worth noting that while SQLite isn't the direct cause, it and any other database files (including BerkDB, as well as many backends for most RDBMS software, and (rather significantly) systemd journal files) will tend to accelerate this process on many SD cards because they involve lots of internal rewrites, and most SD cards have really poorly designed FTL's that don't do a good job of wear-leveling.

Ruffio commented Jan 7, 2017

@NitroG42 Can this issue be closed?

NitroG42 commented Jan 7, 2017

Oh I didn't play with my Raspberry for a long time but since user don't seems to have the issue anymore (it was working for me after the first patch I think), it can be closed !
If anybody get this issue, don't be afraid to open another (I think)

@NitroG42 NitroG42 closed this Jan 7, 2017

neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this issue Feb 27, 2017

kernel: Bump to 3.18.12
kernel: alsa: Make interrupted close paths quieter
See: raspberrypi/linux#931

kernel: bcm2835-mmc: Add range of debug options for slowing things down
kernel: bcm2835-mmc: Default to disabling MMC_QUIRK_BLK_NO_CMD23
kernel: bcm2708-dmaengine: Add debug option for setting wait states
See: raspberrypi#397

firmware: arm_loader: Changes to support bcm2835_sdhost driver

neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this issue Feb 27, 2017

kernel: Adding bcm2835-sdhost driver, and an overlay to enable it
See: raspberrypi#397

kernel: bcm2835-sdhost: Adding overclocking option
kernel: bcm2835-mmc: Adding overclocking option
See: http://forum.kodi.tv/showthread.php?tid=224025&pid=2005396#pid2005396

kernel: config: Add CONFIG_CIFS_UPCALL
See: raspberrypi/linux#968

kernel: config: Add CONFIG_FB_SSD1307=m
See: raspberrypi/linux#969

firmware: di_adv: Fix memory leak of converted buffers
See: raspberrypi#429

firmware: arm_display: Fix initialisation of framebuffer struct when framebuffer base is passed in

firmware: hdmi: Tweak hdmi_mai_thresh for 192kHz audio
See: https://discourse.osmc.tv/t/rp2-multichannel-flac-playback/2627/28

firmware: vcsm: Update to header from kernel side
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment