New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cortex-m4: move hardfault handler to arch #1051

Merged
merged 1 commit into from Jul 6, 2018

Conversation

Projects
None yet
4 participants
@bradjc
Copy link
Contributor

bradjc commented Jun 29, 2018

So far it has just been copied a couple times.

Testing Strategy

This pull request was tested by compiling.

TODO or Help Wanted

n/a

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make formatall.
cortex-m4: move hardfault handler to arch
So far it has just been copied a couple times.

@ppannuto ppannuto force-pushed the hardfault-cortexm4 branch from 7bfd43c to eee268d Jun 29, 2018

@ppannuto
Copy link
Member

ppannuto left a comment

I imagine a bunch of the m0/m4 code overlaps too, though that may be less worth doing

@phil-levis
Copy link
Collaborator

phil-levis left a comment

I don't want to put a hold on this particular PR, but I do want to comment on my experience trying to maintain a board outside the mainline -- it's hard! Between releases there are tons of tiny refactors that make staying in sync difficult, because just to compile you have to change a lot of little things, then when it doesn't work figuring out why is hard.

In this particular case, rather than these several incremental refactors of the cortex-M code, perhaps we should do a one-time design of what the breakdown between -m and -mN should be? Then we can be done with it.

@bradjc

This comment has been minimized.

Copy link
Contributor

bradjc commented Jun 30, 2018

My take is everything that is shared by more than one cortex-mX architecture should be in cortex-m and exported properly by the specific cortex-mX crate.

Having copies is very hard to maintain, and makes getting tock working on new hw harder.

@ppannuto

This comment has been minimized.

Copy link
Member

ppannuto commented Jun 30, 2018

w.r.t. to the small refactors vs one big shot, I think I'd argue the opposite. It's hard to review giant PRs, if things can be merged in digestible chunks, it's much easier to reason about. I've opened and closed #1012 probably a dozen times over the last week, meaning to look at it and then deciding I didn't have enough time.

I think releases serve the goal of the big "one-shot" refactors. If you're maintaining an out-of-tree board, it probably makes sense to only merge tock mainline on a slower cycle rather than tracking master. Maybe you don't have to wait the full two months, but you also aren't obligated to track master every day, out-of-tree ports can wait until all of the small cortex-m refactors land to merge.

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Jun 30, 2018

I'd argue that bigger PRs are better for design, while smaller PRs are better for details. Smaller PRs are easier for us, but more focused evolution is the system is better for users, and ultimately our emphasis should be on users. We should be driven by what makes the system easier to use and maintain, not some code development process pushing us in the wrong direction.

I'm suggesting that we come up with a design -- there can be lots of discussion around that, that considers all of the complications -- then implement it. Lots of small changes here and there don't necessarily lead to a well-designed system.

My argument for having more structured, complete, and consistent evolution is based on trying to bring the Hotel port from 1.0 to master: in the past four months, there have been changes to everything from Driver to the linker to debug output to stack sizes to cortex-m factoring to process to breaking things out into crates. But what this means is that suddenly no code compiles, and then, when you do get it to compile and it doesn't work, trying to figure out why is extremely hard. This could even all be fine if I thought that these things wouldn't just change more between now and the next release. But based on the current development process I have no doubt that it won't happen again, and I'm not going to do it again and again.

I'd suggest this as a trial run to see what I mean: take the 1.0 version of the sam4l and imix code, as well as a handful of capsules (which chips that have different peripherals may have), and drop it into 1.2. It'll give you a sense of how everything is creeping forward at once, which is a nightmare for keeping in sync with.

@bradjc

This comment has been minimized.

Copy link
Contributor

bradjc commented Jun 30, 2018

I have roughly done that, from a commit in November, 2017: tock/tock-teensy#12

From looking at that PR, what changed:

  • We no longer need Xargo. Removed Xargo.toml, but it doesn't hurt if that file is there.
  • The console interface changed to allow reading. That PR was open for a month.
  • The basic board-specific panic code was moved to kernel/src/debug.rs. This saves us from having a copy for each board, and should make it easier to update boards, in tree and out of tree, in the future.
  • We don't need compiler builtins explicitly any more. That is a rust update.
  • main was renamed to kernel_loop. This came along with a slight re-org of the kernel crate that otherwise wouldn't have affected the naming :/.
  • A default copy of the load_processes() function was added to the kernel. Boards are free to not use this and use their own.
  • The process functions from the kernel were added to the procs namespace. This was necessary for Tock safety guarantees (too much was being publicly exported from the tock kernel).
  • The cells moved to a consistent namespace. Yeah, fixing this one is annoying, but it doesn't make sense for there to be 3 different ways to use the Tock cells. Also this made moving the cells to a new crate as was requested by users transparent to all Tock boards.
  • Some kernel syscall APIs added Option<>s. That PR was also open for a month.
  • Interrupts are no longer stored in explicit data structures and instead use the NVIC controller. That PR was open for 12 days but had been mentioned several times before as the possible culprit for some hard to find bugs. It did however mean removing any notion of defining specific interrupt handlers in chip peripheral drivers.
  • The atomic() function was added to the chip interface. That PR was open for only a day, but is part of a larger effort to make the kernel crate architecture agnostic. Yeah, this one is an annoying fix, but is luckily copy and paste.
  • We merged the new register interface. That PR was open for almost four months, and if it hadn't, then the current design may have matched the version in the teensy repo, making it much, much easier to update to tock master. But, in the general case, the VolatileCell approach still works.

So, based on my experience, we are doing design reviews since major PRs that affect out-of-tree boards are largely remaining open for a month or months, and we are serving users by making our designs available for out-of-tree projects. Other changes are explicitly driving projects that are tracking issues (which recently have been more of a focus which I think is a good thing, so there is an obvious place to comment on the design before an implementation exists), or are needed for the safety guarantees we (are trying to) promise our users.

Is there something concrete we can change to make this easier? One thing would be to actually delay releases until the projects we want to get in them are actually done, rather than just push them to the next release. Then there wouldn't be so much churn on specific parts of the kernel between releases. This may, however, mean we don't do another release this year.

I think having more comments on the RFC and tracking PRs would be helpful. No response is easily interpreted as an implicit approval of the proposed design.

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Jun 30, 2018

The issue is less about the length of PRs, or their size, but rather a development process that encourages making lots of small changes across the entire system. It means that when you need to update, the changes are small and pervasive, which is much harder to manage and debug than more deliberate, concentrated ones. I'll take #1049 as an example, since Pat argued for small pull requests.

First off, we very recently had a discussion on the UART HIL and its advanced features: in just the past two months we have #936 and #943. Why do we have three incremental changes to a HIL in such a small space of time? If I thought that #1049 was the end of it, that would be one thing, but it almost certainly isn't. Rather than making many small incremental changes as their needs come up, we should design the HIL carefully and be done with it. Many small point changes is the antithesis of good design.

Second, why should the UART HIL have ReturnCodes but some other HILs don't? We need to be consistent. We've heard this complaint before, when people ask what examples to look at. Some HILs have setting the client in the trait; others have it in the structure. Some HILs return ReturnCodes; others don't. Some capsules use Cell<Option<>>; others use OptionalCell<>. If all of the HILs should have return codes, then we should change all of them, and change all of the capsules. This would change a lot of code, but it's an example of a specific change repeated many times, rather than many small different changes.

@ppannuto

This comment has been minimized.

Copy link
Member

ppannuto commented Jun 30, 2018

I find myself agreeing and disagreeing with this at the same time.

I agree that a series of small changes, particularly to HILs which should be somewhat principled interfaces, reflects at least some absence of a clear over-arching design. However, at the same time, I very much like that #936 and #1049 are separate PRs. They are two completely separate conceptual ideas, and muddying them in one PR would make each change harder to reason about. (As an aside #943 probably should've been P-Significant, I'm not sure I even saw that one till now -- that was while I was in Ghana and I was definitely triaging Tock tracking -- which is in part why #1049 changes some of #943)

Perhaps the tracking issues coupled with release milestones could help address this. I imagine something like at the start of a new Tock release cycle, we craft a few tracking issue for major fixes (i.e. reconsidering the UART HIL), and the next Tock release then includes either all or none of the affiliated PRs that fix the individual problems.


why should the UART HIL have ReturnCodes but some other HILs don't?

I think this is a historical anachronism. Loosely, it looks like the HILs that were created after the kernel had ReturnCode support include ReturnCodes and the older ones don't. I think all the HILs should have ReturnCodes (for methods that can fail), and it's just technical debt that they don't.

Edit: Opened #1052 to track this.

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Jul 2, 2018

The idea that the UART HIL is continually changing across multiple PRs without a conscious design stage is untenable. That's programming, not software engineering. You design first, in a complete sense, then implement. You do not continually change core APIs when itches arise.

@alevy

This comment has been minimized.

Copy link
Member

alevy commented Jul 2, 2018

Can we technically get rid of cortexm3 vs cortexm4? They both have the same (optional) core peripherals, don't they?

@@ -179,6 +191,199 @@ pub unsafe extern "C" fn switch_to_user(
user_stack as *mut u8
}

pub unsafe extern "C" fn hard_fault_handler() {

This comment has been minimized.

@alevy

alevy Jul 2, 2018

Member

If this is moving into a common crate, I'd prefer for either there to be some validation that the code overhead of this implementation goes away if panic doesn't actually print anything, or have the "stack trace" functionality be feature gated at compile time.

This comment has been minimized.

@bradjc

bradjc Jul 5, 2018

Contributor

I changed Hail's io.rs to:

/// Panic handler.
#[cfg(not(test))]
#[no_mangle]
#[panic_implementation]
pub unsafe extern "C" fn panic_fmt(pi: &PanicInfo) -> ! {
    // turn off the non panic leds, just in case
    let led_green = &sam4l::gpio::PA[14];
    led_green.enable_output();
    led_green.set();
    let led_blue = &sam4l::gpio::PA[15];
    led_blue.enable_output();
    led_blue.set();

    let led_red = &mut led::LedLow::new(&mut sam4l::gpio::PA[13]);
    // let writer = &mut WRITER;
    // debug::panic(&mut [led_red], writer, pi, &cortexm4::support::nop)
    // Just blink, do not print anything
    debug::panic_blink_forever(&mut [led_red])
}

to make sure it would only blink the LEDs and not print anything.

Before:

   text	   data	    bss	    dec	    hex	filename
  86018	   4412	  61120	 151550	  24ffe	target/thumbv7em-none-eabi/release/hail

After:

   text	   data	    bss	    dec	    hex	filename
  51202	   4412	  61120	 116734	  1c7fe	target/thumbv7em-none-eabi/release/hail

Also, this function was already in a shared crate, so I'm not sure what has changed (except I guess more boards will be affected). If you wanted to do something differently in the hardfault handler on your board I think you would have the same problem after this PR as you did before.

@ppannuto ppannuto merged commit 55f9cc4 into master Jul 6, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
deploy/netlify Deploy preview ready!
Details

@ppannuto ppannuto deleted the hardfault-cortexm4 branch Jul 6, 2018

@ppannuto

This comment has been minimized.

Copy link
Member

ppannuto commented Jul 6, 2018

Re m3 vs m4, the difference is m4 has DSP extensions and optional FPU. It's possible the DSP stuff could add something that the hardfault handler might print some state of, but I don't know it off hand, and it'd be easy enough to compose them.

I think for clarity, a chip that has an m3 should use the m3 arch crate. Internally to the crate, it probably makes sense for one to largely re-export the other.

@phil-levis

This comment has been minimized.

Copy link
Collaborator

phil-levis commented Jul 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment