Skip to content

Conversation

xuhancn
Copy link
Collaborator

@xuhancn xuhancn commented Jul 22, 2024

We have tried to switch AotCodeCompiler to new cpp_builder in the PR: #130127
But, it has some compatibility issue to fb_code, and it was reverted: #130127 (comment)

I re-created this PR and add some debug code to help fix compatibility issue: 20035b1

@desertfire could please help me to import this PR to fb_code?

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Jul 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131304

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 20035b1 with merge base 1439bd3 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xuhancn xuhancn added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 22, 2024
@xuhancn xuhancn marked this pull request as ready for review July 22, 2024 04:39
@xuhancn xuhancn changed the title switch AotCodeCompiler to new cpp_builder. [inductor] switch AotCodeCompiler to new cpp_builder. (take 2) Jul 22, 2024
@xuhancn xuhancn added topic: not user facing topic category intel This tag is for PR from Intel labels Jul 22, 2024
@xuhancn xuhancn requested a review from desertfire July 22, 2024 05:33
@facebook-github-bot
Copy link
Contributor

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2024
@henrylhtsang
Copy link
Contributor

Hi @xuhancn, I will be helping with fixing fbcode tests for this PR.

One thing I noticed so far: Should we add cuda_passthough_args = [] after https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1101

That seems to be the logic here https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1735

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 3, 2024

Hi @xuhancn, I will be helping with fixing fbcode tests for this PR.

One thing I noticed so far: Should we add cuda_passthough_args = [] after https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1101

That seems to be the logic here https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1735

Hi @henrylhtsang ,
I tried to simulate it by: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L955-L957
You can re-create a new PR by your-self, and try to switch AotCodeCompiler.

@henrylhtsang
Copy link
Contributor

Hi @xuhancn, I will be helping with fixing fbcode tests for this PR.
One thing I noticed so far: Should we add cuda_passthough_args = [] after https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1101
That seems to be the logic here https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1735

Hi @henrylhtsang , I tried to simulate it by: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L955-L957 You can re-create a new PR by your-self, and try to switch AotCodeCompiler.

Hi, one problem is that passthough flags is not None when cuda and fbcode is used
https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1056

Are there any problems with adding that cuda_passthough_args = []? Can we try it? I seem to have good success with adding that line for the fbcode tests.

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 3, 2024

Hi @henrylhtsang , could you please try #132594?

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

@henrylhtsang you can also reference to my design doc: #124245

@henrylhtsang
Copy link
Contributor

@xuhancn just to be clear, should I test #132594 and this pull request together?

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

@xuhancn just to be clear, should I test #132594 and this pull request together?

@henrylhtsang Yes, test them together.
Cherry-pick into this PR or create a new one, it is up to you.

@henrylhtsang
Copy link
Contributor

henrylhtsang commented Aug 5, 2024

@xuhancn I will circle back in a few hours. I think after this fix it might be good to go.

edit: nevermind celebrated too early. it fixes some errors but not all errors

@henrylhtsang
Copy link
Contributor

@xuhancn just to update a bit more:

  • the fix helps with a few tests
  • a few remaining tests are still failing. Problem seems to be due to using relative path instead of complete path, and it seems to be related to using re environment (probably related to fbcode_aot_cpu_re flag?!?)

Let me know if you have more ideas. Otherwise will continue to investigate.

@henrylhtsang
Copy link
Contributor

@xuhancn

Looks like output_o used to be absolute path, but now is relative path.

Before:
os.path.splitext(input_path)[0] + ".o" -> absolute path

After:
output_o = object_builder.get_target_file_path() -> relative path

@chenyang78
Copy link
Contributor

@xuhancn

Looks like output_o used to be absolute path, but now is relative path.

Before: os.path.splitext(input_path)[0] + ".o" -> absolute path

After: output_o = object_builder.get_target_file_path() -> relative path

Thanks for the investigation. So, looks like we might want to keep using the absolution path for fbcode, e.g. differentiating fbcode from oss in object_builder.get_target_file_path()?

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

@xuhancn

Looks like output_o used to be absolute path, but now is relative path.

Before: os.path.splitext(input_path)[0] + ".o" -> absolute path

After: output_o = object_builder.get_target_file_path() -> relative path

Hi @henrylhtsang could you please try to add some logical to fix the target path here: #131304 (comment)
CC: @chenyang78

@henrylhtsang
Copy link
Contributor

@xuhancn
Looks like output_o used to be absolute path, but now is relative path.
Before: os.path.splitext(input_path)[0] + ".o" -> absolute path
After: output_o = object_builder.get_target_file_path() -> relative path

Hi @henrylhtsang could you please try to add some logical to fix the target path here: #131304 (comment) CC: @chenyang78

#132708 seems to fix the test those errors. Do you know a way to incorporate them?

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

@henrylhtsang
#131304 (comment) for output_o.
#131304 (comment) for output_so

@henrylhtsang
Copy link
Contributor

henrylhtsang commented Aug 5, 2024

@xuhancn sorry I am not getting it. Can you elaborate a bit more for both cases?

edit: okay maybe I get the output_so part. But not sure about the output_o part (where to add logical?)

edit2: okay still confused about both parts

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

Hi @henrylhtsang

For output_o, it is rel/abs path issue. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1251-L1257 Adjust the control logical by fb_code, self._aot_mode, and self._use_absolute_path.

For output_so, in https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1152-L1173, add a parameter specified_so_name: str = "". and example:

def get_name_and_dir_from_output_file_path(
    file_path: str, specified_so_name: str = ""
) -> Tuple[str, str]:

    if len(specified_so_name) > 0
        file_path = config.aot_inductor.output_path

    name_and_ext = os.path.basename(file_path)
    name, ext = os.path.splitext(name_and_ext)
    dir = os.path.dirname(file_path)

    return name, dir

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

Hi @henrylhtsang

For output_o, it is rel/abs path issue. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1251-L1257 Adjust the control logical by fb_code, self._aot_mode, and self._use_absolute_path.

For output_so, in https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1152-L1173, add a parameter specified_so_name: str = "". and example:

def get_name_and_dir_from_output_file_path(
    file_path: str, specified_so_name: str = ""
) -> Tuple[str, str]:

    if len(specified_so_name) > 0
        file_path = config.aot_inductor.output_path

    name_and_ext = os.path.basename(file_path)
    name, ext = os.path.splitext(name_and_ext)
    dir = os.path.dirname(file_path)

    return name, dir

Additional, for control abs/rel only impact output_obut not output_so, please check self._compile_only, ref: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cpp_builder.py#L1239-L1244

@henrylhtsang
Copy link
Contributor

@xuhancn update:

it seems like we can always add output_so = os.path.splitext(input_path)[0] + ".so" and output_o = os.path.splitext(input_path)[0] + ".o"

For example:

log.debug("aot linkage command: %s", link_cmd)
output_so = os.path.splitext(input_path)[0] + ".so"
if fbcode_aot_cpu_re:
    compile_file([output_o, consts_o], output_so, link_cmd.split())
    os.chmod(output_so, 0o755)
else:
    run_command_and_check(link_cmd)

As a matter of fact, we do that for consts_o already. Let me know if this works.

I am waiting for test signals now and will circle back in a few hours.

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 5, 2024

@xuhancn update:

it seems like we can always add output_so = os.path.splitext(input_path)[0] + ".so" and output_o = os.path.splitext(input_path)[0] + ".o"

For example:

log.debug("aot linkage command: %s", link_cmd)
output_so = os.path.splitext(input_path)[0] + ".so"
if fbcode_aot_cpu_re:
    compile_file([output_o, consts_o], output_so, link_cmd.split())
    os.chmod(output_so, 0o755)
else:
    run_command_and_check(link_cmd)

As a matter of fact, we do that for consts_o already. Let me know if this works.

I am waiting for test signals now and will circle back in a few hours.

It is also a solution, if we never plan to support AoT for Windows.

@henrylhtsang
Copy link
Contributor

@xuhancn update:
it seems like we can always add output_so = os.path.splitext(input_path)[0] + ".so" and output_o = os.path.splitext(input_path)[0] + ".o"
For example:

log.debug("aot linkage command: %s", link_cmd)
output_so = os.path.splitext(input_path)[0] + ".so"
if fbcode_aot_cpu_re:
    compile_file([output_o, consts_o], output_so, link_cmd.split())
    os.chmod(output_so, 0o755)
else:
    run_command_and_check(link_cmd)

As a matter of fact, we do that for consts_o already. Let me know if this works.
I am waiting for test signals now and will circle back in a few hours.

It is also a solution, if we never plan to support AoT for Windows.

Well I guess we can move it inside the if fbcode_aot_cpu_re:

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 6, 2024

@xuhancn update:
it seems like we can always add output_so = os.path.splitext(input_path)[0] + ".so" and output_o = os.path.splitext(input_path)[0] + ".o"
For example:

log.debug("aot linkage command: %s", link_cmd)
output_so = os.path.splitext(input_path)[0] + ".so"
if fbcode_aot_cpu_re:
    compile_file([output_o, consts_o], output_so, link_cmd.split())
    os.chmod(output_so, 0o755)
else:
    run_command_and_check(link_cmd)

As a matter of fact, we do that for consts_o already. Let me know if this works.
I am waiting for test signals now and will circle back in a few hours.

It is also a solution, if we never plan to support AoT for Windows.

Well I guess we can move it inside the if fbcode_aot_cpu_re:

Sounds good, it's only impact fbcode_aot_cpu_re.

@henrylhtsang
Copy link
Contributor

Internal signals LGTM (D60773405)

I think it is good to land as long as #132594 and the relative path thing is patched. cc @desertfire @chenyang78

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 6, 2024

@henrylhtsang so, could you please help on merge these PRs?

@henrylhtsang
Copy link
Contributor

@henrylhtsang so, could you please help on merge these PRs?

Yes. Can you add the commits to this PR?

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 6, 2024

Added comments and request review: #132594 @henrylhtsang , @desertfire

@henrylhtsang
Copy link
Contributor

@xuhancn is that a good reason to split it into a few PRs? It will add a few more cycles to merging this PR.

@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 6, 2024

@xuhancn is that a good reason to split it into a few PRs? It will add a few more cycles to merging this PR.

My suggest is you re-create a new PR and land it. @henrylhtsang

@henrylhtsang
Copy link
Contributor

@xuhancn is that a good reason to split it into a few PRs? It will add a few more cycles to merging this PR.

My suggest is you re-create a new PR and land it. @henrylhtsang

sure I can do that

henrylhtsang added a commit to henrylhtsang/pytorch that referenced this pull request Aug 6, 2024
…ch#132766)

Summary:
Pull Request resolved: pytorch#132766

This is basically pytorch#131304 together with pytorch#132594 and absolute path fix for fbcode.

Test Plan: ci

Differential Revision: D60773405
pytorchmergebot pushed a commit that referenced this pull request Aug 6, 2024
Summary: This is basically #131304 together with #132594 and absolute path fix for fbcode.

Test Plan: ci

Differential Revision: D60773405

Pull Request resolved: #132766
Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire
@xuhancn
Copy link
Collaborator Author

xuhancn commented Aug 7, 2024

close by #132766

@xuhancn xuhancn closed this Aug 7, 2024
pytorch-bot bot pushed a commit that referenced this pull request Aug 8, 2024
…ilder. (take 3)

Summary:
Forward fix of a test failure caused by D60773405.

The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o.

However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See #131304 (comment)

Turns out I was wrong.

Test Plan: ci

Differential Revision: D60990594
pytorchmergebot pushed a commit that referenced this pull request Aug 9, 2024
…ilder. (take 3) (#133042)

Summary:
Forward fix of a test failure caused by D60773405.

The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o.

However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See #131304 (comment)

Turns out I was wrong.

Test Plan: ci

Differential Revision: D60990594

Pull Request resolved: #133042
Approved by: https://github.com/hl475, https://github.com/desertfire
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants