Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error building Astro project with 30k MDX files #11683

Open
1 task
chrome99 opened this issue Aug 12, 2024 · 2 comments
Open
1 task

Error building Astro project with 30k MDX files #11683

chrome99 opened this issue Aug 12, 2024 · 2 comments
Assignees
Labels
- P2: has workaround Bug, but has workaround (priority)

Comments

@chrome99
Copy link

chrome99 commented Aug 12, 2024

Astro Info

Astro                    v4.13.3
Node                     v21.4.0
System                   Linux (x64)
Package Manager          npm
Output                   static
Adapter                  none
Integrations             @astrojs/mdx
                         @astrojs/tailwind

If this issue only occurs in one browser, which browser is a problem?

This is unrelated to any a specific browser.

Describe the Bug

I'm working on an Astro project with 30,000 blog posts using Content collections, where both the Frontmatter data and the content are integrated into the page. This setup works well for a smaller number of pages (100-1000). However, when I scaled up and added more files, I encountered a JavaScript heap out of memory error. To address this, I started using NODE_OPTIONS='--max-old-space-size=8192' npm run astro build instead of the regular npm run astro build. While this allowed me to build more files, the process is very slow, and I'm facing a new issue where the build fails abruptly with a "Killed" message, providing no additional information for debugging.

Full output using ``NODE_OPTIONS='--max-old-space-size=8192' npm run astro build`:
Screenshot 2024-08-12 112952 2

What's the expected result?

Finish build successfully.

Link to Minimal Reproducible Example

https://stackblitz.com/~/github.com/chrome99/build-error-astro

Participation

  • I am willing to submit a pull request for this issue.
@github-actions github-actions bot added the needs triage Issue needs to be triaged label Aug 12, 2024
@matthewp
Copy link
Contributor

Thanks, are most of these posts mdx?

@chrome99
Copy link
Author

chrome99 commented Aug 14, 2024

Yes, all of these posts are MDX files. In a real project I would also have some aggregations of the posts based on various factors (a tag page, a category page, etc.) - these would be JSON files.

I eventually found a solution / workaround by splitting the files to be built into different chunks, and then merging them together. I built on agamm's solution here.

This solution works for now, but I have some concerns about potential issues. One is with generating a sitemap, and another is when a page queries the entire collection (e.g., await getCollection("blog")). Currently, this would only return the last chunk, not the full data set. While these issues can be manually addressed, or possibly automated with script improvements, I'm debating whether to refine this into a PR or if it might be too much of a workaround.

// build.js
// At package.json, add a new script to run this file:
// "build": "node build.js"
// This script will copy the files from the /data directory to the src/content/blog directory,
// build the Astro project, and then copy the build files to a temporary directory.
// After all chunks are processed, it will merge the build files from all chunks into the final dist/blog directory.

import fs from "fs/promises";
import path from "path";
import { promisify } from "util";
import { exec } from "child_process";

const buildChunkSize = 1000;
const sourceDir = "./data";
const blogDir = "./src/content/blog";
const distDir = "/tmp/dist";

const execAsync = promisify(exec);

function chunkArray(array, chunkSize) {
  const chunks = [];
  for (let i = 0; i < array.length; i += chunkSize) {
    chunks.push(array.slice(i, i + chunkSize));
  }
  return chunks;
}

async function copyDirectoryContents(sourceDir, destDir) {
  const entries = await fs.readdir(sourceDir, { withFileTypes: true });

  await fs.mkdir(destDir, { recursive: true });

  await Promise.all(
    entries.map(async (entry) => {
      const srcPath = path.join(sourceDir, entry.name);
      const destPath = path.join(destDir, entry.name);
      if (entry.isDirectory()) {
        await copyDirectoryContents(srcPath, destPath);
      } else {
        await fs.copyFile(srcPath, destPath);
      }
    })
  );
}

async function emptyDir(directory) {
  if (await fs.stat(directory).catch(() => false)) {
    const entries = await fs.readdir(directory, { withFileTypes: true });
    await Promise.all(
      entries.map(async (entry) => {
        const fullPath = path.join(directory, entry.name);
        if (entry.isDirectory()) {
          await emptyDir(fullPath);
          await fs.rmdir(fullPath);
        } else {
          await fs.unlink(fullPath);
        }
      })
    );
  } else {
    await fs.mkdir(directory, { recursive: true });
  }
}

async function processChunk(sourceDir, blogDir, chunk, chunkNum, totalChunks) {
  console.log(
    `Processing chunk ${chunkNum}/${totalChunks} with ${chunk.length} files.`
  );

  await emptyDir(blogDir);
  console.log(`Cleared ${blogDir}.`);

  await Promise.all(
    chunk.map(async (file) => {
      const srcPath = path.join(sourceDir, file);
      const destPath = path.join(blogDir, file);
      await fs.copyFile(srcPath, destPath);
    })
  );
  console.log(`Copied chunk ${chunkNum} files to ${blogDir}.`);

  await execAsync("npm run astro build");
  console.log(`Build for chunk ${chunkNum} complete.`);
}

async function mergeChunks(distDir, fileChunks) {
  await emptyDir("./dist/blog");
  console.log("Prepared the final dist/blog directory.");

  for (let i = 1; i <= fileChunks.length; i++) {
    const chunkDistDir = `${distDir}_${i}/blog`;
    if (await fs.stat(chunkDistDir).catch(() => false)) {
      await copyDirectoryContents(chunkDistDir, "./dist/blog");
      console.log(`Merged files from ${chunkDistDir} into ./dist/blog.`);
    }
  }

  console.log(
    "Successfully merged all chunks into the final dist/blog directory."
  );
}

async function main() {
  try {
    const allFiles = (await fs.readdir(sourceDir)).filter(
      (file) => file.endsWith(".mdx") || file === "index.astro"
    );
    console.log(`Found ${allFiles.length} mdx files in ${sourceDir}.`);

    const fileChunks = await chunkArray(allFiles, buildChunkSize);
    console.log(`Chunked files into ${fileChunks.length} groups.`);

    for (let i = 0; i < fileChunks.length; i++) {
      await processChunk(
        sourceDir,
        blogDir,
        fileChunks[i],
        i + 1,
        fileChunks.length
      );

      const chunkDistDir = `${distDir}_${i + 1}`;
      await copyDirectoryContents("./dist", chunkDistDir);
      console.log(`Copied build files to ${chunkDistDir}.`);
    }

    await mergeChunks(distDir, fileChunks);
  } catch (error) {
    console.error("Error:", error);
  }
}

main();

@ematipico ematipico self-assigned this Aug 15, 2024
@ematipico ematipico added - P2: has workaround Bug, but has workaround (priority) and removed needs triage Issue needs to be triaged labels Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- P2: has workaround Bug, but has workaround (priority)
Projects
None yet
Development

No branches or pull requests

3 participants