Migration script from MediaWiki to Hugo
This is a shell script I wrote to migrate my MediaWiki instance into a Hugo static website.
In my use case, my MediaWiki instance was hosting several notes and HowTo's. So, it was a quite small instance (256 pages in 8 NameSpaces and 16 Categories illustrated by 61 images). And even, it takes ~25sec to process. This is certainly not the best approach to migrate huge instance of MediaWiki. Nevertheless, as I didn't found anything similar doing that, I think this script could really helps other, even it should be improved.
The useful variables are customizable via arguments of the shell script.
xenlo@red-carpet:~/Scripts/$ ./mediawiki-2-hugo.sh -h
Usage: ./mediawiki-2-hugo.sh [-v] [-i mediawiki_dir] [-o out_dir] [-t timezone] [-c charset]
-i mediawiki_dir Specify the MediaWiki root directory as input (default: /var/www/mediawiki/)
-o out_dir Specify output directory (default: ./out)
-t timezone Specify your timezone offset (default: +02:00)
-c charset Specify the DB charset (default: binary)
-f frontmatter Specify a file with extra Front Matter entries
-w format_script Specify a script which pre-processing the wiki content text,
any script taking wiki text from standart input and return the edited wiki text as standard output
-m format_script Specify a script which post-processing the MD content text,
any script taking MarkDown text from standart input and return the edited MarkDown text as standard output
-M md_format Specify the destination MarkDown format as pandoc will accept for --to argument,
(default: `markdown_strict+backtick_code_blocks`)
-v Verbose
So simply run the script, eventually with sudo as apache user (if yours don't have read access to the MediaWiki directory).
sudo -u www-data ./mediawiki-2-hugo.sh -i /var/www/my_wiki -o /tmp/output -w my-pre-format-script.py
The script will first load the credentials from LocalSettings.php
file and use it to read all wiki pages data in the database.
From there, it generates a structure of subdirectories (which matches the MediaWiki namespaces) in a destination directory (DEST_DIR
is set as ./out
by default).
Then for each wiki page (each record of page
in the SQL DB) it will generate a .md
file. And each file will be filled in with servral attributes (front matters in Hugo language) and with the page's content.
This script is fetching and filling the following data:
- title: Which is the
page.page_title
where I apply a replace of the '_
' by spaces - author: The
user.user_name
of MediaWiki user who created the page. - date: The oldest
revision.rev_timestamp
in the database for that page. - lastmod: The latest
revision.rev_timestamp
in the database for that page. - draft: Always false, ... (I think to put true for the stuff that are not from man name_space)
- categories: The list of categories linked to that page (
categorylinks.cl_to
) - tags: Left empty for now...
- aliases: The list of redirect/renamed page name (
page.page_title
whenpage.page_is_redirect = 1
).
And of course, it fetch the content of each pages in its latest version, and convert it in MarkDown (github flavour).
pandoc
is installed and accessible.
- Pandoc installed
- Read access to file
${WIKI_WEB_DIR}/LocalSettings.php
- Read access to directory
${WIKI_WEB_DIR}/images/
- SQL connectivity (credentials collected from
LocalSettings.php
)
It's working but far from perfect. Lot's of stuff are not handled as you can see in the list here under.
- Ensure it generate a nicer post header
- Handle the verbosity
- Manage template into html snippet
- Further test of output in Hugo
- Ensure that the inital logo respects the licences of Hugo and MediaWiki images
- Correct the known bugs
- Complete the tags with ...?
- Re-factoring a bit with functions
- Better exploit possibilities of pandoc (metadata variables)??
- Copy the files: Except the images, I don't have any files on my wiki.
- Slug: I took the page's name from MediaWiki (which is the title with spaces substituted by underscores) as files name. So in the end of the url remain the same.
- Front matters of theme: In the theme I plan to use (Tranquilpeak) I don't saw any interesting attribute that could be fed with data from the MediaWiki.
Copyright 2019 Laurent G (xenlo)
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.