Join GitHub today
This is a set of python program and scripts to download a selected a list of topics from wikipedia to a local machine. This can be used for creating an offline repository like wikipedia selected article’s CD/DVD distribution. The program was originally written for releasing a selected 500 articles of Malayalam wikipedia in CD. The program is written such a way that it can be reused with any wikiprojects to do the same kind of work.
“The Malayalam wikipedia: selected 500 articles” generated by this program is available at here
Before running the program, make sure that you have python-lxml installed in your machine. If not install it using the package manager of your distro. The program has a sample set of topics from english wikipdia and pre-configured to run on that sample. All you need to do is, download the program, run the wiki2cd.sh. A folder named samplewiki will be generated in the parent folder, and you can open index.html using your browser.
$tar -xvzf wiki2cd_20100414.tar.gz
How to Customize?
- Prepare a topic list. The input to the program is a plain text file with each line containing the title of page. See the sample topicslist given along with the program. If you want to categorize the topics, give == before the title. This will be used for creating a navigation tree in the offline version. The number of = signes in the prefix of title will determine its position in the tree.
- Open wiki2cd.sh in any text editor. Change outputfolder=“../wiki” to the prefered output folder. The content extraction will happen to that directory. If the directory is not present, program will create it. Change the baseurl=“http://en.wikipedia.org” to your wiki’s base URL. Change topics=“topicslist.txt” to the topics list your prepared.
#Change the following properties as per your requirement
- You also need to edit some of the pages like banner, titles, credits etc as per your requirement. And you might need to edit the banner images, main page image etc to fit your preference.
- ISO9660 file system has lots of limitations when it comes to unicode file names, and directory depths. The first part of the shellscript will create a repository with filenames same as article titles, image names same as original wiki image name. But this will cause problems most of the time when you try to burn the repository to a CD/DVD. So the shellscript will rename all the file names to numbers and move all images to wikiimages folder to reduce the directory depth. By default the script will make the repository suitable for writing into a CD/DVD. But if you prefer to keep the filenames and imagenames as such and not planning to write into CD, you can always comment out the section of the script that does renaming. More details is available in the script wiki2cd.sh as comment.
For any assistance contact the author santhosh dot thottingal at gmail dot com. If you see any bugs or if you have any feature suggestions or feedback, please contact in that address.
- Hiran Venugopal for the artworks
- Shiju Alex for testing, feature suggestions. Read his guidelines on creating a CD version of wikipedia : Creating Wikipedia CD
- Malayalam Wikipedia
The program is licensed under GPLv3+