Skip to content

Commit

Permalink
added test files for UTF-8 / Umlaute - Testing:
Browse files Browse the repository at this point in the history
These 3 files contain the same text in different HTML encodings. We use this documents to test if the parser and indexer creates the same set of word hashes for all three texts.

To use these files, run a indexing/crawling on them. To get the files inside the localhost-path, do the following:

cd <yacy-home>
rmdir DATA/HTDOCS/repository
ln -s test/parsertest DATA/HTDOCS/repository

you have then linked the test directory as repository directory which you can reach in yacy if you switch to intranet indexing mode. So the next step is to start yacy, then
- switch to intranet use case
- go to the crawl start page
- the repository directory should be the default path as crawl start
- start the crawl
- search for any word that appears in the demo texts
- search not only for words with umlautel but also for words without umlaute to ensure that you find _all_ three documents
- see how yacy presents the snippet with the text containing umlaute

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5293 6c8d7289-2bf4-0310-a012-ef5d649a1542
  • Loading branch information
orbiter committed Oct 22, 2008
1 parent 2e53cbc commit 204220e
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 0 deletions.
10 changes: 10 additions & 0 deletions test/parsertest/umlaute_html.html
@@ -0,0 +1,10 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body>
In M&uuml;nchen steht ein Hofbr&auml;uhaus.
Dort gibt es Bier aus Ma&szlig;kr&uuml;gen.<br>
</body>
</html>
10 changes: 10 additions & 0 deletions test/parsertest/umlaute_iso.html
@@ -0,0 +1,10 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body>
In München steht ein Hofbräuhaus.
Dort gibt es Bier aus Maßkrügen.<br>
</body>
</html>
10 changes: 10 additions & 0 deletions test/parsertest/umlaute_utf8.html
@@ -0,0 +1,10 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body>
In M&#252;nchen steht ein Hofbr&#228;uhaus.
Dort gibt es Bier aus Ma&#223;kr&#252;gen.<br>
</body>
</html>

0 comments on commit 204220e

Please sign in to comment.