Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing problem with Polish characters #2

Closed
safesploit opened this issue Nov 2, 2022 · 8 comments
Closed

Indexing problem with Polish characters #2

safesploit opened this issue Nov 2, 2022 · 8 comments

Comments

@safesploit
Copy link
Member

Email body below.

Hi, I use your doogle and I have a problem with Polish characters, i.e. partially indexed pages have normally Polish characters, but some pages do not exist, do you know how to solve it? Thank you and best regards Thank you

image

@safesploit
Copy link
Member Author

safesploit commented Nov 2, 2022

I am not currently able to test indexing https://pogoda.wp.pl/ at this moment, but I have included a StackOverflow post Store Polish characters in MySQL which suggests possibly an SQL Unicode issue.

Polish characters not encoding properly likely occurred because I used DEFAULT CHARSET=latin1. See doogle-tables-no-data.sql. When ideally I should have used UTF-8.

Please let me know.

@ArcyXander
Copy link

Hello, I did as you said and it is still the same ;/
doogleUTF

@safesploit
Copy link
Member Author

Modified SQL code you provided by email

-- phpMyAdmin SQL Dump - No Data
-- version 5.1.1
-- https://www.phpmyadmin.net/
--
-- Host: 192.168.5.240
-- Generation Time: Apr 24, 2022 at 09:25 AM
-- Server version: 8.0.28-0ubuntu0.20.04.3
-- PHP Version: 7.4.24

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
START TRANSACTION;
SET time_zone = "+00:00";


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;


USE `doogle`;

-- --------------------------------------------------------

--
-- Table structure for table `images`
--

CREATE TABLE `images` (
`id` int(11) NOT NULL,
`siteUrl` varchar(512) NOT NULL,
`imageUrl` varchar(512) NOT NULL,
`alt` varchar(512) NOT NULL,
`title` varchar(512) NOT NULL,
`clicks` int(11) NOT NULL DEFAULT '0',
`broken` tinyint(4) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Table structure for table `sites`
--

CREATE TABLE `sites` (
`id` int(11) NOT NULL,
`url` varchar(512) NOT NULL,
`title` varchar(512) NOT NULL,
`description` varchar(512) NOT NULL,
`keywords` varchar(512) NOT NULL,
`clicks` int(11) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Indexes for dumped tables
--

--
-- Indexes for table `images`
--
ALTER TABLE `images`
ADD PRIMARY KEY (`id`);

--
-- Indexes for table `sites`
--
ALTER TABLE `sites`
ADD PRIMARY KEY (`id`);

--
-- AUTO_INCREMENT for dumped tables
--

--
-- AUTO_INCREMENT for table `images`
--
ALTER TABLE `images`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=13003;

--
-- AUTO_INCREMENT for table `sites`
--
ALTER TABLE `sites`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=5297;
COMMIT;

/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;

I do agree there is an issue, as I have used the above and index using crawl-formSubmit.php and am still having issues with the encoding of Polish characters.

MySQL reference manual supports the usage of utf8mb4. So, possibly an issue relating to PHP submitting the data to SQL. I haven't tested this theory, but I am using prepared statements in PHP to insert data into the database. But, I also disagree with the point I just made, as the prepare() function (PDO::prepare) is meant to treat the variables as a data object (no sanitisation); unless I have misunderstood the usage of this function.

crawl-formSubmit.php

function insertLink($url, $title, $description, $keywords)
{
	global $con;

	$query = $con->prepare("INSERT INTO sites(url, title, description, keywords)
							VALUES(:url, :title, :description, :keywords)");

	$query->bindParam(":url", $url);
	$query->bindParam(":title", $title);
	$query->bindParam(":description", $description);
	$query->bindParam(":keywords", $keywords);

	return $query->execute();
}

@safesploit
Copy link
Member Author

I may have a possible solution. mysqli uses the default charset=latin1, as discussed Polish characters in MySQL response. But I suspect PDO may also do the same with its default charset.

In this case, config.php file needs to be modified when initiating a new PDO. PDO::_construct comment 113498 discussed how to change the charset to UTF-8.

@ArcyXander
Copy link

I'm keeping my fingers crossed for a successful resolution of this problem. Warm regards

@safesploit
Copy link
Member Author

safesploit commented Nov 7, 2022

I may have a possible solution. mysqli uses the default charset=latin1, as discussed Polish characters in MySQL response. But I suspect PDO may also do the same with its default charset.

In this case, config.php file needs to be modified when initiating a new PDO. PDO::_construct comment 113498 discussed how to change the charset to UTF-8.

I have tested with the tables configured as utf8 (which MySQL calls 'utf8_general_ci') and PDO using charset=utf8 with no success. I will need to revisit this issue.

$con = new PDO("mysql:dbname=$dbname;host=$dbhost;charset=utf8", "$dbuser", "$dbpass");

Erroneous data inserted into the database via PDO.

reference error image

@safesploit
Copy link
Member Author

safesploit commented Nov 9, 2022

I have discovered a possible fix (more work is needed) for PHP submitting data into the database.
Within crawl-formSubmit.php I am using json_encode() function for the title, description and keywords.

$title = json_encode($titleArray->item(0)->nodeValue); title
$description = json_encode($meta->getAttribute("content")); description
$description = json_encode($meta->getAttribute("content")); keywords

Now, this only fixes the values inserted into the database. These values still need to be decoded using json_decode() function in SiteResultsProvider.php. However, I am having issues with UTF-8 characters not rendering properly in the browser now.

$title = json_decode($row["title"]);
$description = json_decode($row["description"]);

I am including an article relating to UTF-8 charset for PHP as I suspect the choice of using JSON to encode and decode is the wrong choice for UTF-8. But this is still a work in progress.

JSON encoded strings in database

JSON encoded strings in database

JSON decoded strings on frontend

JSON decoded strings on frontend

@safesploit
Copy link
Member Author

Non-ASCII Character Fix

image

/classes/DomDocumentParser.php must be modified to reflect the UTF-8 changes

	public function __construct($url) 
	{
		...

		$this->doc = new DomDocument('1.0', 'utf-8');
		@$this->doc->loadHTML('<?xml encoding="UTF-8">' . file_get_contents($url, false, $context));

		...
	}

Currently, the database tables sites and images must be using CHARSET=utf8mb4 (as shown above) and /classes/DomDocumentParser.php must be modified to support UTF-8.

Bug explained

Inside $metasArray = $parser->getMetatags() the function getMetatags() was fetching values using a non-UTF-8 character set.

However, I am not necessarily happy using CHARSET=utf8mb4 as this datatype requires 4 bytes. As a result, I will not release a new version of Doogle (supporting UTF-8) until I have decided whether to use encoding/decoding for database values or the fix described above (changing database charset and DOM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants