Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

ScraperUrl: detect and use charset even if web server don't report it

  • Loading branch information...
commit 872de5f2fb21d155a0c1a51d7aadf8a70133cbe4 1 parent 99c8594
Karlson2k Karlson2k authored
Showing with 52 additions and 12 deletions.
  1. +52 −12 xbmc/utils/ScraperUrl.cpp
64 xbmc/utils/ScraperUrl.cpp
View
@@ -23,10 +23,14 @@
#include "settings/AdvancedSettings.h"
#include "HTMLUtil.h"
#include "CharsetConverter.h"
+#include "utils/CharsetDetection.h"
+#include "utils/StringUtils.h"
#include "URL.h"
#include "filesystem/CurlFile.h"
#include "filesystem/ZipFile.h"
#include "URIUtils.h"
+#include "utils/XBMCTinyXML.h"
+#include "utils/FileUtils.h"
#include <cstring>
#include <sstream>
@@ -233,27 +237,63 @@ bool CScraperUrl::Get(const SUrlEntry& scrURL, std::string& strHTML, XFILE::CCur
return false;
strHTML = strHTML1;
- std::string fileCharset(http.GetServerReportedCharset());
- if (scrURL.m_url.find(".zip") != std::string::npos)
+ std::string mimeType(http.GetMimeType());
+ CFileUtils::EFileType ftype = CFileUtils::GetFileTypeFromMime(mimeType);
+
+ if (ftype == CFileUtils::FileTypeZip || ftype == CFileUtils::FileTypeGZip)
Voyager1 Collaborator
Voyager1 added a note

the fact you removed the check for .zip extension now causes problems with TheTVDB scraper - the zips downloaded are not always coming with the correct ZIP mimetype, leading to log lines like this

CScraperUrl::Get: Assuming "UTF-8" charset for content of "http://thetvdb.com/api/1D62F2F90030C444/series/77526/all/en.zip"

causing no TV show scraping. Could you please fix. Thanks!

Rutger van Sleen
DjSlash added a note

Correct me if I'm wrong, but the line replacing the old one, shouldn't that check correctly if the file is an .zip?

But I also suspect something between these lines of changes makes that the tvdb scraper is currently broken.

Actually, I've replaced the new code with the old code and that makes the tvdb scraper to at least scrape the episodes and provide the database with information. However, it doesn't create the data as it should, but I'm not sure if that's related to this. Anyway, it's most definitly that there is something in these changes that doesn't work as you thought.

For complete information, the sideaffect: it scrapes the episodes fine, but all found episodes are put in one "serie". This makes that in my case 7000 episodes are put there, seperated in several seasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
{
XFILE::CZipFile file;
- CStdString strBuffer;
- int iSize = file.UnpackFromMemory(strBuffer,strHTML,scrURL.m_isgz);
- if (iSize)
+ std::string strBuffer;
+ int iSize = file.UnpackFromMemory(strBuffer,strHTML,scrURL.m_isgz); // FIXME: use FileTypeGZip instead of scrURL.m_isgz?
+ if (iSize > 0)
+ strHTML = strBuffer;
+ }
+
+ std::string reportedCharset(http.GetServerReportedCharset());
+ if (ftype == CFileUtils::FileTypeHtml)
+ {
+ std::string realHtmlCharset, converted;
+ if (!CCharsetDetection::ConvertHtmlToUtf8(strHTML, converted, reportedCharset, realHtmlCharset))
+ CLog::Log(LOGWARNING, "%s: Can't find precise charset for \"%s\", using \"%s\" as fallback", __FUNCTION__, scrURL.m_url.c_str(), realHtmlCharset.c_str());
+ else
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, realHtmlCharset.c_str(), scrURL.m_url.c_str());
+
+ strHTML = converted;
+ }
+ else if (ftype == CFileUtils::FileTypeXml)
+ {
+ CXBMCTinyXML xmlDoc;
+ xmlDoc.Parse(strHTML, reportedCharset);
+
+ std::string realXmlCharset(xmlDoc.GetUsedCharset());
+ if (!realXmlCharset.empty())
{
- fileCharset.clear();
- strHTML.clear();
- strHTML.append(strBuffer.c_str(),strBuffer.data()+iSize);
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, realXmlCharset.c_str(), scrURL.m_url.c_str());
+ std::string converted;
+ g_charsetConverter.ToUtf8(realXmlCharset, strHTML, converted);
+ strHTML = converted;
}
}
-
- if (!fileCharset.empty() && fileCharset != "UTF-8")
+ else if (ftype == CFileUtils::FileTypePlainText || StringUtils::CompareNoCase(mimeType.substr(0, 5), "text/") == 0)
+ {
+ std::string realTextCharset, converted;
+ CCharsetDetection::ConvertPlainTextToUtf8(strHTML, converted, reportedCharset, realTextCharset);
+ strHTML = converted;
+ if (reportedCharset != realTextCharset)
+ CLog::Log(LOGWARNING, "%s: Using \"%s\" charset for \"%s\" instead of server reported \"%s\" charset", __FUNCTION__, realTextCharset.c_str(), scrURL.m_url.c_str(), reportedCharset.c_str());
+ else
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, realTextCharset.c_str(), scrURL.m_url.c_str());
+ }
+ else if (!reportedCharset.empty() && reportedCharset != "UTF-8")
{
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, reportedCharset.c_str(), scrURL.m_url.c_str());
std::string converted;
- if (g_charsetConverter.ToUtf8(fileCharset, strHTML, converted) && !converted.empty())
- strHTML = converted;
+ g_charsetConverter.ToUtf8(reportedCharset, strHTML, converted);
+ strHTML = converted;
}
+ else
+ CLog::Log(LOGDEBUG, "%s: Assuming \"UTF-8\" charset for content of \"%s\"", __FUNCTION__, scrURL.m_url.c_str());
if (!scrURL.m_cache.empty())
{
Voyager1

the fact you removed the check for .zip extension now causes problems with TheTVDB scraper - the zips downloaded are not always coming with the correct ZIP mimetype, leading to log lines like this

CScraperUrl::Get: Assuming "UTF-8" charset for content of "http://thetvdb.com/api/1D62F2F90030C444/series/77526/all/en.zip"

causing no TV show scraping. Could you please fix. Thanks!

Rutger van Sleen

Correct me if I'm wrong, but the line replacing the old one, shouldn't that check correctly if the file is an .zip?

But I also suspect something between these lines of changes makes that the tvdb scraper is currently broken.

Actually, I've replaced the new code with the old code and that makes the tvdb scraper to at least scrape the episodes and provide the database with information. However, it doesn't create the data as it should, but I'm not sure if that's related to this. Anyway, it's most definitly that there is something in these changes that doesn't work as you thought.

For complete information, the sideaffect: it scrapes the episodes fine, but all found episodes are put in one "serie". This makes that in my case 7000 episodes are put there, seperated in several seasons.

Please sign in to comment.
Something went wrong with that request. Please try again.