Permalink
Browse files

ScraperUrl: detect and use charset even if web server don't report it

  • Loading branch information...
Karlson2k committed Dec 10, 2013
1 parent 99c8594 commit 872de5f2fb21d155a0c1a51d7aadf8a70133cbe4
Showing with 52 additions and 12 deletions.
  1. +52 −12 xbmc/utils/ScraperUrl.cpp
View
@@ -23,10 +23,14 @@
#include "settings/AdvancedSettings.h"
#include "HTMLUtil.h"
#include "CharsetConverter.h"
+#include "utils/CharsetDetection.h"
+#include "utils/StringUtils.h"
#include "URL.h"
#include "filesystem/CurlFile.h"
#include "filesystem/ZipFile.h"
#include "URIUtils.h"
+#include "utils/XBMCTinyXML.h"
+#include "utils/FileUtils.h"
#include <cstring>
#include <sstream>
@@ -233,27 +237,63 @@ bool CScraperUrl::Get(const SUrlEntry& scrURL, std::string& strHTML, XFILE::CCur
return false;
strHTML = strHTML1;
- std::string fileCharset(http.GetServerReportedCharset());
- if (scrURL.m_url.find(".zip") != std::string::npos)
+ std::string mimeType(http.GetMimeType());
+ CFileUtils::EFileType ftype = CFileUtils::GetFileTypeFromMime(mimeType);
+
+ if (ftype == CFileUtils::FileTypeZip || ftype == CFileUtils::FileTypeGZip)

This comment has been minimized.

Show comment Hide comment
@Voyager1

Voyager1 Jan 7, 2014

Member

the fact you removed the check for .zip extension now causes problems with TheTVDB scraper - the zips downloaded are not always coming with the correct ZIP mimetype, leading to log lines like this

CScraperUrl::Get: Assuming "UTF-8" charset for content of "http://thetvdb.com/api/1D62F2F90030C444/series/77526/all/en.zip"

causing no TV show scraping. Could you please fix. Thanks!

@Voyager1

Voyager1 Jan 7, 2014

Member

the fact you removed the check for .zip extension now causes problems with TheTVDB scraper - the zips downloaded are not always coming with the correct ZIP mimetype, leading to log lines like this

CScraperUrl::Get: Assuming "UTF-8" charset for content of "http://thetvdb.com/api/1D62F2F90030C444/series/77526/all/en.zip"

causing no TV show scraping. Could you please fix. Thanks!

This comment has been minimized.

Show comment Hide comment
@DjSlash

DjSlash Jan 7, 2014

Correct me if I'm wrong, but the line replacing the old one, shouldn't that check correctly if the file is an .zip?

But I also suspect something between these lines of changes makes that the tvdb scraper is currently broken.

Actually, I've replaced the new code with the old code and that makes the tvdb scraper to at least scrape the episodes and provide the database with information. However, it doesn't create the data as it should, but I'm not sure if that's related to this. Anyway, it's most definitly that there is something in these changes that doesn't work as you thought.

For complete information, the sideaffect: it scrapes the episodes fine, but all found episodes are put in one "serie". This makes that in my case 7000 episodes are put there, seperated in several seasons.

@DjSlash

DjSlash Jan 7, 2014

Correct me if I'm wrong, but the line replacing the old one, shouldn't that check correctly if the file is an .zip?

But I also suspect something between these lines of changes makes that the tvdb scraper is currently broken.

Actually, I've replaced the new code with the old code and that makes the tvdb scraper to at least scrape the episodes and provide the database with information. However, it doesn't create the data as it should, but I'm not sure if that's related to this. Anyway, it's most definitly that there is something in these changes that doesn't work as you thought.

For complete information, the sideaffect: it scrapes the episodes fine, but all found episodes are put in one "serie". This makes that in my case 7000 episodes are put there, seperated in several seasons.

{
XFILE::CZipFile file;
- CStdString strBuffer;
- int iSize = file.UnpackFromMemory(strBuffer,strHTML,scrURL.m_isgz);
- if (iSize)
+ std::string strBuffer;
+ int iSize = file.UnpackFromMemory(strBuffer,strHTML,scrURL.m_isgz); // FIXME: use FileTypeGZip instead of scrURL.m_isgz?
+ if (iSize > 0)
+ strHTML = strBuffer;
+ }
+
+ std::string reportedCharset(http.GetServerReportedCharset());
+ if (ftype == CFileUtils::FileTypeHtml)
+ {
+ std::string realHtmlCharset, converted;
+ if (!CCharsetDetection::ConvertHtmlToUtf8(strHTML, converted, reportedCharset, realHtmlCharset))
+ CLog::Log(LOGWARNING, "%s: Can't find precise charset for \"%s\", using \"%s\" as fallback", __FUNCTION__, scrURL.m_url.c_str(), realHtmlCharset.c_str());
+ else
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, realHtmlCharset.c_str(), scrURL.m_url.c_str());
+
+ strHTML = converted;
+ }
+ else if (ftype == CFileUtils::FileTypeXml)
+ {
+ CXBMCTinyXML xmlDoc;
+ xmlDoc.Parse(strHTML, reportedCharset);
+
+ std::string realXmlCharset(xmlDoc.GetUsedCharset());
+ if (!realXmlCharset.empty())
{
- fileCharset.clear();
- strHTML.clear();
- strHTML.append(strBuffer.c_str(),strBuffer.data()+iSize);
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, realXmlCharset.c_str(), scrURL.m_url.c_str());
+ std::string converted;
+ g_charsetConverter.ToUtf8(realXmlCharset, strHTML, converted);
+ strHTML = converted;
}
}
-
- if (!fileCharset.empty() && fileCharset != "UTF-8")
+ else if (ftype == CFileUtils::FileTypePlainText || StringUtils::CompareNoCase(mimeType.substr(0, 5), "text/") == 0)
+ {
+ std::string realTextCharset, converted;
+ CCharsetDetection::ConvertPlainTextToUtf8(strHTML, converted, reportedCharset, realTextCharset);
+ strHTML = converted;
+ if (reportedCharset != realTextCharset)
+ CLog::Log(LOGWARNING, "%s: Using \"%s\" charset for \"%s\" instead of server reported \"%s\" charset", __FUNCTION__, realTextCharset.c_str(), scrURL.m_url.c_str(), reportedCharset.c_str());
+ else
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, realTextCharset.c_str(), scrURL.m_url.c_str());
+ }
+ else if (!reportedCharset.empty() && reportedCharset != "UTF-8")
{
+ CLog::Log(LOGDEBUG, "%s: Using \"%s\" charset for \"%s\"", __FUNCTION__, reportedCharset.c_str(), scrURL.m_url.c_str());
std::string converted;
- if (g_charsetConverter.ToUtf8(fileCharset, strHTML, converted) && !converted.empty())
- strHTML = converted;
+ g_charsetConverter.ToUtf8(reportedCharset, strHTML, converted);
+ strHTML = converted;
}
+ else
+ CLog::Log(LOGDEBUG, "%s: Assuming \"UTF-8\" charset for content of \"%s\"", __FUNCTION__, scrURL.m_url.c_str());
if (!scrURL.m_cache.empty())
{

0 comments on commit 872de5f

Please sign in to comment.