Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist not working #66

Closed
Kristiansky opened this issue Jan 19, 2018 · 9 comments
Closed

Blacklist not working #66

Kristiansky opened this issue Jan 19, 2018 · 9 comments

Comments

@Kristiansky
Copy link

Kristiansky commented Jan 19, 2018

I have added few pages from my site into the blacklist array, but despite that, they appear in the sitemap everytime.

@Kristiansky
Copy link
Author

Kristiansky commented Jan 19, 2018

$blacklist = array( "/de/", "/de/*", "/private/", "/private/*", "*.jpg", "*.png", );
This is my blacklist array. When i open the xml file:
<url> <loc>https://www.mywebsite.com/de</loc> <changefreq>daily</changefreq> <priority>1</priority> </url> <url> <loc>https://www.mywebsite.com/de/sonnenschirme</loc> <changefreq>daily</changefreq> <priority>1</priority> </url>

@vezaynk
Copy link
Owner

vezaynk commented Jan 19, 2018

$blacklist needs absolute urls. For /de/ you would either want https://website.com/de/ or */de/.

@Kristiansky
Copy link
Author

I have changed the array as you told me:
$blacklist = array( "https://www.mywebsite.com/de/", "https://www.mywebsite.com/de/*", "https://www.mywebsite.com/private/", "https://www.mywebsite.com/private/*", "*.jpg", "*.png", );
But in the sitemap /de links still appear. 😞

@vezaynk
Copy link
Owner

vezaynk commented Jan 19, 2018

Post the full config, I'll take a look.

@Kristiansky
Copy link
Author

Kristiansky commented Jan 19, 2018

<?php
/*
Sitemap Generator by Slava Knyazev. Further acknowledgements in the README.md file.

Website: https://www.knyz.org/
I also live on GitHub: https://github.com/knyzorg
Contact me: Slava@KNYZ.org
*/

//Make sure to use the latest revision by downloading from github: https://github.com/knyzorg/Sitemap-Generator-Crawler

/* Usage
Usage is pretty strait forward:
- Configure the crawler by editing this file.
- Select the file to which the sitemap will be saved
- Select URL to crawl
- Configure blacklists, accepts the use of wildcards (example: http://example.com/private/* and *.jpg)
- Generate sitemap
- Either send a GET request to this script or run it from the command line (refer to README file)
- Submit to Google
- Setup a CRON Job execute this script every so often

It is recommended you don't remove the above for future reference.
*/

// Default site to crawl
$site = "https://www.may-online.com/en";

// Default sitemap filename
$file = "../sitemap-generated.xml";
$permissions = 0644;

// Depth of the crawl, 0 is unlimited
$max_depth = 0;

// Show changefreq
$enable_frequency = true;

// Show priority
$enable_priority = true;

// Default values for changefreq and priority
$freq = "daily";
$priority = "1";

// Add lastmod based on server response. Unreliable and disabled by default.
$enable_modified = false;

// Disable this for misconfigured, but tolerable SSL server.
$curl_validate_certificate = true;

// The pages will be excluded from crawl and sitemap.
// Use for exluding non-html files to increase performance and save bandwidth.
$blacklist = array(
	"https://www.may-online.com/de/",
	"https://www.may-online.com/de/*",
	"https://www.may-online.com/private/",
	"https://www.may-online.com/private/*",
	"*.jpg",
	"*.png",
);

// Enable this if your site do requires GET arguments to function
$ignore_arguments = false;

// Not yet implemented. See issue #19 for more information.
$index_img = false;

//Index PDFs
$index_pdf = true;

// Set the user agent for crawler
$crawler_user_agent = "Mozilla/5.0 (compatible; Sitemap Generator Crawler; +https://github.com/knyzorg/Sitemap-Generator-Crawler)";

// Header of the sitemap.xml
$xmlheader ='<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">';

// Optionally configure debug options
$debug = array(
	"add" => true,
	"reject" => true,
	"warn" => true
);


//Modify only if configuration version is broken
$version_config = 2;

@Kristiansky
Copy link
Author

Kristiansky commented Jan 19, 2018

Here's what's generated

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>https://www.may-online.com/de</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/impressum</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/agb</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/datenschutz</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/restaurant-cafe</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/ampelschirme</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/ampelschirme/mezzo</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/unternehmen/referenzen</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
</urlset>

@vezaynk
Copy link
Owner

vezaynk commented Jan 20, 2018

Found the issue. Blacklist seems to be ignored when it caused by a redirect. Oh the pleasures of parsing the web!

By the way, to format code blocks, it's 3 backticks. The initial $site is trusted and is never checked against blacklists.

For some reason, it's refusing to go to the /en site. The reformatter chokes on it for some reason, probably somehow related to the redirection.

FYI, redirecting from the root is bad practice.

@vezaynk
Copy link
Owner

vezaynk commented Jan 20, 2018

I was wrong. It is not related to the redirect. Your link looks as such: <a href=" https://www.may-online.com/en">en</a>. That is not okay. The space before the https:// makes it invalid. Web browsers are smart enough to remove it, my script is not.

@vezaynk
Copy link
Owner

vezaynk commented Jan 20, 2018

I was suppose to close this issue via commit.
Redirection bug was fixed in b894362

@vezaynk vezaynk closed this as completed Jan 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants