# **LOGIC EXPLANATION**
To categorize the websites we will follow this logic:


1.   Firstly, we will scrape the content which is on the main pages by using Cloudscraper and we will parse the content with BeautifulSoup. 
2.   Secondly, we will translate this content to English as the Google NLP API does not support other languages by using Googletrans module.
3.   Thirdly, by using Google NLP API we will categorize these websites and get their categories and a confidence score.

**Note**: if the websites were in English, we could skip the translation step and directly input the English content into NLP API module.


# **REQUIREMENTS**
In order to be able to use this notebook we will first need to install the modules cloudscraper and googletrans.

1.   **[Cloudscraper](https://pypi.org/project/cloudscraper/)**: this library is similar to Requests. It will enable us to send HTTP requests and scrape the content within the URLs. This module works better than requests to scrape websites which use a CloudFlare technology.
2.   **[Googletrans](https://pypi.org/project/googletrans/)**: free and unlimited Python library which uses Google Translate AJAX API. Note that this is not the official API, so if you seek a more stable solution, it is better to use the [official solution](https://cloud.google.com/translate/docs) (but it is not for free).

You also need to register into Google Cloud, create a project, enable the Natural Language service and create your JSON credentials file to be able to make use of this notebook. Some info over here: [https://cloud.google.com/natural-language/docs/setup](https://cloud.google.com/natural-language/docs/setup)



In [1]:
#First, we install cloudscraper and googletrans modules that we are going to use for scraping and translating the content.

!pip install cloudscraper
!pip install googletrans

Collecting cloudscraper
[?25l  Downloading https://files.pythonhosted.org/packages/83/e4/f1d3872ce822f52f4133cc04960185f64676ca5ec87f05a5350fc0c0a92f/cloudscraper-1.2.48-py2.py3-none-any.whl (94kB)
[K     |████████████████████████████████| 102kB 4.1MB/s 
Collecting requests-toolbelt>=0.9.1
[?25l  Downloading https://files.pythonhosted.org/packages/60/ef/7681134338fc097acef8d9b2f8abe0458e4d87559c689a8c306d0957ece5/requests_toolbelt-0.9.1-py2.py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 7.1MB/s 
Installing collected packages: requests-toolbelt, cloudscraper
Successfully installed cloudscraper-1.2.48 requests-toolbelt-0.9.1
Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/71/3a/3b19effdd4c03958b90f40fe01c93de6d5280e03843cc5adf6956bfc9512/googletrans-3.0.0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-non

# **INDIVIDUAL URL Mode**
On this section we are going to input an URL through a form as a string and we are going to categorize only this URL. This will enable us only to categorize URLs one by one, which isn't very helpful as we could even do it manually by checking the websites by ourselves.

In [5]:
#@title URL { output-height: 30 }
url = "https://www.elpais.com" #@param {type:"string"}
english_content = "No" #@param ["Yes", "No"]
print(url)
print("Is the content in English? " + english_content)



https://www.elpais.com
Is the content in English? No


In [6]:
#With cloudscraper we scrape the URL and with Beutiful Soup we parse the URLs.
#We put everything together in order based on importance.


import cloudscraper
from bs4 import BeautifulSoup


scraper = cloudscraper.create_scraper() 
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}

try:
    r = scraper.get(url, headers = headers)

    soup = BeautifulSoup(r.text, 'html.parser')
    title = soup.find('title').text
    description = soup.find('meta', attrs={'name': 'description'})

    if "content" in str(description):
        description = description.get("content")
    else:
        description = ""


    h1 = soup.find_all('h1')
    h1_all = ""
    for x in range (len(h1)):
        if x ==  len(h1) -1:
            h1_all = h1_all + h1[x].text
        else:
            h1_all = h1_all + h1[x].text + ". "


    paragraphs_all = ""
    paragraphs = soup.find_all('p')
    for x in range (len(paragraphs)):
        if x ==  len(paragraphs) -1:
            paragraphs_all = paragraphs_all + paragraphs[x].text
        else:
            paragraphs_all = paragraphs_all + paragraphs[x].text + ". "



    h2 = soup.find_all('h2')
    h2_all = ""
    for x in range (len(h2)):
        if x ==  len(h2) -1:
            h2_all = h2_all + h2[x].text
        else:
            h2_all = h2_all + h2[x].text + ". "



    h3 = soup.find_all('h3')
    h3_all = ""
    for x in range (len(h3)):
        if x ==  len(h3) -1:
            h3_all = h3_all + h3[x].text
        else:
            h3_all = h3_all + h3[x].text + ". "

    allthecontent = str(title) + " " + str(description) + " " + str(h1_all) + " " + str(h2_all) + " " + str(h3_all) + " " + str(paragraphs_all)
    allthecontent = str(allthecontent)[0:999]

except Exception as e:
        print(e)

print(allthecontent)
#

EL PAÍS: el periódico global Noticias de última hora sobre la actualidad en España y el mundo: política, economía, deportes, cultura, sociedad, tecnología, gente, opinión, viajes, moda, televisión, los blogs y las firmas de EL PAÍS. Además especiales, vídeos, fotos, audios, gráficos, entrevistas, promociones y todos los servicios de EL PAÍS. Actualiza tu navegador España e Irlanda, los dos extremos en la UE de la crisis del coronavirus . Alemania inmunizará a su población vulnerable en 60 centros repartidos por el país . Viaje a Mbour, la costa senegalesa de los naufragios olvidados: “Este lugar está muerto”
. El Gobierno despliega campamentos para la crisis migratoria en Canarias . Iker Casillas, el guardameta que fue santo . Las cartas de Sylvia Plath que relatan su niñez y adolescencia. La sombra aplastante del general De Gaulle . El futuro según Jenny Kleeman: robots sexuales, filetes clonados y úteros externos . Trump recurre a sus poderes presidenciales para tratar de subvertir 


In [7]:
#We use the Googletrans module to translate the content to English.
#Once the content is translated, we limit the number of characters up to 1.000 so that we do not overspend Google NLP API tokens.
#If the content is already in English, we do not translate it.

from googletrans import Translator
translator = Translator()

if english_content == "No":

  try:
          translation = translator.translate(allthecontent).text
          translation = str(translation)[0:999]
          
  except Exception as e:
          print(e)

else:

  translation = allthecontent

print(translation)


EL PAÍS: the global newspaper Breaking news about current affairs in Spain and the world: politics, economy, sports, culture, society, technology, people, opinion, travel, fashion, television, EL PAÍS blogs and firms. Also specials, videos, photos, audios, graphics, interviews, promotions and all EL PAÍS services. Update your browser Spain and Ireland, the two extremes in the EU of the coronavirus crisis. Germany will immunize its vulnerable population in 60 centers throughout the country. Travel to Mbour, the Senegalese coast of forgotten shipwrecks: "This place is dead"
. The Government deploys camps for the migratory crisis in the Canary Islands. Iker Casillas, the goalkeeper who was a saint. Sylvia Plath's letters recounting her childhood and adolescence. The crushing shadow of General de Gaulle. The future according to Jenny Kleeman: sex robots, cloned steaks and external wombs. Trump uses his presidential powers to try to subvert


In [10]:
#@title Path to your Google NLP API credentials
path_credentials = "/content/gdrive/MyDrive/Colab Notebooks/Website Categorization/NLPAPI.json" #@param {type:"string"}


In [26]:
#We use the Google NLP API module to categorize the websites.
#First, we need to provide the Google Application Credentials.
#We make the request to NLP API.
#We print the assigned category and the confidence score.

import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud import language
from google.cloud.language import types
from google.colab import drive

drive.mount('/content/gdrive')
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = path_credentials
 
try:
        text_content = str(translation)
        client = language_v1.LanguageServiceClient()
        type_ = enums.Document.Type.PLAIN_TEXT
        document = {"content": text_content, "type": type_, "language": "en"}
        encoding_type = enums.EncodingType.UTF8
        response = client.classify_text(document)

        print(url + " categorized under " + response.categories[0].name + " with " + str(int(round(response.categories[0].confidence,3)*100)) +"% of confidence")
 
 
except Exception as e:
    print(e)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
https://www.elpais.com categorized under /News with 97% of confidence


# **BULK MODE**
This is where fun gets started. We can input a bunch of websites at the same time by using a list provided through a CSV file.
However, as we are using a not-official Google Translate API, it is recommendable to not input chunks of over 50 URLs to not get banned.
Once the categorization is done, we can save it as an Excel file with Pandas.

In [2]:
#@title Path to your CSV file
path_csv = "/content/gdrive/MyDrive/Colab Notebooks/Website Categorization/Sample Websites.csv" #@param {type:"string"}
english_content = "No" #@param ["Yes", "No"]
print("Is the content in English? " + english_content)


Is the content in English? No


In [3]:
#To start with, we need to import and read the websites from the CSV file. We will use the CSV module.
#Reader function iterates over the lines.
#We transform this the reader object into a list.

import csv

with open(path_csv, newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

print(data)

[['https://www.larazon.es/'], ['https://www.bezzia.com'], ['https://www.ahorradoras.com/'], ['https://pix-geeks.com'], ['https://www.ciclismoafondo.es/'], ['https://www.revistaoxigeno.es/'], ['https://www.triatlonweb.es/'], ['https://www.trailrun.es/'], ['https://www.actualidadmotor.com'], ['https://estudio-27.com'], ['https://www.etapainfantil.com'], ['https://www.mexicodesconocido.com.mx'], ['https://ingenieriareal.com'], ['https://www.nacion321.com/'], ['https://www.pasala.com.mx/'], ['https://www.elfinanciero.com.mx/']]


In [4]:
#With cloudscraper we scrape the URL and with Beutiful Soup we parse the URLs.
#We put everything together in order based on importance.
#Main difference compared to the individual URL section is that we append each translation in the data list.


import cloudscraper
from bs4 import BeautifulSoup
 
scraper = cloudscraper.create_scraper() 
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
  
for iteration in range (len(data)):

  try:
      print("Scraping text from: " + data[iteration][0])
      r = scraper.get(data[iteration][0], headers = headers)

      soup = BeautifulSoup(r.text, 'html.parser')
      title = soup.find('title').text
      description = soup.find('meta', attrs={'name': 'description'})
  
      if "content" in str(description):
          description = description.get("content")
      else:
          description = ""
  
  
      h1 = soup.find_all('h1')
      h1_all = ""
      for x in range (len(h1)):
          if x ==  len(h1) -1:
              h1_all = h1_all + h1[x].text
          else:
              h1_all = h1_all + h1[x].text + ". "
  
  
      paragraphs_all = ""
      paragraphs = soup.find_all('p')
      for x in range (len(paragraphs)):
          if x ==  len(paragraphs) -1:
              paragraphs_all = paragraphs_all + paragraphs[x].text
          else:
              paragraphs_all = paragraphs_all + paragraphs[x].text + ". "
  
  
  
      h2 = soup.find_all('h2')
      h2_all = ""
      for x in range (len(h2)):
          if x ==  len(h2) -1:
              h2_all = h2_all + h2[x].text
          else:
              h2_all = h2_all + h2[x].text + ". "
  
  
  
      h3 = soup.find_all('h3')
      h3_all = ""
      for x in range (len(h3)):
          if x ==  len(h3) -1:
              h3_all = h3_all + h3[x].text
          else:
              h3_all = h3_all + h3[x].text + ". "
  

      allthecontent = str(title) + " " + str(description) + " " + str(h1_all) + " " + str(h2_all) + " " + str(h3_all) + " " + str(paragraphs_all)
      allthecontent = str(allthecontent)[0:999]
      data[iteration].append(allthecontent)
  
  except Exception as e:
          print(e)

print(data)

Scraping text from: https://www.larazon.es/
Scraping text from: https://www.bezzia.com
Scraping text from: https://www.ahorradoras.com/
Scraping text from: https://pix-geeks.com
Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
Scraping text from: https://www.ciclismoafondo.es/
Scraping text from: https://www.revistaoxigeno.es/
Scraping text from: https://www.triatlonweb.es/
Scraping text from: https://www.trailrun.es/
Scraping text from: https://www.actualidadmotor.com
Scraping text from: https://estudio-27.com
Scraping text from: https://www.etapainfantil.com
Scraping text from: https://www.mexicodesconocido.com.mx
Scraping text from: https://ingenieriareal.com
Scraping text from: https://www.nacion321.com/
Scraping text from: https://www.pasala.com.mx/
Scraping text from: https://www.elfinanciero.com.mx/
[['https://www.larazon.es/', 'La Razón - Diario de Noticias de España y Actualidad La Razón - Diario de Noticias de España y

In [5]:
#We iterate over our list and we translate the content into English. We append the translation into the list.
#If the content is already in English, we just append the same content in English again.
#If you receive an error like 'NoneType' object has no attribute 'group', try to run it again

from googletrans import Translator
import time

translator = Translator()


for iteration in range (len(data)):

  if english_content == "No":

    try:
            print("Translating content for: " + data[iteration][0])
            translation = translator.translate(data[iteration][1]).text
            translation = str(translation)[0:999]
            data[iteration].append(translation)
            time.sleep(5)
            
    except Exception as e:
            print(e)

  else:

    data[iteration].append(data[iteration][1])


print(data)

Translating content for: https://www.larazon.es/
Translating content for: https://www.bezzia.com
Translating content for: https://www.ahorradoras.com/
Translating content for: https://pix-geeks.com
list index out of range
Translating content for: https://www.ciclismoafondo.es/
Translating content for: https://www.revistaoxigeno.es/
Translating content for: https://www.triatlonweb.es/
Translating content for: https://www.trailrun.es/
Translating content for: https://www.actualidadmotor.com
Translating content for: https://estudio-27.com
Translating content for: https://www.etapainfantil.com
Translating content for: https://www.mexicodesconocido.com.mx
Translating content for: https://ingenieriareal.com
Translating content for: https://www.nacion321.com/
Translating content for: https://www.pasala.com.mx/
Translating content for: https://www.elfinanciero.com.mx/
[['https://www.larazon.es/', 'La Razón - Diario de Noticias de España y Actualidad La Razón - Diario de Noticias de España y Ac

In [6]:
#@title Path to your NLP API credentials
path_credentials_bulk = "/content/gdrive/MyDrive/Colab Notebooks/Website Categorization/NLPAPI.json" #@param {type:"string"}


In [7]:
#We iterate over the data list and we categorize in a bulk mode the list of URLs.
#We append the categorization into the list.

import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud import language
from google.cloud.language import types
from google.colab import drive

drive.mount('/content/gdrive')
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = path_credentials_bulk

for iteration in range (len(data)):

  try:
          print("Categorizing: " + data[iteration][0])

          text_content = str(data[iteration][2])
          client = language_v1.LanguageServiceClient()
          type_ = enums.Document.Type.PLAIN_TEXT
          document = {"content": text_content, "type": type_, "language": "en"}
          encoding_type = enums.EncodingType.UTF8

          response = client.classify_text(document)
          data[iteration].append(response.categories[0].name)
          data[iteration].append(str(int(round(response.categories[0].confidence,3)*100))+"%")
  
  
  except Exception as e:
      print(e)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Categorizing: https://www.larazon.es/
Categorizing: https://www.bezzia.com
Categorizing: https://www.ahorradoras.com/
Categorizing: https://pix-geeks.com
list index out of range
Categorizing: https://www.ciclismoafondo.es/
Categorizing: https://www.revistaoxigeno.es/
Categorizing: https://www.triatlonweb.es/
Categorizing: https://www.trailrun.es/
Categorizing: https://www.actualidadmotor.com
Categorizing: https://estudio-27.com
Categorizing: https://www.etapainfantil.com
Categorizing: https://www.mexicodesconocido.com.mx
Categorizing: https://ingenieriareal.com
Categorizing: https://www.nacion321.com/
list index (0) out of range
Categorizing: https://www.pasala.com.mx/
Categorizing: https://www.elfinanciero.com.mx/


In [9]:
#With pandas we can visualize how the categorization looks like.

import pandas as pd
 
df = pd.DataFrame(data,columns=['website', 'content','translation','category','confidence'])
df[["website","category","confidence"]]

Unnamed: 0,website,category,confidence
0,https://www.larazon.es/,/News,72%
1,https://www.bezzia.com,/Online Communities/Dating & Personals,74%
2,https://www.ahorradoras.com/,/Shopping/Consumer Resources/Coupons & Discoun...,71%
3,https://pix-geeks.com,,
4,https://www.ciclismoafondo.es/,/Hobbies & Leisure,99%
5,https://www.revistaoxigeno.es/,/Sports/Extreme Sports,50%
6,https://www.triatlonweb.es/,/Hobbies & Leisure/Water Activities/Surf & Swim,89%
7,https://www.trailrun.es/,/Hobbies & Leisure,97%
8,https://www.actualidadmotor.com,/Autos & Vehicles,99%
9,https://estudio-27.com,/Jobs & Education,52%


In [10]:
#@title Name file and path to save your results as a CSV file
final_file = "/content/gdrive/MyDrive/Colab Notebooks/Website Categorization/finalfile.xlsx" #@param {type:"string"}


In [12]:
#Finally, with pandas we can also store as an Excel file the whole list, with the scraped content, translated content, assigned category and confidence score.
df.to_excel(final_file, header=True, index=False)
print("Export has been done successfully!")

Export has been done successfully!
