## Importing necessary libraries

In [9]:
import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums
import matplotlib.pyplot as plt
import cv2
import argparse
import io
import json
from google.cloud import language
import numpy
import six

## Setting path to json key

In [5]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/home/affine/GCP/downloaded_key.json"

**Content classification:**

- Content Classification analyzes a document and returns a list of content categories that apply to the text found in the document.

- To use the Cloud Natural Language API, you must to import the language module from the google-cloud-language library. 

- The language.types module contains classes that are required for creating requests.

- The language.enums module is used to specify the type of the input text.

## 1) Classifying content provided a string : first method

- We use the Python client library to make a request to the Natural Language to classify content. The Python client library encapsulates the details for requests to and responses from the Natural Language.


- The below function calls the Natural Language 'classifyText' method, by first creating an instance of the LanguageServiceClient class, and then calling the classify_text method of the LanguageServiceClient instance.


- The below function only classifies text content. You can also classify the content of a web page by passing in the source HTML of the web page as the text and by setting the type parameter to language.enums.Document.Type.HTML.

In [75]:
# classify a ‘text string’ passed by the user

def sample_classify_text(text_content):
    """
    Classifying Content in a String

    Args:
      text_content The text content to analyze. Must include at least 20 words.
    """

    client = language_v1.LanguageServiceClient()

    # Available types: PLAIN_TEXT, HTML
    type_ = enums.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": text_content, "type": type_, "language": language}

    response = client.classify_text(document)
    print('response:\n ',response)
    # Loop through classified categories returned from the API
    for category in response.categories:
        # Get the name of the category representing the document.
        # See the predefined taxonomy of categories:
        # https://cloud.google.com/natural-language/docs/categories
        print(u"Category name: {}".format(category.name))
        # Get the confidence. Number representing how certain the classifier
        # is that this category represents the provided text.
        print(u"Confidence: {}".format(category.confidence))

In [25]:
# calling function:

sample_classify_text('Golf is a club-and-ball sport in which players use various clubs to hit balls into a series of holes on a course in as few strokes as possible.')

response:
  categories {
  name: "/Sports/Individual Sports/Golf"
  confidence: 0.9900000095367432
}

Category name: /Sports/Individual Sports/Golf
Confidence: 0.9900000095367432


In [100]:
# example:

sample_classify_text("Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice.")

response:
  categories {
  name: "/Computers & Electronics/Software"
  confidence: 0.550000011920929
}
categories {
  name: "/Internet & Telecom"
  confidence: 0.5099999904632568
}

Category name: /Computers & Electronics/Software
Confidence: 0.550000011920929
Category name: /Internet & Telecom
Confidence: 0.5099999904632568


**Output:**

Response gives categories to which given text string belongs to from list of content categories returned for the **classifyText** method.

In categories it gives:

name: name of the category

confidence : confidence score

The given text can belong to more than one category.

## 2) Classify content provided a string: second method

In [6]:
def classify(text, verbose=True):
    """Classify the input text into categories. """

    language_client = language.LanguageServiceClient()

    document = language.types.Document(
        content=text,
        type=language.enums.Document.Type.PLAIN_TEXT)
    response = language_client.classify_text(document)
    print('response: \n',response)
    categories = response.categories

    result = {}

    for category in categories:
        # Turn the categories into a dictionary of the form:
        # {category.name: category.confidence}, so that they can
        # be treated as a sparse vector.
        result[category.name] = category.confidence

    if verbose:
        print(text)
        for category in categories:
            print(u'=' * 20)
            print(u'{:<16}: {}'.format('category', category.name))
            print(u'{:<16}: {}'.format('confidence', category.confidence))

    return result

In [102]:
# calling function

classify("Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice.")

response: 
 categories {
  name: "/Computers & Electronics/Software"
  confidence: 0.550000011920929
}
categories {
  name: "/Internet & Telecom"
  confidence: 0.5099999904632568
}

Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice.
category        : /Computers & Electronics/Software
confidence      : 0.550000011920929
category        : /Internet & Telecom
confidence      : 0.5099999904632568


{'/Computers & Electronics/Software': 0.550000011920929,
 '/Internet & Telecom': 0.5099999904632568}

**Output:**

The returned result is a dictionary with the category labels as keys, and confidence scores as values.

## 3) Classifying Content from Google Cloud Storage

In [127]:
def sample_classify_text_uri(gcs_content_uri):
    """
    Classifying Content in text file stored in Cloud Storage

    Args:
      gcs_content_uri Google Cloud Storage URI where the file content is located.
      e.g. gs://[Your Bucket]/[Path to File]
      The text file must include at least 20 words.
    """

    client = language_v1.LanguageServiceClient()

    # Available types: PLAIN_TEXT, HTML
    type_ = enums.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"gcs_content_uri": gcs_content_uri, "type": type_, "language": language}

    response = client.classify_text(document)
    print('response: \n',response)
    result = {}
    # Loop through classified categories returned from the API
    for category in response.categories:
        # Get the name of the category representing the document.
        # See the predefined taxonomy of categories:
        # https://cloud.google.com/natural-language/docs/categories
        print(u"Category name: {}".format(category.name))
        # Get the confidence. Number representing how certain the classifier
        # is that this category represents the provided text.
        print(u"Confidence: {}".format(category.confidence))
        
        # Turn the categories into a dictionary of the form:
        # {category.name: category.confidence}, so that they can
        # be treated as a sparse vector.
        result[category.name] = category.confidence
        
    return result

### a) Classifying content stored in a text file on Google Cloud Storage:

Text in txt file uri:

Your food choices each day affect your health — how you feel today, tomorrow, and in the future. Good nutrition is an important part of leading a healthy lifestyle.

In [128]:
# calling function by giving txt file uri

sample_classify_text_uri("gs://bucket0406/string1.txt")

response: 
 categories {
  name: "/Health/Nutrition"
  confidence: 0.7799999713897705
}
categories {
  name: "/Food & Drink"
  confidence: 0.5400000214576721
}

Category name: /Health/Nutrition
Confidence: 0.7799999713897705
Category name: /Food & Drink
Confidence: 0.5400000214576721


{'/Health/Nutrition': 0.7799999713897705, '/Food & Drink': 0.5400000214576721}

Here we can see input text belongs to two categories and confidence score of text beloning to first category is more as compared to second one.

### b) Classifying content stored in a pdf file on Google Cloud Storage:

In [129]:
# calling function by giving pdf file uri

sample_classify_text_uri("gs://bucket0406/text.pdf")

response: 
 categories {
  name: "/Computers & Electronics/Software"
  confidence: 0.7099999785423279
}

Category name: /Computers & Electronics/Software
Confidence: 0.7099999785423279


{'/Computers & Electronics/Software': 0.7099999785423279}

## 4) Index multiple text files

- Classify multiple text files and write the result to an index file.


- The below index function takes, as input, a directory containing multiple text files, and the path to a file where it stores the indexed output (the default file name is index.json). 


- It reads the content of each text file in the input directory, and then passes the text files to the Cloud Natural Language API to be classified into content categories.

In [7]:
def index(path, index_file):
    """Classify each text file in a directory and write
    the results to the index_file.
    """

    results = {}
    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)

        if not os.path.isfile(file_path):
            continue

        try:
            with io.open(file_path, 'r') as f:
                text = f.read()
                print('for file: ', filename)
                categories = classify(text, verbose=False)
                

                results[filename] = categories
        except Exception:
            print('Failed to process {}'.format(file_path))

    with io.open(index_file, 'w', encoding='utf-8') as f:
        f.write(json.dumps(results, ensure_ascii=False))

    print('Texts indexed in file: {}'.format(index_file))
    return results

In [8]:
index('/home/affine/GCP/NLP_Content_Classification/text_files','index.json')

for file:  string1.txt
response: 
 categories {
  name: "/Health/Nutrition"
  confidence: 0.7799999713897705
}
categories {
  name: "/Food & Drink"
  confidence: 0.5400000214576721
}

for file:  string5.txt
response: 
 categories {
  name: "/Books & Literature/Children\'s Literature"
  confidence: 0.9900000095367432
}
categories {
  name: "/People & Society/Kids & Teens/Children\'s Interests"
  confidence: 0.9700000286102295
}

for file:  string4.txt
response: 
 categories {
  name: "/Books & Literature/Children\'s Literature"
  confidence: 0.9200000166893005
}
categories {
  name: "/People & Society/Kids & Teens/Children\'s Interests"
  confidence: 0.8100000023841858
}

for file:  string6.txt
response: 
 categories {
  name: "/Books & Literature/Children\'s Literature"
  confidence: 0.9200000166893005
}
categories {
  name: "/People & Society/Kids & Teens/Children\'s Interests"
  confidence: 0.8500000238418579
}
categories {
  name: "/People & Society/Subcultures & Niche Interests"
  

{'string1.txt': {'/Health/Nutrition': 0.7799999713897705,
  '/Food & Drink': 0.5400000214576721},
 'string5.txt': {"/Books & Literature/Children's Literature": 0.9900000095367432,
  "/People & Society/Kids & Teens/Children's Interests": 0.9700000286102295},
 'string4.txt': {"/Books & Literature/Children's Literature": 0.9200000166893005,
  "/People & Society/Kids & Teens/Children's Interests": 0.8100000023841858},
 'string6.txt': {"/Books & Literature/Children's Literature": 0.9200000166893005,
  "/People & Society/Kids & Teens/Children's Interests": 0.8500000238418579,
  '/People & Society/Subcultures & Niche Interests': 0.5600000023841858},
 'string3.txt': {'/Internet & Telecom/Mobile & Wireless': 0.6299999952316284,
  '/Internet & Telecom/Service Providers': 0.5199999809265137},
 'string2.txt': {'/Internet & Telecom': 0.7099999785423279}}

**Output:**

For each file in directory it is calling above classify function that passes the text files to the Cloud Natural Language API to be classified into content categories and gives categories of that file in output.


The results from the Cloud Natural Language API for each file are organized into a single dictionary, serialized as a JSON string,

## 5) Query with category labels

- It process input query category labels to find similar text files.


- Find files in a directory that are most similar to a query label passed by the user.


- In this we use a category label as the query

#### first splitting the categories into individual levels:

In [119]:
def split_labels(categories):
    """The category labels are of the form "/a/b/c" up to three levels,
    for example "/Computers & Electronics/Software", and these labels
    are used as keys in the categories dictionary, whose values are
    confidence scores.
    The split_labels function splits the keys into individual levels
    while duplicating the confidence score, which allows a natural
    boost in how we calculate similarity when more levels are in common.
    Example:
    If we have
    x = {"/a/b/c": 0.5}
    y = {"/a/b": 0.5}
    z = {"/a": 0.5}
    Then x and y are considered more similar than y and z.
    """
    _categories = {}
    for name, confidence in six.iteritems(categories):
        labels = [label for label in name.split('/') if label]
        for label in labels:
            _categories[label] = confidence
    return _categories

#### Then finding similarity between text based on their resulting content classification by using numpy for vector calculations.

In [120]:
def similarity(categories1, categories2):
    """Cosine similarity of the categories treated as sparse vectors."""
    categories1 = split_labels(categories1)
    categories2 = split_labels(categories2)

    norm1 = numpy.linalg.norm(list(categories1.values()))
    norm2 = numpy.linalg.norm(list(categories2.values()))

    # Return the smallest possible similarity if either categories is empty.
    if norm1 == 0 or norm2 == 0:
        return 0.0

    # Compute the cosine similarity.
    dot = 0.0
    for label, confidence in six.iteritems(categories1):
        dot += confidence * categories2.get(label, 0.0)

    return dot / (norm1 * norm2)

#### Then finding the indexed files that are the most similar to the query label

In [121]:
def query_category(index_file, category_string, n_top=3):
    """Find the indexed files that are the most similar to
    the query label.

    The list of all available labels:
    https://cloud.google.com/natural-language/docs/categories
    """

    with io.open(index_file, 'r') as f:
        index = json.load(f)
    # Make the category_string into a dictionary so that it is
    # of the same format as what we get by calling classify.
    query_categories = {category_string: 1.0}

    similarities = []
    for filename, categories in six.iteritems(index):
        similarities.append(
            (filename, similarity(query_categories, categories)))

    similarities = sorted(similarities, key=lambda p: p[1], reverse=True)

    print('=' * 20)
    print('Query: {}\n'.format(category_string))
    print('\nMost similar {} indexed texts:'.format(n_top))
    for filename, sim in similarities[:n_top]:
        print('\tFilename: {}'.format(filename))
        print('\tSimilarity: {}'.format(sim))
        print('\n')

    return similarities

In [122]:
query_category('/home/affine/GCP/NLP_Content_Classification/index.json','/Books & Literature')

Query: /Books & Literature


Most similar 3 indexed texts:
	Filename: string4.txt
	Similarity: 0.48081945867711245


	Filename: string6.txt
	Similarity: 0.4741386255611049


	Filename: string5.txt
	Similarity: 0.4526781570345511




[('string4.txt', 0.48081945867711245),
 ('string6.txt', 0.4741386255611049),
 ('string5.txt', 0.4526781570345511),
 ('string1.txt', 0.0),
 ('string3.txt', 0.0),
 ('string2.txt', 0.0)]

## 6) Query with text

- It process input query text to find similar text files.


- Classify files in a directory based on category of a query text


- In this we query with text that may not be part of the indexed text. The below query function is similar to the query_category function, with the added step of making a classifyText request for the text input, and using the results to query the index file.

#### Finding the indexed files that are the most similar to the query text

In [123]:
def query(index_file, text, n_top=3):
    """Find the indexed files that are the most similar to
    the query text.
    """

    with io.open(index_file, 'r') as f:
        index = json.load(f)

    # Get the categories of the query text.
    query_categories = classify(text, verbose=False)

    similarities = []
    for filename, categories in six.iteritems(index):
        similarities.append(
            (filename, similarity(query_categories, categories)))

    similarities = sorted(similarities, key=lambda p: p[1], reverse=True)

    print('=' * 20)
    print('Query: {}\n'.format(text))
    for category, confidence in six.iteritems(query_categories):
        print('\tCategory: {}, confidence: {}'.format(category, confidence))
    print('\nMost similar {} indexed texts:'.format(n_top))
    for filename, sim in similarities[:n_top]:
        print('\tFilename: {}'.format(filename))
        print('\tSimilarity: {}'.format(sim))
        print('\n')

    return similarities

In [124]:
query('/home/affine/GCP/NLP_Content_Classification/index.json',"Healthy children learn better. People with adequate nutrition are more productive and can create opportunities to gradually break the cycles of poverty and hunger")

response: 
 categories {
  name: "/Health"
  confidence: 0.5899999737739563
}
categories {
  name: "/People & Society"
  confidence: 0.5299999713897705
}

Query: Healthy children learn better. People with adequate nutrition are more productive and can create opportunities to gradually break the cycles of poverty and hunger

	Category: /Health, confidence: 0.5899999737739563
	Category: /People & Society, confidence: 0.5299999713897705

Most similar 3 indexed texts:
	Filename: string1.txt
	Similarity: 0.472457802007375


	Filename: string5.txt
	Similarity: 0.29639893032291637


	Filename: string4.txt
	Similarity: 0.282897926749342




[('string1.txt', 0.472457802007375),
 ('string5.txt', 0.29639893032291637),
 ('string4.txt', 0.282897926749342),
 ('string6.txt', 0.19286617819127827),
 ('string3.txt', 0.0),
 ('string2.txt', 0.0)]