# Google Cloud Natural Language API Demo

## Introduction to Jupyter Notebook
Jupyter Notebooks are a staple in any data scientist's toolkit. It is a free, open source, interactive data science environment that can function as both an IDE and a visualisation tool. A Jupyter Notebook is a single document where you can run code, display the output and add equations and explainations. Each notebook is a `.ipynb` file, which is a text file that describes the content of the notebook in JSON format.

Each Jupter Notebook contains a kernal that can be thought of as a "computational engine" that executes the code within the notebook. Notebooks are made up of a number of cells. For example, this piece of text you are reading resides in the first cell of this notebook. They can be markdown cells that display text in-place or code cells. When a code cell is run, the output is displayed below the cell. The order in which cells are run matters! Cells containing functions or variables have to be run before those same functions or variables can be called from a subsequent cell. 

How to use a Jupyter Notebook:
- To run a cell, either click the arrow to the left of the cell or press `ctrl + Enter` after selecting the cell. When a cell is run, a number will appear in square brackets (e.g. [1]) telling you the order in which each cell is run.
- To interrupt a cell while it is running, press the button with the black square in the toolbar at the top
- To restart the kernal, right-click `kernel` and choose from the list of restart options available


## Introduction to NLP API

The Natural Language API has several methods for performing analysis and annotation on your text. Each level of analysis provides valuable information for language understanding. These methods are listed below:

**Sentiment analysis** inspects the given text and identifies the prevailing emotional opinion within the text, especially to determine a writer's attitude as positive, negative, or neutral. This method returns the sentiment of the text as a whole as well as the sentiment of individual sentences within it. Sentiment analysis is performed through the analyzeSentiment method.

**Entity analysis** inspects the given text for known entities (Proper nouns such as public figures, landmarks, and so on. Common nouns such as restaurant, stadium, and so on.) and returns information about those entities. This includes a Wikipedia link (if applicable), the entity type and the salience (a measure of the relevance of the entity to the entire text). Entity analysis is performed with the analyzeEntities method.

**Entity sentiment analysis** inspects the given text for known entities (proper nouns and common nouns), returns information about those entities, and identifies the prevailing emotional opinion of the entity within the text, especially to determine a writer's attitude toward the entity as positive, negative, or neutral. An example of how this might be used is when presented with a sentence that contains a number of different emotions; for example, "I liked the food but the service was terrible". Entity analysis is performed with the analyzeEntitySentiment method.

**Syntactic analysis** extracts linguistic information, breaking up the given text into a series of sentences and tokens (generally, word boundaries), providing further analysis on those tokens. For each word in the text, the API tells you the word's part of speech (noun, verb, adjective, etc.) and how it relates to other words in the sentence. Syntactic Analysis is performed with the analyzeSyntax method.

**Content classification** analyzes text content and returns a content category for the content. Content classification is performed by using the classifyText method.

Each API call also detects and returns the language, if a language is not specified by the caller in the initial request. A full list of supported languages can be found here: https://cloud.google.com/natural-language/docs/languages

Additionally, if you wish to perform several natural language operations on given text using only one API call, the annotateText request can also be used to perform sentiment analysis and entity analysis.

#### Further Documentation:
https://cloud.google.com/natural-language/docs
https://cloud.google.com/natural-language/docs/basics
https://cloud.google.com/natural-language/docs/how-to

## The Natural Language API: Set Up And Examples

#### Setup

Ensure you have enabled billing, the cloud natural language APIs, and have a service account before running this notebook. 

You may also need to restart your kernel ('Kernel' in the menu). 

In [None]:
pip install --user --upgrade google-cloud-language

In [None]:
# Import google-cloud-language
# Make sure that you have installed or upgraded to the latest google-cloud-language using pip
from google.cloud import language_v1 as language
import pandas as pd
#Print all columns and all rows in a panda dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

#### Set up functions to call Google Natural Language API
Here are some examples of the API in action <br>
Sentiment Analysis:

In [None]:
# Code from Google at https://codelabs.developers.google.com/codelabs/cloud-natural-language-python3#7
# Probably would be better off changing all the functions to follow the Google standard ones from the codelab, and then making 
# small modifications to the rest of the code to make it all work together.

def analyze_text_sentiment(text):
    client = language.LanguageServiceClient()
    document = language.Document(content=text, type_=language.Document.Type.PLAIN_TEXT)

    response = client.analyze_sentiment(document=document)

    sentiment = response.document_sentiment
    results = dict(
        text=text,
        score=f"{sentiment.score:.1%}",
        magnitude=f"{sentiment.magnitude:.1%}",
    )
    
    # Get sentiment for all sentences in the document
    sentence_sentiment = []
    for sentence in response.sentences:
        item={}
        item["text"]=sentence.text.content
        item["sentiment score"]=sentence.sentiment.score
        item["sentiment magnitude"]=sentence.sentiment.magnitude
        sentence_sentiment.append(item)
    
    return sentence_sentiment

In [None]:
text = "Stocks are going down on the NASDAQ"
analyze_text_sentiment(text)

Syntactic Analysis:

In [None]:
# Syntax Analysis
def gcp_analyze_syntax(text, debug=0):
    """
    Analyzing Syntax in a String

    Args:
      text The text content to analyze
    """

    client = language.LanguageServiceClient()
    document = language.Document(content=text, type_=language.Document.Type.PLAIN_TEXT)
    response = client.analyze_syntax(document=document)
    
    output = []   
    # Loop through tokens returned from the API
    for token in response.tokens:
        word = {}
        # Get the text content of this token. Usually a word or punctuation.
        text = token.text  

        # Get the part of speech information for this token.
        # Parts of spech are as defined in:
        # http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf
        part_of_speech = token.part_of_speech
        # Get the tag, e.g. NOUN, ADJ for Adjective, et al.
        
        # Get the dependency tree parse information for this token.
        # For more information on dependency labels:
        # http://www.aclweb.org/anthology/P13-2017
        dependency_edge = token.dependency_edge   
        
        word["word"]=text.content
        word["begin_offset"]=text.begin_offset        
        word["part_of_speech"]=language.PartOfSpeech.Tag(part_of_speech.tag).name
        
        # Get the voice, e.g. ACTIVE or PASSIVE
        word["Voice"]=language.PartOfSpeech.Voice(part_of_speech.voice).name
        word["Tense"]=language.PartOfSpeech.Tense(part_of_speech.tense).name
        
        # See API reference for additional Part of Speech information available
        # Get the lemma of the token. Wikipedia lemma description
        # https://en.wikipedia.org/wiki/Lemma_(morphology)        
        word["Lemma"]=token.lemma
        word["index"]=dependency_edge.head_token_index
        word["Label"]=language.DependencyEdge.Label(dependency_edge.label).name
        
        if debug:
            print(u"Token text: {}".format(text.content))
            print(
                u"Location of this token in overall document: {}".format(text.begin_offset)
            ) 
            print(
                u"Part of Speech tag: {}".format(
                    language.PartOfSpeech.Tag(part_of_speech.tag).name
                )
            )        

            print(u"Voice: {}".format(language.PartOfSpeech.Voice(part_of_speech.voice).name))
            # Get the tense, e.g. PAST, FUTURE, PRESENT, et al.
            print(u"Tense: {}".format(language.PartOfSpeech.Tense(part_of_speech.tense).name))

            print(u"Lemma: {}".format(token.lemma))

            print(u"Head token index: {}".format(dependency_edge.head_token_index))
            print(
                u"Label: {}".format(language.DependencyEdge.Label(dependency_edge.label).name)
            )
        
        output.append(word)
        

    # Get the language of the text, which will be the same as
    # the language specified in the request or, if not specified,
    # the automatically-detected language.
    if debug:
        print(u"Language of the text: {}".format(response.language))
    return (output)

In [None]:
gcp_analyze_syntax(text)

Entity Analysis:

In [None]:
# Entity Analysis
def gcp_analyze_entities(text, debug=0):
    """
    Analyzing Entities in a String

    Args:
      text_content The text content to analyze
    """

    client = language.LanguageServiceClient()
    document = language.Document(content=text, type_=language.Document.Type.PLAIN_TEXT)
    response = client.analyze_entities(document=document)
    output = []   
    
    # Loop through entitites returned from the API
    for entity in response.entities:
        item = {}
        item["name"]=entity.name
        item["type"]=language.Entity.Type(entity.type_).name
        item["Salience"]=entity.salience
        
        if debug:
            print(u"Representative name for the entity: {}".format(entity.name))

            # Get entity type, e.g. PERSON, LOCATION, ADDRESS, NUMBER, et al
            print(u"Entity type: {}".format(language.Entity.Type(entity.type_).name))

            # Get the salience score associated with the entity in the [0, 1.0] range
            print(u"Salience score: {}".format(entity.salience))

        # Loop over the metadata associated with entity. For many known entities,
        # the metadata is a Wikipedia URL (wikipedia_url) and Knowledge Graph MID (mid).
        # Some entity types may have additional metadata, e.g. ADDRESS entities
        # may have metadata for the address street_name, postal_code, et al.
        for metadata_name, metadata_value in entity.metadata.items():
            item[metadata_name]=metadata_value
            if debug:
                print(u"{}: {}".format(metadata_name, metadata_value))

        # Loop over the mentions of this entity in the input document.
        # The API currently supports proper noun mentions.
        if debug:
            for mention in entity.mentions:
                print(u"Mention text: {}".format(mention.text.content))
                # Get the mention type, e.g. PROPER for proper noun
                print(
                    u"Mention type: {}".format(language.EntityMention.Type(mention.type_).name)
                )
        output.append(item)
    
    # Get the language of the text, which will be the same as
    # the language specified in the request or, if not specified,
    # the automatically-detected language.
    if debug:
        print(u"Language of the text: {}".format(response.language))
    
    return(output)

In [None]:
gcp_analyze_entities(text)

Content Classification:

In [None]:
# Content Classification

def gcp_classify_text(text):
    client = language.LanguageServiceClient()
    document = language.Document(content=text, type_=language.Document.Type.PLAIN_TEXT)

    response = client.classify_text(document=document)

    for category in response.categories:
        print("=" * 80)
        print(f"category  : {category.name}")
        print(f"confidence: {category.confidence:.0%}")

A longer piece of text is required.

In [None]:
text="Although most people consider piranhas to be quite dangerous, they are, for the most part, entirely harmless. \n\
Piranhas rarely feed on large animals; they eat smaller fish and aquatic plants. When confronted with humans, piranhas’ \n\
first instinct is to flee, not attack. Their fear of humans makes sense. Far more piranhas are eaten by people than people \n\
are eaten by piranhas. If the fish are well-fed, they won’t bite humans."

gcp_classify_text(text)

## Demo 1 - Process a single news article

#### Analyze Syntax
Syntactic Analysis breaks up the given text into a series of sentences and tokens and provides linguistic information about those tokens

In [None]:
text_syntax=gcp_analyze_syntax(text)
df_syntax = pd.DataFrame(text_syntax)
df_syntax

#### Analyze Entities
Entity Analysis inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.), and returns information about those entities.

In [None]:
entities=gcp_analyze_entities(text)
df_entities = pd.DataFrame(entities)
df_entities

#### Classify Documents
Google Natual Language API classifies documents into these major categories: <br>
Adult

Arts & Entertainment

Autos & Vehicles

Beauty & Fitness

Books & Literature

Business & Industrial

Computers & Electronics

Finance

Food & Drink

Games

Health

Hobbies & Leisure

Home & Garden

Internet & Telecom

Jobs & Education

Law & Government

News

Online Communities

People & Society

Pets & Animals

Real Estate

Reference

Science

Sensitive Subjects

Shopping

Sports

Travel

A full list of categories and subcategories could be found here: https://cloud.google.com/natural-language/docs/categories

#### Analyze Sentiment
Interpreting Google Sentiment Analysis Values:

Sentiment Score - a number from -1.0 to 1.0 indicating how positive or negative the statement is.

Sentiment Magnitude - a number ranging from 0 to infinity that represents the weight of sentiment expressed in the statement, regardless of being positive or negative. This value is often proportional to the length of the document.

In [None]:
# sentiment, magnitude, sentence_sentiment=gcp_analyze_sentiment(text) <- never declared
sentence_sentiment = analyze_text_sentiment(text)

In [None]:
df_sentiment = pd.DataFrame(sentence_sentiment)
df_sentiment

## Demo 2 - Process sample news articles from Refinitiv

In [None]:
from google.cloud import storage

# news_sample="github/gcp/FinancialServicesHeadline100.csv" 
# news_sample="gs://ml-core-shared-standard-bucket/data/FinancialServicesHeadline100.csv"
# df = pd.read_csv(news_sample)
df = pd.read_csv('reuters_headlines.csv')
print(df.shape)
df.head()

In [None]:
text=df["Headlines"]
print("size of document:", text.shape)
text.head()

In [None]:
# Combine all news into one document
text_all= df["Headlines"].to_string(index=False)
#print(text_all)

#### Analyze Syntax
Syntactic Analysis breaks up the given text into a series of sentences and tokens and provides linguistic information about those tokens

In [None]:
# Process each news as a separate document

df_text_syntax=pd.DataFrame()
for text in df["Headlines"]:
    item=gcp_analyze_syntax(text)
    df_text_syntax=df_text_syntax.append(pd.DataFrame(item))


In [None]:
print("size of output:", df_text_syntax.shape)
df_text_syntax.head(50)

#### Analyze Entities
Entity Analysis inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.), and returns information about those entities.

In [None]:
# Process each article independently

df_entities=pd.DataFrame()
for text in df["Headlines"]:
    item=pd.DataFrame(gcp_analyze_entities(text))
    df_entities=df_entities.append(item, ignore_index=True)
# entities=gcp_analyze_entities(text_all)
# df_entities2 = pd.DataFrame(entities)

In [None]:
print("size of output:", df_entities.shape)
df_entities.head(50)

#### Classify Documents
Google Natual Language API classifies documents into these major categories: <br>
Adult

Arts & Entertainment

Autos & Vehicles

Beauty & Fitness

Books & Literature

Business & Industrial

Computers & Electronics

Finance

Food & Drink

Games

Health

Hobbies & Leisure

Home & Garden

Internet & Telecom

Jobs & Education

Law & Government

News

Online Communities

People & Society

Pets & Animals

Real Estate

Reference

Science

Sensitive Subjects

Shopping

Sports

Travel

A full list of categories and subcategories could be found here:
https://cloud.google.com/natural-language/docs/categories

In [None]:
## Overall document classification
gcp_classify_text(text_all)

In [None]:
# Process each article independently

df_sentiment=pd.DataFrame()
item_sentiment=pd.DataFrame(columns=["text", "sentiment score","sentiment magnitude"])
for text in df["Headlines"]:
    sentiment_output = analyze_text_sentiment(text)
    item_sentiment.loc[0, "text"]=sentiment_output[0].get('text')
    item_sentiment.loc[0, "sentiment score"]= sentiment_output[0].get('sentiment score')
    item_sentiment.loc[0,"sentiment magnitude"]= sentiment_output[0].get('sentiment magnitude')
    
    df_sentiment=df_sentiment.append(item_sentiment, ignore_index=True)

In [None]:
df_sentiment.head(100)

In [None]:
# Plot Sentiment Scores
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import colors
from matplotlib.ticker import PercentFormatter
plt.rcParams.update({'figure.figsize':(16,8)})

x = df_sentiment["sentiment score"]
y =  df_sentiment["sentiment magnitude"]

sns.scatterplot(data= df_sentiment[["sentiment score", "sentiment magnitude"]])
                
n_bins=30

#plt.hist(x, bins=n_bins)
#plt.show()

fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)
# We can set the number of bins with the `bins` kwarg
axs[0].set_xlabel("Sentiment Score")
axs[0].set_ylabel("percentage")
axs[0].set_title('Histogram of Sentiment Score')
axs[1].set_xlabel("Sentiment Magnitude")
axs[1].set_title('Histogram of Sentiment Magnitude')

axs[0].hist(x, bins=n_bins)
axs[1].hist(y, bins=n_bins)
plt.show()


fig, ax = plt.subplots(tight_layout=True)
hist = ax.hist2d(x, y, norm=colors.LogNorm())
plt.title("Sentiment Score and Magnitude 2-D Distribution")
ax.set_xlabel("Sentiment Score")
ax.set_ylabel("Sentiment Magnitude")

plt.show()