## Importing necessary libraries

In [7]:
import os
import matplotlib.pyplot as plt
import cv2
import argparse
import io
import json
import numpy
import six
import re
from google.cloud import storage
import pandas as pd

## Setting path to json key

In [8]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/home/affine/GCP/downloaded_key.json"

- We use AutoML Natural Language to create a custom machine learning model.


- We can create a model to :

    - classify documents
    - identify entities in documents
    - analyze the prevailing emotional attitude in a document.
    
    
- To use AutoML Natural Language, enable Enable the Cloud AutoML and Storage APIs.for that project.

### Model objectives

AutoML Natural Language can train custom models for four distinct tasks, known as model objectives:

- **Single label classification** classifies documents by assigning a label to them
- **Multi-label classification** allows a document to be assigned multiple labels
- **Entity extraction** identifies entities in documents
- **Sentiment analysis** analyzes attitudes within documents


### Steps to follow:

1) To create a dataset

- Open the AutoML Natural Language UI and select Get started in the box corresponding to the type of model you plan to train.
- Click the New Dataset button in the title bar.
- Enter a name for the dataset and select the model objective that matches the sample dataset you chose.
- Leave the Location set to Global.
- In the Import text items section, choose Select a CSV file on Cloud Storage, and enter the path to the dataset you want to use into the text box.
- If you choose the sentiment dataset, AutoML Natural Language asks for the maximum sentiment value. 
- Click Create dataset.
- You're returned to the Datasets page; your dataset will show an in progress animation while your documents are being imported. This process should take approximately 10 minutes per 1000 documents, but may take more or less time.

2) Train your model

- After your training data has been successfully imported, select the dataset from the dataset listing page to see the details about the dataset. The name of the selected dataset appears in the title bar, and the page lists the individual documents in the dataset along with their labels.

- When you are done reviewing the dataset, click the Train tab just below the title bar.
- Click Start Training.
- Enter a name for the new model and check the Deploy model after training finishes check box.
- Click Start Training.
- Training a model can take several hours to complete. 
- After training, the bottom of the Train page shows high-level metrics for the model, such as precision and recall.

3) Use the custom model

- After your model has been successfully trained, you can use it to analyze other documents.
- Click the Test & Use tab just below the title bar. Enter text in the Input text box or the URL of a PDF or TIFF file in a Cloud Storage bucket, then click Predict.
- AutoML Natural Language analyzes the text using your model and displays the annotations.

## Code for our custom model

In [36]:
import sys

from google.api_core.client_options import ClientOptions
from google.cloud import automl_v1
from google.cloud.automl_v1.proto import service_pb2

def inline_text_payload(file_path):
    with open(file_path, 'rb') as ff:
        content = ff.read()
    return {'text_snippet': {'content': content, 'mime_type': 'text/plain'} }

def pdf_payload(file_path):
    return {'document': {'input_config': {'gcs_source': {'input_uris': [file_path] } } } }

def get_prediction(file_path, model_name):
    options = ClientOptions(api_endpoint='automl.googleapis.com')
    prediction_client = automl_v1.PredictionServiceClient(client_options=options)

    payload = inline_text_payload(file_path)
    print('payload : \n',payload)
    # Uncomment the following line (and comment the above line) if want to predict on PDFs.
    # payload = pdf_payload(file_path)

    params = {}
    request = prediction_client.predict(model_name, payload, params)
    return request  # waits until request is returned

if __name__ == '__main__':
    file_path = sys.argv[1]
    model_name = sys.argv[2]

### 1) Prediction on custom single label classification model

To create a single-label classification model, use the "happy moments" dataset derived from the Kaggle open-source dataset HappyDB. The resulting model classifies happy moments into categories reflecting the causes of happiness.

In [29]:
get_prediction('SL_test1.txt','projects/project-001-285307/locations/us-central1/models/TCN3799597725268312064')

payload : 
 {'text_snippet': {'content': b'I cooked a new dish for dinner than turned out great.\n', 'mime_type': 'text/plain'}}


payload {
  annotation_spec_id: "4809009872706207744"
  classification {
    score: 0.9595863819122314
  }
  display_name: "achievement"
}
payload {
  annotation_spec_id: "8267774386526748672"
  classification {
    score: 0.02703445963561535
  }
  display_name: "leisure"
}
payload {
  annotation_spec_id: "3656088368099360768"
  classification {
    score: 0.01262345165014267
  }
  display_name: "enjoy_the_moment"
}
payload {
  annotation_spec_id: "1350245358885666816"
  classification {
    score: 0.0006985074724070728
  }
  display_name: "affection"
}
payload {
  annotation_spec_id: "5961931377313054720"
  classification {
    score: 3.316345828352496e-05
  }
  display_name: "exercise"
}
payload {
  annotation_spec_id: "7114852881919901696"
  classification {
    score: 2.2405794879887253e-05
  }
  display_name: "bonding"
}
payload {
  annotation_spec_id: "2503166863492513792"
  classification {
    score: 1.5677270539526944e-06
  }
  display_name: "nature"
}

### Output:

The text "I cooked a new dish for dinner than turned out great" belongs to :
- achievement class with confidence score of 0.9595
- leisure class with confidence score of 0.02703
- enjoy_the_moment class with confidence score of 0.0126
- affection class with confidence score of 0.0006
- exercise class with confidence score of 3.316345828352496e-05
- bonding class with confidence score of 2.2405794879887253e-05
- nature class with confidence score of 1.5677270539526944e-06

Thus the text belongs to **achievement** class

### 2) Prediction on custom Sentiment analysis model

To create a sentiment analysis model, use the open dataset from FigureEight that analyzes Twitter mentions of the allergy medicine Claritin.

In [30]:
get_prediction('SA_test1.txt','projects/project-001-285307/locations/us-central1/models/TST1499947165542252544')

payload : 
 {'text_snippet': {'content': b'"If she was on a Claritin clear commercial I\'d buy it" - Kyle Tyler\n', 'mime_type': 'text/plain'}}


payload {
  text_sentiment {
    sentiment: 3
  }
}
metadata {
  key: "sentiment_score"
  value: "0.32662117"
}

### Output:

The text "If she was on a Claritin clear commercial I\'d buy it" - Kyle Tyler has sentiment score of 3.

### 3) Prediction on custom Entity extraction model

To create an entity extraction model, use a corpus of biomedical research abstracts that mention hundreds of diseases and concepts. The resulting model identifies these medical entities in other documents.

In [32]:
get_prediction('EE_test1.txt','projects/project-001-285307/locations/us-central1/models/TEN512532947241271296')

payload : 
 {'text_snippet': {'content': b'2390095\tTotal deficiency of plasma cholesteryl ester transfer protein in subjects homozygous and heterozygous for the intron 14 splicing defect.\tThe molecular basis of cholesteryl ester transfer protein ( CETP ) deficiency was investigated in 4 unrelated CETP-deficient families . The high density lipoprotein-cholesterol levels of the probands exceeded 150 mg / dl . The plasma of the probands was totally deficient in CETP activity and mass . The genomic DNA of the patients was amplified by polymerase chain reaction , using two oligonucleotide primers located in the intron 12 and 14 of the CETP gene , and the amplified products were directly sequenced . Two patients were homozygous for a G-to-A change at the 5-splice donor site of the intron 14 . The G-to-A change would cause impaired splicing of pre-messenger RNA . The other two probands were heterozygous for the mutation , but totally lacked CETP . Their lipoprotein patterns were also simila

payload {
  annotation_spec_id: "8185302218350526464"
  display_name: "SpecificDisease"
  text_extraction {
    score: 0.9990716576576233
    text_segment {
      start_offset: 168
      end_offset: 222
      content: "cholesteryl ester transfer protein ( CETP ) deficiency"
    }
  }
}
payload {
  annotation_spec_id: "2997155447619715072"
  display_name: "Modifier"
  text_extraction {
    score: 0.9969088435173035
    text_segment {
      start_offset: 255
      end_offset: 269
      content: "CETP-deficient"
    }
  }
}
payload {
  annotation_spec_id: "5302998456833409024"
  display_name: "DiseaseClass"
  text_extraction {
    score: 0.9987812638282776
    text_segment {
      start_offset: 1004
      end_offset: 1019
      content: "genetic defects"
    }
  }
}
payload {
  annotation_spec_id: "8185302218350526464"
  display_name: "SpecificDisease"
  text_extraction {
    score: 0.9991596937179565
    text_segment {
      start_offset: 1158
      end_offset: 1173
      content: "CETP 

### Output:

In the above text:

- content: "cholesteryl ester transfer protein ( CETP ) deficiency" belongs to entity "SpecificDisease" with score: 0.9990

- content: "CETP-deficient" belongs to entity "Modifier" with score: 0.9969

-  content: "genetic defects" belongs to entity "DiseaseClass" with score: 0.9987

- content: "CETP deficiency" belongs to entity "SpecificDisease" with score: 0.99915