En este ejemplo, mostraremos como utilizar Microsoft MLLib para crear un pipeline de Machine Learning capaz de utilizar los servicios cognitivos de Microsoft para el procesamiento de los datos. Para completar esta demo necesitarán disponer de los siguientes servicios cognitivos:
- Text Analytics
- Custom Vision
- Bing Search

Desde el portal de Azure, necesitarán disponer de las Keys para acceder estos servicios. Prestar atención a la región en donde están provicionados los servicios. Este ejemplo asume que están en EAST US

In [2]:
from mmlspark.cognitive import *
from pyspark.ml import PipelineModel
from pyspark.sql.functions import col, udf
from pyspark.ml.feature import SQLTransformer
import os

#Keys de Azure Congnitive Services
TEXT_API_KEY          = os.environ["TEXT_API_KEY"]
VISION_API_KEY        = os.environ["VISION_API_KEY"]
BING_IMAGE_SEARCH_KEY = os.environ["BING_IMAGE_SEARCH_KEY"]

## Extracting celebrity quote images using Bing Image Search on Spark

Here we define two Transformers to extract celebrity quote images.

<img src="https://camo.githubusercontent.com/6352eb0f5144aff091d8409a1fc9d60739fdc473/68747470733a2f2f6d6d6c737061726b2e626c6f622e636f72652e77696e646f77732e6e65742f67726170686963732f436f67253230536572766963652532304e422f73746570253230312e706e67" width="900" />

In [4]:
imgsPerBatch = 10 #the number of images Bing will return for each query
offsets = [(i*imgsPerBatch,) for i in range(100)] # A list of offsets, used to page into the search results
bingParameters = spark.createDataFrame(offsets, ["offset"])

bingSearch = BingImageSearch()\
  .setSubscriptionKey(BING_IMAGE_SEARCH_KEY)\
  .setOffsetCol("offset")\
  .setQuery("celebrity quotes")\
  .setCount(imgsPerBatch)\
  .setOutputCol("images")

#Transformer to that extracts and flattens the richly structured output of Bing Image Search into a simple URL column
getUrls = BingImageSearch.getUrlTransformer("images", "url")

#### Recognizing Images of Celebrities
This block identifies the name of the celebrities for each of the images returned by the Bing Image Search.

<img src="https://camo.githubusercontent.com/c38313a61569e972c566f49e44349bc4b2cc7d99/68747470733a2f2f6d6d6c737061726b2e626c6f622e636f72652e77696e646f77732e6e65742f67726170686963732f436f67253230536572766963652532304e422f73746570253230322e706e67" width="900" />

In [6]:
celebs = RecognizeDomainSpecificContent()\
          .setSubscriptionKey(VISION_API_KEY)\
          .setModel("celebrities")\
          .setUrl("https://eastus.api.cognitive.microsoft.com/vision/v2.0/")\
          .setImageUrlCol("url")\
          .setOutputCol("celebs")

#Extract the first celebrity we see from the structured response
firstCeleb = SQLTransformer(statement="SELECT *, celebs.result.celebrities[0].name as firstCeleb FROM __THIS__")

#### Reading the quote from the image.
This stage performs OCR on the images to recognize the quotes.

<img src="https://camo.githubusercontent.com/1a36ace1632502996724041c6f3c4d9f00b7ad36/68747470733a2f2f6d6d6c737061726b2e626c6f622e636f72652e77696e646f77732e6e65742f67726170686963732f436f67253230536572766963652532304e422f73746570253230332e706e67" width="900" />

In [8]:
from mmlspark.stages import UDFTransformer 

recognizeText = RecognizeText()\
  .setSubscriptionKey(VISION_API_KEY)\
  .setUrl("https://eastus.api.cognitive.microsoft.com/vision/v2.0/recognizeText")\
  .setImageUrlCol("url")\
  .setMode("Printed")\
  .setOutputCol("ocr")\
  .setConcurrency(5)

def getTextFunction(ocrRow):
    if ocrRow is None: return None
    return "\n".join([line.text for line in ocrRow.recognitionResult.lines])

# this transformer wil extract a simpler string from the structured output of recognize text
getText = UDFTransformer().setUDF(udf(getTextFunction)).setInputCol("ocr").setOutputCol("text")

#### Understanding the Sentiment of the Quote

In [10]:
sentimentTransformer = TextSentiment()\
    .setTextCol("text")\
    .setUrl("https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment")\
    .setSubscriptionKey(TEXT_API_KEY)\
    .setOutputCol("sentiment")

#Extract the sentiment score from the API response body
getSentiment = SQLTransformer(statement="SELECT *, sentiment[0].score as sentimentScore FROM __THIS__")

#### Tying it all together

<img src="https://camo.githubusercontent.com/0e5ff00f47339673fc2bbf80a8a0c39a856d2ff6/68747470733a2f2f6d6d6c737061726b2e626c6f622e636f72652e77696e646f77732e6e65742f67726170686963732f436f67253230536572766963652532304e422f66756c6c253230706970652e706e67" width="900" />

In [12]:
from mmlspark.stages import SelectColumns
# Select the final coulmns
cleanupColumns = SelectColumns().setCols(["url", "firstCeleb", "text", "sentimentScore"])

celebrityQuoteAnalysis = PipelineModel(stages=[
  bingSearch, getUrls, celebs, firstCeleb, recognizeText, getText, sentimentTransformer, getSentiment, cleanupColumns])

celebrityQuoteAnalysis.transform(bingParameters).createOrReplaceTempView('tmp_predictions')

In [13]:
%sql
SELECT * FROM tmp_predictions LIMIT 5

url,firstCeleb,text,sentimentScore
https://worldwideinterweb.com/wp-content/uploads/2015/10/funniest-quotes-of-all-time.jpg,Patrick Stewart,"I am not the archetypal leading man. This is mainly for one reason: as you may have noticed, I have no hair. - Patrick Stewart",0.7781869
https://quotereel.com/wp-content/uploads/2017/08/Best-Celebrity-Quotes-15.jpg,Kim Kardashian,"Quotereel.com ""Well, a bear can juggle and stand on a ball and he's talented, but he's not famous. Kim Kardashian imgflip.com",0.9645388
https://thechive.files.wordpress.com/2017/06/famous-people-simplify-our-complex-world-with-these-quotes-225.jpg?quality=85&strip=info&w=600,James Dean,"""Dream as if you'll live forever. Live as if you'll die today.' JAMES DEAN",0.91431165
https://keyassets-p2.timeincuk.net/wp/prod/wp-content/uploads/sites/30/2016/03/AngelinaJolieQuote.jpg,Angelina Jolie,Now ANGELINA JOLIE Find who you are in this world and what you need to feel good alone. I think that's the most important thing in life.,0.87862855
http://quotesideas.com/wp-content/uploads/2015/03/34110-celebrities-quotes-we-heart-it-wallpaper-2400x1350.jpg,Zayn Malik,"celebrities-quotes | tumblr ""BECAUSE YOU CAN'T FIND A PRINCE DOESN'T MEAN YOU'RE NOT A PRINCESS' Zayn Malik",0.16331878
