# Introduction

In this notebook we shall use Google's Search API and IBM Watson API to comapre reviews of competing products for new generation Mesh Routers. Here we are comparing reviews of 3 Mesh Routers - EERO, Netgear's Orbi and Luma.
This notebook uses Spark's Rest Data Source to call Gooogle's Search API and IBM Watson API.
This same notebook can be used for other competing products with different configuration for Google Search API and key words.

## Getting search results using key words in Google Search API

At first we shall define a set of key words to search reviews of the selected 3 products in Google using Google's Search API.

For that let us first define search phrases/search words each for one type of Router

In [5]:
inputSW1 = ('eero wifi')
inputSW2 = ('Luma wifi')
inputSW3 = ('Orbi wifi')

Now we define the URI for the Google's Search API. This URI uses specific Custom Search Engine instance (cx) and access key (key) created for this usecase using Google's developer cloud. For other use cases similar Custom Serach Engine instance can be defined. We are keeping these two parameters as blank here so that you can insert values specific to your Custom Search Engine.

In [17]:
gKey = ''
gCx = ''

In [7]:
gapi = 'https://www.googleapis.com/customsearch/v1?cx=' + gCx + '&key='+ gKey

Now we create a dataframe out of those 3 sets of search words. Note that for credating the dataframe we are using 'q' as the column name as 'q' is the key expected by Google Search API for supplying the search words.

In [8]:
gDf = sc.parallelize([[inputSW1], [inputSW2], [inputSW3]]).toDF(['q'])

In [9]:
gDf.show()

+---------+
|        q|
+---------+
|eero wifi|
|Luma wifi|
|Orbi wifi|
+---------+



Now we create a Temporary Spark Table out of this dataframe as required by the Rest Data Source

Now we create a parameter map to be passed to the Rest Data Source. We provide there the Google Search API url, the Input Table name (with serach words for different routers) and the method (which is GET for Google Search API).

In [10]:
parmg = {'url' : gapi, 'input' : 'gtbl', 'method' : 'GET'}

In [11]:
gDf.createOrReplaceTempView("gtbl")

Now we create the Rest Data Source using these parameters which internally calls the Google Search API in parallel for 3 different sets of search words (for 3 different Routers) and returns the result in a dataframe

In [12]:
grDf = spark.read.format("org.apache.dsext.spark.datasource.rest.RestDataSource").options(**parmg).load().cache()

We shall first investigae the schema of the result returned by Google Search API

In [13]:
grDf.printSchema()

root
 |-- output: struct (nullable = true)
 |    |-- context: struct (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- cacheId: string (nullable = true)
 |    |    |    |-- displayLink: string (nullable = true)
 |    |    |    |-- formattedUrl: string (nullable = true)
 |    |    |    |-- htmlFormattedUrl: string (nullable = true)
 |    |    |    |-- htmlSnippet: string (nullable = true)
 |    |    |    |-- htmlTitle: string (nullable = true)
 |    |    |    |-- kind: string (nullable = true)
 |    |    |    |-- link: string (nullable = true)
 |    |    |    |-- pagemap: struct (nullable = true)
 |    |    |    |    |-- article: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- alternativeheadline: string (nullable = true)
 |    |    |    |    |    |    |-- datemodified: string (null

The links (output.items.element.link) returned by the search results are important for our analysis as those sites contains the content (feedback, review, description, etc.) about the Routers. However, let us first check what are the links returned by Google Search

In [14]:
grDf.createOrReplaceTempView("grtbl")

In [15]:
spark.sql("select q, inline(output.items) from grtbl").createOrReplaceTempView("grtbl2")

In [16]:
spark.sql("select q, link from grtbl2").show(50, False)

+---------+--------------------------------------------------------------------------------------------------------------------------+
|q        |link                                                                                                                      |
+---------+--------------------------------------------------------------------------------------------------------------------------+
|eero wifi|https://eero.com/                                                                                                         |
|eero wifi|https://www.amazon.com/eero-Home-WiFi-System-Pack/dp/B00XEW3YD6                                                           |
|eero wifi|https://www.cnet.com/products/eero-wi-fi-system/                                                                          |
|eero wifi|https://www.amazon.com/eero-Home-WiFi-System-Beacons/dp/B0713ZCT4N                                                        |
|eero wifi|https://www.cnet.com/products/eero-home-wi-f

Now we are going to pass these links to IBM Watson Natural Language Understanding API (Watson NLU API) which can provide us sentiment score (positive, negative, neutral) as expressed in the content of these sites (links). Note that we are renaming the column link as 'url' as 'url' is the key expected by Watson NLU API for each site it is going to analyze. 

But before that let us filter this list of links only to those which are about review or article. We are also keeping links from amazon as they contain users reviews.

In [17]:
spark.sql("select q, link from grtbl2 where link like '%review%' or link like '%article%' or link like '%amazon%'").createOrReplaceTempView("grtbl3")

In [18]:
spark.sql("select q, link as url from grtbl3").createOrReplaceTempView("nltbl")

Watson API will also need username and password to be passed for basic authentication. So we are creating them. These values can be obtained when you subscribe for Watson NLU API services from IBM Cloud. We are keeping this as blank as this is sensitive data. For your case you need to use based on the credential you create for your Watson NLU API service.

We are also defining the URi for Watson NLU API. We are are specifying the features as 'sentiment' as expected by Watson NLU API.

In [1]:
nluUsername = ''
nluPassword = ''

In [29]:
nluurisent = "https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27&features=sentiment"

We are now creating a parameter map to call the Rest Data Source for getting result from Watson NLU API. The Rest Data Source will call the Watson NLU API for all of the links in parallel. We are passing the url, the Temporary Spark Table name which has the inputs, username, password and number of partitions (to increase the parallelism). We are also increasing the 'readTimeout' from default 5000 to 20000 as sometimes analyzing content from a site will take time.

In [73]:
parmw = {'url' : nluurisent, 'input' : 'nltbl', 'callStrictlyOnce' : 'Y', 'userId' : nluUsername, 'userPassword' : nluPassword, 'partitions' : '10', 'readTimeout' : '20000'}

In [31]:
nlusentDf = spark.read.format("org.apache.dsext.spark.datasource.rest.RestDataSource").options(**parmw).load().cache()

Let us check the schema of the results returned by Watson NLU API

In [32]:
nlusentDf.printSchema()

root
 |-- output: struct (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- retrieved_url: string (nullable = true)
 |    |-- sentiment: struct (nullable = true)
 |    |    |-- document: struct (nullable = true)
 |    |    |    |-- label: string (nullable = true)
 |    |    |    |-- score: double (nullable = true)
 |    |-- usage: struct (nullable = true)
 |    |    |-- features: long (nullable = true)
 |    |    |-- text_characters: long (nullable = true)
 |    |    |-- text_units: long (nullable = true)
 |-- q: string (nullable = true)
 |-- url: string (nullable = true)



We are going to use the sentiment labela nd score from output.sentiment

In [33]:
nlusentDf.createOrReplaceTempView("nlusenttbl")

In [34]:
sent2df = spark.sql("select q, url, output.sentiment.document.label as label, output.sentiment.document.score as score from nlusenttbl")

In [35]:
sent2df.createOrReplaceTempView("nlusent2tbl")

In [36]:
spark.sql("select q, label , avg(score) from nlusent2tbl group by q, label order by q").show(30, False)

+---------+--------+----------+
|q        |label   |avg(score)|
+---------+--------+----------+
|Luma wifi|negative|-0.0262883|
|Luma wifi|positive|0.2155561 |
|Orbi wifi|positive|0.185361  |
|eero wifi|positive|0.2959509 |
+---------+--------+----------+



Looks like "eero" has most positive feedback compared to "Luma" or "Orbi". I use "eero" in my house. It is really good.