
# Related keywords detection

**NOTE**: This notebook depends upon the the Retrotech dataset. If you have any issues, please rerun the [Setting up the Retrotech Dataset](../ch04/1.setting-up-the-retrotech-dataset.ipynb) notebook.

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
from aips import *
spark = SparkSession.builder.appName("AIPS").getOrCreate()
engine = get_engine()

### Step 1: Prepare the data using py-spark and data frames 


## Listing 6.4

In [2]:
signals_collection = engine.get_collection("signals")
create_view(signals_collection, "signals")
spark.sql("""
SELECT LOWER(searches.target) AS keyword, searches.user
FROM signals AS searches WHERE searches.type='query'
""").createOrReplaceTempView("user_searches")

In [3]:
#Show Results:
spark.sql("""SELECT COUNT(*) FROM user_searches""").show(1)
spark.sql("""SELECT * FROM user_searches""").show(3)

+--------+
|count(1)|
+--------+
|  725459|
+--------+

+-------------------+-------+
|            keyword|   user|
+-------------------+-------+
|internal hard drive|u584986|
|               bose|u177210|
|        young jeezy|u515980|
+-------------------+-------+
only showing top 3 rows



### Step2 : Create Cooccurrence & PMI2  Model based on users searchs

## Listing 6.5

In [4]:
#Calculation:
spark.sql("""
SELECT k1.keyword AS keyword1, k2.keyword AS keyword2,
COUNT(DISTINCT k1.user) users_cooc FROM user_searches k1
JOIN user_searches k2 ON k1.user = k2.user WHERE k1.keyword > k2.keyword
GROUP BY k1.keyword, k2.keyword
""").createOrReplaceTempView("keywords_users_cooc")

spark.sql("""
SELECT keyword, COUNT(DISTINCT user) users_occ FROM
user_searches GROUP BY keyword
""").createOrReplaceTempView("keywords_users_oc")


In [8]:
#Show Results
spark.sql("""SELECT * FROM keywords_users_oc
             ORDER BY users_occ DESC""").show(10)
spark.sql("SELECT COUNT(1) AS keywords_users_cooc FROM keywords_users_cooc").show()
spark.sql("""SELECT * FROM keywords_users_cooc
             ORDER BY users_cooc desc""").show(10)

+-----------+---------+
|    keyword|users_occ|
+-----------+---------+
|     lcd tv|     8449|
|       ipad|     7749|
|hp touchpad|     7144|
|  iphone 4s|     4642|
|   touchpad|     4019|
|     laptop|     3625|
|    laptops|     3435|
|      beats|     3282|
|       ipod|     3164|
| ipod touch|     2992|
+-----------+---------+
only showing top 10 rows

+-------------------+
|keywords_users_cooc|
+-------------------+
|             244876|
+-------------------+

+-------------+---------------+----------+
|     keyword1|       keyword2|users_cooc|
+-------------+---------------+----------+
|green lantern|captain america|        23|
|    iphone 4s|         iphone|        21|
|       laptop|      hp laptop|        20|
|         thor|captain america|        18|
|         bose|          beats|        17|
|   skullcandy|          beats|        17|
|    iphone 4s|       iphone 4|        17|
|      macbook|            mac|        16|
|         thor|  green lantern|        16|
|      lapt

## Listing 6.6

In [9]:
#Calculation:
spark.sql("""
SELECT k1.keyword AS k1, k2.keyword AS k2, k1_k2.users_cooc,
k1.users_occ AS n_users1, k2.users_occ AS n_users2,
LOG(POW(k1_k2.users_cooc, 2) / (k1.users_occ * k2.users_occ)) AS pmi2
FROM keywords_users_cooc AS k1_k2 
JOIN keywords_users_oc AS k1 ON k1_k2.keyword1 = k1.keyword
JOIN keywords_users_oc AS k2 ON k1_k2.keyword2 = k2.keyword
""").createOrReplaceTempView("user_related_keywords_pmi")

In [10]:
#Show Results:
spark.sql("""SELECT * FROM user_related_keywords_pmi
             WHERE users_cooc > 5 ORDER BY pmi2 DESC""").show(10)

+-----------------+--------------------+----------+--------+--------+------------------+
|               k1|                  k2|users_cooc|n_users1|n_users2|              pmi2|
+-----------------+--------------------+----------+--------+--------+------------------+
|  iphone 4s cases|      iphone 4 cases|        10|     158|     740|-7.064075033237091|
|     sony laptops|          hp laptops|         8|     209|     432|-7.251876756849249|
|otterbox iphone 4|            otterbox|         7|     122|     787|-7.580428995040033|
|    green lantern|     captain america|        23|     963|    1091|-7.593914965772897|
|          kenwood|              alpine|        13|     584|     717|-7.815078108504774|
|      sony laptop|         dell laptop|        10|     620|     451|-7.936016631553724|
|   wireless mouse|           godfather|         6|     407|     248|-7.938722993151467|
|       hp laptops|        dell laptops|         6|     432|     269| -8.07961802938984|
|      mp3 players|  

## Listing 6.7

In [14]:
#Calculation:
spark.sql("""
SELECT *, (r1 + r2 / (r1 * r2)) / 2 AS comp_score FROM (
  SELECT *, 
    RANK() OVER (PARTITION BY 1 ORDER BY users_cooc DESC) r1,
    RANK() OVER (PARTITION BY 1 ORDER BY pmi2 DESC) r2  
  FROM user_related_keywords_pmi)
""").createOrReplaceTempView("users_related_keywords_comp_score")

In [15]:
#Show Results:
spark.sql("""SELECT k1, k2, users_cooc, pmi2, r1, r2, comp_score 
FROM users_related_keywords_comp_score ORDER BY comp_score ASC""").show(20)

+-------------+---------------+----------+-------------------+---+------+------------------+
|           k1|             k2|users_cooc|               pmi2| r1|    r2|        comp_score|
+-------------+---------------+----------+-------------------+---+------+------------------+
|green lantern|captain america|        23| -7.593914965772897|  1|  8626|               1.0|
|    iphone 4s|         iphone|        21|-10.216737746029027|  2| 56156|              1.25|
|       laptop|      hp laptop|        20| -9.132682838345458|  3| 20383|1.6666666666666667|
|         thor|captain america|        18| -8.483026598234463|  4| 13190|             2.125|
|         bose|          beats|        17|-10.074222345094169|  5| 51916|               2.6|
|    iphone 4s|       iphone 4|        17| -10.07559536143275|  5| 51964|               2.6|
|   skullcandy|          beats|        17|  -9.00066454587719|  5| 18792|               2.6|
|         thor|  green lantern|        16| -8.593796095512285|  8| 140

###  Create Cooccurrence & PMI2  Model based on product interaction

## Listing 6.8

In [18]:
#Calculation:
spark.sql("""
SELECT LOWER(searches.target) AS keyword, searches.user AS user,
clicks.target AS product FROM signals AS searches
RIGHT JOIN signals AS clicks ON searches.query_id = clicks.query_id 
WHERE searches.type = 'query' AND clicks.type = 'click'
""").createOrReplaceTempView("keyword_click_product")


In [19]:
#Show Results:
print("Original signals format: ")
spark.sql("""SELECT * FROM signals WHERE type = 'query'""").show(3)
print("Simplified signals format: ")
spark.sql("""SELECT * FROM keyword_click_product""").show(3)

Original signals format: 
+--------------------+-----------+--------------------+-------------------+-----+-------+
|                  id|   query_id|         signal_time|             target| type|   user|
+--------------------+-----------+--------------------+-------------------+-----+-------+
|000090eb-aedc-4a6...|u584986_0_1|2019-10-08 09:25:...|internal hard drive|query|u584986|
|0002c1ad-2715-4e2...|u177210_0_1|2020-01-18 12:59:...|               Bose|query|u177210|
|00033270-d36e-4b9...|u515980_0_1|2020-04-10 07:35:...|        young jeezy|query|u515980|
+--------------------+-----------+--------------------+-------------------+-----+-------+
only showing top 3 rows

Simplified signals format: 
+-----------+-------+------------+
|    keyword|   user|     product|
+-----------+-------+------------+
|dc universe|u100011|883929194629|
|     dazzle|u100024|613570226642|
|       macs|u100031|885909431618|
+-----------+-------+------------+
only showing top 3 rows



## Listing 6.9

In [20]:
#Calculation:
spark.sql("""
SELECT k1.keyword AS k1,k2.keyword AS k2,SUM(p1) n_users1,sum(p2) n_users2,
SUM(p1 + p2) AS users_cooc, COUNT(1) n_products FROM (
  SELECT keyword, product, COUNT(1) AS p1 FROM keyword_click_product
    GROUP BY keyword, product) AS k1 JOIN (
  SELECT keyword, product, COUNT(1) AS p2 FROM keyword_click_product
    GROUP BY keyword, product) AS k2 ON k1.product = k2.product
WHERE k1.keyword > k2.keyword GROUP BY k1.keyword, k2.keyword
""").createOrReplaceTempView("keyword_click_product_cooc")

In [21]:
#Show Results:
spark.sql("""SELECT COUNT(1) AS keyword_click_product_cooc FROM keyword_click_product_cooc""").show()
spark.sql("""SELECT * FROM keyword_click_product_cooc ORDER BY n_products DESC""").show(20)

+--------------------------+
|keyword_click_product_cooc|
+--------------------------+
|                   1579710|
+--------------------------+

+--------------+-------------+--------+--------+----------+----------+
|            k1|           k2|n_users1|n_users2|users_cooc|n_products|
+--------------+-------------+--------+--------+----------+----------+
|       laptops|       laptop|    3251|    3345|      6596|       187|
|       tablets|       tablet|    1510|    1629|      3139|       155|
|        tablet|         ipad|    1468|    7067|      8535|       146|
|       tablets|         ipad|    1359|    7048|      8407|       132|
|       cameras|       camera|     637|     688|      1325|       116|
|          ipad|        apple|    6706|    1129|      7835|       111|
|      iphone 4|       iphone|    1313|    1754|      3067|       108|
|    headphones|  head phones|    1829|     492|      2321|       106|
|        ipad 2|         ipad|    2736|    6738|      9474|        98|
| 

## Listing 6.10

In [22]:
#Calculation:
spark.sql("""SELECT keyword, COUNT(1) AS n_users FROM keyword_click_product
GROUP BY keyword """).createOrReplaceTempView("keyword_click_product_oc")

In [None]:
#Show Results:
spark.sql("""SELECT COUNT(1) AS keyword_click_product_oc FROM keyword_click_product_oc""").show()
spark.sql("""SELECT * FROM keyword_click_product_oc ORDER BY n_users DESC""").show(20)

+------------------------+
|keyword_click_product_oc|
+------------------------+
|                   13744|
+------------------------+

+------------+-------+
|     keyword|n_users|
+------------+-------+
|        ipad|   7554|
| hp touchpad|   4829|
|      lcd tv|   4606|
|   iphone 4s|   4585|
|      laptop|   3554|
|       beats|   3498|
|     laptops|   3369|
|        ipod|   2949|
|  ipod touch|   2931|
|      ipad 2|   2842|
|      kindle|   2833|
|    touchpad|   2785|
|   star wars|   2564|
|      iphone|   2430|
|beats by dre|   2328|
|     macbook|   2313|
|  headphones|   2270|
|        bose|   2071|
|         ps3|   2041|
|         mac|   1851|
+------------+-------+
only showing top 20 rows



## Listing 6.11

In [23]:
# calculate PMI2, per Listing 6.6

#Calculation:
spark.sql("""
SELECT k1.keyword AS k1, k2.keyword AS k2, k1_k2.users_cooc,
k1.n_users AS n_users1, k2.n_users AS n_users2,
LOG(POW(k1_k2.users_cooc, 2) / (k1.n_users * k2.n_users)) AS pmi2
FROM keyword_click_product_cooc AS k1_k2 
JOIN keyword_click_product_oc AS k1 ON k1_k2.k1 = k1.keyword
JOIN keyword_click_product_oc AS k2 ON k1_k2.k2 = k2.keyword
""").createOrReplaceTempView("product_related_keywords_pmi")

In [25]:
#Show Results:
spark.sql("""SELECT COUNT(1) AS related_keywords_pmi FROM product_related_keywords_pmi""").show()
spark.sql("""SELECT * FROM product_related_keywords_pmi ORDER BY pmi2 DESC""").show(20)

+--------------------+
|related_keywords_pmi|
+--------------------+
|             1579710|
+--------------------+

+-------------------+-------------------+----------+--------+--------+------------------+
|                 k1|                 k2|users_cooc|n_users1|n_users2|              pmi2|
+-------------------+-------------------+----------+--------+--------+------------------+
|     hp touchpad 32|        hp touchpad|      4022|       1|    4829| 8.116674454791653|
|          pad pivot|        hp touchpad|      4022|       1|    4829| 8.116674454791653|
|        hp touchpad|     hp tablet 32gb|      4022|    4829|       1| 8.116674454791653|
|        hp touchpad|    hp tablet 32 gb|      4022|    4829|       1| 8.116674454791653|
|           touchpad|          pad pivot|      2350|    2785|       1| 7.592338061915025|
|           touchpad|     hp touchpad 32|      2350|    2785|       1| 7.592338061915025|
|           touchpad|    hp tablet 32 gb|      2350|    2785|       1| 7.5

In [26]:
# calculate comp_score, per Listing 6.7

#Calculation:
spark.sql("""
SELECT *, (r1 + r2 / (r1 * r2)) / 2 as comp_score from (
  SELECT *, 
    RANK() OVER (PARTITION BY 1 ORDER BY users_cooc DESC) r1, 
    RANK() OVER (PARTITION BY 1 ORDER BY pmi2 DESC) r2  
FROM product_related_keywords_pmi)
""").createOrReplaceTempView("product_related_keywords_comp_score")

In [27]:
#Show Results:
spark.sql("""SELECT COUNT(1) product_related_keywords_comp_scores
             FROM product_related_keywords_comp_score""").show()
spark.sql("""
SELECT k1, k2, n_users1, n_users2, pmi2, comp_score 
FROM product_related_keywords_comp_score
ORDER BY comp_score ASC
""").show(20)

+------------------------------------+
|product_related_keywords_comp_scores|
+------------------------------------+
|                             1579710|
+------------------------------------+

+----------+-----------+--------+--------+------------------+------------------+
|        k1|         k2|n_users1|n_users2|              pmi2|        comp_score|
+----------+-----------+--------+--------+------------------+------------------+
|      ipad|hp touchpad|    7554|    4829|1.2318940540272372|               1.0|
|    ipad 2|       ipad|    2842|    7554| 1.430517155037946|              1.25|
|    tablet|       ipad|    1818|    7554|1.6685364924472557|1.6666666666666667|
|  touchpad|       ipad|    2785|    7554|1.2231908670315748|             2.125|
|   tablets|       ipad|    1627|    7554|1.7493143317791537|               2.6|
|     ipad2|       ipad|    1254|    7554|1.9027023623302282|3.0833333333333335|
|      ipad|      apple|    7554|    1814|1.4995901756327583|3.571428571428

Up next: [Misspelling detection and correction](../ch06/3.spell-correction.ipynb)