## JOB 2

we are going look into:
    
    - join
    - get unique values from selected column
    - light touch on window

In [1]:
import sys; sys.path.insert(0, '..')
import findspark; findspark.init()

In [2]:
import json
import configparser
from os import environ, listdir, path

from pyspark import SparkConf
from pyspark import SparkFiles
from pyspark.sql import SparkSession

from src.commons import utils
from src.cross_domain_reviews import etl
from src.amazon_reviews.etl import to_stats_aggregation

In [3]:
import pyspark.sql.functions as F

In [4]:
# EMR 6.10, as spark is based on JVM, version number matters
environ['PYSPARK_SUBMIT_ARGS'] = "--packages=com.amazonaws:aws-java-sdk:1.11.900,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell"
environ['DEBUG'] = "1"

In [6]:
session, logger= utils.start_spark()

First we read the cleaned review datasets from S3, which is carried out in `amazon_reviews` job. (see source code in https://github.com/zdjohn/spark-setup-workshop/tree/master/src/amazon_reviews

In [7]:
music_df = utils.extract_parquet_data(session, etl.DIG_MUSIC)
video_df = utils.extract_parquet_data(session, etl.DIG_VIDEO)

In [8]:
music_df.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- marketplace: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)



In [9]:
video_df.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- marketplace: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)



In [12]:
music_df.agg(
    F.countDistinct('customer_id').alias('users_count')
    , F.countDistinct('product_id').alias('items_count')).show()

+-----------+-----------+
|users_count|items_count|
+-----------+-----------+
|     144240|     401484|
+-----------+-----------+



In [13]:
video_df.agg(
    F.countDistinct('customer_id').alias('users_count')
    , F.countDistinct('product_id').alias('items_count')).show()

+-----------+-----------+
|users_count|items_count|
+-----------+-----------+
|     501158|     127873|
+-----------+-----------+



We want to make video recommendations based on the user-item interaction information we learned from music. 
(google "transfer learning" if you like to know more) 

as a result, we need to find overlapping users who purchased both video and music product for model to learn the corelation between `video` and `music` domains.

In [14]:
music_corss_reviews = etl.to_overlapping_customers(music_df, video_df)
music_corss_reviews.count()

20582

Here we see there are **20582** customers out of (144240 music product customer, 501158 video customer) have done shopping on both amazon music and video products.

In [15]:
corss_music_products = etl.to_overlapping_reviews(music_df,music_corss_reviews)
corss_music_products.agg(F.countDistinct('product_id')).show()

+-----------------+
|count(product_id)|
+-----------------+
|            96198|
+-----------------+



In [16]:
corss_video_products = etl.to_overlapping_reviews(video_df,music_corss_reviews)
corss_video_products.agg(F.countDistinct('product_id')).show()

+-----------------+
|count(product_id)|
+-----------------+
|            41764|
+-----------------+



In [None]:
session.stop()

## submit job from your local machine

run command tox -e pack releasable artifact will be generated inside ./dist folder.
```
dist/
  ├── dist_files.zip
  └── main.py
```  

step into dist folder involke spark submit:

```
$: spark-submit \
    --master local[3] \
    --deploy-mode client \
    --packages=com.amazonaws:aws-java-sdk:1.11.900,org.apache.hadoop:hadoop-aws:3.2.0 \
    --py-files ./dist_files.zip \
    main.py --job=cross_domain --source_domain=music --target_domain=video --local_run=1
```