# <center>DATA643: Recommender System </center>
## <center> Final Project </center>
### <i> <center> Harpreet Shoker, Rose Koh, Summer 2018 </center> </i>

## Notebook5_TFIDF

---

## Get Data

In [1]:
import os
datasets_path = os.path.join(os.getcwd(), 'data')
dt_path = os.path.join(datasets_path, 'instacart_2017_05_01.tar.gz')

In [2]:
from subprocess import check_output
print(check_output(["ls", "./data/instacart_2017_05_01"]).decode("utf8"))

aisles.csv
departments.csv
order_products__prior.csv
order_products__train.csv
orders.csv
products.csv



---

## ALS with Spark ML library 

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

See documentation at https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('recommend').getOrCreate()

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 10 µs


In [4]:
aisles_path = os.path.join(datasets_path, 'instacart_2017_05_01', 'aisles.csv')
departments_path = os.path.join(datasets_path, 'instacart_2017_05_01', 'departments.csv')
order_products__prior_path = os.path.join(datasets_path, 'instacart_2017_05_01', 'order_products__prior.csv')
order_products__train_path = os.path.join(datasets_path, 'instacart_2017_05_01', 'order_products__train.csv')
orders_path = os.path.join(datasets_path, 'instacart_2017_05_01', 'orders.csv')
products_path = os.path.join(datasets_path, 'instacart_2017_05_01', 'products.csv')

CPU times: user 34 µs, sys: 1e+03 ns, total: 35 µs
Wall time: 40.3 µs


In [5]:
aisles_df = spark.read.csv(aisles_path, inferSchema=True, header=True)
aisles_df.printSchema()

root
 |-- aisle_id: integer (nullable = true)
 |-- aisle: string (nullable = true)



In [6]:
departments_df = spark.read.csv(departments_path, inferSchema=True, header=True)
aisles_df.printSchema()

root
 |-- aisle_id: integer (nullable = true)
 |-- aisle: string (nullable = true)



In [7]:
order_products_prior_df = spark.read.csv(order_products__prior_path, inferSchema=True, header=True)
order_products_prior_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- add_to_cart_order: integer (nullable = true)
 |-- reordered: integer (nullable = true)



In [8]:
order_products_train_df = spark.read.csv(order_products__train_path, inferSchema=True, header=True)
order_products_train_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- add_to_cart_order: integer (nullable = true)
 |-- reordered: integer (nullable = true)



In [9]:
orders_df = spark.read.csv(orders_path, inferSchema=True, header=True)
orders_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- eval_set: string (nullable = true)
 |-- order_number: integer (nullable = true)
 |-- order_dow: integer (nullable = true)
 |-- order_hour_of_day: integer (nullable = true)
 |-- days_since_prior_order: double (nullable = true)



In [10]:
products_df = spark.read.csv(products_path, inferSchema=True, header=True)
products_df.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- aisle_id: string (nullable = true)
 |-- department_id: string (nullable = true)



In [11]:
# Merge priora orders and products
ta = order_products_prior_df
tb = products_df
merged_order_products_prior_df = ta.join(tb, ta.product_id == tb.product_id,how='left') # Could also use 'left_outer'
merged_order_products_prior_df.show()
merged_order_products_prior_df.printSchema()

+--------+----------+-----------------+---------+----------+--------------------+--------+-------------+
|order_id|product_id|add_to_cart_order|reordered|product_id|        product_name|aisle_id|department_id|
+--------+----------+-----------------+---------+----------+--------------------+--------+-------------+
|       2|     33120|                1|        1|     33120|  Organic Egg Whites|      86|           16|
|       2|     28985|                2|        1|     28985|Michigan Organic ...|      83|            4|
|       2|      9327|                3|        0|      9327|       Garlic Powder|     104|           13|
|       2|     45918|                4|        1|     45918|      Coconut Butter|      19|           13|
|       2|     30035|                5|        0|     30035|   Natural Sweetener|      17|           13|
|       2|     17794|                6|        1|     17794|             Carrots|      83|            4|
|       2|     40141|                7|        1|     4