<a href="https://colab.research.google.com/github/ralsouza/apache_spark_real_time_analytics/blob/master/notebooks/recommendation_system/01_pyspark_mllib_recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark MLLib - Recommendation System
*   Also known as collaborative filter;
*   Analyses data to understand people/entities behavior;
*   The recommendation is done by behavioral similarity;
*   The recommendation is based on person or items;
*   Recommendation algorithms expect to receive data in a specific format: `[user_ID,item_ID,score]`;
*   `Score`, also known `rating` indicates the preference from a user about an item. Can be boolean values, ratings or even sales volume;



# PySpark Setup


In [None]:
!apt-get update

In [2]:
# Install the dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
# Environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [4]:
# Make pyspark "importable"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [5]:
# Libraries and Context Setup
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

In [6]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)


# Instance Spark Session
spark = SparkSession.builder.master('local').appName('spark_ml_lib').getOrCreate()

# Create the SQL Context
sqlContext = pyspark.SQLContext(sc)

# Recommendation System

In [7]:
# Imports
from pyspark.ml.recommendation import ALS

In [8]:
# Spark session
sp_session = SparkSession.builder.master('local').appName('app_recom_system').getOrCreate()

In [9]:
# Load data
rdd_ratings = sc.textFile('/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/mllib/user-item.txt')

In [12]:
# Show data, in ALS format - Alternating Least Squares (user,item,rating)
rdd_ratings.collect()[:20]

['1001,9001,10',
 '1001,9002,1',
 '1001,9003,9',
 '1002,9001,3',
 '1002,9002,5',
 '1002,9003,1',
 '1002,9004,10',
 '1003,9001,2',
 '1003,9002,6',
 '1003,9003,2',
 '1003,9004,9',
 '1003,9005,10',
 '1003,9006,8',
 '1003,9007,9',
 '1004,9001,9',
 '1004,9002,2',
 '1004,9003,8',
 '1004,9004,3',
 '1004,9010,10',
 '1004,9011,9']

In [15]:
# Converting columns
rdd_ratings2 = rdd_ratings.map(lambda l: l.split(',')).map(lambda l: (int(l[0]),int(l[1]),float(l[2])))

In [17]:
# Check data
rdd_ratings2.collect()[:20]

[(1001, 9001, 10.0),
 (1001, 9002, 1.0),
 (1001, 9003, 9.0),
 (1002, 9001, 3.0),
 (1002, 9002, 5.0),
 (1002, 9003, 1.0),
 (1002, 9004, 10.0),
 (1003, 9001, 2.0),
 (1003, 9002, 6.0),
 (1003, 9003, 2.0),
 (1003, 9004, 9.0),
 (1003, 9005, 10.0),
 (1003, 9006, 8.0),
 (1003, 9007, 9.0),
 (1004, 9001, 9.0),
 (1004, 9002, 2.0),
 (1004, 9003, 8.0),
 (1004, 9004, 3.0),
 (1004, 9010, 10.0),
 (1004, 9011, 9.0)]

In [18]:
# Create a dataframe
df_rating = sp_session.createDataFrame(rdd_ratings2,['user','item','rating'])

In [19]:
df_rating.show()

+----+----+------+
|user|item|rating|
+----+----+------+
|1001|9001|  10.0|
|1001|9002|   1.0|
|1001|9003|   9.0|
|1002|9001|   3.0|
|1002|9002|   5.0|
|1002|9003|   1.0|
|1002|9004|  10.0|
|1003|9001|   2.0|
|1003|9002|   6.0|
|1003|9003|   2.0|
|1003|9004|   9.0|
|1003|9005|  10.0|
|1003|9006|   8.0|
|1003|9007|   9.0|
|1004|9001|   9.0|
|1004|9002|   2.0|
|1004|9003|   8.0|
|1004|9004|   3.0|
|1004|9010|  10.0|
|1004|9011|   9.0|
+----+----+------+
only showing top 20 rows



# Creating the model
`ALS - Alternating Least Squares` Algorithm to a recommendation system, that optimizes the `loss function` and works very well in parallel environments.

In [20]:
als = ALS(rank=10,maxIter=5)
model = als.fit(df_rating)

In [21]:
# Extract the affinity score
model.userFactors.orderBy('id').collect()

[Row(id=1001, features=[0.9619649648666382, 0.6547734141349792, -0.26652851700782776, -0.009926089085638523, -0.21235449612140656, -0.2957074046134949, -0.08845337480306625, 1.1089266538619995, -0.3647739291191101, 0.3443385064601898]),
 Row(id=1002, features=[-0.1572585105895996, -0.917250394821167, 0.6700908541679382, -0.39680248498916626, -0.5441147685050964, -0.6537660956382751, 0.3588864505290985, 0.6562268137931824, 0.8832180500030518, -0.26746198534965515]),
 Row(id=1003, features=[-0.056894171983003616, -0.7523425221443176, 0.3572603166103363, -0.4809357523918152, 0.1021844670176506, -0.8870552182197571, 0.5259473323822021, 0.15355284512043, 0.8263186812400818, -0.21205703914165497]),
 Row(id=1004, features=[0.474994421005249, 0.3911818265914917, 0.010068115778267384, -0.3692135512828827, -0.20011432468891144, -0.4847201704978943, -0.1547415554523468, 1.2674859762191772, 0.056014977395534515, 0.3388582468032837]),
 Row(id=1005, features=[1.1054517030715942, 0.08229895681142807,

In [22]:
# Create a test dataset with users, items and ratings
df_test = sp_session.createDataFrame([(1001,9003),(1001,9004),(1001,9005)],['user','item'])

In [24]:
df_test.show()

+----+----+
|user|item|
+----+----+
|1001|9003|
|1001|9004|
|1001|9005|
+----+----+



In [25]:
# Make predictions
# The lower the value, less chance do customer buy a product
pred = (model.transform(df_test).collect())
pred

[Row(user=1001, item=9004, prediction=-0.7371293306350708),
 Row(user=1001, item=9005, prediction=-2.5490427017211914),
 Row(user=1001, item=9003, prediction=9.006467819213867)]

Reading outcomes:
*   The user 1001 wont buy the item 9004, the prediction is negative
*   The same is with item 9005
*   Probably the user will buy the item 9003

