# Collaborative Filtering - Spark ML

#### Machine Learning - the science of getting computers to act without being explicitly programmed

Learn how to create a recommendation engine using the Alternating Least Squares algorithm in Spark's machine learning library

<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/FullFile.png' width="80%" height="80%"></img>

## Prepare and shape the data:  "80% of a Data Scientists  job"

In [3]:
# This function includes credentials to your Object Storage.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage V3 using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', 'XXXXXXXXX')
    hconf.set(prefix + '.username', 'XXXXXXXXX')
    hconf.set(prefix + '.password', 'XXXXXXXXX')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', True)
name = 'keystone'
set_hadoop_config(name)

#Load and clean data
loadRetailData = sc.textFile("swift://XXXXXXXXX." + name + "/OnlineRetail.csv.gz")

import re
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)

header = loadRetailData.first()
#Remove the header
#Split by comma
#Remove bad data
#Convert to dataframe
loadRetailData = loadRetailData.filter(lambda line: line != header).\
                            map(lambda l: l.split(",")).\
                            filter(lambda l: int(l[3]) > 0\
                                and len(re.sub("\D", "", l[1])) != 0 \
                                and len(l[6]) != 0).\
                            map(lambda l: Row(inv=int(l[0]),\
                                stockCode=int(re.sub("\D", "", l[1])),description=l[2],\
                                quant=int(l[3]),invDate=l[4],price=float(l[5]),\
                                custId=int(l[6]),country=l[7]))
retailDf = sqlContext.createDataFrame(loadRetailData)
retailDf.registerTempTable("retailPurchases")

query = """
SELECT 
    custId, stockCode, 1 as purch
FROM 
    retailPurchases 
group 
    by custId, stockCode"""
retailDf = sqlContext.sql(query)

In [4]:
print retailDf.take(3)

[Row(custId=12838, stockCode=22941, purch=1),
 Row(custId=17968, stockCode=22731, purch=1),
 Row(custId=16210, stockCode=20977, purch=1)]

## Build recommendation models

In [5]:
from pyspark.mllib.recommendation import ALS, Rating
model = ALS.trainImplicit(retailDf.rdd.map(lambda r: Rating(*r)), 15, 15)
print "The model has been trained"

The model has been trained


# Implement the model

In [6]:
!pip install pymongo --user



In [13]:
from pymongo import MongoClient
import json
import ssl


USERNAME = 'XXXXXXXXX'
PASSWORD = 'XXXXXXXXX'
MONGODB_URL = "mongodb://"+USERNAME+":"+PASSWORD+"@sl-us-dal-9-portal.3.dblayer.com:15511/recs?ssl=true"

client = MongoClient(MONGODB_URL,ssl_cert_reqs=ssl.CERT_NONE)
db = client['recs']
collection = db['retail']
recDf = model.recommendProductsForUsers(5).flatMap(lambda l: l[1]).toDF()
allRecs = recDf.toJSON().collect()
jsonRecs = [json.loads(rec) for rec in allRecs]
result = collection.insert_many(jsonRecs)

##### Data Citation
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).