# <font color='blue'>Data Science Academy Big Data Real-Time Analytics com Python e Spark</font>

# <font color='blue'>Capítulo 9</font>

## <font color='blue'>Spark MLLib - Sistema de Recomendação</font>

<strong> Descrição </strong>
<ul style="list-style-type:square">
  <li>Também chamado de filtros colaborativos.</li>
  <li>Analisa dados passados para compreender comportamentos de pessoas/entidades.</li>
  <li>A recomendação é feita por similaridade de comportamento.</li>
  <li>Recomendação baseada em usuários ou items.</li>
  <li>Algoritmos de Recomendação esperam receber os dados em um formato específico: [user_ID, item_ID, score].</li>
  <li>Score, também chamado rating, indica a preferência de um usuário sobre um item. Podem ser valores booleanos, ratings ou mesmo volume de vendas.</li>
</ul>

In [2]:
from pyspark.ml.recommendation import ALS

In [3]:
#Acesso ao Hadoop e leitura do arquivo source
from pyspark.sql import SparkSession

# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config_with_credentials_f0d6ce325e0f4bc08812229b8d429dbe(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', '9a0cc60102244d368e96a83f25d4ca89')
    hconf.set(prefix + '.username', '0caf8026c98a4342ac027a05416e6dee')
    hconf.set(prefix + '.password', 'D[Cvr1bgf9DM^I{C')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_f0d6ce325e0f4bc08812229b8d429dbe(name)

spark = SparkSession.builder.getOrCreate()

# Carrega os dados no formato ALS (user, item, rating)
ratingsRDD = sc.textFile('swift://CursoSpark.' + name + '/user-item.csv')
ratingsRDD.collect()

[u'1001,9001,10',
 u'1001,9002,1',
 u'1001,9003,9',
 u'1002,9001,3',
 u'1002,9002,5',
 u'1002,9003,1',
 u'1002,9004,10',
 u'1003,9001,2',
 u'1003,9002,6',
 u'1003,9003,2',
 u'1003,9004,9',
 u'1003,9005,10',
 u'1003,9006,8',
 u'1003,9007,9',
 u'1004,9001,9',
 u'1004,9002,2',
 u'1004,9003,8',
 u'1004,9004,3',
 u'1004,9010,10',
 u'1004,9011,9',
 u'1004,9012,8',
 u'1005,9001,8',
 u'1005,9002,3',
 u'1005,9003,7',
 u'1005,9004,1',
 u'1005,9010,9',
 u'1005,9011,10',
 u'1005,9012,9',
 u'1005,9013,8',
 u'1005,9014,1',
 u'1005,9015,1',
 u'1006,9001,7',
 u'1006,9002,4',
 u'1006,9003,8',
 u'1006,9004,1',
 u'1006,9010,7',
 u'1006,9011,6',
 u'1006,9012,9']

In [4]:
# Convertendo as strings
ratingsRDD2 = ratingsRDD.map(lambda l: l.split(',')).map(lambda l:(int(l[0]), int(l[1]), float(l[2])))

In [6]:
# Criando um Dataframe
ratingsDF = spark.createDataFrame(ratingsRDD2, ["user", "item", "rating"])

In [6]:
ratingsDF.show()

+----+----+------+
|user|item|rating|
+----+----+------+
|1001|9001|  10.0|
|1001|9002|   1.0|
|1001|9003|   9.0|
|1002|9001|   3.0|
|1002|9002|   5.0|
|1002|9003|   1.0|
|1002|9004|  10.0|
|1003|9001|   2.0|
|1003|9002|   6.0|
|1003|9003|   2.0|
|1003|9004|   9.0|
|1003|9005|  10.0|
|1003|9006|   8.0|
|1003|9007|   9.0|
|1004|9001|   9.0|
|1004|9002|   2.0|
|1004|9003|   8.0|
|1004|9004|   3.0|
|1004|9010|  10.0|
|1004|9011|   9.0|
+----+----+------+
only showing top 20 rows



In [7]:
# Construindo o modelo
# ALS = Alternating Least Squares --> Algoritmo para sistema de recomendação, que otimiza a loss function 
# e funciona muito bem em ambientes paralelizados
als = ALS(rank = 10, maxIter = 5)
modelo = als.fit(ratingsDF)

In [8]:
# Visualizando o Affinity Score
modelo.userFactors.orderBy("id").collect()

[Row(id=1001, features=[0.3847953975200653, -1.4223843812942505, 0.5573229789733887, -0.08178026974201202, -0.4382185637950897, -0.25812771916389465, -0.2802934944629669, -0.5862515568733215, 0.06860082596540451, -0.49479079246520996]),
 Row(id=1002, features=[0.29569128155708313, 0.9334215521812439, 0.9352413415908813, -0.059310633689165115, -0.26951032876968384, 0.7360463738441467, -0.9229599833488464, 0.36304736137390137, 0.774177074432373, -0.1584303230047226]),
 Row(id=1003, features=[-0.006305389571934938, 1.0576293468475342, 0.7409896850585938, 0.11010909080505371, -0.3228450119495392, 0.5235232710838318, -0.0926164761185646, 0.32665541768074036, 0.9997615814208984, -0.425375372171402]),
 Row(id=1004, features=[0.39390310645103455, -0.6805849671363831, 0.7696443200111389, -0.20997174084186554, -0.5397744178771973, -0.14010731875896454, -0.591958224773407, -0.631860613822937, 0.19471824169158936, -0.5572392344474792]),
 Row(id=1005, features=[0.20535527169704437, -1.1405640840530

In [9]:
# Criando um dataset de teste com usuários e items para rating
testeDF = spark.createDataFrame([(1001, 9003),(1001,9004),(1001,9005)], ["user", "item"])

In [10]:
# Previsões  
# Quanto maior o Affinity Score, maior a probabilidade do usuário aceitar uma recomendação
previsoes = (modelo.transform(testeDF).collect())
previsoes

[Row(user=1001, item=9004, prediction=-0.6144895553588867),
 Row(user=1001, item=9005, prediction=-3.166208505630493),
 Row(user=1001, item=9003, prediction=8.998315811157227)]

# Fim

### Obrigado - Data Science Academy - <a href=http://facebook.com/dsacademy>facebook.com/dsacademybr</a>