# Name: Umad ul hassan Rai

# Experiment 4

Note: This task is done considering all the papers but later on selecting 20000 tf_idf vectors and 20000 LDA vectors so i could run it on my laptop. As i was unable to do it on cluster due to nltk error.

PS: I did not run on cluster So uploading just .ipynb

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.types import StringType
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql import Row
from pyspark.sql.functions import concat, col, lit, split, udf, size, lit, avg
from pyspark.sql import functions as sf
from operator import add
import pandas as pd
import pyspark
import time
import csv
import re



In [7]:
# initialise Spark Session, setting broadcast Timeout to 36000. 
sparkSession = SparkSession.builder.appName("Exercise1").config("spark.sql.broadcastTimeout", "36000").getOrCreate()
sc = sparkSession.sparkContext
sqlC = SQLContext(sc)

# Vectors Representation

In [8]:
# Using csv to parse the papers.CSV file to map paper id and abstract as Key/Value
# Also allows to replace null bytes in csv file.
# Creating paperTermsRDD
# Concatenating the title and abstract fields[13] + " " + fields[14]
rdd2 = sparkSession.sparkContext.textFile("papers.csv")
paperTermsRDD = rdd2.map(lambda line: next(csv.reader(x.replace("\x00", "") for x in [line]))) \
                    .map(lambda fields: (fields[0], fields[13] + " " + fields[14]))
paperTermsRDD.take(1)

[('80546',
  "the arbitrariness of the genetic code the genetic code has been regarded as arbitrary in the sense that the codon-amino acid assignments could be different than they actually are. this general idea has been spelled out differently by previous, often rather implicit accounts of arbitrariness. they have drawn on the frozen accident theory, on evolutionary contingency, on alternative causal pathways, and on the absence of direct stereochemical interactions between codons and amino acids. it has also been suggested that the arbitrariness of the genetic code justifies attributing semantic information to macromolecules, notably to {dna}. i argue that these accounts of arbitrariness are unsatisfactory. i propose that the code is arbitrary in the sense of jacques monod's concept of chemical arbitrariness: the genetic code is arbitrary in that any codon requires certain chemical and structural properties to specify a particular amino acid, but these properties are not required in 

In [9]:
# Converting RDD to DF.
sentenceDataFrame = paperTermsRDD.toDF()
sentenceDataFrame = sentenceDataFrame.withColumnRenamed("_1", "Paper_Id").withColumnRenamed("_2", "Text")
sentenceDataFrame.show(1)

+--------+--------------------+
|Paper_Id|                Text|
+--------+--------------------+
|   80546|the arbitrariness...|
+--------+--------------------+
only showing top 1 row



In [10]:
# For the sake of processing quickly in took small dataset 
# Remvove this cell when running for final results.
#sentenceDataFrame = sentenceDataFrame.select("*").limit(10000)
sentenceDataFrame.count()

172079

In [11]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

# Using regex tokenizer to remove all non-word english characters except (-) and (_)
regexTokenizer = RegexTokenizer(inputCol="Text", outputCol="Words", pattern="[^a-z\_\-A-Z]")

# Counting tokens just for sake of seeing length of papers.
countTokens = udf(lambda words: len(words), IntegerType())

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized = regexTokenized.select("Paper_Id", "Words") \
    .withColumn("tokens", countTokens(col("Words")))
regexTokenized.show(1, False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [12]:
# Creating function to remove (-) and (_) from words
def remove_hyp_uscore(x):
    final = []
    for word in x:
        removed = word.replace("-","").replace("_","")
        final.append(removed)
    return final

# Creating function to remove words with length smaller than 3 to be used with udf
def rem_2len(x):
    final = []
    for word in x:
        if len(word)>2:
            final.append(word)
    return final

In [13]:
# Creating udf for removing (-) and (_), applying it on respective df
rem_hyp_udf = udf(remove_hyp_uscore, ArrayType(StringType()))

removed_hyp_df = regexTokenized.withColumn("Removed_Hyp", rem_hyp_udf(regexTokenized["words"])).\
                    select("Paper_Id","Removed_Hyp")

# Creating udf for removing words less than length 3 and applying it on respective df
rem_2len_words = udf(rem_2len, ArrayType(StringType()))

removed_2len_df = removed_hyp_df.withColumn("Rem_2len", rem_2len_words(removed_hyp_df["Removed_Hyp"])).\
                    select("Paper_Id", "Rem_2len")


In [14]:
removed_hyp_df.show(1,False)
removed_2len_df.show(1,False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [15]:
# Removing stopwords using Spark Stopwords remover.
from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="Rem_2len", outputCol="Filtered")
stopwords_removed = remover.transform(removed_2len_df).select("Paper_Id", "Filtered")

In [16]:
stopwords_removed.show(1,False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
# Performing stemming by using PorterStemmer() from nltk
from nltk.stem import PorterStemmer

#creatig object of PorterStemmer()
porter = PorterStemmer()

# Function for performing stemming of words by iterating over list of each paper
def stemming(x):
    porter = PorterStemmer()
    final = []
    for word in x:
        stemmed = porter.stem(word)
        final.append(stemmed)
    return final

# Stemming udf
stem_udf = udf(stemming, ArrayType(StringType()))

# Stemming the DF which i got after removing stopwords.
stemmedDF = stopwords_removed.withColumn("Stemmed", stem_udf(stopwords_removed["Filtered"])).select("Paper_Id", "Stemmed")


In [18]:
stemmedDF.show(1,False)

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Paper_Id|Stemmed                                                                                                          

In [19]:
stemmedDF.where(col('Stemmed').isNull()).count()

0

In [20]:
stemmedDF.count()

172079

In [21]:
# Trying to write df in csv so i can read from file instead of running porterstemmer on cluster
# Because i cannot update due to "index out of range error" due to nltk 3.2.2 version
# def array_to_string(my_list):
#     return '[' + ','.join([str(elem) for elem in my_list]) + ']'

# array_to_string_udf = udf(array_to_string, StringType())

# test = stemmedDF.withColumn('column_as_str', array_to_string_udf(stemmedDF["Stemmed"]))

# test.drop('Stemmed').write.csv('stemmed.csv', sep='|')


In [22]:
# #Creating Stemmed words DF to use with cluster
# schema = StructType([
#     StructField("Paper_Id", StringType(), False),
#     StructField("Stemmed", StringType(), True)
# ])

# # Creating DataFrame df for users_libraries
# stem_test = sqlC.read.schema(schema)\
#         .option("header", 'False').option("delimiter", "|").csv("test1.csv")

# stem_test.printSchema()
# stem_test.show(2, False)


In [23]:
#Converting string to array of stem_test Df
#Type casting from str to array so i can apply explode function on papers_id column.
# def str_to_arr(my_list):
#     my_list = my_list.strip('[]').split(",")
#     return [x for x in my_list]

# # String to array udf
# str_to_arr_udf = udf(str_to_arr,ArrayType(StringType()))

# stem_array = stem_test.withColumn('Stemmed_array',str_to_arr_udf(stem_test["Stemmed"]))

In [24]:
# stemmedDF = stem_array.drop('Stemmed').withColumnRenamed('Stemmed_array', 'Stemmed')
stemmedDF.show(1)

+--------+--------------------+
|Paper_Id|             Stemmed|
+--------+--------------------+
|   80546|[arbitrari, genet...|
+--------+--------------------+
only showing top 1 row



In [25]:
# Exploding stemmed column to get paper_id|word df and can perform groupBy and other agg functions.
exploded_stems = stemmedDF.select(stemmedDF.Paper_Id, sf.explode(stemmedDF.Stemmed)).withColumnRenamed("col", "Word")
exploded_stems.show(5, False)

+--------+---------+
|Paper_Id|Word     |
+--------+---------+
|80546   |arbitrari|
|80546   |genet    |
|80546   |code     |
|80546   |genet    |
|80546   |code     |
+--------+---------+
only showing top 5 rows



In [26]:
# groupBy on Word, and then collect list and as set so we can perform filtering
# of words occuring in more than 10% papers and atleast 20 Papers.
grouped = exploded_stems.groupBy("Word").agg(sf.collect_list("Paper_Id"), sf.collect_set("Paper_Id")).\
            withColumnRenamed("collect_list(Paper_Id)", "Papers").withColumnRenamed("collect_set(Paper_Id)", "Papers_set")
grouped.show(2, False)

+-------+---------+----------+
|Word   |Papers   |Papers_set|
+-------+---------+----------+
|aaomega|[1270054]|[1270054] |
|abbrev |[758630] |[758630]  |
+-------+---------+----------+
only showing top 2 rows



In [27]:
# Counting the size of each paper set so we know in how many papers every word appeared.
grouped_withCount = grouped.withColumn("Count", size(("Papers_set")))

In [28]:
# We take size of set collected in previous step as it help us in filtering the words which are in more than 
# 10% of the paper and also easily filter for words which are atleast in 20 papers later on sorting helps to get top 1000 Words.
# Keeping list for now as it can help in preparing the Paper-Term Vector...
grouped_withCount.show(1,False)

+-------+---------+----------+-----+
|Word   |Papers   |Papers_set|Count|
+-------+---------+----------+-----+
|aaomega|[1270054]|[1270054] |1    |
+-------+---------+----------+-----+
only showing top 1 row



In [29]:
# Removing the words more than 10%, As we know total papers are 172079.
# Simply filtering data frame size column which contains the count of 
# papers each word has appeared in. Instead of sorting i am using filter
# Also, Selecting the words which appeared in atleast 20 papers. as we know count
# of Papers for each word. We can simply use filters on count.

final_words = grouped_withCount.filter(grouped_withCount.Count <= (0.1 * 172079) ).filter(grouped_withCount.Count >= 20)

In [30]:
final_words.show(2,False)

+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [31]:
final_1000 = final_words.orderBy(final_words.Count.desc()).select("*").limit(1000)

In [32]:
final_1000.printSchema()
# Remove below line
final_words.count()

root
 |-- Word: string (nullable = true)
 |-- Papers: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- Papers_set: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- Count: integer (nullable = false)



14026

In [33]:
# Trying to Use stemmed DF to create paper Term Vector. Dropping paper_set and Count column 
# Will use it to create TermVectors by converting it to the df as paper_id| list of words in paper
paper_term = final_1000.select("Word", "Papers")

In [34]:
paper_term.printSchema()
paper_term.show(2,False)

root
 |-- Word: string (nullable = true)
 |-- Papers: array (nullable = true)
 |    |-- element: string (containsNull = false)

+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [35]:
# Exploding the Papers columm so we get 1 on 1 word-paper df
exploded_paperTerm = paper_term.select("Word", sf.explode("Papers")).withColumnRenamed("col", "Papers")

In [36]:
# Contains repititive words as we exploded list of papers colleted above because it will
# Further help us in counting number of appearance of specific term.
exploded_paperTerm.show(2)

+-----+-------+
| Word| Papers|
+-----+-------+
|activ|8823677|
|activ| 138401|
+-----+-------+
only showing top 2 rows



In [37]:
# Grouping the exploded_paperTerm by "Papers" column and collect all words as list. this will
# Give us dataframe as "Paper": "List of words" appearing in that paper.
grouped_paperTerm = exploded_paperTerm.groupBy("Papers").agg(sf.collect_list("Word")).withColumnRenamed("collect_list(Word)", "Words")

In [38]:
grouped_paperTerm.show(2,False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Papers|Words                                                                                                                                                                                                                                                                                          |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100   |[network, evolv, network, block, network, motif, design, complex, simpl, scienc, build]              

In [39]:
# Using count vectorizer to create sparsevector for frequency of each term in a paper.
from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer(inputCol="Words", outputCol="features")
model = cv.fit(grouped_paperTerm)

result = model.transform(grouped_paperTerm)
result.show(2,truncate=False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Papers|Words                                                                                                                                                                                                                                                                                          |features                                                                                                                                                                                              

In [40]:
finalModel = result.select("Papers", "features").\
            withColumnRenamed("features", "TermFrequencyVector").withColumnRenamed("Papers", "paper_id")

In [41]:
finalModel.show(2,truncate = False)

+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|paper_id|TermFrequencyVector                                                                                                                                                                                             |
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100     |(1000,[2,6,17,184,226,406,499,546,790],[3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                                                                                                                                   |
|100001  |(1000,[2,5,9,14,70,81,88,122,139,262,302,338,342,405,409,414,462,535,550,560,749,805,808,954,961,997],[1.0,2.0

In [42]:
#finalModel.count()

166796

# TF-IDF

In [43]:
# Creating TF-IDF dataframe.
from pyspark.ml.feature import IDF, Tokenizer

idf = IDF(inputCol="TermFrequencyVector", outputCol="features")
idfModel = idf.fit(finalModel)
rescaledData = idfModel.transform(finalModel)

rescaledData.select("paper_id", "features").show(2,truncate = False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|paper_id|features                                                                                                                                                                                                                                                                                                                                                                                                          

In [44]:
# Creating tf_idf dataframe and renaming features column to ptf_idf
tf_idf = rescaledData.select("paper_id", "features").withColumnRenamed("features", "ptf_idf")

In [45]:
tf_idf.show(2, False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|paper_id|ptf_idf                                                                                                                                                                                                                                                                                                                                                                                                           

In [None]:
#Selecting Subset of 20000 paper vectors.
tf_idf = tf_idf.select('*').limit(20000)
#tf_idf.count()

# LDA

In [48]:
# LDA
from pyspark.ml.clustering import LDA

# Trains a LDA model.
lda = LDA(k=40, maxIter=10, featuresCol="TermFrequencyVector")
model = lda.fit(finalModel)

# Describe topics.
topics = model.describeTopics(5)


# Shows the result
transformed = model.transform(finalModel)


In [49]:
# Showing topics and transformed finalModel.
topics.show(5, truncate=False)
transformed.show(truncate=False)

+-----+-------------------------+-------------------------------------------------------------------------------------------------------------+
|topic|termIndices              |termWeights                                                                                                  |
+-----+-------------------------+-------------------------------------------------------------------------------------------------------------+
|0    |[283, 56, 297, 12, 470]  |[0.027895802667851813, 0.019167510377243725, 0.01899309527737269, 0.015175380480696126, 0.01369718699147568] |
|1    |[94, 1, 91, 92, 17]      |[0.0310701560095516, 0.030700210636843753, 0.023722260844310162, 0.02002468628135826, 0.018382978624044814]  |
|2    |[20, 9, 58, 955, 12]     |[0.01929833408705472, 0.016211025498438413, 0.013396872858057119, 0.012891429148514915, 0.00792607392030447] |
|3    |[287, 375, 365, 688, 962]|[0.09742876874785035, 0.09128346913908153, 0.0354628065160916, 0.027340198578862787, 0.0200232166844361

In [50]:
lda = transformed.select("paper_id", "topicDistribution")

In [51]:
lda.printSchema()

root
 |-- paper_id: string (nullable = false)
 |-- topicDistribution: vector (nullable = true)



In [52]:
lda.show(2,False)

+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|paper_id|topicDistribution                                                                                                        

In [53]:
lda.count()

166796

In [54]:
# Taking subset of lda of 20000 
lda = lda.select('*').limit(20000)
lda.count()

20000


# User Profiling

In [55]:
# Creating Users DataFrame from user_libraries.txt
# with two columns user_hash and Papers
# Using the same Dataframes as created in previous exercise.
# Using two columns as StringType (Later will assign second column (Papers_ID) to array.)
schema = StructType([
    StructField("User", StringType(), False),
    StructField("Papers_ID", StringType(), True)
])

# Creating DataFrame df for users_libraries
userDF = sqlC.read.schema(schema)\
        .option("header", 'False').option("delimiter", ";").csv("users_libraries.txt")

userDF.printSchema()
userDF.show(2, False)


root
 |-- User: string (nullable = true)
 |-- Papers_ID: string (nullable = true)

+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [56]:
#Type casting from str to array so i can apply explode function on papers_id column.
def str_to_arr(my_list):
    my_list = my_list.split(",")
    return [x for x in my_list]

# String to array udf
str_to_arr_udf = udf(str_to_arr,ArrayType(StringType()))

# Converting from string to array
userDF = userDF.withColumn('PapersID',str_to_arr_udf(userDF["Papers_ID"]))
userDF = userDF.drop("Papers_ID")
userDF.printSchema()
userDF.show(2)

root
 |-- User: string (nullable = true)
 |-- PapersID: array (nullable = true)
 |    |-- element: string (containsNull = true)

+--------------------+--------------------+
|                User|            PapersID|
+--------------------+--------------------+
|f05bcffe7951de9e5...|[1158654, 478707,...|
|28d3f81251d94b097...|[3929762, 503574,...|
+--------------------+--------------------+
only showing top 2 rows



In [57]:
# Exploding PapersID column.
exploded_UserDF = userDF.select("User", sf.explode("PapersID")).withColumnRenamed("col", "paper_id")

In [58]:
exploded_UserDF.show(2,False)

+--------------------------------+--------+
|User                            |paper_id|
+--------------------------------+--------+
|f05bcffe7951de9e5a32fff4a42eb088|1158654 |
|f05bcffe7951de9e5a32fff4a42eb088|478707  |
+--------------------------------+--------+
only showing top 2 rows



In [59]:
# Creating function to convert from dense to sparse vector after summation of vectors
import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.ml.linalg import DenseVector

#Convert Dense to sparse vector
def dense_to_sparse(vector):
    return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)

# User Profiling using TF-IDF

In [60]:
# Requires 2 DFs, 1: paper_id|User_hash, 2: paper_id|ptf_idf
# Performs join on paper_id column of both dfs and then sum respective vectors for users.
def idf_user_profile(df1, df2, column1):
    joined = df1.join(df2, column1)
    # Adding the sparseVector for each user's paper
    #Dropping "paper_id"
    joined = joined.drop("paper_id")
    # converting to rdd then to array so  can add and convert back to DF.
    summed = joined.rdd.mapValues(lambda v: v.toArray()).reduceByKey(lambda x, y: x+y)\
                .mapValues(lambda x: DenseVector(x)).toDF(["User", "sum_idf"])
    to_sparse = udf(dense_to_sparse, VectorUDT())
    summed = summed.withColumn("TF_IDF_summed", to_sparse(col("sum_idf")))
    summed = summed.drop("sum_idf")
    
    return summed

In [61]:
user_profile_idf = idf_user_profile(exploded_UserDF, tf_idf, "paper_id")

In [62]:
user_profile_idf.show(2,False)

+--------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# User Profiling using LDA

In [64]:
def lda_user_profile(df1, df2, column1):
    joined = df1.join(df2, column1)
    # Adding the sparseVector for each user's paper
    #Dropping "paper_id"
    joined = joined.drop("paper_id")
    # converting to rdd then to array so  can add and convert back to DF.
    summed = joined.rdd.mapValues(lambda v: v.toArray()).reduceByKey(lambda x, y: x+y)\
                .mapValues(lambda x: DenseVector(x)).toDF(["User", "sum_lda"])
    to_sparse = udf(dense_to_sparse, VectorUDT())
    summed = summed.withColumn("LDA_summed", to_sparse(col("sum_lda")))
    summed = summed.drop("sum_lda")

    return summed


In [65]:
user_profile_lda = lda_user_profile(exploded_UserDF, lda, "paper_id")

In [66]:
user_profile_lda.show(5, False)

+--------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Exercise 4.2 Cosine Similarity

In [67]:
# Functon for Cosne Smlarty.
def CosineSim(x , y):
    similarity = x.dot(y)/(x.norm(2)*y.norm(2))
    return similarity

# Content-Based Recommendations Systems 

### CBRS TF-IDF 

In [68]:
# Collecting Papers as list
# filter out rated papers instead of list.
all_pap_vecs = tf_idf.select('*').rdd.map(lambda x: (x[0], x[1])).collect()

# Collecting Users vectors from user profile of tf-idf
all_userss_vecs = user_profile_idf.select('*').rdd.map(lambda x: (x[0], x[1])).collect()

#All users vector which contains library of every user.
all_users_paper_vec = userDF.select('*').rdd.map(lambda x: (x[0], x[1])).collect()

In [204]:
# Taking Small set of users from big vector so i can compute easily.
small_user_vecs = all_userss_vecs[100:120]

In [71]:
def get_SimilarPaper_idf(user_profiles, k):
    schema = StructType([
                        StructField("User", StringType(), True)
                        ,StructField("paper_id", StringType(), True)
                        ,StructField("Cosine Similarity", IntegerType(), True)

                    ])
  
    user_paper_similarity = sqlC.createDataFrame([], schema)

    for user_vec in user_profiles:
        #print(user_vec[1])
        #Getting Papers for a user
        papers_of_user = [item[1] for item in all_users_paper_vec if item[0] == user_vec[0]]
        # In Case if list is empty for user. To get rid of List index out of range
        if not papers_of_user:
            #print(papers_of_user)
            continue
        papers_of_user = papers_of_user[0]


        #print(papers_of_user)
        # Removed Papers which are in user library
        removed = [item for item in all_pap_vecs if item[0] not in papers_of_user]
        #print(removed)

        sim_rdd = sc.parallelize((user_vec[0], i[0], float(CosineSim(user_vec[1], i[1]))) for i in all_pap_vecs)

        tempDF = sqlC.createDataFrame(sim_rdd) \
                .withColumnRenamed('_1', 'User') \
                .withColumnRenamed('_2', 'paper_id') \
                .withColumnRenamed('_3', 'Cosine Similarity')\
                .orderBy("Cosine Similarity", ascending = False)

        tempDF = tempDF.select("*").limit(k)
        #total += tdf.count()
        user_paper_similarity = user_paper_similarity.union(tempDF)
    return user_paper_similarity

In [205]:
user_similarity_df = get_SimilarPaper_idf(small_user_vecs, 10)

In [206]:
user_similarity_df.show(5,False)

+--------------------------------+--------+-------------------+
|User                            |paper_id|Cosine Similarity  |
+--------------------------------+--------+-------------------+
|306726df6d202f66d0c6895c83e9660f|1084904 |0.9918598292575735 |
|306726df6d202f66d0c6895c83e9660f|1104951 |0.40213849515053124|
|306726df6d202f66d0c6895c83e9660f|1216168 |0.3983613821016407 |
|306726df6d202f66d0c6895c83e9660f|1118739 |0.38211302015569776|
|306726df6d202f66d0c6895c83e9660f|100333  |0.38211302015569776|
+--------------------------------+--------+-------------------+
only showing top 5 rows



In [294]:
# Give user hash to filter the recommendation for that user only
user_hash_id = '306726df6d202f66d0c6895c83e9660f'
user_similarity_df.filter(user_similarity_df.User == user_hash_id).show()


+--------------------+--------+-------------------+
|                User|paper_id|  Cosine Similarity|
+--------------------+--------+-------------------+
|306726df6d202f66d...| 1084904| 0.9918598292575735|
|306726df6d202f66d...| 1104951|0.40213849515053124|
|306726df6d202f66d...| 1216168| 0.3983613821016407|
|306726df6d202f66d...| 1118739|0.38211302015569776|
|306726df6d202f66d...|  100333|0.38211302015569776|
|306726df6d202f66d...|  122336|0.38211302015569776|
|306726df6d202f66d...| 1136810| 0.3787940258743861|
|306726df6d202f66d...| 1110399| 0.3787940258743861|
|306726df6d202f66d...|   12395| 0.3616762586391641|
|306726df6d202f66d...| 1123054|0.36094365596631467|
+--------------------+--------+-------------------+



### CBRS LDA

In [75]:
# Creating a list of all selected lda vectors for papers
all_ldaa_vecs = lda.select('*').rdd.map(lambda x: (x[0], x[1])).collect()

#Creating a list of all users in lda user profile
all_users_vecs = user_profile_lda.select('*').rdd.map(lambda x: (x[0], x[1])).collect()

# Selecting the list of users against their papers.
all_users_paper_vec = userDF.select('*').rdd.map(lambda x: (x[0], x[1])).collect()

In [194]:
# Taking Small set of users from big vector so i can compute easily.
small_user_vecs_lda = all_users_vecs[100:120]
small_user_vecs_lda

[('306726df6d202f66d0c6895c83e9660f',
  SparseVector(40, {0: 0.0088, 1: 0.0087, 2: 0.0087, 3: 0.0086, 4: 0.0088, 5: 0.0087, 6: 0.0087, 7: 0.0087, 8: 0.0087, 9: 0.0087, 10: 0.0087, 11: 0.0088, 12: 0.0086, 13: 0.0087, 14: 0.0089, 15: 0.0087, 16: 0.0087, 17: 0.0086, 18: 0.0087, 19: 0.431, 20: 0.0086, 21: 0.0085, 22: 0.0087, 23: 0.0086, 24: 0.0086, 25: 0.0086, 26: 0.0086, 27: 0.0087, 28: 0.2076, 29: 0.0086, 30: 0.0087, 31: 0.0087, 32: 0.0087, 33: 0.0088, 34: 0.0087, 35: 0.6772, 36: 0.0089, 37: 0.3716, 38: 0.0086, 39: 0.0087})),
 ('8a84cb70e9d95e5c6ee19ff950a4adf7',
  SparseVector(40, {0: 0.0004, 1: 0.0004, 2: 0.0004, 3: 0.0004, 4: 0.0004, 5: 0.0004, 6: 0.0004, 7: 0.0004, 8: 0.0004, 9: 0.0004, 10: 0.0004, 11: 0.0004, 12: 0.0004, 13: 0.0004, 14: 0.0004, 15: 0.0004, 16: 0.0613, 17: 0.0004, 18: 0.0004, 19: 0.1043, 20: 0.0004, 21: 0.0004, 22: 0.0004, 23: 0.0004, 24: 0.0004, 25: 0.0004, 26: 0.0004, 27: 0.0004, 28: 0.0004, 29: 0.0004, 30: 0.0004, 31: 0.0004, 32: 0.0004, 33: 0.0004, 34: 0.0004, 35

In [183]:
def get_lda_SimilarPaper(user_profiles, k):
    schema = StructType([
                        StructField("User", StringType(), True)
                        ,StructField("paper_id", StringType(), True)
                        ,StructField("Cosine Similarity", IntegerType(), True)

                    ])
  
    user_paper_lda_similarity = sqlC.createDataFrame([], schema)

    for user_vec in user_profiles:
        #print(user_vec[1])
        #Getting Papers for a user
        papers_of_user = [item[1] for item in all_users_paper_vec if item[0] == user_vec[0]]
        # In Case if list is empty for user. To get rid of List index out of range
        if not papers_of_user:
            print(papers_of_user)
            continue
        papers_of_user = papers_of_user[0]


        #print(papers_of_user)
        # Removed Papers which are in user library
        removed = [item for item in all_ldaa_vecs if item[0] not in papers_of_user]
        #print(removed)

        sim_rdd = sc.parallelize((user_vec[0], i[0], float(CosineSim(user_vec[1], i[1]))) for i in all_ldaa_vecs)

        tempDF = sqlC.createDataFrame(sim_rdd) \
                .withColumnRenamed('_1', 'User') \
                .withColumnRenamed('_2', 'paper_id') \
                .withColumnRenamed('_3', 'Cosine Similarity')\
                .orderBy("Cosine Similarity", ascending = False)

        tempDF = tempDF.select("*").limit(k)
        #total += tdf.count()
        user_paper_lda_similarity = user_paper_lda_similarity.union(tempDF)
    return user_paper_lda_similarity

In [195]:
user_paper_lda_similarity = get_lda_SimilarPaper(small_user_vecs_lda, 10)

In [197]:
user_paper_lda_similarity.show(5)

+--------------------+--------+------------------+
|                User|paper_id| Cosine Similarity|
+--------------------+--------+------------------+
|306726df6d202f66d...|11683049|0.8870038555652144|
|306726df6d202f66d...|10610384|0.8842058365798049|
|306726df6d202f66d...|10430837|0.8833521263517805|
|306726df6d202f66d...| 1135017|0.8809248162669576|
|306726df6d202f66d...|10235078|0.8763203715772356|
+--------------------+--------+------------------+
only showing top 5 rows



In [296]:
# Give user hash to filter the recommendation for that user only
user_hash_id = '306726df6d202f66d0c6895c83e9660f'
user_paper_lda_similarity.filter(user_paper_lda_similarity.User == user_hash_id).show()

+--------------------+--------+------------------+
|                User|paper_id| Cosine Similarity|
+--------------------+--------+------------------+
|306726df6d202f66d...|11683049|0.8870038555652144|
|306726df6d202f66d...|10610384|0.8842058365798049|
|306726df6d202f66d...|10430837|0.8833521263517805|
|306726df6d202f66d...| 1135017|0.8809248162669576|
|306726df6d202f66d...|10235078|0.8763203715772356|
|306726df6d202f66d...| 1056526|0.8743865314270477|
|306726df6d202f66d...|  122285|0.8554939098164116|
|306726df6d202f66d...| 1023689|0.8507553315403571|
|306726df6d202f66d...| 1079299| 0.847201626968957|
|306726df6d202f66d...|12389684|0.8356212706363085|
+--------------------+--------+------------------+



# Exercise 4.4 Offline Evaluation Metrics 

In [83]:
def precision_recall(actual, predicted, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    #print((act_set & pred_set))
    precision = len(act_set & pred_set) / float(k)
    recall = len(act_set & pred_set) / float(len(act_set))
    return [precision,recall]

def recall(actual, predicted, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    recall = len(act_set & pred_set) / float(len(act_set))
    return recall


In [84]:
returnType = StructType([
    StructField("Precision", FloatType(), False),
    StructField("Recall", FloatType(), False)
])

In [85]:
# start with pos = 0
def mrr(test, rec):
    pos = 1
    count = 0 # to check if there is actually a hit
    for i, val in enumerate(rec):
       # print(i, val)
        if val in test:
            #print(i, val)
            # Zero Division
            pos += i
            count += 1
            break
    if count > 0:
        return (1/pos)
    else:
        return 0

In [86]:
# Creating UDF for precision and recall
#pre_udf = udf(precision_recall, FloatType())
#recall_udf = udf(recall, FloatType())
#  returns precision and recall in same udf as array, rename it later.
precision_recall_udf = udf(precision_recall, returnType)
mrr_udf = udf(mrr)

# Offline Evaluation of TF-IDF

In [207]:
# Creating DF as user_hash, test_set, Recommendations
metrics_df = user_similarity_df.select('User', 'paper_id').withColumnRenamed('User', 'user_hash')

metrics_df.show(2,False)

# Grouping on user_hash and collecting recommended papers as list 
grouped_metrics_df = metrics_df.groupBy('user_hash').agg(sf.collect_list('paper_id'))\
                    .withColumnRenamed('collect_list(paper_id)','top_k')

grouped_metrics_df.show(2,False)

#grouped_metrics_df.printSchema()

# Creating a df by joining grouped_metrics_df with userDf to get top_k and test set of user.
user_test_topk = grouped_metrics_df.join(userDF, grouped_metrics_df.user_hash == userDF.User).drop('User')\
                .withColumnRenamed('PapersID', 'test')

#user_test_topk.printSchema()
# Computing Precion and Recall using one udf, which returns array containing precision and recall
with_precision_recall = user_test_topk.withColumn('result', precision_recall_udf(user_test_topk.test, user_test_topk.top_k, lit(10)))\
        .select('user_hash', 'test', 'top_k', 'result.precision', 'result.recall')

In [97]:
with_precision_recall.printSchema()

root
 |-- user_hash: string (nullable = true)
 |-- test: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- top_k: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- precision: float (nullable = true)
 |-- recall: float (nullable = true)



In [218]:
with_precision_recall.show(10)

+--------------------+--------------------+--------------------+---------+-----------+
|           user_hash|                test|               top_k|precision|     recall|
+--------------------+--------------------+--------------------+---------+-----------+
|7d5bd7a7207191b8e...|[1187625, 1374369...|[1187625, 1077201...|      0.1| 0.14285715|
|1a41cac66e94aa5b8...|[114704, 114633, ...|[114704, 114633, ...|      0.2|  0.2857143|
|75cadcb96870032b0...|[6629397, 5882183...|[10659057, 101052...|      0.4|0.015564202|
|5db29434ebd7751e0...|[7229975, 1064495...|[1064495, 1105832...|      0.1|      0.125|
|f04de1b295a1c0759...|[13162824, 156946...|[10250503, 111552...|      0.5|   0.078125|
|c88f84bdcec4d30f3...|[1224201, 1205659...|[1205659, 1074262...|      0.2| 0.33333334|
|46e75423781402aeb...|[457955, 635695, ...|[1187394, 1116360...|      0.2| 0.10526316|
|ddbb4df7bb43ad490...|[2838703, 1122005...|[11220055, 119213...|      0.2|      0.125|
|2115a974c04051623...|[11086640, 245920...|

In [219]:
# Computing MRR 
mrr_df = with_precision_recall.withColumn('Mrr', mrr_udf(with_precision.test ,with_precision.top_k))

In [220]:
mrr_df.show(1)

+--------------------+--------------------+--------------------+---------+----------+---+
|           user_hash|                test|               top_k|precision|    recall|Mrr|
+--------------------+--------------------+--------------------+---------+----------+---+
|7d5bd7a7207191b8e...|[1187625, 1374369...|[1187625, 1077201...|      0.1|0.14285715|1.0|
+--------------------+--------------------+--------------------+---------+----------+---+
only showing top 1 row



In [221]:
# Calculating Average Precision, Recall and MRR....

avg_precision = mrr_df.agg(avg(col("precision")))
avg_recall = mrr_df.agg(avg(col("recall")))
avg_mrr = mrr_df.agg(avg(col('Mrr')))

In [222]:
avg_precision.show()
avg_recall.show()
avg_mrr.show()

+-------------------+
|     avg(precision)|
+-------------------+
|0.23500000275671482|
+-------------------+

+-------------------+
|        avg(recall)|
+-------------------+
|0.16262027905322612|
+-------------------+

+--------+
|avg(Mrr)|
+--------+
|     1.0|
+--------+



# Offline Evaluation of LDA 

In [198]:
# Creating DF as user_hash, test_set, Recommendations
lda_metrics_df = user_paper_lda_similarity.select('User', 'paper_id').withColumnRenamed('User', 'user_hash')

# Collecting the list of recommended papers on basis of lda.
lda_grouped_metrics_df = lda_metrics_df.groupBy('user_hash').agg(sf.collect_list('paper_id'))\
                    .withColumnRenamed('collect_list(paper_id)','top_k')

# Creating Df as user_hash, test and top_k
lda_user_test_topk = lda_grouped_metrics_df.join(userDF, lda_grouped_metrics_df.user_hash == userDF.User).drop('User')\
                .withColumnRenamed('PapersID', 'test')

# Calculating Precision and recall using udf function.
lda_with_precision_recall = lda_user_test_topk.withColumn('result', precision_recall_udf(lda_user_test_topk.test, lda_user_test_topk.top_k, lit(40)))\
        .select('user_hash', 'test', 'top_k', 'result.precision', 'result.recall')

In [203]:
lda_with_precision_recall.show()

+--------------------+--------------------+--------------------+---------+------------+
|           user_hash|                test|               top_k|precision|      recall|
+--------------------+--------------------+--------------------+---------+------------+
|7d5bd7a7207191b8e...|[1187625, 1374369...|[1187625, 1214655...|    0.025|  0.14285715|
|1a41cac66e94aa5b8...|[114704, 114633, ...|[114633, 114704, ...|     0.05|   0.2857143|
|75cadcb96870032b0...|[6629397, 5882183...|[10082432, 103611...|    0.025|0.0038910506|
|5db29434ebd7751e0...|[7229975, 1064495...|[1064495, 1032045...|    0.025|       0.125|
|f04de1b295a1c0759...|[13162824, 156946...|[1068200, 1096026...|      0.0|         0.0|
|c88f84bdcec4d30f3...|[1224201, 1205659...|[10041867, 106401...|    0.025|  0.16666667|
|46e75423781402aeb...|[457955, 635695, ...|[1093250, 1062263...|      0.0|         0.0|
|ddbb4df7bb43ad490...|[2838703, 1122005...|[1074536, 1192135...|     0.05|       0.125|
|2115a974c04051623...|[11086640,

In [223]:
lda_mrr_df = lda_with_precision_recall.withColumn('Mrr', mrr_udf(lda_with_precision_recall.test ,lda_with_precision_recall.top_k))

In [224]:
lda_mrr_df.show(2)

+--------------------+--------------------+--------------------+---------+----------+---+
|           user_hash|                test|               top_k|precision|    recall|Mrr|
+--------------------+--------------------+--------------------+---------+----------+---+
|7d5bd7a7207191b8e...|[1187625, 1374369...|[1187625, 1214655...|    0.025|0.14285715|1.0|
|1a41cac66e94aa5b8...|[114704, 114633, ...|[114633, 114704, ...|     0.05| 0.2857143|1.0|
+--------------------+--------------------+--------------------+---------+----------+---+
only showing top 2 rows



In [225]:
# Calculating Averages of Precision Recall and MRR for LDA
lda_avg_precision = lda_mrr_df.agg(avg(col("precision")))
lda_avg_recall = lda_mrr_df.agg(avg(col("recall")))
lda_avg_mrr = lda_mrr_df.agg(avg(col('Mrr')))

In [226]:
# Showing Results
lda_avg_precision.show()
lda_avg_recall.show()
lda_avg_mrr.show()

+--------------------+
|      avg(precision)|
+--------------------+
|0.022500000428408384|
+--------------------+

+-------------------+
|        avg(recall)|
+-------------------+
|0.07931858979864045|
+-------------------+

+------------------+
|          avg(Mrr)|
+------------------+
|0.8208333333333332|
+------------------+



# Exercise 4.5 Off-Line Evaluation

# Sampler from exercise 3.5

In [235]:
# Creating Function Sampler
# column1 = column to join training and idf df or training and lda df. "paper_id"
# column2 = column of idf df which will be used for sum. "ptf-idf"
# column3 = column of lda df which will be used for sum. "topicDistribution"
def sampler(userDF, tf_idf, lda, column1):
    samplee = userDF.sample(0.7)

    exploded_sample = samplee.select("User", sf.explode("PapersID")).withColumnRenamed("col", "paper_id")

    # Creating a fraction for every user, and set it as 0.8 so i can do stratified Sampling
    fraction = exploded_sample.select("User").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap()
    # Creating a training DF for users. By selecting fraction(0.8) of values for every user from exploded sample.
    training = exploded_sample.sampleBy("User", fraction, 0)
    #print(training.distinct().count())
    # Creating Test DF by subtracting Training DF from Exploded_Sample.
    test = exploded_sample.select("*").subtract(training.select("*"))
    test_set_df = test.groupBy('User').agg(sf.collect_list('paper_id'))\
                    .withColumnRenamed('collect_list(paper_id)','test_set')
    #print(test.distinct().count())
    # Calling idf_user_profiling function
    sampled_tr_idf = idf_user_profile(training, tf_idf, column1)
    # Calling lda_user_profilig function
    sampled_tr_lda = lda_user_profile(training, lda, column1)
    
    # Writing to file, which can be used in next experiment.
    #sampled_tr_idf.rdd.coalesce(1).saveAsTextFile("Result_TF-DF")
    #sampled_tr_lda.rdd.coalesce(1).saveAsTextFile("Result_LDA")
    
    return sampled_tr_idf, sampled_tr_lda, test_set_df
    

In [236]:
# Calling The sampler function which returns sampled tf-idf user profiles and lda user profiles
# Along with test_set df.
sampled_20_idf, sampled_20_lda, sampled_test= sampler(userDF, tf_idf, lda, "paper_id")

# Using Sampled data for TF-IDF 

In [None]:
# Getting List of users from Sampled Test.
user_list = sampled_test.select('User').limit(20).collect()
user_list = [row.User for row in user_list]
#user_list

In [270]:
# Common users between Sampled Test and Sampled_20_idf
common_user = sampled_20_idf.filter(sf.col('User').isin(user_list))

In [272]:
common_users_vecs = common_user.select('*').limit(20).rdd.map(lambda x: (x[0], x[1])).collect()

common_users_vecs

[('aa5363b8b326ed29637fc196e97a8145',
  SparseVector(1000, {2: 2.3101, 20: 16.7136, 27: 2.5361, 33: 2.4134, 39: 2.9581, 42: 2.4618, 50: 2.5352, 54: 7.8767, 71: 2.8645, 72: 2.5369, 78: 2.5215, 87: 25.6212, 89: 2.6325, 99: 9.0259, 106: 2.7027, 112: 2.701, 119: 2.89, 131: 3.3454, 146: 2.8032, 166: 2.8624, 180: 5.9062, 186: 3.0186, 196: 2.9976, 202: 5.995, 211: 3.0972, 222: 3.8017, 231: 3.3737, 233: 3.1757, 247: 6.4302, 263: 3.2616, 269: 3.1312, 325: 3.2884, 331: 3.2273, 362: 7.8067, 368: 3.3969, 375: 3.8431, 379: 3.8055, 385: 3.5735, 406: 3.5225, 443: 3.7297, 463: 3.6612, 474: 3.5216, 475: 3.5641, 485: 3.6547, 502: 3.724, 505: 11.7886, 509: 3.849, 538: 4.0232, 559: 3.7589, 587: 4.0695, 588: 8.4901, 601: 3.8204, 610: 3.8423, 621: 3.9471, 629: 3.9087, 653: 4.0972, 657: 4.2641, 672: 3.8689, 680: 4.0138, 699: 26.5268, 715: 8.1326, 733: 4.1092, 752: 4.2017, 760: 4.1659, 777: 4.0528, 783: 4.1475, 784: 4.1905, 791: 4.5452, 834: 4.5368, 849: 13.6818, 865: 4.1929, 902: 4.317, 921: 4.4382, 945: 4.5

#### For k = 5

In [273]:
# Sampled Users vectors and number of recommendations in form of k.
# returns DF containing recommendations and Precision, Recall, MRR.
def sampled_precision_recall_mrr(user_vectors, sampled_test, k):
    #Getting Recommendations 
    temp_recs = get_SimilarPaper_idf(user_vectors, k)

    #temp_recs.printSchema()

    # Creating DF as user_hash, Recommendations
    temp_metrics_df = temp_recs.select('User', 'paper_id').withColumnRenamed('User', 'user_hash')

    #temp_metrics_df.printSchema()
    
    # Creating DF, user_hash, top_k recommendations
    temp_grouped_metrics_df = temp_metrics_df.groupBy('user_hash').agg(sf.collect_list('paper_id'))\
                        .withColumnRenamed('collect_list(paper_id)','top_k')

    temp_grouped_metrics_df.show(2)

#     test_set_df = sampled_test.groupBy('User').agg(sf.collect_list('paper_id'))\
#                     .withColumnRenamed('collect_list(paper_id)','test_set')
    
#     test_set_df.show(2)
    
    # Creating DF, user, top_k, test_set by joining sampled_test
    temp_user_test_topk = temp_grouped_metrics_df.join(sampled_test, temp_grouped_metrics_df.user_hash == sampled_test.User).drop('User')\
                    .withColumnRenamed('PapersID', 'test_set')

    temp_user_test_topk.show(2)
    
    # Creating DF, with precision and recall as new columns
    temp_with_precision = temp_user_test_topk.withColumn('result', precision_recall_udf(sampled_test.test_set, temp_user_test_topk.top_k, lit(k)))\
            .select('user_hash', 'test_set', 'top_k', 'result.precision', 'result.recall')

    temp_with_precision.show(5)
    
    # Creating DF with MRR value
    temp_mrr_df = temp_with_precision.withColumn('Mrr', mrr_udf(temp_with_precision.test_set ,temp_with_precision.top_k))

    temp_mrr_df.show(5)
    
    return temp_mrr_df

In [274]:
# Creating recommendations for k = 5 and calculating precison, recall and MRR for every sampled user.
# prm = precsion, recall and Mrr
prm = sampled_precision_recall_mrr(common_users_vecs, sampled_test, 5)

+--------------------+--------------------+
|           user_hash|               top_k|
+--------------------+--------------------+
|aa5363b8b326ed296...|[111664, 1021931,...|
|7f4fbc40acceaa303...|[1181259, 1181253...|
+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+
|           user_hash|               top_k|            test_set|
+--------------------+--------------------+--------------------+
|aa5363b8b326ed296...|[111664, 1021931,...|[5864612, 4469058...|
|7f4fbc40acceaa303...|[1181259, 1181253...|[1817028, 3356924...|
+--------------------+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+---------+------+
|           user_hash|            test_set|               top_k|precision|recall|
+--------------------+--------------------+--------------------+---------+------+
|aa5363b8b326ed296...|[5864612, 4469058...|[111664

In [279]:
# Calculating Average of Precision, Recall and Mrr for k = 5
sample_avg_precision = prm.agg(avg(col("precision")))
sample_avg_recall = prm.agg(avg(col("recall")))
sample_avg_mrr = prm.agg(avg(col('Mrr')))

# Showing Results
sample_avg_precision.show()
sample_avg_recall.show()
sample_avg_mrr.show()

+--------------+
|avg(precision)|
+--------------+
|           0.0|
+--------------+

+-----------+
|avg(recall)|
+-----------+
|        0.0|
+-----------+

+--------+
|avg(Mrr)|
+--------+
|     1.0|
+--------+



#### For k = 10 

In [280]:
# Creating recommendations for k = 10 and calculating precison, recall and MRR for every sampled user.
prm_10 = sampled_precision_recall_mrr(common_users_vecs, sampled_test, 10)

+--------------------+--------------------+
|           user_hash|               top_k|
+--------------------+--------------------+
|aa5363b8b326ed296...|[111664, 1021931,...|
|7f4fbc40acceaa303...|[1181259, 1181253...|
+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+
|           user_hash|               top_k|            test_set|
+--------------------+--------------------+--------------------+
|aa5363b8b326ed296...|[111664, 1021931,...|[5864612, 4469058...|
|7f4fbc40acceaa303...|[1181259, 1181253...|[1817028, 3356924...|
+--------------------+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+---------+------+
|           user_hash|            test_set|               top_k|precision|recall|
+--------------------+--------------------+--------------------+---------+------+
|aa5363b8b326ed296...|[5864612, 4469058...|[111664

In [281]:
# Calculating average precision, recall and MRR for k = 10
sample_avg_precision = prm_10.agg(avg(col("precision")))
sample_avg_recall = prm_10.agg(avg(col("recall")))
sample_avg_mrr = prm_10.agg(avg(col('Mrr')))

# Showing Results
sample_avg_precision.show()
sample_avg_recall.show()
sample_avg_mrr.show()

+-------------------+
|     avg(precision)|
+-------------------+
|0.02307692342079603|
+-------------------+

+--------------------+
|         avg(recall)|
+--------------------+
|0.009056877344846725|
+--------------------+

+----------------+
|        avg(Mrr)|
+----------------+
|0.86996336996337|
+----------------+



#### For k = 30 

In [282]:
# Creating recommendations for k = 30 and calculating precison, recall and MRR for every sampled user.
prm_30 = sampled_precision_recall_mrr(common_users_vecs, sampled_test, 30)

+--------------------+--------------------+
|           user_hash|               top_k|
+--------------------+--------------------+
|aa5363b8b326ed296...|[111664, 1021931,...|
|7f4fbc40acceaa303...|[1181259, 1181253...|
+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+
|           user_hash|               top_k|            test_set|
+--------------------+--------------------+--------------------+
|aa5363b8b326ed296...|[111664, 1021931,...|[5864612, 4469058...|
|7f4fbc40acceaa303...|[1181259, 1181253...|[1817028, 3356924...|
+--------------------+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+---------+------+
|           user_hash|            test_set|               top_k|precision|recall|
+--------------------+--------------------+--------------------+---------+------+
|aa5363b8b326ed296...|[5864612, 4469058...|[111664

In [283]:
# Calculating average precision, recall and MRR for k = 30
sample_avg_precision = prm_30.agg(avg(col("precision")))
sample_avg_recall = prm_30.agg(avg(col("recall")))
sample_avg_mrr = prm_30.agg(avg(col('Mrr')))

# Showing Results
sample_avg_precision.show()
sample_avg_recall.show()
sample_avg_mrr.show()

+--------------------+
|      avg(precision)|
+--------------------+
|0.010256410791323734|
+--------------------+

+--------------------+
|         avg(recall)|
+--------------------+
|0.014974037328591714|
+--------------------+

+----------------+
|        avg(Mrr)|
+----------------+
|0.86996336996337|
+----------------+



# Using Sampled data for LDA 

In [285]:
# Getting List of Users from sample test
lda_user_list = sampled_test.select('User').limit(20).collect()
lda_user_list = [row.User for row in lda_user_list]
#lda_user_list

['5b0c4c6f84f09ea85bdd9528ce8f9e42',
 'f0baca0d108f1490138ad98e4d685a1d',
 '3e5e03ed6a16af2a0434ac284d465bb8',
 'dcb0db5e7f1e041d8f4c9765af80b9fa',
 '188b2723f7349804c6a237f47b089982',
 'fc0d51c63591e5b0b12289c002b065c7',
 '5a4b773f33cfc52cf4514db72afab665',
 '04e7b99118265af2550eb8aa585fda4e',
 'f8d5bbf7acbe1aff888408c83b40571b',
 '95aad69fcfd8bdae62a905240c4a5d0f',
 'b070c44bb07ed8da88afc638303052f3',
 'f1e1cd4ff25018273aafc0c68fbb5a2f',
 '3feb00e5f45b1cccfb002a7189af5546',
 '663788a67f42e8d3568b59c04c9b3320',
 '1675f231acf9d1cbb821f58484910420',
 'aa5363b8b326ed29637fc196e97a8145',
 '7f4fbc40acceaa303c9f0f75a0047bec',
 '6be56c8606878196bdc7af44975382d0',
 '3f17484bdd9b3d883abbd6d98d5a39f3',
 '665c39f48d4ab4e4cfd0f52523d54739']

In [286]:
# Finding common users between sampled_test and sampled_20_lda
common_user = sampled_20_lda.filter(sf.col('User').isin(lda_user_list))
lda_common_user_vecs = common_user.select('*').limit(20).rdd.map(lambda x: (x[0], x[1])).collect()

#lda_common_user_vecs

[('f1e1cd4ff25018273aafc0c68fbb5a2f',
  SparseVector(40, {0: 0.0343, 1: 0.0031, 2: 0.0031, 3: 0.003, 4: 0.3505, 5: 0.0031, 6: 0.0031, 7: 0.0031, 8: 0.1177, 9: 0.0031, 10: 0.0031, 11: 2.2199, 12: 0.003, 13: 0.0031, 14: 0.1955, 15: 0.1529, 16: 0.0031, 17: 0.0415, 18: 0.1758, 19: 0.7808, 20: 0.0031, 21: 0.003, 22: 0.0031, 23: 0.0031, 24: 0.056, 25: 0.003, 26: 0.0031, 27: 0.0685, 28: 0.1105, 29: 0.003, 30: 0.0031, 31: 0.2642, 32: 0.0031, 33: 0.0496, 34: 0.4644, 35: 0.0031, 36: 0.9958, 37: 0.1667, 38: 0.4182, 39: 0.2731})),
 ('aa5363b8b326ed29637fc196e97a8145',
  SparseVector(40, {0: 0.0018, 1: 0.0018, 2: 0.0018, 3: 0.5463, 4: 0.0018, 5: 0.0018, 6: 0.0018, 7: 0.0018, 8: 0.0018, 9: 0.0018, 10: 0.0018, 11: 0.5386, 12: 0.3052, 13: 0.0018, 14: 0.0018, 15: 0.0018, 16: 0.0018, 17: 0.0018, 18: 0.0018, 19: 0.0018, 20: 0.0018, 21: 0.0018, 22: 0.0018, 23: 0.0018, 24: 0.0018, 25: 0.0018, 26: 0.0018, 27: 0.0621, 28: 0.0018, 29: 0.0018, 30: 0.0018, 31: 0.0919, 32: 0.0018, 33: 0.0018, 34: 0.0018, 35: 0.0

In [287]:
# Sampled Users vectors and number of recommendations in form of k.
# returns DF containing recommendations and Precision, Recall, MRR as columns.
def sampled_precision_recall_mrr_lda(user_vectors, sampled_test, k):
    
    #Getting k Recommendations 
    temp_recs = get_lda_SimilarPaper(user_vectors, k)

    #temp_recs.printSchema()

    # Creating DF as user_hash, Recommendations
    temp_metrics_df = temp_recs.select('User', 'paper_id').withColumnRenamed('User', 'user_hash')
    
    # Creating DF, user_hash, top_k recommendations
    temp_grouped_metrics_df = temp_metrics_df.groupBy('user_hash').agg(sf.collect_list('paper_id'))\
                        .withColumnRenamed('collect_list(paper_id)','top_k')

    temp_grouped_metrics_df.show(2)

    
    # Creating DF, user, top_k, test_set by joining on sampled_test
    temp_user_test_topk = temp_grouped_metrics_df.join(sampled_test, temp_grouped_metrics_df.user_hash == sampled_test.User).drop('User')\
                    .withColumnRenamed('PapersID', 'test_set')

    temp_user_test_topk.show(2)
    
    # Creating DF, with precision and recall as new columns
    temp_with_precision = temp_user_test_topk.withColumn('result', precision_recall_udf(sampled_test.test_set, temp_user_test_topk.top_k, lit(k)))\
            .select('user_hash', 'test_set', 'top_k', 'result.precision', 'result.recall')

    temp_with_precision.show(5)
    
    # Creating DF with MRR value
    temp_mrr_df = temp_with_precision.withColumn('Mrr', mrr_udf(temp_with_precision.test_set ,temp_with_precision.top_k))

    temp_mrr_df.show(5)
    
    return temp_mrr_df

#### For k = 5

In [289]:
# Precision Recall MRR values for LDA Sample for k = 5
# lda_prm = lda precision, recall and mrr
lda_prm = sampled_precision_recall_mrr_lda(lda_common_user_vecs, sampled_test, 5)

+--------------------+--------------------+
|           user_hash|               top_k|
+--------------------+--------------------+
|f1e1cd4ff25018273...|[12007316, 113955...|
|aa5363b8b326ed296...|[1088914, 115884,...|
+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+
|           user_hash|               top_k|            test_set|
+--------------------+--------------------+--------------------+
|f1e1cd4ff25018273...|[12007316, 113955...|[136656, 843446, ...|
|aa5363b8b326ed296...|[1088914, 115884,...|[5864612, 4469058...|
+--------------------+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+---------+------+
|           user_hash|            test_set|               top_k|precision|recall|
+--------------------+--------------------+--------------------+---------+------+
|f1e1cd4ff25018273...|[136656, 843446, ...|[120073

In [290]:
sample_avg_precision = lda_prm.agg(avg(col("precision")))
sample_avg_recall = lda_prm.agg(avg(col("recall")))
sample_avg_mrr = lda_prm.agg(avg(col('Mrr')))

# Showing Results
print('Sample Results for k = 5')
sample_avg_precision.show()
sample_avg_recall.show()
sample_avg_mrr.show()

Sample Results for k = 5
+--------------+
|avg(precision)|
+--------------+
|           0.0|
+--------------+

+-----------+
|avg(recall)|
+-----------+
|        0.0|
+-----------+

+--------+
|avg(Mrr)|
+--------+
|     1.0|
+--------+



#### For k = 10 

In [291]:
# Calculating the Precision Recall and MRR value for k = 10 using LDA
lda_prm_10 = sampled_precision_recall_mrr_lda(lda_common_user_vecs, sampled_test, 10)

#Calculating Averages of Precision Recall and MRR
sample_avg_precision = lda_prm_10.agg(avg(col("precision")))
sample_avg_recall = lda_prm_10.agg(avg(col("recall")))
sample_avg_mrr = lda_prm_10.agg(avg(col('Mrr')))

# Showing Results
print('Sample Results for k = 10')
sample_avg_precision.show()
sample_avg_recall.show()
sample_avg_mrr.show()

+--------------------+--------------------+
|           user_hash|               top_k|
+--------------------+--------------------+
|f1e1cd4ff25018273...|[12007316, 113955...|
|aa5363b8b326ed296...|[1088914, 115884,...|
+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+
|           user_hash|               top_k|            test_set|
+--------------------+--------------------+--------------------+
|f1e1cd4ff25018273...|[12007316, 113955...|[136656, 843446, ...|
|aa5363b8b326ed296...|[1088914, 115884,...|[5864612, 4469058...|
+--------------------+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+---------+------+
|           user_hash|            test_set|               top_k|precision|recall|
+--------------------+--------------------+--------------------+---------+------+
|f1e1cd4ff25018273...|[136656, 843446, ...|[120073

#### For k = 30 

In [292]:
# Calculating the Precision Recall and MRR value for k = 30 using LDA
lda_prm_30 = sampled_precision_recall_mrr_lda(lda_common_user_vecs, sampled_test, 30)

#Calculating Averages of Precision Recall and MRR
sample_avg_precision = lda_prm_30.agg(avg(col("precision")))
sample_avg_recall = lda_prm_30.agg(avg(col("recall")))
sample_avg_mrr = lda_prm_30.agg(avg(col('Mrr')))

# Showing Results
print('Sample Results for k = 30')
sample_avg_precision.show()
sample_avg_recall.show()
sample_avg_mrr.show()

+--------------------+--------------------+
|           user_hash|               top_k|
+--------------------+--------------------+
|f1e1cd4ff25018273...|[12007316, 113955...|
|aa5363b8b326ed296...|[1088914, 115884,...|
+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+
|           user_hash|               top_k|            test_set|
+--------------------+--------------------+--------------------+
|f1e1cd4ff25018273...|[12007316, 113955...|[136656, 843446, ...|
|aa5363b8b326ed296...|[1088914, 115884,...|[5864612, 4469058...|
+--------------------+--------------------+--------------------+
only showing top 2 rows

+--------------------+--------------------+--------------------+---------+------+
|           user_hash|            test_set|               top_k|precision|recall|
+--------------------+--------------------+--------------------+---------+------+
|f1e1cd4ff25018273...|[136656, 843446, ...|[120073

#### Observations based on values of Precision, Recall and MRR for k = [5,10,30]

For k = 5, we can see that both CBRS tf-idf and CBRS lda, perform similarly, we didnt see much difference in precision, recall and mrr values. But for k = 10, we can clearly see that the CBRS tf-idf has more hits than CBRS lda, which means it performed better than lda. Also for k = 30 CBRS tf-idf has more hits than CBRS lda and has more precision and almost similar recall whereas MRR value of LDA is greater in this case.
Overall observation: CBRS tf-idf has performed better than the CBRS lda 