# Introduction

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

In this notebook, I will demonestrate two different methods for recommender system. 

1. **Collaborative Filtering**: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

2. **Content-Based Filtering**: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.



In this notebook, I will use a dataset shared on Kaggle Datasets: Articles Sharing and Reading from CI&T Deskdrop.
I will demonstrate how to implement Collaborative Filtering and Content-Based Filtering methods in Python, for the task of providing personalized recommendations to the users.

# Loading Data

In [0]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In this section, I loaded the Deskdrop dataset, which contains a real sample of 12 months logs (Mar. 2016 - Feb. 2017) from CI&T's Internal Communication platform (DeskDrop. It contains about 73k logged users interactions on more than 3k public articles shared in the platform. It is composed of two CSV files:

1. shared_articles.csv
2. users_interactions.csv


## Shared_articles.csv

Contains information about the articles shared in the platform. Each article has its sharing date (timestamp), the original url, title, content in plain text, the article' lang (Portuguese: pt or English: en) and information about the user who shared the article (author).

There are two possible event types at a given timestamp:

* **CONTENT SHARED**: The article was shared in the platform and is available for users.

* **CONTENT REMOVED**: The article was removed from the platform and not available for further recommendation.


For simplicity, I will only consider the **"CONTENT SHARED"** event type, assuming (naively) that all articles were available during the whole one year period. 

For a more precise evaluation (and higher accuracy), only articles that were available at a given time should be recommended.

In [5]:
path = '/content/drive/My Drive/2234_3774_bundle_archive/shared_articles.csv'
articles_df = pd.read_csv(path)
articles_df.head()
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df.head()

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
5,1459194522,CONTENT SHARED,-2826566343807132236,4340306774493623681,8940341205206233829,,,,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en


## Users_interactions.csv

Contains logs of user interactions on **shared articles**. It can be joined to **articles_shared.csv** by ***contentId*** column.

The eventType values are:

* **VIEW**: The user has opened the article.
* **LIKE**: The user has liked the article.
* **COMMENT CREATED**: The user created a comment in the article.
* **FOLLOW**: The user chose to be notified on any new comment in the article.
* **BOOKMARK**: The user has bookmarked the article for easy return in the future.

In [6]:
path = '/content/drive/My Drive/2234_3774_bundle_archive/users_interactions.csv'
userinter_df = pd.read_csv(path)
userinter_df.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


# Data joining

As there are different interactions types, we associate them with a weight or strength, assuming that, for example, a comment in an article indicates a higher interest of the user on the item than a like, or than a simple view.

In [9]:
event_type_weight = {
    'VIEW': 1.0,
    'LIKE': 2.0,
    'BOOKMARK': 2.5,
    'FOLLOW': 3.0,
    'COMMENT CREATED': 4.0
}

userinter_df['eventWeight'] = userinter_df['eventType'].apply(lambda x:event_type_weight[x])
userinter_df.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry,eventWeight
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,,1.0
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US,1.0
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,,1.0
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,,3.0
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,,1.0


I will keep only users with at least 5 interactions to avoid the **user cold-start** issue.


In [12]:
user_interaction_count = userinter_df.groupby(['personId','contentId']).size().groupby(['personId']).size()
print('Number of users: ', len(user_interaction_count))

user_5_interaction_count = user_interaction_count[user_interaction_count >= 5].reset_index()[['personId']]
print('Number of users with min. 5 interactions: ', len(user_5_interaction_count))

Number of users:  1895
Number of users with min. 5 interactions:  1140


In [14]:
print('Number of total users interactions: ', len(userinter_df))

# merging two tables 
interactions = userinter_df.merge(user_5_interaction_count, how='right', left_on='personId',right_on='personId')
print('Number of interactions from users with min. 5 interactions: ', len(interactions))

Number of total users interactions:  72312
Number of interactions from users with min. 5 interactions:  69868


to model the user interest on a given article, we aggregate all the interactions the user has performed in an item by a weighted sum of interaction type strength and apply a log transformation to smooth the distribution.

In [15]:
def smooth_user_preference(x):
    return math.log(1+x, 2)

interactions_full = interactions.groupby(['personId',  'contentId'])['eventWeight'].sum().apply(smooth_user_preference).reset_index()
print('Number of unique user per item interaction: ', len(interactions_full))
interactions_full.head()

Number of unique user per item interaction:  39106


Unnamed: 0,personId,contentId,eventWeight
0,-9223121837663643404,-8949113594875411859,1.0
1,-9223121837663643404,-8377626164558006982,1.0
2,-9223121837663643404,-8208801367848627943,1.0
3,-9223121837663643404,-8187220755213888616,1.0
4,-9223121837663643404,-7423191370472335463,3.169925


# Evaluation