# Recommendations with IBM

In this notebook, you will be putting your recommendation skills to use on real data from the IBM Watson Studio platform. 


You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page.  Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/rubrics/3325/view).  **Please save regularly.**

By following the table of contents, you will build out a number of different methods for making recommendations that can be used for different situations. 


## Table of Contents

I. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
II. [Rank Based Recommendations](#Rank)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations (EXTRA - NOT REQUIRED)](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

At the end of the notebook, you will find directions for how to submit your work.  Let's get started by importing the necessary libraries and reading in the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import project_tests as t
import pickle
import seaborn as sns
%matplotlib inline

df = pd.read_csv('data/rating.csv')
df_content = pd.read_csv('data/anime.csv')
#del df['Unnamed: 0']

df["rating"] = df["rating"].fillna(0)
df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [2]:
df = df.head(100000)

In [3]:
df["user_id"].value_counts()

342    1259
478     863
54      834
958     783
446     745
       ... 
824       1
731       1
388       1
555       1
212       1
Name: user_id, Length: 1028, dtype: int64

In [4]:
len(df)

100000

In [None]:
df = df.merge(df_content[["anime_id","name"]], on="anime_id").dropna().drop(["name"], axis=1)

In [None]:
# Show df_content to get an idea of the data
df_content.head()

In [None]:
df_content["rating"] = df_content["rating"].fillna(0)
df_content.head()

### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Exploratory Data Analysis</a>

Use the dictionary and cells below to provide some insight into the descriptive statistics of the data.

`1.` What is the distribution of how many articles a user interacts with in the dataset?  Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.  

In [None]:
df["anime_id"].value_counts().describe()

In [None]:
df.groupby(["user_id"]).count().describe()

In [None]:
df.groupby(["user_id","anime_id"]).count().describe()

In [None]:
plt.hist(df.anime_id) #number of times 

`2.` Explore and remove duplicate articles from the **df_content** dataframe.  

In [None]:
# Find and explore duplicate articles

In [None]:
# Remove any rows that have the same article_id - only keep the first
df_content=df_content[~df_content.duplicated(subset=["anime_id"])==True]

`3.` Use the cells below to find:

**a.** The number of unique articles that have an interaction with a user.  
**b.** The number of unique articles in the dataset (whether they have any interactions or not).<br>
**c.** The number of unique users in the dataset. (excluding null values) <br>
**d.** The number of user-article interactions in the dataset.

In [None]:
print(len(df[df.user_id !=""]["anime_id"].unique()))
print(len(df_content["anime_id"]))
print(len(df[df.user_id !=""]["user_id"].dropna().unique()))
print(len(df[["anime_id","user_id"]]))

### <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not.  We only know that a user has interacted with an article.  In these cases, the popularity of an article can really only be based on how often an article was interacted with.

`1.` Fill in the function below to return the **n** top articles ordered with most interactions as the top. Test your function using the tests below.

In [None]:
df_content["rating"].sort_values(ascending=False)

In [None]:
# def get_top_articles(n, df=df):
#     '''
#     INPUT:
#     n - (int) the number of top articles to return
#     df - (pandas dataframe) df as defined at the top of the notebook 
    
#     OUTPUT:
#     top_articles - (list) A list of the top 'n' article titles 
    
#     '''
#     top_articles= list(set(df[(df.article_id).apply(lambda x:int(x)).isin(
#         list(df["article_id"].value_counts().head(n).index))]["title"]))
    
    
#     return top_articles # Return the top article titles from df (not df_content)


# def get_top_article_ids(n, df=df):
#     '''
#     INPUT:
#     n - (int) the number of top articles to return
#     df - (pandas dataframe) df as defined at the top of the notebook 
    
#     OUTPUT:
#     top_articles - (list) A list of the top 'n' article titles 
    
#     '''
#     top_articles= list(set(df[(df.article_id).apply(lambda x:int(x)).isin(list(df["article_id"].value_counts().head(n).index))]["article_id"].astype('str')))
 
#     return top_articles # Return the top article ids

In [None]:

def get_top_anime(n, df_content=df_content):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    top_anime= list(df_content.sort_values(["rating"],ascending=False,axis=0)["name"].head(n))
    top_anime_id = list(df_content.sort_values(["rating"],ascending=False,axis=0)["anime_id"].head(n))
    return top_anime, top_anime_id # Return the top article titles from df (not df_content)


In [None]:
top_anime , top_anime_id= get_top_anime(10)
print(top_anime)
print(top_anime_id)

### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.` Use the function below to reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** should only appear in each **row** once.


* Each **article** should only show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then place a zero where the user-row meets for that article-column**. 

Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.

In [None]:
# create the user-article matrix with 1's and 0's
def create_user_item_matrix(df):
    
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
   #df = df[[]]
    user_item = df.groupby(['user_id', 'anime_id']).count().unstack()
    user_item = user_item.fillna(0)
#    user_item[user_item > 1] = 1
    return user_item

In [None]:
user_item = create_user_item_matrix(df)

In [None]:
import pickle
with open("user_item_.p","wb") as p:
    pickle.dump(user_item, p)
user_item  = pd.read_pickle('user_item_.p')

`2.` Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar).  The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users. 

Use the tests to test your function.

In [None]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user
    similiarity = pd.DataFrame(np.dot(user_item[user_item.index == user_id],user_item.T).T, columns=["similiarity_score"])
    user_id_list = dict(zip(pd.DataFrame(user_item.index).index,pd.DataFrame(user_item.index)["user_id"]))
    similiarity.index = similiarity.index.map(user_id_list)    
    similiarity = similiarity.reset_index().sort_values(["similiarity_score","index"],ascending=[False,True])
    # sort by similarity
    # create list of just the ids
    most_similar_users = list(similiarity["index"])
    # remove the own user's id
    most_similar_users_exclude_own = [x for x in most_similar_users if x != user_id]  
    return most_similar_users_exclude_own # return a list of the users in order from most to least similar

In [None]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(33)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))

In [None]:
user_item.head()

`3.` Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend.  Complete the functions below to return the articles you would recommend to each user. 

In [None]:
def get_anime_names(anime_ids, df_content=df_content):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    # Your code here
    anime_names = []
    for id in anime_ids :
        
     #   article_names.append( df[df.article_id==id]["title"].unique()[0])    
      #  print(id)
        anime_names.append(df_content[df_content.anime_id.astype("float")==id]["name"].values[0])
    
    return anime_names # Return the article names associated with list of article ids


def get_user_animes(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    mask = user_item[user_item.index == user_id]
    mask = list(mask.iloc[0])
    anime_id = list(user_item.rating.reset_index().columns[1:])
    id_freq_df = pd.DataFrame([anime_id, mask]).T
    anime_ids = list(id_freq_df[id_freq_df[1]==1][0])
   # article_ids = [str(x) for x in article_ids]
    anime_names = get_anime_names(anime_ids)
    
    return anime_ids, anime_names # return the ids and names


def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # Your code here
    user_count = 0
    recs = []
    own_anime_ids, own_anime_names = get_user_animes(user_id, user_item=user_item)
    similar_users = find_similar_users(user_id)
    for j in range (1,1000):
        similar_users_anime_ids, anime_names = get_user_animes(similar_users[j])
        for i in [x for x in similar_users_anime_ids if (x not in own_anime_ids and x not in recs)]:
            recs.append(i)
        if len(recs)>m:
            break
    return recs[0:m] # return your recommendations for this user_id    

In [None]:
user_user_recs(user_id=4, m=20)

In [None]:
# Check Results
get_anime_names(user_user_recs(4, 20))# Return 10 recommendations for user 1

`4.` Now we are going to improve the consistency of the **user_user_recs** function from above.  

* Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.


* Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be  what would be obtained from the **top_articles** function you wrote earlier.

In [None]:
#ORIGINAL

# def get_top_sorted_users(user_id, df=df, user_item=user_item):
#     '''
#     INPUT:
#     user_id - (int)
#     df - (pandas dataframe) df as defined at the top of the notebook 
#     user_item - (pandas dataframe) matrix of users by articles: 
#             1's when a user has interacted with an article, 0 otherwise
    
            
#     OUTPUT:
#     neighbors_df - (pandas dataframe) a dataframe with:
#                     neighbor_id - is a neighbor user_id
#                     similarity - measure of the similarity of each user to the provided user_id
#                     num_interactions - the number of articles viewed by the user - if a u
                    
#     Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
#                     highest of each is higher in the dataframe
     
#     '''
#     # Your code here

#     neighbors_df = pd.DataFrame(np.dot(user_item[user_item.index == user_id],user_item.T).T, columns=["similiarity_score"])
#     user_id_list = dict(zip(pd.DataFrame(user_item.index).index,pd.DataFrame(user_item.index)["user_id"]))
#     neighbors_df.index = neighbors_df.index.map(user_id_list)    
#     neighbors_df =neighbors_df.reset_index().sort_values(["similiarity_score","index"],ascending=[False,True])
#     neighbors_df.rename(columns = {'index':'neighbor_id'}, inplace = True)
#     neighbors_df= neighbors_df.merge(df,left_on="neighbor_id",right_on="user_id").drop(["user_id","anime_id"], axis=1)
#     neighbors_df.rename(columns = {'rating':'num_interactions'}, inplace = True)
#     neighbors_df =neighbors_df.sort_values(["similiarity_score","num_interactions"], ascending=False)  
#     neighbors_df = neighbors_df[neighbors_df["neighbor_id"] != user_id]
#     return neighbors_df # Return the dataframe specified in the doc_string

# def user_user_recs_part2(user_id, m=10):
#     '''
#     INPUT:
#     user_id - (int) a user id
#     m - (int) the number of recommendations you want for the user
    
#     OUTPUT:
#     recs - (list) a list of recommendations for the user by article id
#     rec_names - (list) a list of recommendations for the user by article title
    
#     Description:
#     Loops through the users based on closeness to the input user_id
#     For each user - finds articles the user hasn't seen before and provides them as recs
#     Does this until m recommendations are found
    
#     Notes:
#     * Choose the users that have the most total article interactions 
#     before choosing those with fewer article interactions.

#     * Choose articles with the articles with the most total interactions 
#     before choosing those with fewer total interactions. 
   
#     '''
#     # Your code here
  
#     user_count = 0
#     recs = []
#     own_article_ids, own_article_names = get_user_articles(user_id, user_item=user_item)
#     top_users = list(get_top_sorted_users(user_id)["neighbor_id"])
    
#     for j in range (1,1000):
#         if len(recs)>=m:
#             break
#         similar_users_article_ids, article_names = get_user_articles(top_users[j])
#         similar_users_article_ids_int = [float(x) for x in similar_users_article_ids]
#         df_sub = df[df["anime_id"].apply(lambda x: x in similar_users_article_ids_int)]
#         similar_users_article_ids = get_top_article_ids(len(similar_users_article_ids_int), df=df_sub)
#         for i in [x for x in similar_users_article_ids if (x not in own_article_ids and x not in recs)]:
#             if len(recs)>=m:
#                 break
#             else:
#                 recs.append(i)
#     rec_names = get_article_names(recs, df=df)
    
    
#     return recs, rec_names

In [None]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # Your code here

    neighbors_df = pd.DataFrame(np.dot(user_item[user_item.index == 1],user_item.T).T, columns=["similiarity_score"])
    user_id_list = dict(zip(pd.DataFrame(user_item.index).index,pd.DataFrame(user_item.index)["user_id"]))
    neighbors_df.index = neighbors_df.index.map(user_id_list)    
    neighbors_df =neighbors_df.reset_index().sort_values(["similiarity_score","index"],ascending=[False,True])
    neighbors_df.rename(columns = {'index':'neighbor_id'}, inplace = True)
    #    neighbors_df= neighbors_df.merge(df,left_on="neighbor_id",right_on="user_id").drop(["user_id","anime_id"], axis=1)
    neighbors_df= neighbors_df.merge(df,left_on="neighbor_id",right_on="user_id")
    neighbors_df =neighbors_df.sort_values(["similiarity_score","rating"], ascending=False)  
    neighbors_df = neighbors_df[neighbors_df["neighbor_id"] != 1]
    neighbors_df =neighbors_df.sort_values(["similiarity_score","rating"], ascending=False)  
    return neighbors_df # Return the dataframe specified in the doc_string

def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # Your code here
  
    user_count = 0
    recs = []
    own_anime_ids, own_anime_names = get_user_animes(user_id, user_item=user_item)
    top_anime_list = list(get_top_sorted_users(user_id)["anime_id"])
    
    for j in range (1,1000):
        if len(recs)>=m:
            break
        for i in [x for x in top_anime_list if (x not in own_anime_ids and x not in recs)]:
            if len(recs)>=m:
                break
            else:
                recs.append(i)
    rec_names = get_anime_names(recs)
    return recs, rec_names

In [None]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

`5.` Use your functions from above to correctly fill in the solutions to the dictionary below.  Then test your dictionary against the solution.  Provide the code you need to answer each following the comments below.

# NEW USERS

In [None]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs, _ = get_top_anime(10)# Your recommendations here

In [None]:
get_top_anime(10)[1]

### <a class="anchor" id="Matrix-Fact">Part V: Matrix Factorization</a>

In this part of the notebook, you will build use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

`1.` You should have already created a **user_item** matrix above in **question 1** of **Part III** above.  This first question here will just require that you run the cells to get things set up for the rest of **Part V** of the notebook. 

In [None]:
user_item_matrix_by_rating= df.groupby(['user_id', 'anime_id'])['rating'].max().unstack()
import pickle
with open("user_item_matrix.p","wb") as p:
    pickle.dump(user_item_matrix_by_rating, p)

In [None]:
# Load the matrix here
user_item_matrix = pd.read_pickle('user_item_matrix.p')

In [None]:
# quick look at the matrix
user_item.head()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sparse
#import svd_tests as t
%matplotlib inline


In [None]:
listed_anime = list(df.anime_id.value_counts().index[0:5])

In [None]:
user_movie_subset = user_item_matrix_by_rating[listed_anime].dropna(axis=0)

In [None]:
rating_mat = np.matrix(user_movie_subset[0:10])

In [None]:
def FunkSVD(ratings_mat, latent_features=4, learning_rate=0.0001, iters=100):
    '''
    This function performs matrix factorization using a basic form of FunkSVD with no regularization
    
    INPUT:
    ratings_mat - (numpy array) a matrix with users as rows, movies as columns, and ratings as values
    latent_features - (int) the number of latent features used
    learning_rate - (float) the learning rate 
    iters - (int) the number of iterations
    
    OUTPUT:
    user_mat - (numpy array) a user by latent feature matrix
    movie_mat - (numpy array) a latent feature by movie matrix
    '''
    
    # Set up useful values to be used through the rest of the function
    n_users = ratings_mat.shape[0]
    n_movies = ratings_mat.shape[1]
    num_ratings = np.count_nonzero(~np.isnan(ratings_mat))
    
    # initialize the user and movie matrices with random values
    user_mat = np.random.rand(n_users, latent_features)
    movie_mat = np.random.rand(latent_features, n_movies)
    
    # initialize sse at 0 for first iteration
    sse_accum = 0
    
    # keep track of iteration and MSE
    print("Optimizaiton Statistics")
    print("Iterations | Mean Squared Error ")
    
    # for each iteration
    for iteration in range(iters):

        # update our sse
        old_sse = sse_accum
        sse_accum = 0
        
        # For each user-movie pair
        for i in range(n_users):
            for j in range(n_movies):
                
                # if the rating exists
                if ratings_mat[i, j] > 0:
                    
                    # compute the error as the actual minus the dot product of the user and movie latent features
                    diff = ratings_mat[i, j] - np.dot(user_mat[i, :], movie_mat[:, j])
                    
                    # Keep track of the sum of squared errors for the matrix
                    sse_accum += diff**2
                    
                    # update the values in each matrix in the direction of the gradient
                    for k in range(latent_features):
                        user_mat[i, k] += learning_rate * (2*diff*movie_mat[k, j])
                        movie_mat[k, j] += learning_rate * (2*diff*user_mat[i, k])

        # print results
        print("%d \t\t %f" % (iteration+1, sse_accum / num_ratings))
        
    return user_mat, movie_mat 

In [None]:
## JUST SAMPLE DONT USE HERE, USE BELOW

In [None]:
user_mat, movie_mat = FunkSVD(rating_mat, 
                              latent_features=5, 
                              learning_rate=0.008, 
                              iters=300)

In [None]:
print(np.dot(user_mat, movie_mat))
print(rating_mat)

In [None]:
preds = np.dot(user_mat, movie_mat)
print("The predicted value for the missing rating is: {}".format(preds[0,0]))

In [None]:
def create_train_test(reviews, training_size=0.7):
    #reviews_new = reviews.sort_values(order_by)
    reviews = reviews.sample(frac=1).reset_index(drop=True)
    train_ratio= int(len(reviews)*training_size)
    training_df = reviews.head(train_ratio)
    validation_df = reviews[train_ratio:]

    return training_df, validation_df

In [None]:
train_df, val_df = create_train_test(df, 0.8)
train_user_item = train_df[['user_id', 'anime_id', 'rating']]
train_data_df = train_user_item.groupby(['user_id', 'anime_id'])['rating'].max().unstack()
train_data_np = np.array(train_data_df )

In [None]:
user_mat, movie_mat = FunkSVD(train_data_np, latent_features=3, learning_rate=0.005, iters=100)

In [None]:
def predict_rating(user_matrix, movie_matrix, user_id, movie_id):
    '''
    INPUT:
    user_matrix - user by latent factor matrix
    movie_matrix - latent factor by movie matrix
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    
    OUTPUT:
    pred - the predicted rating for user_id-movie_id according to FunkSVD
    '''
    # Create series of users and movies in the right order
    user_ids_series = np.array(train_data_df.index)
    movie_ids_series = np.array(train_data_df.columns)
    
    # User row and Movie Column
    user_row = np.where(user_ids_series == user_id)[0][0]
    movie_col = np.where(movie_ids_series == movie_id)[0][0]
    
    # Take dot product of that row and column in U and V to make prediction
    pred = np.dot(user_matrix[user_row, :], movie_matrix[:, movie_col])
    pred =  max(min(pred,10),-1)
    return pred

In [None]:
train_data_df.columns

In [None]:
pred_val = predict_rating(user_mat, movie_mat, 1, 1)
pred_val

In [None]:
def print_prediction_summary(user_id, movie_id, prediction):
    '''
    INPUT:
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    prediction - the predicted rating for user_id-movie_id
    '''
    prediction = max(min(predict_rating(user_mat, movie_mat, user_id, movie_id),10),-1)
    movie_name = str(df_content[df_content['anime_id'] == movie_id]['name'].values[0])
    movie_name = movie_name.replace('\nName: anime, dtype: object', '').strip()
    print("For User {}, we predict a {}/10 rating for the anime '{}'.".format(user_id, round(prediction, 2), str(movie_name)))

In [None]:
print_prediction_summary( 2, 12189, pred_val)

In [None]:
df[df.user_id == 2]

In [None]:
def validation_comparison(val_df, user_mat=user_mat, movie_mat=movie_mat):
    '''
    INPUT:
    val_df - the validation dataset created in the third cell above
    user_mat - U matrix in FunkSVD
    movie_mat - V matrix in FunkSVD
        
    OUTPUT:
    rmse - RMSE of how far off each value is from it's predicted value
    perc_rated - percent of predictions out of all possible that could be rated
    actual_v_pred - a 10 x 10 grid with counts for actual vs predicted values
    '''
        
    val_users = np.array(val_df['user_id'])
    val_movies = np.array(val_df['anime_id'])
    val_ratings = np.array(val_df['rating'])
    
    sse = 0
    num_rated = 0
    preds, acts = [], []
    actual_v_pred = np.zeros((10,10))
    for idx in range ( len(val_users)):
        try:
            pred = predict_rating(user_mat, movie_mat, val_users[idx], val_movies[idx])
            sse += (pred - val_ratings[idx])**2
            num_rated+=1
            preds.append(pred)
            acts.append(val_ratings[idx])
            actual_v_pred[11 - int(val_ratings[idx]-1), int(round(pred) - 1)]+=1
        except:
            continue
    rmse = np.sqrt(sse/num_rated)
    perc_rated = num_rated/len(val_users) * 100
    return rmse, perc_rated, actual_v_pred, preds, acts

In [None]:
# How well did we do?
rmse, perc_rated, actual_v_pred, preds, acts = validation_comparison(val_df)
print(rmse, perc_rated)
sns.heatmap(actual_v_pred);
plt.xticks(np.arange(10), np.arange(1,11));
plt.yticks(np.arange(10), np.arange(1,11));
plt.xlabel("Predicted Values");
plt.ylabel("Actual Values");
plt.title("Actual vs. Predicted Values");

In [None]:
plt.figure(figsize=(8,8))
plt.hist(acts, density=True, alpha=.5, label='actual');
plt.hist(preds, density=True, alpha=.5, label='predicted');
plt.legend(loc=2, prop={'size': 15});
plt.xlabel('Rating');
plt.title('Predicted vs. Actual Rating');

In [None]:
not_rated = int(len(val_df['rating'])*(1-perc_rated/100))
rated = int(len(val_df['rating'])*perc_rated/100)

print("Number not rated {}".format(not_rated))
print("Number rated {}.".format(rated))

# NEW MOVIE 

In [None]:
df_content_to_select = df_content[["anime_id","genre","type"]]
df_content_to_select["genre"] = df_content_to_select["genre"].fillna("Unknown")
df_content_to_select["genre1"] = df_content_to_select["genre"].str.split(',')
df_content_to_select["genre"] = df_content_to_select["genre1"].copy()
df_content_to_select = df_content_to_select.drop("genre1", axis=1)
df_model = pd.DataFrame(df_content_to_select["genre"].tolist())
df_content_to_select = pd.concat([df_content_to_select, df_model], axis=1)
df_content_to_select = df_content_to_select.drop("genre", axis=1).melt(id_vars=["anime_id","type"]).dropna(subset=["value"])
df_content_to_select["variable"] = 1
df_content_to_select = df_content_to_select.drop_duplicates()
df_content_to_select = df_content_to_select.pivot(index = ["anime_id","type"], columns=["value"]).fillna(0)
df_content_to_select.columns = [' '.join(col).strip().replace("variable ","").replace("-","_") for col in df_content_to_select.columns.values]
df_content_to_select.reset_index(inplace=True)
df_content_to_select = pd.get_dummies(df_content_to_select,["type"])

In [None]:
# df_main = df_new.reset_index()
# df_sub = df_new[["name","dob","traits"]].reset_index()
# df2_sub = pd.DataFrame(df_sub["traits"].str.split(";"))
# df2_sub = pd.DataFrame(df2_sub["traits"].values.tolist())
# df3_sub = pd.concat([df_sub[["index","name","dob"]],df2_sub], axis=1)
# df3_sub = df3_sub.melt(id_vars=["name","dob","index"])
# df3_sub["variable"] = 1
# df3_sub = df3_sub.drop_duplicates()
# df3_sub = df3_sub.dropna()
# df3_sub= df3_sub.pivot(index=["name","dob","index"], columns=["value"]).fillna(0).reset_index()
# df3_sub.columns = [' '.join(col).strip().replace("variable ","trait_").replace("-","_") for col in df3_sub.columns.values]
# df3_sub = df3_sub.drop(["name","dob"], axis=1)
# df_merged = df_main.merge(df3_sub, on="index")

In [None]:
df_content_to_select

In [None]:
#Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
#movie_content = np.array(movies.iloc[:,4:])
movie_content = np.array(df_content_to_select.drop("anime_id", axis=1))

# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = movie_content.dot(np.transpose(movie_content))


def find_similar_movies(movie_id, n=10):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the row of each movie id
    similar_idxs = []
    for number in range(n) : 
        movie_idx = np.where(df_content['anime_id'] == movie_id)[0][0]
      #  print("movie_id",movie_id)
      #  print("movie_idx",movie_idx)
    # find the most similar movie indices - to start I said they need to be the same for all content
        movie_id = np.where(dot_prod_movies[movie_idx] == np.max(dot_prod_movies[movie_idx]))[0][0]

        similar_idxs.append(movie_idx)
    # pull the movie titles based on the indices
    similar_movies = np.array(df_content.iloc[similar_idxs, ]['name'])
    
    return similar_movies
    
    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst

In [None]:
def make_recommendations(_id, _id_type='movie', train_data=train_data_df, 
                         train_df=train_df, movies=df_content, rec_num=10, user_mat=user_mat):
    '''
    INPUT:
    _id - either a user or movie id (int)
    _id_type - "movie" or "user" (str)
    train_data - dataframe of data as user-movie matrix
    train_df - dataframe of training data reviews
    movies - movies df
    rec_num - number of recommendations to return (int)
    user_mat - the U matrix of matrix factorization
    movie_mat - the V matrix of matrix factorization
    
    OUTPUT:
    recs - (array) a list or numpy array of recommended movies like the 
                   given movie, or recs for a user_id given
    '''
    # if the user is available from the matrix factorization data, 
    # I will use this and rank movies based on the predicted values
    # For use with user indexing
    val_users = train_data_df.index
    rec_names , rec_ids= get_top_anime(10)
    
    if _id_type == 'user':
        if _id in train_data.index:
            # Get the index of which row the user is in for use in U matrix
            idx = np.where(val_users == _id)[0][0]
            
            # take the dot product of that row and the V matrix
            preds = np.dot(user_mat[idx,:],movie_mat)
            
            # pull the top movies according to the prediction
            indices = preds.argsort()[-rec_num:][::-1] #indices
            rec_ids = train_data_df.columns[indices]
            rec_names = get_anime_names(rec_ids)
        else:
            pass
            # if we don't have this user, give just top ratings back
    #        rec_names = popular_recommendations(_id, rec_num, rec_ids)
    #        Settle Popular Recommendation
    # Find similar movies if it is a movie that is passed
    else:
        rec_ids = find_similar_movies(_id)
        rec_names = find_similar_movies(_id)
    
    return rec_ids, rec_names

In [None]:
user_recs_dict_with_top = dict()
for user_id in set(val_df['user_id']):
    user_recs_dict_with_top[user_id] = make_recommendations(user_id, 'user')[1]

In [None]:
cnter = 0
for user, rec in user_recs_dict_with_top.items():
    if cnter < 12:
        print("For user {}, our recommendations are: \n {}".format(user, rec))
        cnter+=1
    else:
        break

In [None]:
# Read in user and movie matrices
user_file = open("user_matrix", 'rb')
user_mat = pickle.load(user_file)
user_file.close()


movie_file = open("movie_matrix", 'rb')
movie_mat = pickle.load(movie_file)
movie_file.close()

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])

In [8]:
import recommender as r

#instantiate recommender
rec = r.Recommender()

# fit recommender
#rec.fit(rating_pth='data/rating.csv', content_pth= 'data/anime.csv', learning_rate=.01, iters=10)

# predict
#rec.predict_rating(user_id=8, movie_id=2844)

In [9]:
rec.fit(rating_pth='data/rating.csv', content_pth= 'data/anime.csv', learning_rate=.002, iters=1)

Optimizaiton Statistics
Iterations | Mean Squared Error 
1 		 12.651666


In [10]:
rec.predict_rating(user_id=1, movie_id=1)

For user 1 we predict a 7.18 rating for the movie Cowboy Bebop.


7.177752048954211

In [13]:
a = rec.make_recommendations(8, _id_type="New_Viewer")

In [15]:
 rec.make_recommendations(8, _id_type="Premium_Viewer")

['Shinsekai yori',
 'Diamond no Ace: Second Season',
 'Toradora!',
 'Mobile Suit Gundam Unicorn',
 'SKET Dance',
 'Gyakkyou Burai Kaiji: Ultimate Survivor',
 'Summer Wars',
 'ef: A Tale of Melodies.',
 'Tonari no Kaibutsu-kun',
 'Macross Zero',
 'Yowamushi Pedal: Special Ride',
 'Minami-ke Omatase']

In [None]:
rec.make_recommendations(1)

In [None]:
    print(rec.make_recommendations(8,'user')) # user in the dataset
    print(rec.make_recommendations(1,'user')) # user not in dataset
    print(rec.make_recommendations(1853728)) # movie in the dataset
    print(rec.make_recommendations(1)) 

In [None]:
df_content['anime_id'].sort_values()

In [None]:
    df_content_to_select = df_content.loc[:,["anime_id","genre","type"]]
    df_content_to_select["genre"] = df_content_to_select.loc[:,"genre"].fillna("Unknown").copy()
    df_content_to_select["genre1"] = df_content_to_select.loc[:,"genre"].str.split(',').copy()
    df_content_to_select["genre"] = df_content_to_select.loc[:,"genre1"].copy()
    df_content_to_select = df_content_to_select.drop("genre1", axis=1)
    df_model = pd.DataFrame(df_content_to_select["genre"].tolist())
    df_content_to_select = pd.concat([df_content_to_select, df_model], axis=1)
    df_content_to_select = df_content_to_select.drop("genre", axis=1).melt(id_vars=["anime_id","type"]).dropna(subset=["value"])
    df_content_to_select["variable"] = 1
    df_content_to_select = df_content_to_select.drop_duplicates()
    df_content_to_select = df_content_to_select.pivot(index = ["anime_id","type"], columns=["value"]).fillna(0)
    df_content_to_select.columns = [' '.join(col).strip().replace("variable ","").replace("-","_") for col in df_content_to_select.columns.values]
    df_content_to_select.reset_index(inplace=True)
    df_content_to_select = pd.get_dummies(df_content_to_select,["type"])
    movie_content = np.array(df_content_to_select.drop("anime_id", axis=1))
    dot_prod_movies = movie_content.dot(np.transpose(movie_content))

In [None]:
movie_id = 1
similar_idxs = []



In [None]:
similar_idxs[0]

In [None]:
np.where(dot_prod_movies[movie_idx] >= n2ndlargest)

In [None]:
np.where(dot_prod_movies[70] == np.max(dot_prod_movies[70]))[0][0]

In [None]:
np.max(dot_prod_movies[70])

In [None]:
np.partition(dot_prod_movies[70].flatten(), -2)[-2]

In [None]:
#find second largest
np.partition(dot_prod_movies[movie_idx].flatten(), -1)[-1]

In [None]:
 np.where(df_content['anime_id'] == 2197)

In [None]:
df_content[df_content['anime_id'].astype("int") == movie_id]

In [None]:
df_content.sort_values("anime_id").head(20)

In [None]:
dot_prod_movies.shape

In [None]:
len(df_content)

In [None]:
movie_content.dot(np.transpose(movie_content))

In [None]:
    df_content_to_select = df_content.loc[:,["anime_id","genre","type"]]
    df_content_to_select["genre"] = df_content_to_select.loc[:,"genre"].fillna("Unknown").copy()
    df_content_to_select["genre1"] = df_content_to_select.loc[:,"genre"].str.split(',').copy()
    df_content_to_select["genre"] = df_content_to_select.loc[:,"genre1"].copy()
    df_content_to_select = df_content_to_select.drop("genre1", axis=1)
    df_model = pd.DataFrame(df_content_to_select["genre"].tolist())
    df_content_to_select = pd.concat([df_content_to_select, df_model], axis=1)
    df_content_to_select = df_content_to_select.drop("genre", axis=1).melt(id_vars=["anime_id","type"]).dropna(subset=["value"])
    df_content_to_select["variable"] = 1
    df_content_to_select = df_content_to_select.drop_duplicates()
    df_content_to_select = df_content_to_select.pivot(index = ["anime_id","type"], columns=["value"]).fillna(0)
    df_content_to_select.columns = [' '.join(col).strip().replace("variable ","").replace("-","_") for col in df_content_to_select.columns.values]
    df_content_to_select.reset_index(inplace=True)
    df_content_to_select = pd.get_dummies(df_content_to_select,["type"])
    movie_content = np.array(df_content_to_select.drop("anime_id", axis=1))
    dot_prod_movies = movie_content.dot(np.transpose(movie_content))

In [None]:
len(df_content)

In [None]:
movie_content 

In [None]:
import pickle
with open("df_content_matrix.pkl","wb") as p :
    pickle.dump(dot_prod_movies, p)

In [None]:
with open("df_content_matrix.pkl","rb") as p :
    dot_prod_movies_2 = pickle.load(p)

In [None]:
dot_prod_movies_2 

In [7]:
df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


1     5114
3     9253
6    11061
Name: anime_id, dtype: int64

In [204]:
anime_list = ['Fullmetal Alchemist: Brotherhood', 'Steins;Gate', 'Hunter x Hunter (2011)']

In [205]:
anime_id = list(df_content[df_content.name.apply(lambda x : x in anime_list)]["anime_id"])

In [206]:
anime_id

[5114, 9253, 11061]

In [207]:
df[df.anime_id.isin(anime_id)]

Unnamed: 0,user_id,anime_id,rating
183,3,5114,10
516,5,9253,9
932,7,11061,9
1165,10,5114,10
1166,10,9253,-1
...,...,...,...
99095,1024,5114,-1
99146,1025,5114,9
99303,1026,5114,10
99562,1027,9253,-1


In [208]:
df_to_pivot = df[df.anime_id.isin(anime_id)]

In [209]:
df_to_pivot 

Unnamed: 0,user_id,anime_id,rating
183,3,5114,10
516,5,9253,9
932,7,11061,9
1165,10,5114,10
1166,10,9253,-1
...,...,...,...
99095,1024,5114,-1
99146,1025,5114,9
99303,1026,5114,10
99562,1027,9253,-1


In [210]:
#df_to_pivot["rating_2"] = df_to_pivot["rating"]

In [211]:
df_pivot =  df_to_pivot[df_to_pivot.anime_id.isin(anime_id)].pivot(index=["user_id"], columns="anime_id").fillna(0)

In [212]:
#df_pivot = df_pivot.reset_index()
df_pivot.columns = [''.join(str(col[1])).strip().replace("rating_2","").replace("-","_") for col in df_pivot.columns.values]
df_pivot.reset_index(inplace=True)

In [213]:
df_pivot.columns.values

array(['user_id', '5114', '9253', '11061'], dtype=object)

In [214]:
df_pivot

Unnamed: 0,user_id,5114,9253,11061
0,3,10.0,0.0,0.0
1,5,0.0,9.0,0.0
2,7,0.0,0.0,9.0
3,10,10.0,-1.0,0.0
4,11,8.0,7.0,0.0
...,...,...,...,...
530,1023,0.0,10.0,0.0
531,1024,-1.0,0.0,0.0
532,1025,9.0,0.0,0.0
533,1026,10.0,0.0,0.0


In [215]:
df_pivot_id = df_pivot[["user_id"]]

In [216]:
df_pivot_id_var = df_pivot.drop(["user_id"], axis=1)

In [217]:
df_pivot_id.loc[:,"count"] = np.count_nonzero(df_pivot_id_var, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pivot_id.loc[:,"count"] = np.count_nonzero(df_pivot_id_var, axis=1)


In [218]:
df_pivot_id

Unnamed: 0,user_id,count
0,3,1
1,5,1
2,7,1
3,10,2
4,11,2
...,...,...
530,1023,1
531,1024,1
532,1025,1
533,1026,1


In [220]:
df_final = df_pivot_id.merge(df_to_pivot.groupby('user_id').agg("mean").reset_index().drop("anime_id", axis=1), on="user_id")

In [221]:
df_final["score"] = df_final["count"] * df_final["rating"] 

In [222]:
df_final=  df_final.sort_values("score", ascending = False).reset_index()

In [223]:
df_final["user_id"][0]

702