In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#Dataset url: https://grouplens.org/datasets/movielens/latest/
! ls ../input

**Recommendation system**

A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behavior of a customer and based on that, recommends products which the users might be likely to buy.

**Types of Recommender System**

There are basically two main components of any recommendation system, Users and Items. Items are the entities that are recommended by the recommender system to the users. Let’s understand by taking some examples.

Netflix recommends movies to the people, hence movies are items and people are users, while Facebook recommends the people you may know to the people, here people are users and people are items too.

There are three types of recommender systems that are mostly used:

**● Popularity Based Recommender System**

Popularity based recommender system recommends the most popular items to the users. Most popular items is the item that is used by most number of users. For example, youtube trending list recommends the most popular videos of the day.


**● Content Based Recommender System**

 Content based recommender systems recommends similar items used by the user in the past.
For example, Netflix recommends us the similar movies to the movie we recently watched.
Similarly, Youtube also recommends us similar videos to the videos in our watch history.


**● Collaborative Filtering based Recommender System**

Collaborative Filtering based recommender system creates profiles of users based on the items the user likes. Then it recommends the items liked by a user to the user with similar profile.

For example, Google creates our profile based on our browsing history and then shows us the relevant ads


Now we’ll be building a Content Based Hollywood movie recommender system in Python programming language.

**CountVectorizer and cosine_similarity**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

text = ["London Paris London","Paris Paris London"]


In [None]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(text)
count_matrix.toarray()

In [None]:
similarity_scores = cosine_similarity(count_matrix)
print(similarity_scores)

**TfidfVectorizer short tutorial**

For more information check belwo link: 

https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Xc4cwdIzbtQ

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.

In [None]:
# importing the dataset
import pandas as pd
 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
 
# this is a very toy example, do not try this at home unless you want to understand the usage differences
docs=["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

In [None]:
 
#instantiate CountVectorizer()
cv=CountVectorizer()
 
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)
word_count_vector.toarray()

let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words):

In [None]:
word_count_vector.shape

In [None]:
#compute IDF Values
 
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector) 

In [None]:

# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

In [None]:
#With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple
from sklearn.feature_extraction.text import TfidfVectorizer#With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
 
# just send in all your docs here
fitted_vectorizer=tfidf_vectorizer.fit(docs)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(docs)

**Tfidftransformer vs. Tfidfvectorizer**


With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document

**Content Based Recommender System Working**

for more information
https://www.academyofdatascience.com/blog-by-prashant/

In [None]:
import pandas as pd
import numpy as np

In [None]:
credits = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_credits.csv")
movies_df = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")

In [None]:
credits.head()

In [None]:
movies_df.head()

In [None]:
print("Credits:",credits.shape)
print("Movies Dataframe:",movies_df.shape)

renaming movies_id columns to id in credits datatframe and mering it to movies_df dataframe

In [None]:
credits_column_renamed = credits.rename(index=str, columns={"movie_id": "id"})
movies_df_merge = movies_df.merge(credits_column_renamed, on='id')
movies_df_merge.head()

Drpping unnecessary columns from dataframe

In [None]:
movies_cleaned_df = movies_df_merge.drop(columns=['homepage', 'title_x', 'title_y', 'status','production_countries'])
movies_cleaned_df.head()

In [None]:
movies_cleaned_df.info()

Now lets make a recommendations based on the movie’s plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries.

In [None]:
movies_cleaned_df.head(1)['overview']

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3),
            stop_words = 'english')

# Filling NaNs with empty string
movies_cleaned_df['overview'] = movies_cleaned_df['overview'].fillna('')

In [None]:
tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview'])
tfv_matrix.shape

In [None]:
from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)

In [None]:
# Reverse mapping of indices and movie titles
indices = pd.Series(movies_cleaned_df.index, index=movies_cleaned_df['original_title']).drop_duplicates()

In [None]:
indices

In [None]:
indices['Newlyweds']

In [None]:
sig[4799]

In [None]:
list(enumerate(sig[indices['Newlyweds']]))

In [None]:
def give_rec(title, sig=sig):
    # Get the index corresponding to original_title
    idx = indices[title]

    # Get the pairwsie similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies in the decending order on basis of similarity score 
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:11]

    # Movie indices
    movie_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    return movies_cleaned_df['original_title'].iloc[movie_indices]

In [None]:
# Testing our content-based recommendation system with the seminal film Spy Kids
give_rec('Avatar')

In [None]:
! ls ../input/movie-rating

**Nearest Neighbor item based Collaborative Filtering**

In [None]:
import pandas as pd
import numpy as np

In [None]:
movies_df = pd.read_csv('../input/movie-rating/movies.csv',usecols=['movieId','title'],dtype={'movieId': 'int32', 'title': 'str'})
rating_df=pd.read_csv('../input/movie-rating/ratings.csv',usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

In [None]:
movies_df.head()

In [None]:
rating_df.head()

In [None]:
#mering dataset on bbased on movie_id

df = pd.merge(rating_df,movies_df,on='movieId')
df.head()

In [None]:

combine_movie_rating = df.dropna(axis = 0, subset = ['title'])
movie_ratingCount = (combine_movie_rating.groupby(by = ['title'])['rating'].count().reset_index().rename(columns = {'rating': 'totalRatingCount'})
     [['title', 'totalRatingCount']]
    )
movie_ratingCount.head()

In [None]:
rating_with_totalRatingCount = combine_movie_rating.merge(movie_ratingCount, left_on = 'title', right_on = 'title', how = 'left')
rating_with_totalRatingCount.head()

In [None]:

print(movie_ratingCount['totalRatingCount'].describe())

In [None]:
popularity_threshold = 50
rating_popular_movie= rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')
rating_popular_movie.head()

In [None]:
rating_popular_movie.shape

In [None]:

## First lets create a Pivot matrix

movie_features_df=rating_popular_movie.pivot_table(index='title',columns='userId',values='rating').fillna(0)
movie_features_df.head()


In [None]:
from scipy.sparse import csr_matrix

movie_features_df_matrix = csr_matrix(movie_features_df.values)

from sklearn.neighbors import NearestNeighbors


model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(movie_features_df_matrix)

In [None]:
movie_features_df.shape

In [None]:
query_index = np.random.choice(movie_features_df.shape[0])
print(query_index)
distances, indices = model_knn.kneighbors(movie_features_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)

In [None]:
movie_features_df.head()

In [None]:

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(movie_features_df.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, movie_features_df.index[indices.flatten()[i]], distances.flatten()[i]))

If you like the kernel,please share and upvote the kernel .

For average weight recommendation system check below kernel. 

https://www.kaggle.com/uttam94/average-weight-recommedation-system