# Building a Recommendor System

## What is a Recommendor System ?

Recommendor Systems as the name suggests, are sytems or techniques that recommend or suggest a particular product,service or entity. However, these systems can be classified into the following two categories based on the approach to providing recommendations.
- The Prediction Problem
- The Ranking Problem

## The Prediction Problem

In this version of the problem, we are given a matrix of m users and n items. Each row of the matrix represents a user and each column represents an item. The value inside of the i<sup>th</sup> and the j<sup>th</sup> column denotes the rating given by user i to item j. This value is usually denoted as r<sub>ij</sub>

## Types of Recommendor Systems
- ### Collaborative Filtering
> Collaborative Filtering leverages the power of community to provide recommendations. Collaborative filters are one of the most popular recommender models used in the industry.Collaborative filtering can be broadly classified into two types
> - User-Based Collaborative Filtering
> > The main idea being user-based collaborative filtering is that if we are able to find users that have bought and liked similar items in the past, they are more likely to buy similar items in the future too. Therefore, these models recommend items to a user thatt similar users have also liked.
> - Item- Based Collaborative Filtering
> > If a group of people have rated two items similarly, then the two items must be similar. Therefore, if a person likes one particular item, they're likely to be interested in the other item too. This is the principle on which item-based filtering works


> <font color='red'> Note: One of the biggest preprequisites of a collaborative filtering system is the availability of data of past activity. Amazon is able to leverage collaborative filtering so well because it has access to data concerning millions of purchases from millions of users. Therefore, collaborative filters suffer from what we call the <b>Cold Start Problem</b>

- ### Content Based Systems
> Content Based systems do not require data relating to past activity. Instead, they provide recommendaations based on a user profile and metadata it has on particular items. Example: Netflix

# Starting Off
> We go ahead and import the pandas library as a first step.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('../RecommendorSystems/movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


Lets check what a sample of this dataset looks like

In [3]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Check the shape and the number of columns in this dataset

In [4]:
df.shape

(45466, 24)

In [5]:
len(df.columns) 

24

We have 24 columns in this dataset. Each column represents a particular feature of the dataset. Now, lets see how we can access details for a particular movie

In [6]:
df.iloc[1]

adult                                                                False
belongs_to_collection                                                  NaN
budget                                                            65000000
genres                   [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
homepage                                                               NaN
id                                                                    8844
imdb_id                                                          tt0113497
original_language                                                       en
original_title                                                     Jumanji
overview                 When siblings Judy and Peter discover an encha...
popularity                                                         17.0155
poster_path                               /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
production_companies     [{'name': 'TriStar Pictures', 'id': 559}, {'na...
production_countries     

In [7]:
type(df)

pandas.core.frame.DataFrame

We can also access a particular movie by its index. Lets place an index on the title

In [8]:
df=df.set_index('title')

In [9]:
df.loc['Jumanji']

adult                                                                False
belongs_to_collection                                                  NaN
budget                                                            65000000
genres                   [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
homepage                                                               NaN
id                                                                    8844
imdb_id                                                          tt0113497
original_language                                                       en
original_title                                                     Jumanji
overview                 When siblings Judy and Peter discover an encha...
popularity                                                         17.0155
poster_path                               /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
production_companies     [{'name': 'TriStar Pictures', 'id': 559}, {'na...
production_countries     

In [37]:

df=df.reset_index()

## Creating a smaller dataset with only some features

In [38]:
small_df=df[['title','release_date','budget','revenue','runtime','genres']]

In [39]:
small_df.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres
0,Toy Story,1995-10-30,30000000,373554033.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,Jumanji,1995-12-15,65000000,262797249.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,Grumpier Old Men,1995-12-22,0,0.0,101.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,Waiting to Exhale,1995-12-22,16000000,81452156.0,127.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,Father of the Bride Part II,1995-02-10,0,76578911.0,106.0,"[{'id': 35, 'name': 'Comedy'}]"


Now that we have extracted the relevant features out of the dataset we can go ahead and get the information about this dataset.

In [40]:
small_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 6 columns):
title           45460 non-null object
release_date    45379 non-null object
budget          45466 non-null object
revenue         45460 non-null float64
runtime         45203 non-null float64
genres          45466 non-null object
dtypes: float64(2), object(4)
memory usage: 2.1+ MB


We can see that the budget data has been assigned the object type. We want to have float instead. Lets write a function that can help us achieve this

In [41]:
small_df['budget']=small_df['budget'].astype['float']

TypeError: 'method' object is not subscriptable

In [42]:
import numpy as np

In [43]:
def to_float(x):
    try:
        x=float(x)
    except:
        x=np.nan
    return x

In [44]:
small_df['budget']=small_df['budget'].apply(to_float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [45]:
small_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 6 columns):
title           45460 non-null object
release_date    45379 non-null object
budget          45463 non-null float64
revenue         45460 non-null float64
runtime         45203 non-null float64
genres          45466 non-null object
dtypes: float64(3), object(3)
memory usage: 2.1+ MB


In [46]:
small_df.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres
0,Toy Story,1995-10-30,30000000.0,373554033.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,Jumanji,1995-12-15,65000000.0,262797249.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,Grumpier Old Men,1995-12-22,0.0,0.0,101.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,Waiting to Exhale,1995-12-22,16000000.0,81452156.0,127.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,Father of the Bride Part II,1995-02-10,0.0,76578911.0,106.0,"[{'id': 35, 'name': 'Comedy'}]"


Now, we want to convert the release data to year.

In [47]:
small_df['release_date']=pd.to_datetime(small_df['release_date'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [48]:
small_df['year']=small_df['release_date'].apply(lambda x: str(x).split('-') if x!=np.nan else np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [49]:
small_df.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres,year
0,Toy Story,1995-10-30,30000000.0,373554033.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[1995, 10, 30 00:00:00]"
1,Jumanji,1995-12-15,65000000.0,262797249.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[1995, 12, 15 00:00:00]"
2,Grumpier Old Men,1995-12-22,0.0,0.0,101.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[1995, 12, 22 00:00:00]"
3,Waiting to Exhale,1995-12-22,16000000.0,81452156.0,127.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[1995, 12, 22 00:00:00]"
4,Father of the Bride Part II,1995-02-10,0.0,76578911.0,106.0,"[{'id': 35, 'name': 'Comedy'}]","[1995, 02, 10 00:00:00]"


In [50]:
small_df=small_df.sort_values('year',ascending=False)

In [51]:
small_df.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres,year
44296,Neither Wolf Nor Dog,NaT,0.0,0.0,110.0,"[{'id': 18, 'name': 'Drama'}]",[NaT]
42573,Whn the day had no name,NaT,0.0,0.0,,[],[NaT]
39604,Digital Dharma,NaT,0.0,0.0,90.0,[],[NaT]
43962,Irwin & Fran 2013,NaT,0.0,0.0,83.0,[],[NaT]
19322,Endeavour,NaT,0.0,0.0,98.0,[],[NaT]


In [52]:
small_df=small_df.set_index('title')

Let check all the movies which have earned more tha 1 billion USD

In [53]:
new = small_df[small_df['revenue']>1e9]

In [54]:
new.head()

Unnamed: 0_level_0,release_date,budget,revenue,runtime,genres,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Despicable Me 3,2017-06-15,80000000.0,1020063000.0,96.0,"[{'id': 28, 'name': 'Action'}, {'id': 16, 'nam...","[2017, 06, 15 00:00:00]"
The Fate of the Furious,2017-04-12,250000000.0,1238765000.0,136.0,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...","[2017, 04, 12 00:00:00]"
Beauty and the Beast,2017-03-16,160000000.0,1262886000.0,129.0,"[{'id': 10751, 'name': 'Family'}, {'id': 14, '...","[2017, 03, 16 00:00:00]"
Rogue One: A Star Wars Story,2016-12-14,200000000.0,1056057000.0,133.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[2016, 12, 14 00:00:00]"
Finding Dory,2016-06-16,200000000.0,1028571000.0,97.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...","[2016, 06, 16 00:00:00]"


We can also multiple conditions like, say we want movies which have made more than 1 billion where the actual investment was less than 150 million USD

In [55]:
new2=small_df[(small_df['revenue']>1e9)&(small_df['budget']<1e8)]

In [56]:
new2.head()

Unnamed: 0_level_0,release_date,budget,revenue,runtime,genres,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Despicable Me 3,2017-06-15,80000000.0,1020063000.0,96.0,"[{'id': 28, 'name': 'Action'}, {'id': 16, 'nam...","[2017, 06, 15 00:00:00]"
Minions,2015-06-17,74000000.0,1156731000.0,91.0,"[{'id': 10751, 'name': 'Family'}, {'id': 16, '...","[2015, 06, 17 00:00:00]"
The Lord of the Rings: The Return of the King,2003-12-01,94000000.0,1118889000.0,201.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[2003, 12, 01 00:00:00]"


# The Pandas Series
Now that we have seen how we can operate on the pandas Dataframe type. Lets move our attention towards Pandas Series. 
So, what exactly is a series? Well, without knowing , we have already used the Pandas series in our previous code when we performed the .iloc and .loc operations on a Dataframe. Also, when we perform the apply on a particular column we are actually performing it on the Series Type. So, basically Pandas series is a labelled list. Lets find the max and min values in a particular column of our dataset with the help of Series 

In [80]:
revenue=small_df['revenue']

We can now get the various statistics about this series like Mean, Median, Mode, Max, Min, Quartile. Lets go ahead find the 90th percentile of this column

In [81]:
revenue.quantile(0.9)

8267610.399999982

Above, we have calculated the 90th percentile of the revenue of all the movies. We can clearly see that only 10 percent of the movies have scored more than 8.2$ million. 

In [82]:
revenue.quantile(0.98)

154932102.33999997