# In this Python Script, we will look again at the movie metadata that we have used earlier on in the course. However, this time round, we are building a knowledge recommender system. The Python script will in fact be interactive!

##### Summary about Knowledge-based (KB) recommenders
- Knowledge based are used for items that are very rarely bought, (cars, computers, houses,...). It is hard to recommend such items based on past purchasing activity or by building a user profile.
- KB rely on explicitly soliciting user requirement for such item.
- KB user interactive feedback, which allows the user to explore the inherently complex product space and learn about the trade off between available various options.
- The retrieval and exploration process is facilitate by knowledge bases describing the utilities and/or trade-offs between various features in the product domain. 
- KB systems are appropriate in the following situations:

        - Customer wants to explicitly specify their requirements. Therefore, interactivity is a crucial component of such systems.
        - it is difficult to obtain rating for specific type of item because of the greater complexity of the product domain in terms of the types of items and options available.
        - In some domains, such as computers, the ratings may be time-sensitive. The ratings on an old car or computer are not very useful for recommendations because they evolve with changing product availability and corresponding user requirements.
- KB can be categorized based on the user interactive methodology :

        - Constraint-based RS: user specify requirements or constraints on the item attributes. Domain-specific rules are used to match the user requirements or attributes to item attributes.
        - Case-based RS: Specific cases are specified by the user as targets or anchor points. Similarity metrics are defined on the item attributes or retrieve similar items to these targets.
- KB systems draw on highly heterogenous, domain-specific sources of knowledge, compared to content-based and collaborative systems, which work with the some what similar types of input data across various domains.

<img src="KB.png">

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('./movies_metadata.csv')

#Print all the features (or columns) of the DataFrame
# list(df.columns)
df.columns

  exec(code_obj, self.user_global_ns, self.user_ns)


Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

### We will do three things in this script
##### 1. Ask the user for the genres of movies she is looking for
##### 2. Ask the user for the duration
##### 3. Ask the user for the timeline (when produced?) of the movies recommended
##### 4. Using the information collected, recommend movies to the user that have a high weighted rating (according to the IMDB formula) and that satisfy preceding conditions
Obviously, you can also change these inputs to create a different type of interaction Python design

In [10]:
list(df.columns)

['adult',
 'belongs_to_collection',
 'budget',
 'genres',
 'homepage',
 'id',
 'imdb_id',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'video',
 'vote_average',
 'vote_count']

In [12]:
#Only keep those features that we require: title, generes, release_data, runtime, vote_average, vote_count
df = df[['title','genres','release_date','runtime','vote_average','vote_count']]

df.head()

Unnamed: 0,title,genres,release_date,runtime,vote_average,vote_count
0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995-10-30,81.0,7.7,5415.0
1,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995-12-15,104.0,6.9,2413.0
2,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",1995-12-22,101.0,6.5,92.0
3,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",1995-12-22,127.0,6.1,34.0
4,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",1995-02-10,106.0,5.7,173.0


In [13]:
#Convert release_date into pandas datetime format, and invalid parsing into NaT
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

#Extract year from the datetime
df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)


In [4]:
# look at years
df['year']

0        1995
1        1995
2        1995
3        1995
4        1995
         ... 
45461     NaT
45462    2011
45463    2003
45464    1917
45465    2017
Name: year, Length: 45466, dtype: object

In [14]:
# look at the the year and new features after pre-processing, if there is any NaT
df.loc[df.year == 'NaT'].any()

title            True
genres           True
release_date    False
runtime          True
vote_average     True
vote_count       True
year             True
dtype: bool

In [15]:
#Helper function to convert NaT to 0 and all other years to integers.

def convert_int(x):
    try:
        return int(x)
    except:
        return 0

In [16]:
#Apply convert_int to the year feature
df['year'] = df['year'].apply(convert_int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['year'] = df['year'].apply(convert_int)


In [17]:
# look at the the year and new features after pre-processing, if there is any NaT
df.loc[df.year == 'NaT'].any()

title           False
genres          False
release_date    False
runtime         False
vote_average    False
vote_count      False
year            False
dtype: bool

In [19]:
#Drop the release_date column
df = df.drop(columns='release_date', index=1)

#Display the dataframe
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",81.0,7.7,5415.0,1995
2,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",101.0,6.5,92.0,1995
3,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",127.0,6.1,34.0,1995
4,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",106.0,5.7,173.0,1995
5,Heat,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",170.0,7.7,1886.0,1995


In [20]:
#Print genres of the first movie
df.iloc[2]['genres']

"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]"

In [21]:
#Import the literal_eval function from ast
from ast import literal_eval 

## literal_eval help to find type of value stored in string or a file. It can be used to manipulates these data structures during program.

#Define a stringified list and output its type
a = "[1,2,3]"
print(type(a))

#Apply literal_eval and output type
b = literal_eval(a)
print(type(b))

<class 'str'>
<class 'list'>


In [22]:
#Convert all NaN into stringified empty lists
df['genres'] = df['genres'].fillna('[]')

#Apply literal_eval to convert stringified empty lists to the list object
df['genres'] = df['genres'].apply(literal_eval)

df['genres']

0        [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
2        [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3        [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4                           [{'id': 35, 'name': 'Comedy'}]
5        [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
                               ...                        
45461    [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462                        [{'id': 18, 'name': 'Drama'}]
45463    [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464                                                   []
45465                                                   []
Name: genres, Length: 45465, dtype: object

In [23]:
#Convert list of dictionaries to a list of strings
df['genres'] = df['genres'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

In [24]:
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"[animation, comedy, family]",81.0,7.7,5415.0,1995
2,Grumpier Old Men,"[romance, comedy]",101.0,6.5,92.0,1995
3,Waiting to Exhale,"[comedy, drama, romance]",127.0,6.1,34.0,1995
4,Father of the Bride Part II,[comedy],106.0,5.7,173.0,1995
5,Heat,"[action, crime, drama, thriller]",170.0,7.7,1886.0,1995


In [25]:
    #Create a new feature by exploding genres
    s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)

    # there is also a function that you can use to explode dataframe pd.explode()


    #Name the new feature as 'genre'
    s.name = 'genre'

    #Create a new dataframe gen_df which by dropping the old 'genres' feature and adding the new 'genre'.
    gen_df = df.drop('genres', axis=1).join(s)

    #Print the head of the new gen_df
    # gen_df.head()

  s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)


In [26]:
gen_df.head()

Unnamed: 0,title,runtime,vote_average,vote_count,year,genre
0,Toy Story,81.0,7.7,5415.0,1995,animation
0,Toy Story,81.0,7.7,5415.0,1995,comedy
0,Toy Story,81.0,7.7,5415.0,1995,family
2,Grumpier Old Men,101.0,6.5,92.0,1995,romance
2,Grumpier Old Men,101.0,6.5,92.0,1995,comedy


#### You can ask for input in Python. This way, you can directly ask your users what they are looking for. This could allow them to search more 'directly'

using explicit user preference to get recommendation, explicit preference elicitation is main driver to get kind of recommendation without forgetting domain expert knowledge: for this we are using this formula to retrieve relevant movies:
<img src="weithed_mean.png">

In [27]:
def build_chart(gen_df, percentile=0.8):
    #Ask for preferred genres
    print("Input preferred genre")
    genre = input()
    
    #Ask for lower limit of duration
    print("Input shortest duration")
    low_time = int(input())
    
    #Ask for upper limit of duration
    print("Input longest duration")
    high_time = int(input())
    
    #Ask for lower limit of timeline
    print("Input earliest year")
    low_year = int(input())
    
    #Ask for upper limit of timeline
    print("Input latest year")
    high_year = int(input())
    
    #Define a new movies variable to store the preferred movies. Copy the contents of gen_df to movies
    movies = gen_df.copy()
    
    #Filter based on the condition
    movies = movies[(movies['genre'] == genre) & 
                    (movies['runtime'] >= low_time) & 
                    (movies['runtime'] <= high_time) & 
                    (movies['year'] >= low_year) & 
                    (movies['year'] <= high_year)]
    
  
    #Compute the values of C and m for the filtered movies
    C = movies['vote_average'].mean()
    m = movies['vote_count'].quantile(percentile)
    

    #Only consider movies that have higher than m votes. Save this in a new dataframe q_movies
    q_movies = movies.copy().loc[movies['vote_count'] >= m]
    q_movies.head()
    #Calculate score using the IMDB formula
    q_movies['score'] = q_movies.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) 
                                       + (m/(m+x['vote_count']) * C)
                                       ,axis=1)
   
    q_movies.head()
    
    #Sort movies in descending order of their scores
    q_movies = q_movies.sort_values('score', ascending=False)
    
    return q_movies

#### This going to be the interactive part of the script. Now, we are going to generate a user dialog, in which the user needs to put in values that are also part of the database. So, you can't just type random works, because then the script will break down. The genre should match any of the possibilities in the dataset (e.g., comedy), the other inputs are integers. Try it!

In [28]:
#Generate the chart for top animation movies and display top 5.
build_chart(gen_df).head()

# Input preferred genre
#  comedy
# Input shortest duration
#  50
# Input longest duration
#  100
# Input earliest year
#  2012
# Input latest year
#  2017

Input preferred genre
Input shortest duration
Input longest duration
Input earliest year
Input latest year


Unnamed: 0,title,runtime,vote_average,vote_count,year,genre,score
22841,The Grand Budapest Hotel,99.0,8.0,4644.0,2014,comedy,7.951675
30315,Inside Out,94.0,7.9,6737.0,2015,comedy,7.867831
38186,Perfect Strangers,97.0,7.8,803.0,2016,comedy,7.564755
19016,Moonrise Kingdom,94.0,7.6,1701.0,2012,comedy,7.492829
22718,The Lego Movie,100.0,7.5,3127.0,2014,comedy,7.44316


In [8]:
#Convert the cleaned (non-exploded) dataframe df into a CSV file and save it in the data folder
#Set parameter index to False as the index of the DataFrame has no inherent meaning.

df.to_csv('./metadata_clean.csv', index=False)

###### Example

[Live constraint based recommender](http://158.39.201.22:5438/)

[Boosting Health? Examining the Role of Nutrition Labels and Preference Elicitation Methods in Food Recommendation](https://www.researchgate.net/profile/Ayoub-El-Majjodi/publication/363700837_Boosting_Health_Examining_the_Role_of_Nutrition_Labels_and_Preference_Elicitation_Methods_in_Food_Recommendation/links/632af731071ea12e364e8d31/Boosting-Health-Examining-the-Role-of-Nutrition-Labels-and-Preference-Elicitation-Methods-in-Food-Recommendation.pdf)

<img src='Agerwal.png'>