## Part I: Scrape data from IMDB Top 100 Movies

#### Import Libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
movie_rating = pd.read_csv(
    "https://raw.githubusercontent.com/ycwang15/Rec_Sys_assignments/Data/movie_rating.csv"
)

<IPython.core.display.Javascript object>

#### Scrape data from IMDB top1000 greatest movies website

In [3]:
url = "https://www.imdb.com/list/ls006266261/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

<IPython.core.display.Javascript object>

#### Create three different lists(name; year and genre).

In [4]:
movie_name = []
year = []
plot = []

<IPython.core.display.Javascript object>

In [5]:
movie_data = soup.findAll("div", attrs={"class": "lister-item mode-detail"})

<IPython.core.display.Javascript object>

In [6]:
for i in movie_data:
    name = i.h3.a.text
    movie_name.append(name)

    year_of_release = (
        i.h3.find("span", class_="lister-item-year text-muted unbold")
        .text.replace("(", "")
        .replace(")", "")
    )
    year.append(year_of_release)

    plot_des = i.find("p", class_="").text[1:]
    plot.append(plot_des)

<IPython.core.display.Javascript object>

#### Check first 10 items in the lists

In [7]:
print(movie_name[0:10])

['The Godfather', 'Goodfellas', 'Pulp Fiction', 'The Usual Suspects', 'Apocalypse Now', 'Trainspotting', 'Fight Club', "Schindler's List", 'Boogie Nights', 'Reservoir Dogs']


<IPython.core.display.Javascript object>

In [8]:
print(year[0:10])

['1972', '1990', '1994', '1995', '1979', '1996', '1999', '1993', '1997', '1992']


<IPython.core.display.Javascript object>

In [9]:
print(plot[0:10])

['The aging patriarch of an organized crime dynasty in postwar New York City transfers control of his clandestine empire to his reluctant youngest son.', 'The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.', 'The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.', 'A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.', 'A U.S. Army officer serving in Vietnam is tasked with assassinating a renegade Special Forces Colonel who sees himself as a god.', 'Renton, deeply immersed in the Edinburgh drug scene, tries to clean up and get out, despite the allure of the drugs and influence of friends.', 'An insomniac office worker and a devil-may-care soap maker f

<IPython.core.display.Javascript object>

#### Store the data into a pandas dataframe

In [10]:
movies_df = pd.DataFrame({"name": movie_name, "year": year, "plot": plot})

<IPython.core.display.Javascript object>

In [11]:
movies_df.head()

Unnamed: 0,name,year,plot
0,The Godfather,1972,The aging patriarch of an organized crime dyna...
1,Goodfellas,1990,The story of Henry Hill and his life in the mo...
2,Pulp Fiction,1994,"The lives of two mob hitmen, a boxer, a gangst..."
3,The Usual Suspects,1995,A sole survivor tells of the twisty events lea...
4,Apocalypse Now,1979,A U.S. Army officer serving in Vietnam is task...


<IPython.core.display.Javascript object>

#### I want to combine the two attributes(year and plot) into one column, making it much easier to analyze.

In [12]:
movies_df["year_plot"] = movies_df["year"].astype(str) + " " + movies_df["plot"]

<IPython.core.display.Javascript object>

In [13]:
movies_df = movies_df.drop(["year", "plot"], axis=1)

<IPython.core.display.Javascript object>

In [14]:
movies_df.rename(columns={"year_plot": "Overview"}, inplace=True)

<IPython.core.display.Javascript object>

#### Check the dataset again

In [15]:
movies_df.head()

Unnamed: 0,name,Overview
0,The Godfather,1972 The aging patriarch of an organized crime...
1,Goodfellas,1990 The story of Henry Hill and his life in t...
2,Pulp Fiction,"1994 The lives of two mob hitmen, a boxer, a g..."
3,The Usual Suspects,1995 A sole survivor tells of the twisty event...
4,Apocalypse Now,1979 A U.S. Army officer serving in Vietnam is...


<IPython.core.display.Javascript object>

## Part II. Calculate the word's weight of the movies' plot and the cosine similarity of the movies.

#### Leverage the Tf-Idf to calculate the weight of each word in the plot.

In [16]:
tfidf = TfidfVectorizer(stop_words="english")
movies_df["Overview"] = movies_df["Overview"].fillna("")
tfidf_matrix = tfidf.fit_transform(movies_df["Overview"])

<IPython.core.display.Javascript object>

#### Calculate the cosine similarity

In [17]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

<IPython.core.display.Javascript object>

#### Get the idex of each movie

In [18]:
indices = pd.Series(movies_df.index, index=movies_df["name"])
indices

name
The Godfather              0
Goodfellas                 1
Pulp Fiction               2
The Usual Suspects         3
Apocalypse Now             4
                          ..
Men in Black              95
No Country for Old Men    96
Airplane!                 97
There Will Be Blood       98
Inception                 99
Length: 100, dtype: int64

<IPython.core.display.Javascript object>

#### Define a function that will allow us to automatically generate movie recommendations

In [19]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = enumerate(cosine_sim[idx])
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    sim_index = [i[0] for i in sim_scores]
    print(movies_df["name"].iloc[sim_index])

<IPython.core.display.Javascript object>

---

## Part III. Recommend movies to certain audience that based on the data they provided

* This method is a content-based recommendation system, so for each audience, I selected the movie they gave the highest rating and recommended the most similar to the movie they rated the highest score before.

#### Check the real data that we have

In [20]:
movie_rating

Unnamed: 0,Critics,The Godfather,Goodfellas,Pulp Fiction,The Usual Suspects,Apocalypse Now
0,Kyler,5.0,3.0,4.0,,
1,Kyle,5.0,2.0,2.0,4.0,
2,Tyler,3.0,4.0,5.0,,2.0
3,Dustin,1.0,3.0,5.0,4.0,3.0
4,Alex,3.0,2.0,,,1.0
5,Roy,5.0,4.0,2.0,,2.0
6,Yan,,2.0,3.0,5.0,
7,Yang,5.0,,3.0,3.0,
8,Jessie,,,4.0,3.0,2.0
9,Frank,5.0,,4.0,,3.0


<IPython.core.display.Javascript object>

#### Tidy the data

In [21]:
formatted_movie_rating = pd.melt(
    movie_rating, ["Critics"], var_name="movie_name", value_name="rating"
)
formatted_movie_rating = formatted_movie_rating.sort_values(by=["Critics"])
formatted_movie_rating.head()

Unnamed: 0,Critics,movie_name,rating
24,Alex,Pulp Fiction,
44,Alex,Apocalypse Now,1.0
4,Alex,The Godfather,3.0
34,Alex,The Usual Suspects,
14,Alex,Goodfellas,2.0


<IPython.core.display.Javascript object>

In [22]:
movie_rating_final = formatted_movie_rating.pivot_table(
    index="Critics", columns="movie_name", values="rating"
)
movie_rating_final

movie_name,Apocalypse Now,Goodfellas,Pulp Fiction,The Godfather,The Usual Suspects
Critics,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alex,1.0,2.0,,3.0,
Dustin,3.0,3.0,5.0,1.0,4.0
Frank,3.0,,4.0,5.0,
Jessie,2.0,,4.0,,3.0
Kyle,,2.0,2.0,5.0,4.0
Kyler,,3.0,4.0,5.0,
Roy,2.0,4.0,2.0,5.0,
Tyler,2.0,4.0,5.0,3.0,
Yan,,2.0,3.0,,5.0
Yang,,,3.0,5.0,3.0


<IPython.core.display.Javascript object>

In [23]:
movie_rating_final = movie_rating_final.fillna(0)
movie_rating_final

movie_name,Apocalypse Now,Goodfellas,Pulp Fiction,The Godfather,The Usual Suspects
Critics,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alex,1.0,2.0,0.0,3.0,0.0
Dustin,3.0,3.0,5.0,1.0,4.0
Frank,3.0,0.0,4.0,5.0,0.0
Jessie,2.0,0.0,4.0,0.0,3.0
Kyle,0.0,2.0,2.0,5.0,4.0
Kyler,0.0,3.0,4.0,5.0,0.0
Roy,2.0,4.0,2.0,5.0,0.0
Tyler,2.0,4.0,5.0,3.0,0.0
Yan,0.0,2.0,3.0,0.0,5.0
Yang,0.0,0.0,3.0,5.0,3.0


<IPython.core.display.Javascript object>

#### Let's customize who we want to recommend movies for

In [24]:
target_viewer = input("We want to recommend the movies to:")

We want to recommend the movies to:Alex


<IPython.core.display.Javascript object>

#### Select the corresponding data of the user we input before

In [25]:
target_df = movie_rating_final.loc[target_viewer].to_frame()
target_df

Unnamed: 0_level_0,Alex
movie_name,Unnamed: 1_level_1
Apocalypse Now,1.0
Goodfellas,2.0
Pulp Fiction,0.0
The Godfather,3.0
The Usual Suspects,0.0


<IPython.core.display.Javascript object>

#### Reset the index of above data frame(it allow me much easier to analyze)

In [26]:
target_df = target_df.reset_index()
target_df

Unnamed: 0,movie_name,Alex
0,Apocalypse Now,1.0
1,Goodfellas,2.0
2,Pulp Fiction,0.0
3,The Godfather,3.0
4,The Usual Suspects,0.0


<IPython.core.display.Javascript object>

#### Get the movie name that the target viewer gave the highest rating.

In [27]:
movie_title = target_df.loc[target_df[target_viewer] == target_df[target_viewer].max()][
    "movie_name"
].item()
movie_title

'The Godfather'

<IPython.core.display.Javascript object>

#### Input the movie name that the target viewer rated the highest rating, and then get the recommendation for this audience.

In [28]:
get_recommendations(movie_title)

15         The Godfather Part II
12                   Taxi Driver
88                  12 Angry Men
94               American Psycho
58                 Batman Begins
45                        Casino
63                   The Shining
76    Terminator 2: Judgment Day
80                      Sin City
36                     Toy Story
Name: name, dtype: object


<IPython.core.display.Javascript object>

In [29]:
print(f"The movies that should be recommended to {target_viewer} is:")
get_recommendations(movie_title)

The movies that should be recommended to Alex is:
15         The Godfather Part II
12                   Taxi Driver
88                  12 Angry Men
94               American Psycho
58                 Batman Begins
45                        Casino
63                   The Shining
76    Terminator 2: Judgment Day
80                      Sin City
36                     Toy Story
Name: name, dtype: object


<IPython.core.display.Javascript object>

#### Some places that can improve in the future.
**I didn't exclude the movies that the users watched before in the final result, so in the future, it's better to recommend to the audience the movie that is most similar to the film that they gave the highest rating and also that they didn't watch before.**
* I believe it is not a difficult task.

---

## Part IV. Item to Item Collaborative Filtering

#### Check the dataset again

In [30]:
df = movie_rating
df

Unnamed: 0,Critics,The Godfather,Goodfellas,Pulp Fiction,The Usual Suspects,Apocalypse Now
0,Kyler,5.0,3.0,4.0,,
1,Kyle,5.0,2.0,2.0,4.0,
2,Tyler,3.0,4.0,5.0,,2.0
3,Dustin,1.0,3.0,5.0,4.0,3.0
4,Alex,3.0,2.0,,,1.0
5,Roy,5.0,4.0,2.0,,2.0
6,Yan,,2.0,3.0,5.0,
7,Yang,5.0,,3.0,3.0,
8,Jessie,,,4.0,3.0,2.0
9,Frank,5.0,,4.0,,3.0


<IPython.core.display.Javascript object>

#### Tidy the data

In [31]:
formatted_df = pd.melt(df, ["Critics"], var_name="movie", value_name="rating")
formatted_df = formatted_df.sort_values(by=["Critics"])

<IPython.core.display.Javascript object>

In [32]:
formatted_df.head(10)

Unnamed: 0,Critics,movie,rating
24,Alex,Pulp Fiction,
44,Alex,Apocalypse Now,1.0
4,Alex,The Godfather,3.0
34,Alex,The Usual Suspects,
14,Alex,Goodfellas,2.0
43,Dustin,Apocalypse Now,3.0
3,Dustin,The Godfather,1.0
33,Dustin,The Usual Suspects,4.0
23,Dustin,Pulp Fiction,5.0
13,Dustin,Goodfellas,3.0


<IPython.core.display.Javascript object>

In [33]:
matrix = formatted_df.pivot_table(index="movie", columns="Critics", values="rating")
matrix

Critics,Alex,Dustin,Frank,Jessie,Kyle,Kyler,Roy,Tyler,Yan,Yang
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Apocalypse Now,1.0,3.0,3.0,2.0,,,2.0,2.0,,
Goodfellas,2.0,3.0,,,2.0,3.0,4.0,4.0,2.0,
Pulp Fiction,,5.0,4.0,4.0,2.0,4.0,2.0,5.0,3.0,3.0
The Godfather,3.0,1.0,5.0,,5.0,5.0,5.0,3.0,,5.0
The Usual Suspects,,4.0,,3.0,4.0,,,,5.0,3.0


<IPython.core.display.Javascript object>

#### Get the value (actual - average) for each movie

In [34]:
df_norm = matrix.subtract(matrix.mean(axis=1), axis=0)
df_norm

Critics,Alex,Dustin,Frank,Jessie,Kyle,Kyler,Roy,Tyler,Yan,Yang
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Apocalypse Now,-1.166667,0.833333,0.833333,-0.166667,,,-0.166667,-0.166667,,
Goodfellas,-0.857143,0.142857,,,-0.857143,0.142857,1.142857,1.142857,-0.857143,
Pulp Fiction,,1.444444,0.444444,0.444444,-1.555556,0.444444,-1.555556,1.444444,-0.555556,-0.555556
The Godfather,-1.0,-3.0,1.0,,1.0,1.0,1.0,-1.0,,1.0
The Usual Suspects,,0.2,,-0.8,0.2,,,,1.2,-0.8


<IPython.core.display.Javascript object>

* I calculated cosine similarity before, let's check it again.

In [35]:
cosine_sim

array([[1.        , 0.03139379, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.03139379, 1.        , 0.12829673, ..., 0.        , 0.04084095,
        0.        ],
       [0.        , 0.12829673, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.04084095, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

<IPython.core.display.Javascript object>

#### Convert array into dataframe, allowing us to look it clearly

In [36]:
df_array = pd.DataFrame(cosine_sim)
df_array

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1.000000,0.031394,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.10028,0.000000,0.000000,0.0,0.000000,0.0
1,0.031394,1.000000,0.128297,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.0,0.040841,0.0
2,0.000000,0.128297,1.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,0.046228,0.0,0.000000,0.0
3,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.0,0.0,0.000000,0.093354,...,0.0,0.032510,0.0,0.000000,0.00000,0.041294,0.000000,0.0,0.000000,0.0
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.108547,0.0,0.000000,0.00000,0.057557,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.000000,0.000000,0.000000,0.041294,0.057557,0.000000,0.0,0.0,0.050081,0.049068,...,0.0,0.090097,0.0,0.000000,0.00000,1.000000,0.000000,0.0,0.000000,0.0
96,0.000000,0.000000,0.046228,0.000000,0.000000,0.049011,0.0,0.0,0.000000,0.065818,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,1.000000,0.0,0.063004,0.0
97,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,1.0,0.000000,0.0
98,0.000000,0.040841,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.099394,0.000000,...,0.0,0.000000,0.0,0.048815,0.00000,0.000000,0.063004,0.0,1.000000,0.0


<IPython.core.display.Javascript object>

#### Reset the index and columns name

In [37]:
reset_name = movies_df["name"].to_list()

<IPython.core.display.Javascript object>

In [38]:
df_array.set_axis(reset_name, axis=1, inplace=True)

<IPython.core.display.Javascript object>

In [39]:
df_array.index = reset_name

<IPython.core.display.Javascript object>

In [40]:
df_array

Unnamed: 0,The Godfather,Goodfellas,Pulp Fiction,The Usual Suspects,Apocalypse Now,Trainspotting,Fight Club,Schindler's List,Boogie Nights,Reservoir Dogs,...,Rain Man,Minority Report,Goldfinger,The Social Network,American Psycho,Men in Black,No Country for Old Men,Airplane!,There Will Be Blood,Inception
The Godfather,1.000000,0.031394,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.10028,0.000000,0.000000,0.0,0.000000,0.0
Goodfellas,0.031394,1.000000,0.128297,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.0,0.040841,0.0
Pulp Fiction,0.000000,0.128297,1.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,0.046228,0.0,0.000000,0.0
The Usual Suspects,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.0,0.0,0.000000,0.093354,...,0.0,0.032510,0.0,0.000000,0.00000,0.041294,0.000000,0.0,0.000000,0.0
Apocalypse Now,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.108547,0.0,0.000000,0.00000,0.057557,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Men in Black,0.000000,0.000000,0.000000,0.041294,0.057557,0.000000,0.0,0.0,0.050081,0.049068,...,0.0,0.090097,0.0,0.000000,0.00000,1.000000,0.000000,0.0,0.000000,0.0
No Country for Old Men,0.000000,0.000000,0.046228,0.000000,0.000000,0.049011,0.0,0.0,0.000000,0.065818,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,1.000000,0.0,0.063004,0.0
Airplane!,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,1.0,0.000000,0.0
There Will Be Blood,0.000000,0.040841,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.099394,0.000000,...,0.0,0.000000,0.0,0.048815,0.00000,0.000000,0.063004,0.0,1.000000,0.0


<IPython.core.display.Javascript object>

#### Find the movies that audiences didn't rating before

In [41]:
matrix

Critics,Alex,Dustin,Frank,Jessie,Kyle,Kyler,Roy,Tyler,Yan,Yang
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Apocalypse Now,1.0,3.0,3.0,2.0,,,2.0,2.0,,
Goodfellas,2.0,3.0,,,2.0,3.0,4.0,4.0,2.0,
Pulp Fiction,,5.0,4.0,4.0,2.0,4.0,2.0,5.0,3.0,3.0
The Godfather,3.0,1.0,5.0,,5.0,5.0,5.0,3.0,,5.0
The Usual Suspects,,4.0,,3.0,4.0,,,,5.0,3.0


<IPython.core.display.Javascript object>

In [42]:
not_rating_all_audiences = []
for audience in matrix.columns:
    not_rating_before = matrix[matrix[audience].isnull()].index.to_list()
    not_rating_all_audiences.append(not_rating_before)

<IPython.core.display.Javascript object>

In [43]:
not_rating_all_audiences

[['Pulp Fiction', 'The Usual Suspects'],
 [],
 ['Goodfellas', 'The Usual Suspects'],
 ['Goodfellas', 'The Godfather'],
 ['Apocalypse Now'],
 ['Apocalypse Now', 'The Usual Suspects'],
 ['The Usual Suspects'],
 ['The Usual Suspects'],
 ['Apocalypse Now', 'The Godfather'],
 ['Apocalypse Now', 'Goodfellas']]

<IPython.core.display.Javascript object>

#### Who do we want to recommend?

In [44]:
target_audience = input("Target audience is:")

Target audience is:Alex


<IPython.core.display.Javascript object>

#### Which movie he never watched before?

In [45]:
never_watched = not_rating_all_audiences[matrix.columns.get_loc(target_audience)]
never_watched

['Pulp Fiction', 'The Usual Suspects']

<IPython.core.display.Javascript object>

#### Get the top10 similarity movies' name and similarity value

In [46]:
similarity_list = []
for movie in df_array.columns:
    sm = df_array[movie].sort_values(ascending=False)[1:].head(10)
    similarity_list.append(sm)

<IPython.core.display.Javascript object>

In [47]:
top10_for_never_watched = []
for movie in never_watched:
    top10 = similarity_list[df_array.columns.get_loc(movie)]
    top10_for_never_watched.append(top10)

<IPython.core.display.Javascript object>

In [48]:
top10_for_never_watched

[Raging Bull                    0.128354
 Goodfellas                     0.128297
 The Shawshank Redemption       0.128235
 The Green Mile                 0.070035
 Once Upon a Time in America    0.064009
 The French Connection          0.059898
 Bowling for Columbine          0.059064
 Memento                        0.055797
 True Romance                   0.048038
 Django Unchained               0.047284
 Name: Pulp Fiction, dtype: float64,
 Reservoir Dogs           0.093354
 Bowling for Columbine    0.070771
 Evil Dead II             0.059994
 The Shining              0.053937
 The Prestige             0.051033
 Forrest Gump             0.049732
 Se7en                    0.047319
 Blue Velvet              0.044860
 Platoon                  0.044373
 Toy Story                0.043016
 Name: The Usual Suspects, dtype: float64]

<IPython.core.display.Javascript object>

#### Convert above list into dataframe, to make the next step easier 

In [49]:
df_movie_sm = pd.DataFrame(top10_for_never_watched).T
df_movie_sm

Unnamed: 0,Pulp Fiction,The Usual Suspects
Raging Bull,0.128354,
Goodfellas,0.128297,
The Shawshank Redemption,0.128235,
The Green Mile,0.070035,
Once Upon a Time in America,0.064009,
The French Connection,0.059898,
Bowling for Columbine,0.059064,0.070771
Memento,0.055797,
True Romance,0.048038,
Django Unchained,0.047284,


<IPython.core.display.Javascript object>

#### Check the target audience rating history again, and fill out null value as 0, to allow us to calculate in the future.

In [50]:
df_target_audience = matrix[target_audience].to_frame().fillna(0)
df_target_audience

Unnamed: 0_level_0,Alex
movie,Unnamed: 1_level_1
Apocalypse Now,1.0
Goodfellas,2.0
Pulp Fiction,0.0
The Godfather,3.0
The Usual Suspects,0.0


<IPython.core.display.Javascript object>

#### Combine the above dataframes, fill null value as 0 to allow us to calculate.

In [51]:
df_final = df_target_audience.join(df_movie_sm).fillna(0)
df_final

Unnamed: 0_level_0,Alex,Pulp Fiction,The Usual Suspects
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apocalypse Now,1.0,0.0,0.0
Goodfellas,2.0,0.128297,0.0
Pulp Fiction,0.0,0.0,0.0
The Godfather,3.0,0.0,0.0
The Usual Suspects,0.0,0.0,0.0


<IPython.core.display.Javascript object>

#### Calculate the rating for each movie that the target audience never used before with loop.

In [52]:
final_rating = []
for i in never_watched:
    sum_rating = sum(df_final[i] * df_final[target_audience])
    final_rating.append(sum_rating)

<IPython.core.display.Javascript object>

#### Return the maximum rating and according to the index find the corresponding brand name.

In [53]:
movie_recommend = never_watched[final_rating.index(max(final_rating))]

<IPython.core.display.Javascript object>

#### Get the result, return the rating and the movie that we should recommend to the target audience.

In [54]:
print(
    "The movie we should recommend to the",
    target_audience,
    "is",
    movie_recommend,
    "the rating is",
    round(max(final_rating), 6),
)

The movie we should recommend to the Alex is Pulp Fiction the rating is 0.256593


<IPython.core.display.Javascript object>

---

## Part V. Comparison

* Based on Content-based rec_sys, we can recommend the movies that similar to the movies that audience liked (based on their rating), so the advantage of this approach is that even if all known viewers have not watched certain movies (ie, did not rate some movies), we can still make recommendations based on similarity of movies which is good at cold-start.But every time a new movie is added, its properties must be defined and marked. The never-ending nature of attribute assignment can make scalability difficult and time-consuming.
* In contrast, collaborative filtering does not perform well on cold-start(according to my example showed above, I can't recommend movies that all audiences didn't watch before), but more sophisticated algorithms can help users discover new interests. Compared with CBF, collaborative filtering has the advantage that it can recommend items based on the historical information of each user, regardless of the content attributes of the item itself.
* So that's the reason why we need 'hybrid' two methods.

---