## Introduction to Project

Business Case - A new streaming company, LiveWire, is beginning to understand the market that has the majority of its viewers utilizing streaming systems such as: Netflix, Amazon Prime, and Hulu. LiveWire needs the help of a Data Scientist to understand what movies are best to include in its library that are appealing to a wide variety of viewers and needs a recommendation system like these other large brand names have to keep the consumer engaged and interested with their product.

Questions to answer throughout the modeling project:
    1. What are the top 5 genres that should be focused on when adding to the library?
    2. How can a similar user help the recommendation system suggest movies to a new user?
    3. What are the top ten movies of all time based off this DataFrame being utilized?

In [1]:
!conda install surprise

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - surprise

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




## Importing Libraries

In [64]:
# import libraries
import numpy as np
import pandas as pd

from surprise import Dataset, Reader
from surprise import KNNBaseline, SVD, SVDpp, NMF
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split

## EDA

In [3]:
df1 = pd.read_csv(('movies.csv'), index_col = 0)
df1.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [7]:
df1.describe()

Unnamed: 0,title,genres
count,58098,58098
unique,58020,1643
top,Berlin Calling (2008),Drama
freq,2,8402


In [10]:
df2 = pd.read_csv(('ratings.csv'), index_col=0)
df2.set_index('movieId')
df2

  mask |= (ar1 == a)


Unnamed: 0_level_0,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,307,3.5,1256677221
1,481,3.5,1256677456
1,1091,1.5,1256677471
1,1257,4.5,1256677460
1,1449,4.5,1256677264
...,...,...,...
283228,8542,4.5,1379882795
283228,8712,4.5,1379882751
283228,34405,4.5,1379882889
283228,44761,4.5,1354159524


In [14]:
# Creating one df with the two seperate df's
df = df2.merge(df1, right_index=True, left_on='movieId')
df = df.reset_index()

In [15]:
df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,307,3.5,1256677221,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
1,6,307,4.0,832059248,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
2,56,307,4.0,1383625728,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
3,71,307,5.0,1257795414,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
4,84,307,3.0,999055519,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
...,...,...,...,...,...,...
27753439,282403,167894,1.0,1524243885,Stranglehold (1994),Action
27753440,282732,161572,3.5,1504408070,The Great Houdini (1976),Drama
27753441,283000,117857,3.5,1417317969,Hotline (2014),Documentary
27753442,283000,133409,3.5,1431539331,Barnum! (1986),(no genres listed)


In [30]:
# Creating df based on movie ID's as the index
avg_rating = df.groupby('movieId').mean()
avg_rating

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,141939.237859,3.886649,1.128094e+09
2,142395.293962,3.246583,1.105961e+09
3,140371.877575,3.173981,9.692933e+08
4,140527.990632,2.874540,9.405874e+08
5,141254.322735,3.077291,9.970812e+08
...,...,...,...
193876,103565.000000,3.000000,1.537875e+09
193878,176871.000000,2.000000,1.537875e+09
193880,81710.000000,2.000000,1.537886e+09
193882,33330.000000,2.000000,1.537891e+09


In [29]:
# Adding the title into the new df and dropping irrelevant column 'timestamp'
df_avg_rtng = avg_rating.merge(df1, right_index=True, left_on='movieId')
df_avg_rtng = df_avg_rtng.drop(columns='timestamp')
df_avg_rtng

Unnamed: 0_level_0,userId,rating,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,141939.237859,3.886649,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,142395.293962,3.246583,Jumanji (1995),Adventure|Children|Fantasy
3,140371.877575,3.173981,Grumpier Old Men (1995),Comedy|Romance
4,140527.990632,2.874540,Waiting to Exhale (1995),Comedy|Drama|Romance
5,141254.322735,3.077291,Father of the Bride Part II (1995),Comedy
...,...,...,...,...
193876,103565.000000,3.000000,The Great Glinka (1946),(no genres listed)
193878,176871.000000,2.000000,Les tribulations d'une caissière (2011),Comedy
193880,81710.000000,2.000000,Her Name Was Mumu (2016),Drama
193882,33330.000000,2.000000,Flora (2017),Adventure|Drama|Horror|Sci-Fi


In [51]:
df_avg_rtng.drop_duplicates(inplace=True)
df_avg_rtng.shape

(53889, 4)

In [52]:
print(df_avg_rtng.info())
print(df_avg_rtng.describe())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53889 entries, 1 to 193886
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   userId  53889 non-null  float64
 1   rating  53889 non-null  float64
 2   title   53889 non-null  object 
 3   genres  53889 non-null  object 
dtypes: float64(2), object(2)
memory usage: 2.1+ MB
None
              userId        rating
count   53889.000000  53889.000000
mean   142750.849686      3.068593
std     42134.337536      0.736242
min       277.000000      0.500000
25%    125275.870370      2.687500
50%    142078.448485      3.156250
75%    158224.727273      3.500000
max    283000.000000      5.000000


In [72]:
# Sorting df with ratings 5-0
five_to_zero_rating = df_avg_rtng.sort_values(['rating'], ascending=False)
five_to_zero_rating

Unnamed: 0_level_0,userId,rating,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
169338,246452.0,5.0,Brad Williams: Daddy Issues (2016),Comedy
187729,84400.0,5.0,Ab-normal Beauty (2004),Horror
172149,225652.0,5.0,Back to You and Me (2005),Drama|Romance
160966,65081.0,5.0,You're Human Like the Rest of Them (1967),(no genres listed)
134387,123100.0,5.0,At Ellen’s Age (2011),Comedy|Drama
...,...,...,...,...
133810,48470.0,0.5,The Mad (2007),Comedy|Horror|Thriller
170255,74136.5,0.5,Junior (2011),Documentary
160614,22482.0,0.5,Big Man - A Policy for Hell (1988),(no genres listed)
170259,74136.5,0.5,Lily & the Snowman (2015),Animation


In [73]:
five_to_zero_rating['genres'].split(sep='|')
five_to_zero_rating

AttributeError: 'Series' object has no attribute 'split'

In [78]:
five_to_zero_rating['genres'].astype(str)

movieId
169338                    Comedy
187729                    Horror
172149             Drama|Romance
160966        (no genres listed)
134387              Comedy|Drama
                   ...          
133810    Comedy|Horror|Thriller
170255               Documentary
160614        (no genres listed)
170259                 Animation
148749        (no genres listed)
Name: genres, Length: 53889, dtype: object

In [56]:
five_to_zero_rating['genres'].describe()

count     53889
unique     1610
top       Drama
freq       7837
Name: genres, dtype: object

In [57]:
five_to_zero_rating['genres'].value_counts().head(10)

Drama                   7837
Comedy                  4919
Documentary             4082
(no genres listed)      3732
Comedy|Drama            2102
Drama|Romance           1868
Comedy|Romance          1384
Horror                  1333
Comedy|Drama|Romance     963
Drama|Thriller           803
Name: genres, dtype: int64

In [63]:
# Condition
# top_votes = (five_to_zero_rating['rating'] >= '4.0')
avg_rating_genre = df.groupby('genres').mean()
avg_rating_genre

Unnamed: 0_level_0,userId,movieId,rating,timestamp
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
(no genres listed),142593.787808,152624.712926,3.291397,1.495403e+09
Action,142542.081940,23659.359368,2.926899,1.138760e+09
Action|Adventure,141905.843425,8229.209946,3.704253,1.200243e+09
Action|Adventure|Animation,143066.895695,118835.582277,3.630601,1.472296e+09
Action|Adventure|Animation|Children,143043.954940,106931.324101,3.547609,1.470138e+09
...,...,...,...,...
Thriller|War,132602.653722,47241.543689,3.346278,1.320435e+09
Thriller|Western,148589.830508,101134.169492,3.067797,1.390964e+09
War,144040.075584,19553.551565,3.624725,1.136855e+09
War|Western,166741.717391,30940.804348,3.250000,1.261606e+09


In [79]:
df_avg_rtng['genres'].str.get_dummies(sep='|')

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193876,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
193878,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
193880,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
193882,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0


## Creating a Recommendation Model

In [None]:
# def model_accuracy(model):
#     model.fit(smote_x, smote_y)
    
    
#     y_train_pred = model.predict(X_train)
#     print(round(accuracy_score(y_train, y_train_pred)*100,2),'%')
#     print(confusion_matrix(y_train, y_train_pred))
    
#     y_pred = model.predict(X_test)
#     print(round(accuracy_score(y_test, y_pred)*100,2),'%')
#     print(confusion_matrix(y_test, y_pred))

In [None]:
KNNB = KNNBaseline()
model_accuracy(KNNB)

In [None]:
SVD = SVD()
model_accuracy(SVD)

In [None]:
SVDpp = SVDpp()
model_accuracy(SVDpp)

In [None]:
NMF = NMF()
model_accuracy(NMF)