# Inhaltsbasiertes Empfehlungssystem für Netflix-Filme

**Gruppe 1:** <br>
Theen, Johannes (TH München)<br>
Utz, Elisabeth (OTH Amberg-Weiden)<br>
Yaruchyk, Oleg (TH München)

(Überlegungen zum Bericht)
# Textverarbeitung<br>
Um die Textdaten verarbeiten zu können, ist es notwendig, diese auf Vektoren oder Matrizen abzubilden. 

## 1. Import von Bibliotheken, Klassen und Funktionen

In [1]:
import pandas as pd
import numpy as np
import string

# Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Dimensionsreduktion
from scipy import sparse
from sklearn.decomposition import TruncatedSVD

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import sigmoid_kernel
from sklearn.metrics.pairwise import polynomial_kernel

## 2. Einlesen und Analyse der Datenbank

In [2]:
raw_data = pd.read_csv('netflix_titles.csv')
print(raw_data.shape)
raw_data.head()

(6234, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [3]:
raw_data.describe(include='all')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,6234.0,6234,6234,4265,5664,5758,6223,6234.0,6224,6234,6234,6234
unique,,2,6172,3301,5469,554,1524,,14,201,461,6226
top,,Movie,Tunnel,"Raúl Campos, Jan Suter",David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,Documentaries,A surly septuagenarian gets another chance at ...
freq,,4265,3,18,18,2032,122,,2027,1321,299,3
mean,76703680.0,,,,,,,2013.35932,,,,
std,10942960.0,,,,,,,8.81162,,,,
min,247747.0,,,,,,,1925.0,,,,
25%,80035800.0,,,,,,,2013.0,,,,
50%,80163370.0,,,,,,,2016.0,,,,
75%,80244890.0,,,,,,,2018.0,,,,


## 3. Preprocessing

Die Datenbank besteht aus 12 Spalten und 6.234 Zeilen. Anhand der Spalte "director" ist bereits zu erkennen, dass manche Zellen den Wert "NaN" enthalten. Die Zeile "count" in der anschließenden Ansicht macht noch einmal deutlich, dass nicht alle Spalten über 6.234 Einträge verfügen. Daher werden im nächsten Schritt alle Einträge "NaN" durch eine leere Zelle ersetzt.

In [4]:
raw_data = raw_data.fillna('')
raw_data.describe(include='all')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,6234.0,6234,6234,6234.0,6234.0,6234,6234,6234.0,6234,6234,6234,6234
unique,,2,6172,3302.0,5470.0,555,1525,,15,201,461,6226
top,,Movie,Tunnel,,,United States,"January 1, 2020",,TV-MA,1 Season,Documentaries,A surly septuagenarian gets another chance at ...
freq,,4265,3,1969.0,570.0,2032,122,,2027,1321,299,3
mean,76703680.0,,,,,,,2013.35932,,,,
std,10942960.0,,,,,,,8.81162,,,,
min,247747.0,,,,,,,1925.0,,,,
25%,80035800.0,,,,,,,2013.0,,,,
50%,80163370.0,,,,,,,2016.0,,,,
75%,80244890.0,,,,,,,2018.0,,,,


Damit Schauspieler und Regisseure in die Berechnungen einbezogen werden können, werden die Vor- und Nachnamen zusammen- und alles klein geschrieben (z.B. wird Richard Finn zu richardfinn). Auch aus mehreren Wörtern bestehende Ländernamen und Kategorien werden auf diese Weise verändert.

In [5]:
# Funktion entfernt Leerzeichen, schreibt alles klein und ersetzt anschließend jedes Komma durch ein Leerzeichen 
def organize_data(data):
    data = data.str.replace(' ','')
    data = data.str.lower()
    data = data.str.replace(',',', ')
    return data

raw_data['type'] = organize_data(raw_data['type'])
raw_data['director'] = organize_data(raw_data['director'])
raw_data['cast'] = organize_data(raw_data['cast'])
raw_data['country'] = organize_data(raw_data['country'])
raw_data['listed_in'] = organize_data(raw_data['listed_in'])

raw_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,movie,Norm of the North: King Sized Adventure,"richardfinn, timmaltby","alanmarriott, andrewtoth, briandobson, colehow...","unitedstates, india, southkorea, china","September 9, 2019",2019,TV-PG,90 min,"children&familymovies, comedies",Before planning an awesome wedding for his gra...
1,80117401,movie,Jandino: Whatever it Takes,,jandinoasporaat,unitedkingdom,"September 9, 2016",2016,TV-MA,94 min,stand-upcomedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,tvshow,Transformers Prime,,"petercullen, sumaleemontano, frankwelker, jeff...",unitedstates,"September 8, 2018",2013,TV-Y7-FV,1 Season,kids'tv,"With the help of three human allies, the Autob..."
3,80058654,tvshow,Transformers: Robots in Disguise,,"willfriedle, darrencriss, constancezimmer, kha...",unitedstates,"September 8, 2018",2016,TV-Y7,1 Season,kids'tv,When a prison ship crash unleashes hundreds of...
4,80125979,movie,#realityhigh,fernandolebrija,"nestacooper, katewalsh, johnmichaelhiggins, ke...",unitedstates,"September 8, 2017",2017,TV-14,99 min,comedies,When nerdy high schooler Dani finally attracts...


Im folgenden Schritt werden noch enthaltene (Satz-)Zeichen wie z.B. "&" in den Spalten "director", "cast", "country", "rating" und "listed_in" entfernt.

In [6]:
# Interpunktion löschen
raw_data['cast'] = [row.translate(str.maketrans("","", string.punctuation)) for row in raw_data['cast']]
raw_data['listed_in'] = [row.translate(str.maketrans("","", string.punctuation)) for row in raw_data['listed_in']]
raw_data['director'] = [row.translate(str.maketrans("","", string.punctuation)) for row in raw_data['director']]
raw_data['country'] = [row.translate(str.maketrans("","", string.punctuation)) for row in raw_data['country']]
raw_data['rating'] = [row.translate(str.maketrans("","", string.punctuation)) for row in raw_data['rating']]
raw_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,movie,Norm of the North: King Sized Adventure,richardfinn timmaltby,alanmarriott andrewtoth briandobson colehoward...,unitedstates india southkorea china,"September 9, 2019",2019,TVPG,90 min,childrenfamilymovies comedies,Before planning an awesome wedding for his gra...
1,80117401,movie,Jandino: Whatever it Takes,,jandinoasporaat,unitedkingdom,"September 9, 2016",2016,TVMA,94 min,standupcomedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,tvshow,Transformers Prime,,petercullen sumaleemontano frankwelker jeffrey...,unitedstates,"September 8, 2018",2013,TVY7FV,1 Season,kidstv,"With the help of three human allies, the Autob..."
3,80058654,tvshow,Transformers: Robots in Disguise,,willfriedle darrencriss constancezimmer kharyp...,unitedstates,"September 8, 2018",2016,TVY7,1 Season,kidstv,When a prison ship crash unleashes hundreds of...
4,80125979,movie,#realityhigh,fernandolebrija,nestacooper katewalsh johnmichaelhiggins keith...,unitedstates,"September 8, 2017",2017,TV14,99 min,comedies,When nerdy high schooler Dani finally attracts...


## 4. Vektorisierung des Dokuments

Damit die Daten ausgewertet und verglichen werden können, werden sie nun mithilfe der Klasse CountVectorizer von scikit-learn in Vektoren transformiert. Auf die Spalte "description" wird zusätzlich der TF-IDF-Transformer von scikit-learn angewandt. So wird zwar nach wie vor die Häufigkeit von Wörtern innerhalb eines Dokuments betrachtet. Allerdings werden Wörter, die in jedem Dokument vorkommen (wie z.B. "is" oder "and") weniger stark gewichtet, da sie für uns keinen Mehrwert enthalten.

In [7]:
count_vectorizer= CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

final_data = pd.DataFrame(data=raw_data)

data_type = count_vectorizer.fit_transform(raw_data['type'])
data_title = count_vectorizer.fit_transform(raw_data['title'])
data_director = count_vectorizer.fit_transform(raw_data['director'])
data_cast = count_vectorizer.fit_transform(raw_data['cast'])
data_country = count_vectorizer.fit_transform(raw_data['country'])
data_date_added = count_vectorizer.fit_transform(raw_data['date_added'])
data_release_year = raw_data['release_year']
data_rating = count_vectorizer.fit_transform(raw_data['rating'])
data_duration = count_vectorizer.fit_transform(raw_data['duration'])
data_listed_in = count_vectorizer.fit_transform(raw_data['listed_in'])
data_description = tfidf_vectorizer.fit_transform(raw_data['description'])

In [8]:
#merge the data
all_matrix = sparse.hstack((data_type, data_title, data_director, data_cast, data_country, data_rating, data_listed_in, data_description), format='csr') 
all_matrix

<6234x54665 sparse matrix of type '<class 'numpy.float64'>'
	with 234240 stored elements in Compressed Sparse Row format>

In [11]:
from sklearn.preprocessing import MaxAbsScaler
all_matrix_scale = MaxAbsScaler().fit_transform(all_matrix)
all_matrix_scale

<6234x54665 sparse matrix of type '<class 'numpy.float64'>'
	with 234240 stored elements in Compressed Sparse Row format>

In [14]:
svd = TruncatedSVD(n_components = 5100, n_iter = 5)
all_matrix_svd = svd.fit(all_matrix_scale)
all_matrix_svd = svd.transform(all_matrix_svd)
explained_variance = svd.explained_variance_ratio_.sum()
print(explained_variance)

0.9488362595204387


In [15]:
# generating the cosine similarity matrix
similarity = cosine_similarity(all_matrix, all_matrix)
similarity_scale = cosine_similarity(all_matrix_scale, all_matrix_scale)
similarity_svd = cosine_similarity(all_matrix_svd,all_matrix_svd)

indices = pd.Series(final_data['title'])
indices[:5]

0    Norm of the North: King Sized Adventure
1                 Jandino: Whatever it Takes
2                         Transformers Prime
3           Transformers: Robots in Disguise
4                               #realityhigh
Name: title, dtype: object

In [17]:
def recommendations(title, cosine_sim = similarity_scale):
    
    recommended_movies = []
    
    # gettin the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    #print(score_series)
    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(raw_data['title'])[i])
        
    return recommended_movies

recommendations('Avengers: Infinity War')

['Thor: Ragnarok',
 'Black Panther',
 'Godzilla',
 'Scorpion King 5: Book of Souls',
 'Limitless',
 'Solo: A Star Wars Story',
 'Solo: A Star Wars Story (Spanish Version)',
 'War Horse',
 '9',
 'Her']