# Data preprocessing 

## Data Sources
https://grouplens.org/datasets/movielens - ml-32m.zip

https://datasets.imdbws.com/ - title.ratings.tsv.gz / title.basics.tsv.gz



## Imports

In [40]:
import os

import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, avg, count, regexp_extract, split,
    from_unixtime, year, month, dayofmonth, hour, dayofweek, size, format_string,
    count, mean, stddev, min, max, hour, year, month, dayofmonth,
    date_format, when, split, explode, array_distinct, array_union, flatten, lower
)
from pyspark.sql.types import IntegerType

## Spark session initialization

In [41]:
spark = SparkSession.builder \
    .appName("MovieLens ETL") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.hadoop.fs.defaultFS", "hdfs://namenode:9000/") \
    .getOrCreate()

## MovieLens

### Import Data

In [42]:
movies = spark.read.option("header", True).csv("/data/ml-32m/movies.csv")
ratings = spark.read.option("header", True).csv("/data/ml-32m/ratings.csv")
links = spark.read.option("header", True).csv("/data/ml-32m/links.csv")
tags = spark.read.option("header", True).csv("/data/ml-32m/tags.csv")

### Data overview

In [43]:
def overview_data(df, rows=10):
    pandas_df = pd.DataFrame(df.head(rows), columns=df.columns)
    return pandas_df

#### Movies

In [44]:
overview_data(movies)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


2 columns:
- `primaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryprimaryTitle` - movie title
- `genres` - list of genres, separated with '__|__' sign 

#### Ratings

In [45]:
overview_data(ratings)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858
5,1,34,2.0,943228491
6,1,36,1.0,944249008
7,1,80,5.0,944248943
8,1,110,3.0,943231119
9,1,111,5.0,944249008


4 columns
- `userId` - id of a user, who posted a review
- `movieId` - id of a movie reviewed
- `rating` - self-explanatory 
- `timestamp` - time, when review was posted

#### Links

In [46]:
overview_data(links)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862
1,2,113497,8844
2,3,113228,15602
3,4,114885,31357
4,5,113041,11862
5,6,113277,949
6,7,114319,11860
7,8,112302,45325
8,9,114576,9091
9,10,113189,710


Dataframe with links to data from other datasources (IMDB, TMDB)

#### Tags

In [47]:
overview_data(tags)

Unnamed: 0,userId,movieId,tag,timestamp
0,22,26479,Kevin Kline,1583038886
1,22,79592,misogyny,1581476297
2,22,247150,acrophobia,1622483469
3,34,2174,music,1249808064
4,34,2174,weird,1249808102
5,34,8623,Steve Martin,1249808497
6,55,5766,the killls and the score,1319322078
7,58,7451,bullying,1672551536
8,58,7451,clique,1672551510
9,58,7451,coming of age,1672551502


Dataframe about tags assigned to films by users:

4 columns:
1. `userId` - id of a user who added a tag
2. `movieId` - id of a movie with assigned tag
3. `tag` - self-explanatory
4. `timestamp` - self-explanatory

### Data Preprocessing

#### Movies

- Extract year from movie title
- Clear primaryprimaryprimaryprimaryTitle
- Count genres

In [48]:
movies = movies.withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1))
movies = movies.withColumn("clean_title", regexp_extract(col("title"), r"^(.*)\s+\(\d{4}\)$", 1))
movies = movies.withColumn("genres", split("genres", "\\|"))


In [49]:
overview_data(movies)

Unnamed: 0,movieId,title,genres,year,clean_title
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,Toy Story
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995,Jumanji
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995,Grumpier Old Men
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995,Waiting to Exhale
4,5,Father of the Bride Part II (1995),[Comedy],1995,Father of the Bride Part II
5,6,Heat (1995),"[Action, Crime, Thriller]",1995,Heat
6,7,Sabrina (1995),"[Comedy, Romance]",1995,Sabrina
7,8,Tom and Huck (1995),"[Adventure, Children]",1995,Tom and Huck
8,9,Sudden Death (1995),[Action],1995,Sudden Death
9,10,GoldenEye (1995),"[Action, Adventure, Thriller]",1995,GoldenEye


#### Rating

- Parse timestamps
- Aggregate data

In [50]:
ratings = ratings.withColumn("timestamp", col("timestamp").cast("long"))
ratings = ratings.withColumn("rating_datetime", from_unixtime(col("timestamp")))

In [51]:
overview_data(ratings)

Unnamed: 0,userId,movieId,rating,timestamp,rating_datetime
0,1,17,4.0,944249077,1999-12-03 19:24:37
1,1,25,1.0,944250228,1999-12-03 19:43:48
2,1,29,2.0,943230976,1999-11-22 00:36:16
3,1,30,5.0,944249077,1999-12-03 19:24:37
4,1,32,5.0,943228858,1999-11-22 00:00:58
5,1,34,2.0,943228491,1999-11-21 23:54:51
6,1,36,1.0,944249008,1999-12-03 19:23:28
7,1,80,5.0,944248943,1999-12-03 19:22:23
8,1,110,3.0,943231119,1999-11-22 00:38:39
9,1,111,5.0,944249008,1999-12-03 19:23:28


In [52]:
rating_agg = ratings.groupBy("movieId").agg(
    avg("rating").alias("avg_rating"),
    count("rating").alias("num_ratings")
)

In [53]:
overview_data(rating_agg)

Unnamed: 0,movieId,avg_rating,num_ratings
0,80,3.739496,595
1,110,3.98863,69482
2,260,4.099824,85010
3,302,3.753191,1645
4,909,4.028898,6921
5,1080,3.987304,27292
6,1090,3.905313,19876
7,1150,3.889724,1664
8,1178,4.178506,5490
9,1196,4.130352,72151


#### Links


- Parse `imdbId` column for join

In [54]:
links = links.withColumn("tconst", format_string("tt%07d", col("imdbId").cast("int")))

#### Tags

In [55]:
from pyspark.sql import functions as F
tags = tags.withColumn("tag", lower(col("tag")))
tags = tags.groupBy('movieId') \
    .agg(F.collect_set('tag').alias('tags'))

In [56]:
overview_data(tags)

Unnamed: 0,movieId,tags
0,1,"[friendship, match, girl, toys come to life, h..."
1,100042,"[texas, warpath]"
2,100103,[writer]
3,100155,"[iran, middle east]"
4,100163,"[better than expected, badass female character..."
5,100169,"[interesting jobs, abuse, pornography, subcult..."
6,1003,"[latex gloves, based on novel, secret laborato..."
7,100336,"[coming of age, drama, growing up, abusive fat..."
8,100344,"[based on true events, comedy, campaign, funny..."
9,100370,[woman director]


### Merge data

In [57]:
movies = movies.join(rating_agg, on="movieId", how="left")
movies = movies.join(links, on="movieId", how="left")
movies = movies.join(tags, on="movieId", how="left")

In [58]:
overview_data(movies)

Unnamed: 0,movieId,title,genres,year,clean_title,avg_rating,num_ratings,imdbId,tmdbId,tconst,tags
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,Toy Story,3.897438,68997,114709,862,tt0114709,"[friendship, match, girl, toys come to life, h..."
1,289049,Why We Laugh: Funny Women (2013),"[Comedy, Documentary]",2013,Why We Laugh: Funny Women,4.0,1,2741452,184159,tt2741452,
2,289029,Bawaal (2023),"[Action, Comedy, Drama, Romance]",2023,Bawaal,1.833333,3,19755170,1142127,tt19755170,
3,5,Father of the Bride Part II (1995),[Comedy],1995,Father of the Bride Part II,3.059602,13154,113041,11862,tt0113041,"[confidence, father, parent child relationship..."
4,289047,The Rebellious Life of Mrs. Rosa Parks (2022),[Documentary],2022,The Rebellious Life of Mrs. Rosa Parks,4.0,1,15976536,965403,tt15976536,
5,289053,Waterlife (2009),[Documentary],2009,Waterlife,3.0,1,1436049,69720,tt1436049,
6,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995,Grumpier Old Men,3.139447,13134,113228,15602,tt0113228,"[old, comedinha de velhinhos engraãƒâ§ada, old..."
7,8,Tom and Huck (1995),"[Adventure, Children]",1995,Tom and Huck,3.115563,1510,112302,45325,tt0112302,"[bridge, friendship, girl, swearing an oath, v..."
8,10,GoldenEye (1995),"[Action, Adventure, Thriller]",1995,GoldenEye,3.42785,32474,113189,710,tt0113189,"[007, sean bean, bill tanner character, gun ba..."
9,289035,Have You Got It Yet? The Story of Syd Barrett ...,[Documentary],2023,Have You Got It Yet? The Story of Syd Barrett ...,4.0,1,7182482,1118723,tt7182482,


## IMDB Data

In [59]:
imdb_basics = spark.read.option("header", True).option("sep", "\t").csv("/data/title.basics.tsv")
imdb_ratings = spark.read.option("header", True).option("sep", "\t").csv("/data/title.ratings.tsv")

In [60]:
overview_data(imdb_basics)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In [61]:
imdb_basics = imdb_basics.withColumnRenamed("genres", "IMDB_genres")
imdb_basics = imdb_basics.withColumn("IMDB_genres", split("IMDB_genres", "\\,"))

In [62]:
overview_data(imdb_basics)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,IMDB_genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"[Documentary, Short]"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"[Animation, Short]"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"[Animation, Comedy, Romance]"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"[Animation, Short]"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,[Short]
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,[Short]
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"[Short, Sport]"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"[Documentary, Short]"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,[Romance]
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"[Documentary, Short]"


In [63]:
overview_data(imdb_ratings)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2159
1,tt0000002,5.5,294
2,tt0000003,6.5,2204
3,tt0000004,5.3,188
4,tt0000005,6.2,2944
5,tt0000006,5.0,212
6,tt0000007,5.3,911
7,tt0000008,5.4,2301
8,tt0000009,5.4,224
9,tt0000010,6.8,7968


### Filtering

In [64]:
imdb_basics = imdb_basics.filter((col("titleType") == "movie"))
imdb_joined = imdb_basics.join(imdb_ratings, on="tconst", how="left")

### Merging

In [65]:
movies = movies.join(links.select("movieId", "tconst"), on="movieId", how="left")
movies = movies.join(imdb_joined, on="tconst", how="left")

In [66]:
movies = movies.withColumn(
    "combined_genres",
    array_distinct(array_union(col("genres"), col("IMDB_genres")))
)

In [67]:
overview_data(movies)

Unnamed: 0,tconst,movieId,title,genres,year,clean_title,avg_rating,num_ratings,imdbId,tmdbId,...,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,IMDB_genres,averageRating,numVotes,combined_genres
0,tt27458968,289043,Blondi (2023),"[Comedy, Drama]",2023,Blondi,3.5,1,27458968,982502,...,Blondi,Blondi,0,2023,\N,87,"[Comedy, Drama]",6.7,1198,"[Comedy, Drama]"
1,tt7182482,289035,Have You Got It Yet? The Story of Syd Barrett ...,[Documentary],2023,Have You Got It Yet? The Story of Syd Barrett ...,4.0,1,7182482,1118723,...,Have You Got It Yet? The Story of Syd Barrett ...,Have You Got It Yet? The Story of Syd Barrett ...,0,2023,\N,94,"[Biography, Documentary, Music]",7.3,586,"[Documentary, Biography, Music]"
2,tt0114319,7,Sabrina (1995),"[Comedy, Romance]",1995,Sabrina,3.363968,13585,114319,11860,...,Sabrina,Sabrina,0,1995,\N,127,"[Comedy, Drama, Romance]",6.3,45941,"[Comedy, Romance, Drama]"
3,tt0114885,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995,Waiting to Exhale,2.845331,2806,114885,31357,...,Waiting to Exhale,Waiting to Exhale,0,1995,\N,124,"[Comedy, Drama, Romance]",6.0,12888,"[Comedy, Drama, Romance]"
4,tt0114576,9,Sudden Death (1995),[Action],1995,Sudden Death,2.987723,4154,114576,9091,...,Sudden Death,Sudden Death,0,1995,\N,111,"[Action, Crime, Thriller]",5.9,38414,"[Action, Crime, Thriller]"
5,tt1436049,289053,Waterlife (2009),[Documentary],2009,Waterlife,3.0,1,1436049,69720,...,Waterlife,Waterlife,0,2009,\N,109,[Documentary],7.4,96,[Documentary]
6,tt0113277,6,Heat (1995),"[Action, Crime, Thriller]",1995,Heat,3.868277,29490,113277,949,...,Heat,Heat,0,1995,\N,170,"[Action, Crime, Drama]",8.3,758869,"[Action, Crime, Thriller, Drama]"
7,tt19755170,289029,Bawaal (2023),"[Action, Comedy, Drama, Romance]",2023,Bawaal,1.833333,3,19755170,1142127,...,Bawaal,Bawaal,0,2023,\N,137,"[Action, Comedy, Drama]",6.6,17090,"[Action, Comedy, Drama, Romance]"
8,tt4076258,289051,Paper Tigers (2015),[Documentary],2015,Paper Tigers,4.0,1,4076258,355036,...,Paper Tigers,Paper Tigers,0,2015,\N,102,"[Documentary, Family]",7.4,152,"[Documentary, Family]"
9,tt7613194,289057,Rock Sugar (2021),[Thriller],2021,Rock Sugar,3.0,1,7613194,539044,...,Bullied,Rock Sugar,0,2021,\N,70,"[Drama, Thriller]",5.1,199,"[Thriller, Drama]"


In [68]:
print(movies.columns)

['tconst', 'movieId', 'title', 'genres', 'year', 'clean_title', 'avg_rating', 'num_ratings', 'imdbId', 'tmdbId', 'tags', 'tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'IMDB_genres', 'averageRating', 'numVotes', 'combined_genres']


## Uzupełnienie braków

In [69]:
from pyspark.sql.functions import when, col, array
from pyspark.ml.feature import Tokenizer, HashingTF, IDF, CountVectorizer, VectorAssembler, MinMaxScaler
from pyspark.sql.functions import col

# Rzutowanie kolumn na odpowiedni typ
movies = movies.withColumn("avg_rating", col("avg_rating").cast("double"))
movies = movies.withColumn("num_ratings", col("num_ratings").cast("double"))
movies = movies.withColumn("averageRating", col("averageRating").cast("double"))
movies = movies.withColumn("numVotes", col("numVotes").cast("double"))
movies = movies.withColumn("runtimeMinutes", col("runtimeMinutes").cast("double"))
movies = movies.withColumn("year", col("year").cast("double"))

movies = movies.fillna({
    "avg_rating": 0.0,
    "num_ratings": 0,
    "averageRating": 0.0,
    "numVotes": 0,
    "runtimeMinutes": 0,
    "year": 2000
})

movies = movies.withColumn(
    "combined_genres",
    when(col("combined_genres").isNull(), array().cast("array<string>")).otherwise(col("combined_genres"))
).withColumn(
    "tags",
    when(col("tags").isNull(), array().cast("array<string>")).otherwise(col("tags"))
)

## TF-IDF na tagach

In [70]:
movies.select(explode("tags")).distinct().count()

132172

In [71]:
count_vectorizer = CountVectorizer(inputCol="tags", outputCol="tags_vec", vocabSize=8000, minDF=10)
model = count_vectorizer.fit(movies)
movies = model.transform(movies)

## CountVectorizer na genres

In [72]:
cv_genres = CountVectorizer(inputCol="combined_genres", outputCol="genres_vec")
cv_model_genres = cv_genres.fit(movies)
movies = cv_model_genres.transform(movies)

## MinMaxScaler dla cech liczbowych

In [73]:
num_cols = ["avg_rating", "num_ratings", "averageRating", "numVotes", "runtimeMinutes", "year"]
vec_assembler_num = VectorAssembler(inputCols=num_cols, outputCol="numeric_features")
movies = vec_assembler_num.transform(movies)

scaler = MinMaxScaler(inputCol="numeric_features", outputCol="scaled_numeric")
scaler_model = scaler.fit(movies)
movies = scaler_model.transform(movies)

In [74]:
from pyspark.ml.linalg import VectorUDT

def get_vec_size(v):
    return v.size

from pyspark.sql.functions import udf

vec_size_udf = udf(get_vec_size, IntegerType())

movies.select("tags_vec").withColumn("tags_vec_size", vec_size_udf("tags_vec")).show()
movies.select("genres_vec").withColumn("genres_vec_size", vec_size_udf("genres_vec")).show()

+--------------------+-------------+
|            tags_vec|tags_vec_size|
+--------------------+-------------+
|(8000,[6,25,42,57...|         8000|
|(8000,[22,30,50,6...|         8000|
|(8000,[7,20,28,31...|         8000|
|        (8000,[],[])|         8000|
|        (8000,[],[])|         8000|
|(8000,[1,2,3,6,13...|         8000|
|(8000,[6,7,11,12,...|         8000|
|(8000,[9,20,21,27...|         8000|
|        (8000,[],[])|         8000|
|        (8000,[],[])|         8000|
|(8000,[1112,1714,...|         8000|
|(8000,[1372,5783]...|         8000|
|        (8000,[],[])|         8000|
|(8000,[4,20,29,43...|         8000|
|(8000,[1,3,4,5,17...|         8000|
|        (8000,[],[])|         8000|
|(8000,[1,4,5,9,10...|         8000|
|(8000,[1,4,15,19,...|         8000|
|(8000,[1,4,8,15,2...|         8000|
|        (8000,[],[])|         8000|
+--------------------+-------------+
only showing top 20 rows

+--------------------+---------------+
|          genres_vec|genres_vec_size|
+-------

In [79]:
bad_movies_nulls = movies.filter(
    col("tags_vec").isNull() | col("genres_vec").isNull()
)

bad_movies_nulls.select("movieId", "clean_title").show(truncate=False)

+-------+-----------+
|movieId|clean_title|
+-------+-----------+
+-------+-----------+



## Połączenie wszystkich cech do jednego wektora

In [80]:
final_assembler = VectorAssembler(
    inputCols=["scaled_numeric", "tags_vec", "genres_vec"],
    outputCol="final_features"
)
movies = final_assembler.transform(movies)

In [81]:
bad_movies = movies.withColumn("final_features_size", vec_size_udf(col("final_features"))) \
    .filter((col("final_features_size") != 8036) | col("final_features").isNull())

bad_movies.select("movieId", "clean_title", "final_features_size").show(truncate=False)

+-------+-----------+-------------------+
|movieId|clean_title|final_features_size|
+-------+-----------+-------------------+
+-------+-----------+-------------------+



In [144]:
overview_data(movies)

Unnamed: 0,tconst,movieId,title,genres,year,clean_title,avg_rating,num_ratings,imdbId,tmdbId,...,runtimeMinutes,IMDB_genres,averageRating,numVotes,combined_genres,tags_vec,genres_vec,numeric_features,scaled_numeric,final_features
0,tt0000015,238360,Autour d’une cabine ou Mésaventures d’un copur...,[Animation],1894.0,Autour d’une cabine ou Mésaventures d’un copur...,3.0,6.0,15,159896,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[3.0, 6.0, 0.0, 0.0, 0.0, 1894.0]","[0.6000000000000001, 5.8292609468662865e-05, 0...","(0.6000000000000001, 5.8292609468662865e-05, 0..."
1,tt0000016,199716,Boat Leaving the Port (1895),"[Documentary, Drama]",1895.0,Boat Leaving the Port,2.615385,13.0,16,129436,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[2.6153846153846154, 13.0, 0.0, 0.0, 0.0, 1895.0]","[0.5230769230769231, 0.00012630065384876954, 0...","(0.5230769230769231, 0.00012630065384876954, 0..."
2,tt0000026,225523,Partie d'écarté (1896),[Documentary],1896.0,Partie d'écarté,2.5,3.0,26,163064,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[2.5, 3.0, 0.0, 0.0, 0.0, 1896.0]","[0.5, 2.9146304734331433e-05, 0.0, 0.0, 0.0, 0...","(0.5, 2.9146304734331433e-05, 0.0, 0.0, 0.0, 0..."
3,tt0000029,167498,Baby's Dinner (1895),[(no genres listed)],1895.0,Baby's Dinner,2.8,35.0,29,122134,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[2.8, 35.0, 0.0, 0.0, 0.0, 1895.0]","[0.5599999999999999, 0.0003400402219005334, 0....","(0.5599999999999999, 0.0003400402219005334, 0...."
4,tt0000301,210793,Faust and Marguerite (1900),"[Fantasy, Horror]",1900.0,Faust and Marguerite,2.0,6.0,301,195606,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[2.0, 6.0, 0.0, 0.0, 0.0, 1900.0]","[0.4, 5.8292609468662865e-05, 0.0, 0.0, 0.0, 0...","(0.4, 5.8292609468662865e-05, 0.0, 0.0, 0.0, 0..."
5,tt0000358,174249,History of a Crime (1901),"[Crime, Drama]",1901.0,History of a Crime,3.3,10.0,358,171430,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[3.3, 10.0, 0.0, 0.0, 0.0, 1901.0]","[0.66, 9.715434911443811e-05, 0.0, 0.0, 0.0, 0...","(0.66, 9.715434911443811e-05, 0.0, 0.0, 0.0, 0..."
6,tt0000410,182537,Demolishing and Building Up the Star Theatre (...,[Documentary],1901.0,Demolishing and Building Up the Star Theatre,2.7,15.0,410,129863,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[2.7, 15.0, 0.0, 0.0, 0.0, 1901.0]","[0.54, 0.00014573152367165716, 0.0, 0.0, 0.0, ...","(0.54, 0.00014573152367165716, 0.0, 0.0, 0.0, ..."
7,tt0000455,140551,The Melomaniac (1903),[Comedy],1903.0,The Melomaniac,2.957143,35.0,455,49273,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[2.9571428571428573, 35.0, 0.0, 0.0, 0.0, 1903.0]","[0.5914285714285715, 0.0003400402219005334, 0....","(0.5914285714285715, 0.0003400402219005334, 0...."
8,tt0000465,174171,The Kingdom of Fairies (1903),"[Adventure, Fantasy]",1903.0,The Kingdom of Fairies,3.675,20.0,465,32673,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[3.675, 20.0, 0.0, 0.0, 0.0, 1903.0]","[0.735, 0.00019430869822887622, 0.0, 0.0, 0.0,...","(0.735, 0.00019430869822887622, 0.0, 0.0, 0.0,..."
9,tt0000854,200968,Edgar Allan Poe (1909),[Drama],1909.0,Edgar Allan Poe,3.0,6.0,854,194066,...,0.0,,0.0,0.0,[],"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[3.0, 6.0, 0.0, 0.0, 0.0, 1909.0]","[0.6000000000000001, 5.8292609468662865e-05, 0...","(0.6000000000000001, 5.8292609468662865e-05, 0..."


## Save preprocessed data

In [82]:
movies.select("movieId", "clean_title", "final_features", "combined_genres", "tags", "year", "avg_rating", "num_ratings") \
  .write.mode("overwrite") \
  .parquet("/output/movie_vectors.parquet")

In [83]:
movies.select(
    vec_size_udf("scaled_numeric").alias("scaled_numeric_size"),
    vec_size_udf("genres_vec").alias("genres_vec_size"),
    vec_size_udf("tags_vec").alias("tags_vec_size"),
    vec_size_udf("final_features").alias("final_features_size")
).groupBy("scaled_numeric_size", "genres_vec_size", "tags_vec_size", "final_features_size").count().show(truncate=False)

+-------------------+---------------+-------------+-------------------+-----+
|scaled_numeric_size|genres_vec_size|tags_vec_size|final_features_size|count|
+-------------------+---------------+-------------+-------------------+-----+
|6                  |30             |8000         |8036               |87585|
+-------------------+---------------+-------------+-------------------+-----+

