# Steps to conduct

1. Setup (Databricks/AWS/GitHub)     ---------------------------                       <-- We are currently here!
2. EDA (within Databricks per PySpark/SQL)
3. Feature Selection (Databricks)
4. Write to S3
5. ML case on AWS Sagemaker (model, scores, etc.)
6. Deployment on AWS Sagemaker
6. Create presentation
7. (20:80 or optional task)

# Import Data

In [0]:
# connect so s3 bucket
# get credentials
import os

ACCESS_KEY = os.getenv("AWS_ACCESS_KEY")
SECRET_KEY = os.getenv("AWS_SECRET_KEY")
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "aida-project"
MOUNT_NAME = "data"

# mount data
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))

In [0]:
display(dbutils.fs.ls("/mnt/%s/TSV" % MOUNT_NAME))

In [0]:
df_names = spark.read.load("dbfs:/mnt/data/TSV/name.basics.tsv",
                           format="csv", sep="\t", inferSchema="true", header="true")
df_akas = spark.read.load("dbfs:/mnt/data/TSV/title.akas.tsv",
                           format="csv", sep="\t", inferSchema="true", header="true")
df_basics = spark.read.load("dbfs:/mnt/data/TSV/title.basics.tsv",
                           format="csv", sep="\t", inferSchema="true", header="true")
df_principals = spark.read.load("dbfs:/mnt/data/TSV/title.principals.tsv",
                           format="csv", sep="\t", inferSchema="true", header="true")
df_ratings = spark.read.load("dbfs:/mnt/data/TSV/title.ratings.tsv",
                           format="csv", sep="\t", inferSchema="true", header="true")

In [0]:
list_dfs = [df_names, df_akas, df_basics, df_principals, df_ratings]

for df in list_dfs:
  df.printSchema()

# EDA

## Possible list of questions (non-exhaustive):
1. What is the range of our ratings (inlcuding distribution)? --------------------------------------- DONE
2. How many votes does a movie have on average? (Box-Plot)
3. What timeframe does our dataset span? (e.g. oldest and newest movie)
4. Who are the most popular actors and directors?
5. What genres are represented the most?
6. Distribution of films / series / shows?
    ----> Focus on movies!
7. What genres are in the dataset?
8. Which genres have the highest rating?
9. Which actors play in the high rated films?
10. In which countries were the most high-rated films made and when?
11. Dependence on high rated film:
    - Country of origin ---> Actor?
    - Genre -> Actor
    - Year of creation -> Actor -> Genre
12. Which parameters go into the rating?

.....

## Imports and Functions

In [0]:
from pyspark.sql.functions import mean as _mean, \
                                  min as _min, \
                                  max as _max, \
                                  count as _count, \
                                  stddev as _stddev, col

import matplotlib.pyplot as plt
from pyspark.sql import functions as F

In [0]:
percentiles = [0.1, 0.25, 0.5, 0.75, 0.9]

In [0]:
# https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.histogram

def viz_histogram(dataframe, column, buckets):
  bins, counts = dataframe.select(column).rdd.flatMap(lambda x: x).histogram(buckets)
  plt.hist(bins[:-1], bins=bins, weights=counts)
  plt.title(f'Histogram of {column}')
  
  
def show_quantiles(dataframe, list_columns, percentiles):
  df_quantile = spark.createDataFrame(
      zip(percentiles, *df.approxQuantile(columns, percentiles, 0.1)), 
      ["Pecentile"] + columns
  )
  df_quantile.show()
  
  return df_quantile

## What is the range of our ratings (inlcuding distribution)

In [0]:
stats_ratings = df_ratings.select(
    _mean(col('averageRating')).alias('mean'),
    _min(col('averageRating')).alias('min'),
    _max(col('averageRating')).alias('max'),
    _stddev(col('averageRating')).alias('std')).collect()

min_ratings = stats_ratings[0]['min']
mean_ratings = stats_ratings[0]['mean']
max_ratings = stats_ratings[0]['max']
std_ratings = stats_ratings[0]['std']

stats_ratings

In [0]:
df_ratings.describe().show()

In [0]:
viz_histogram(df_ratings, 'averageRating', 9)

Our ratings range from **1** to **10**. The mean rating is **6.89** (maybe people are a little bit kinder than expected), with a standard deviation of roughly **1.40**. We have roughly **1 million** entries.

## How many votes does a movie have on average? (Box-Plot)

In [0]:
df_ratings.groupBy('tconst').mean('numVotes').describe().show()

In [0]:
quantiles_votes = show_quantiles(df_ratings, 'numVotes', percentiles)

In [0]:
viz_histogram(df_ratings, 'numVotes', 1000)

# Feature Selection & Data Cleaning

# Write to S3

In [0]:
# TO DEFINE SHAPE OF TARGET DATASET

# id, rating, [list of features]

# sample of target dataset (e.g. 20%) write to s3

df_final.write('s3')

In [0]:
spark.version

In [0]:
#Import of the dataset

df_basics_original = spark.read.load("dbfs:/mnt/data/TSV/title.basics.tsv",
                           format="csv", sep="\t", inferSchema="true", header="true")




In [0]:
df_basis = df_basics_original.sample(False, 0.5, 42)

In [0]:
# What kind of data are in the columns
df_basics.printSchema()

In [0]:
display(df_basics)

tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
tt0000010,short,Exiting the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In [0]:
display(df_basics.describe())

summary,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
count,6326545,6326545,6326545,6326545,6326545.0,6326545,6326545,6326545,6326535
mean,,,,,0.0330847247589324,2001.4857464536149,2001.6832648116879,45.00799321813517,
stddev,,,,,2.536020251863344,21.183997012090863,18.883630674696988,76.33630531590723,
min,tt0000001,movie,!Next?,!Next?,0.0,1874,1924,0,Action
max,tt9916880,videoGame,Šiška Deluxe,üç,2019.0,\N,\N,\N,\N


In [0]:
df_basics.show(vertical=True)

- Start all tconst with tt?
- Only 'isAdult' is an integer. What is the range or which values are available? (1)
- What is the content of titleTyp? (2)
- What is the difference between titleTyp and genres?
- What means nullable=true?
- Column 'genres' contains different types of genres - seperate!

(1) Only 'isAdult' is an integer. What is the range or which values are available?

In [0]:
display(
  df_basics.select('isAdult').distinct()
)

isAdult
1
0
2019
2004
2014
2018
2015
1994
2005


In [0]:
# Distribution of 0, 1, and the year numbers
groupBy_output = df_basics.groupBy("isAdult").count()
display(groupBy_output)

isAdult,count
1,189186
0,6137349
2019,4
2004,1
2014,1
2018,1
2015,1
1994,1
2005,1


Content of the column isAdult contain the values 0 and 1,sometimes a year number ----> delete all year number entries!

In [0]:
# GroupBy the titleTyp and count the sum

groupBy_output = df_basics.groupBy("titleType").count()
display(groupBy_output)

titleType,count
tvSeries,174601
tvMiniSeries,28413
tvMovie,120424
tvEpisode,4445415
movie,536248
tvSpecial,26137
video,247411
videoGame,24550
tvShort,11569
short,711777


In [0]:
groupBy_output = df_basics.groupBy("titleType").count()
display(groupBy_output)

titleType,count
tvSeries,174601
tvMiniSeries,28413
tvMovie,120424
tvEpisode,4445415
movie,536248
tvSpecial,26137
video,247411
videoGame,24550
tvShort,11569
short,711777


---> most of the movies are in the titleType tvEpisode

In [0]:
# GroupBy the genres and count the sum

groupBy_output = df_basics.groupBy("titleType", "genres","isAdult").count()
display(groupBy_output)

titleType,genres,isAdult,count
short,"Biography,Romance,Short",0,17
movie,"Fantasy,Sci-Fi",0,117
movie,"Comedy,Family,Musical",0,89
movie,"Film-Noir,Mystery,Thriller",0,6
movie,"Animation,Documentary,History",0,48
tvSeries,Comedy,0,26951
tvSeries,"Adventure,Mystery",0,13
movie,"Action,Horror,Thriller",0,271
tvSeries,"Animation,Family,Western",0,3
tvSeries,"Adventure,Drama,Sci-Fi",0,24


In [0]:
from pyspark.sql.functions import split


In [0]:
# Split of the genres column in genre 1, 2, 3

split_col = split(df_basics['genres'], ',')
df_basics_test = df_basics.withColumn('NAME1', split_col.getItem(0))
df_basics_test = df_basics.withColumn('NAME2', split_col.getItem(1))
df_basics_test = df_basics.withColumn('NAME3', split_col.getItem(2))

df_basics.display()

tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
tt0000010,short,Exiting the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In [0]:
# Alternative: Split of the genres column in genre 1, 2, 3
df_basics_encode = df_basics.withColumn("genre1", split(col("genres"), ",").getItem(0)).withColumn("genre2", split(col("genres"), ",").getItem(1)).withColumn("genre3", split(col("genres"), ",").getItem(2))
df_basics_encode.display()

tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3
tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",Documentary,Short,
tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",Animation,Short,
tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",Animation,Comedy,Romance
tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short",Animation,Short,
tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",Comedy,Short,
tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short,Short,,
tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport",Short,Sport,
tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short",Documentary,Short,
tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance,Romance,,
tt0000010,short,Exiting the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short",Documentary,Short,


In [0]:
input = df_basics_encode['genre1', 'genre1', 'genre',], 
#output =['genre1index']
input


In [0]:
# Transform genre columns into StringIndexer

from pyspark.ml.feature import StringIndexer


#input = df_basics_encode['genre1']

indexer = StringIndexer(inputCol='genre1', outputCol='genre1index')



indexed = indexer.fit(df_basics_encode).transform(df_basics_encode)
indexed.show()

In [0]:
# OneHotEncoding of genre1, genre2 and genre3

from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["genre1index"],
                        outputCols=["genral1index-onehot"])
model = encoder.fit(indexed)
encoded = model.transform(indexed)
encoded.show()




In [0]:
df_basics_encode.printSchema()

Next steps:

root
- |-- tconst: string (nullable = true) 
- |-- titleType: string (nullable = true)  ---> delete NaN
- |-- primaryTitle: string (nullable = true) ---> delete NaN
- |-- originalTitle: string (nullable = true)---> delete NaN
- |-- isAdult: integer (nullable = true)---> delete year numbers
- |-- startYear: string (nullable = true)---> delete /n
- |-- endYear: string (nullable = true)---> delete /n
- |-- runtimeMinutes: string (nullable = true)---> delete /n
- |-- genres: string (nullable = true)---> delete /n
- |-- genre1: string (nullable = true)---> delete /n
- |-- genre2: string (nullable = true)---> delete /n
- |-- genre3: string (nullable = true)---> delete /n


- Encoding of Genres