# Notebook for the data source
Here we download the data and perform the preprocessing to make the dataset ready for the ML models.

In [18]:
# Import packages
import os, yaml
import pandas as pd

## Define functions for the notebooks

In [2]:
# Creating a small function to load the data sheet by ID and sheet name
def load_google_sheet(sheet_id:str, sheet_name:str) -> pd.DataFrame:
    url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
    df = pd.read_csv(url)
    return df

# Reading from Google sheet

In [3]:
# Defining the ID of the Google Sheet with the movie ratings
sheet_id = '1-8tdDUtm0iBrCdCRAsYCw2KOimecrHcmsnL-aqG-l0E'

# Loading all the sheets and joining them together
df_main = load_google_sheet(sheet_id, 'main')
df_patreon = load_google_sheet(sheet_id, 'patreon')
df_mnight = load_google_sheet(sheet_id, 'movie_night')
df = pd.concat([df_main, df_patreon, df_mnight], axis = 0)

Extra basic info and stats from the unified dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 323 entries, 0 to 6
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            323 non-null    object 
 1   Category        323 non-null    object 
 2   Rating          318 non-null    float64
 3   Flickable       322 non-null    object 
 4   Episode Number  323 non-null    object 
 5   Notes           53 non-null     object 
dtypes: float64(1), object(5)
memory usage: 17.7+ KB


No surprised with the dtype, and also few data entries and let's see how the training will perform. Column `Notes` has significant number of empty entries (aka null values). No neeed for extracting `df.describes()` because we only have one numerical column (aka `Rating`)

In [5]:
df.head()

Unnamed: 0,Name,Category,Rating,Flickable,Episode Number,Notes
0,Zoolander 2,Movie,7.0,Yes,10,The very first flickin!
1,Dope,Movie,8.5,Yes,11,
2,The Big Short,Movie,8.0,Yes,12,Gary had to read Caelan's notes since Caelan h...
3,Deadpool,Movie,10.0,Yes,13,
4,Vinyl,TV Show,7.5,Yes,15,


In [9]:
# Save the raw dataset into parquet.file
path_file = os.path.join(os.getcwd(), 'raw_dataset.csv')
df.to_csv(path_file)


The dataset contains a lot more than just the movie reviews. Therefore, we should filtered the dataset using the column `Category`

In [13]:
# Keeping only the data entries within Movie category
df_movies = df[df.Category == 'Movie']

# df_movies.info()
df_movies.head()

Unnamed: 0,Name,Category,Rating,Flickable,Episode Number,Notes
0,Zoolander 2,Movie,7.0,Yes,10,The very first flickin!
1,Dope,Movie,8.5,Yes,11,
2,The Big Short,Movie,8.0,Yes,12,Gary had to read Caelan's notes since Caelan h...
3,Deadpool,Movie,10.0,Yes,13,
5,The Martian,Movie,8.0,Yes,17,


Removing unnecessary columns for the ML model

In [14]:
df_movies.drop(columns = ['Category', 'Episode Number', 'Notes'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [15]:
df_movies.head()

Unnamed: 0,Name,Rating,Flickable
0,Zoolander 2,7.0,Yes
1,Dope,8.5,Yes
2,The Big Short,8.0,Yes
3,Deadpool,10.0,Yes
5,The Martian,8.0,Yes


In [16]:
# Save the movie dataset
path_file = os.path.join(os.getcwd(), 'movies_dataset.csv')
df_movies.to_csv(path_file)

# Gathering extra data
We all know on how to build ML models for `plug-and-play` datasets like the ones available in Kaggle competitions. However, this is **not** our case here, and gathering extra data for enriching the dataset is as important as building a super advance ML model