# About the dataset
There are two provided datasets for Netflix Ratings and Imdb Movie Names.
- `movies.csv`: contains movies from Imdb
- `ratings.csv`: contains user ratings from Netflix

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
movies = pd.read_csv('./movies.csv')
ratings = pd.read_csv('./ratings.csv')

# Inspecting the Dataset
Upon inspecing the datasets we can see that:
- `movies.csv` contains `movieId title genres`
- `ratings.csv` contains `userId movieId rating timestamp`

These two datasets are linked via the `movieId` attribute (this is the movieId of the movie in Imdb which coincides in Netflix

In [None]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama


In [None]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523
...,...,...,...,...
105334,668,142488,4.0,1451535844
105335,668,142507,3.5,1451535889
105336,668,143385,4.0,1446388585
105337,668,144976,2.5,1448656898


# Activity 1:

## Data cleaning and preprocessing
we are using `df.drop_duplicates()` to drop duplicates from our dataframe

In [None]:
movies = movies.drop_duplicates() #remove duplicates from dataframe
movies

we are using OneHotEncoder (from sklearn) to convert `movies['genres']` into onehot encoded form


> One Hot encoder takes categorical data and turns it into numerical data (most commonly binary true and false)



In [None]:
enc = OneHotEncoder(handle_unknown='ignore') # create onehotencoder context
enc.fit(pd.DataFrame(movies['genres']))
encoded_df = enc.transform(pd.DataFrame(movies['genres']))
print(pd.DataFrame(encoded_df))

                     0
0        (0, 337)\t1.0
1        (0, 395)\t1.0
2        (0, 697)\t1.0
3        (0, 650)\t1.0
4        (0, 600)\t1.0
...                ...
10324    (0, 499)\t1.0
10325    (0, 600)\t1.0
10326    (0, 600)\t1.0
10327    (0, 779)\t1.0
10328      (0, 0)\t1.0

[10329 rows x 1 columns]


# Activity 2:

## Numpy & Pandas Operations

We are asked to compute the average rating per genre.

If we look at our datasets we see that `movies.csv['genre']` contains entires seperated by `|`.

So we first merge the both datasets and split the genres by `|` and then compute the average rating using the unwrapped "exploded" dataframe

In [None]:
merged_df = pd.merge(ratings, movies, on='movieId', how='inner')

merged_df['genres'] = merged_df['genres'].str.split('|')
exploded_df = merged_df.explode('genres')

average_rating_by_genre = exploded_df.groupby('genres')['rating'].mean()

average_rating_by_genre.sort_values(ascending=False)

Unnamed: 0_level_0,rating
genres,Unnamed: 1_level_1
Film-Noir,3.913636
War,3.783202
Mystery,3.652043
Drama,3.650266
Documentary,3.643035
Crime,3.642392
IMAX,3.641821
Animation,3.63535
Musical,3.571962
Western,3.565687


We create random values for our budget since its not available in our dataset

In [None]:
budget_np=np.random.randint(low=500000,high=20000000,size=10329)
tdf=pd.DataFrame(budget_np)