# Samples of library usage
Here you can find a set of use cases of the library.

## Imports

In [1]:
from movie_lens_lib import *
import pandas as pd
from sklearn.model_selection import train_test_split

## Constants

The following set of constants was estimated to give the best performance results on the test set for the Hybrid model (random_seed and train_size of course weren't chosen upon that). 

In [2]:
n_movie_clusters = 5
rating_multiplier = 5
year_multiplier = 0.05
weight_genre, weight_cluster, weight_movie = 0.35, 0.45, 0.2
train_size = 0.9
random_state = 42

## Import & Split of the dataset

Library is working with movie lens dataset.

In [3]:
ratings_df = pd.read_csv("data/ratings.csv")
movies_df = pd.read_csv("data/movies.csv", index_col="movieId")

X = ratings_df.drop(["rating"], axis=1)
y = ratings_df["rating"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=random_state)
ratings_train_df = pd.concat([X_train, y_train], axis=1)
ratings_test_df = pd.concat([X_test, y_test], axis=1)

## Preprocess

There are two preprocessing transformers.

* **PreProcessingBase()** is used for GenreBasedRegressor
* **PreProcessingAggregated()** is required for ClusterBasedRegressor and MovieBasedRegressor, however, is also compatible with GenreBasedRegressor

In [4]:
PreProcessingBase().fit_transform(movies_df).head()

Unnamed: 0_level_0,Genres_Split,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,"[Adventure, Animation, Children, Comedy, Fantasy]",0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[Adventure, Children, Fantasy]",0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[Comedy, Romance]",0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,"[Comedy, Drama, Romance]",0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
5,[Comedy],0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
movies_hot_df = PreProcessingAggregated().transform((movies_df, ratings_train_df))
movies_hot_df.head()

Unnamed: 0_level_0,Genres_Split,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,rating_mean,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,"[Adventure, Animation, Children, Comedy, Fantasy]",0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,3.893497,1995.0
2,"[Adventure, Children, Fantasy]",0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,3.278157,1995.0
3,"[Comedy, Romance]",0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,3.16946,1995.0
4,"[Comedy, Drama, Romance]",0,0,0,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,2.866337,1995.0
5,[Comedy],0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,3.079414,1995.0


## Regression

Examples of how models can be fit.

### GenreBasedRegressor

In [6]:
genre_based_regressor = GenreBasedRegressor(movies_hot_df).fit(X_train, y_train)

### ClusterBasedRegressor

In [7]:
cluster_based_regressor = ClusterBasedRegressor(
    movies_hot_df,
    n_movie_clusters,
    rating_multiplier,
    year_multiplier,
    random_state
).fit(X_train, y_train)

### MovieBasedRegressor

In [8]:
movie_based_regressor = MovieBasedRegressor().fit(movies_hot_df)

### HybridRegressor

We may initialize HybridRegressor in various ways, namely:
* create HybridRegressor with default parameters
* pass parametrized regressors as parameters

Cell below shows initialization of HybridRegressor with **default parameters**.

In [9]:
hybrid_regressor_with_default = HybridRegressor(movies_hot_df).fit(X_train, y_train)

Cell below shows initialization of HybridRegressor **parametrized** with already created regressors. We could fit the model, but as all the regressors are already fit we may skip that call.

**Note:** As **HybridRegressor** is a combination of GenreBasedRegressor, ClusterBasedRegressor and MovieBasedRegressor, we may also assign the weights for each of the regressors. Weights correspond to the importance of the output of each regressor model.

In [10]:
hybrid_regressor_parametrized = HybridRegressor(
    movies_hot_df,
    (weight_genre, weight_cluster, weight_movie),
    genre_based_regressor,
    cluster_based_regressor, 
    movie_based_regressor
)

## Prediction

Predictions will be made based on the test set.

### GenreBasedRegressor

In [11]:
genre_predictions = genre_based_regressor.predict(X_test, False)

### ClusterBasedRegressor

In [12]:
cluster_predictions = cluster_based_regressor.predict(X_test, False)

### MovieBasedRegressor

In [13]:
movie_predictions = movie_based_regressor.predict(X_test, False)

### HybridRegressor

The cell below presents the basic functioning logic behind the HybridRegressor (without an explicit call).

In [14]:
weights = np.array([weight_genre, weight_cluster, weight_movie])
hybrid_predictions_implicit = np.column_stack([genre_predictions, cluster_predictions, movie_predictions]).dot(weights)

In the cell below we get predictions from HybridRegressor with default parameters.

In [15]:
hybrid_predictions_default = hybrid_regressor_with_default.predict(X_test, False)

In the cell below we get predictions from HybridRegressor which was parametrized.

In [16]:
hybrid_predictions_parametrized = hybrid_regressor_parametrized.predict(X_test, False)

## Evaluation

In [18]:
print("-" * 20)
print("Genre based prediction")
print_stats(get_performance_stats(y_test, genre_predictions))

print("-" * 20)
print("Clustering based prediction")
print_stats(get_performance_stats(y_test, cluster_predictions))

print("-" * 20)
print("Movie based prediction")
print_stats(get_performance_stats(y_test, movie_predictions))

print("-" * 20)
print("Hybrid prediction (by logic)")
print_stats(get_performance_stats(y_test, hybrid_predictions_implicit))

print("-" * 20)
print("Hybrid prediction (default parameters)")
print_stats(get_performance_stats(y_test, hybrid_predictions_default))

print("-" * 20)
print("Hybrid prediction (parametrized)")
print_stats(get_performance_stats(y_test, hybrid_predictions_parametrized))

--------------------
Genre based prediction
MSE: 0.856
MAE: 0.711
ACCURACY: 0.751
--------------------
Clustering based prediction
MSE: 0.802
MAE: 0.669
ACCURACY: 0.788
--------------------
Movie based prediction
MSE: 0.932
MAE: 0.745
ACCURACY: 0.727
--------------------
Hybrid prediction (by logic)
MSE: 0.75
MAE: 0.664
ACCURACY: 0.78
--------------------
Hybrid prediction (default parameters)
MSE: 0.755
MAE: 0.667
ACCURACY: 0.778
--------------------
Hybrid prediction (parametrized)
MSE: 0.75
MAE: 0.664
ACCURACY: 0.78


As we see results for parametrized and logic based Hybrid models are the same as they used the same regressors. It is important to note that parametrized version might have different results than the model on the default parameters due to unset random_state in the default version.