<a href="https://colab.research.google.com/github/raulbenitez/postgrau_IML_exploratory/blob/master/RECOMENDADORES/Recomendadores_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This module describes how we can define a dataset in surprise, (down)load it and explore their data


As with any python library we will start with the imports, in this case we only need the dataset module: <br>
https://surprise.readthedocs.io/en/stable/dataset.html

In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 3.3MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618278 sha256=822533c3aff376e4e4ba270d97117ee293e3b1cdaab0c782530738b5c697b879
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [None]:
from surprise import Dataset

The first step to train a model is obtain the data needed for the experiment <br>
in surprise there are two options (more information in: https://surprise.readthedocs.io/en/stable/dataset.html):
* define your dataset (user-defined dataset)
* use a predefined ones (built-in) <br>



## Load a predefined dataset
In this course we will work with the dataset structure and the built-in datasets defined with surprise that makes the data handling really easy, <br> 
Right now, three built-in datasets are available: ml-100k, ml-1m, jester. <br>
*  ml-100k: MovieLens 100K movie ratings. 100,000 ratings from 1000 users on 1700 movies <br>
*  ml-1m: MovieLens 1M movie ratings. 1 million ratings from 6000 users on 4000 movies <br>
*  jester: Jester dataset 2 joke dataset. over 115,000 new ratings from 82,366 total users <br>

In this first notebook we will explore MovieLens dataset that we will use over the course: ml-100k.

In [None]:
# valid options: ml-100k, ml-1m, jester.
dataset_experiment = "ml-100k"  


Load the movielens-100k dataset <br>
It will be automatically downloaded the first time


In [None]:
data = Dataset.load_builtin(dataset_experiment)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


Surprise allows extract the raw contents of a dataset with the raw_ratings instruction:


In [None]:
raw_data = data.raw_ratings

Lets print the first 3 values of the raw_data:

In [None]:
print("Each entry has the columns: user_id, item_id,  rating and timestamp")
print(raw_data[0])
print(raw_data[1])
print(raw_data[2])
print("\n")

Each entry has the columns: user_id, item_id,  rating and timestamp
('196', '242', 3.0, '881250949')
('186', '302', 3.0, '891717742')
('22', '377', 1.0, '878887116')




## Pandas

To further explore the dataset we will use the pandas module which allows the handling of data organized in rows <br>
**Pandas is outside the scope of the course**, but it helps to view the contents of the dataset ( https://github.com/NicolasHug/Surprise/issues/254 )

In [None]:
import pandas as pd
data = Dataset.load_builtin('ml-100k')
dataframe = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id', 'item_id', 'rating', 'timestamp'])

Lets print all the information in the dataset:

In [None]:
print("Dataframe contents:")
print(dataframe)
print("\n")

Dataframe contents:
      user_id item_id  rating  timestamp
0         196     242     3.0  881250949
1         186     302     3.0  891717742
2          22     377     1.0  878887116
3         244      51     2.0  880606923
4         166     346     1.0  886397596
...       ...     ...     ...        ...
99995     880     476     3.0  880175444
99996     716     204     5.0  879795543
99997     276    1090     1.0  874795795
99998      13     225     2.0  882399156
99999      12     203     3.0  879959583

[100000 rows x 4 columns]




Lets access a particular row in the dataset:

In [None]:
print("Sample row in the dataset:")
print(dataframe.loc[0])
print("\n")

Sample row in the dataset:
user_id            196
item_id            242
rating               3
timestamp    881250949
Name: 0, dtype: object




Lets access a particular column in the dataset:

In [None]:
print("Sample column (item_id) in the dataset (first 10 values):")
print(dataframe["item_id"].head(10))
print("\n")

Sample column (item_id) in the dataset (first 10 values):
0    242
1    302
2    377
3     51
4    346
5    474
6    265
7    465
8    451
9     86
Name: item_id, dtype: object




### Lets check some things on the dataset:

(you can change the numbers to obtain different items)

How many users have evaluated item "51"? 

In [None]:
print("Number of ratings for item 51 in the database:")
print(dataframe.loc[dataframe['item_id'] == "51"])  # [81 rows x 4 columns]


Number of ratings for item 51 in the database:
      user_id item_id  rating  timestamp
3         244      51     2.0  880606923
780        85      51     2.0  879454782
1614      201      51     2.0  884140751
4469      198      51     3.0  884208455
5158      330      51     5.0  876546753
...       ...     ...     ...        ...
94280     551      51     5.0  892784780
96205     632      51     4.0  879459166
96650     846      51     4.0  883949121
97126     711      51     4.0  879994778
98569     878      51     4.0  880869239

[81 rows x 4 columns]


How many evaluations have the user "780"?

In [None]:
print("Number of ratings provided by user 780:")
print(len(dataframe.loc[dataframe['user_id'] == "780"]))  # 55 films evaluated

Number of ratings provided by user 780:
55


## Stats

Lets check some values for the rating column:


In [None]:
print("Mean value of the rating column: " + str(dataframe["rating"].mean()))  # 3.52986
print("Max value of the rating column: " + str(dataframe["rating"].max()))
print("Min value of the rating column: " + str(dataframe["rating"].min())+"\n")
print("Number of diferent values in the rating column = " + str(dataframe["rating"].nunique()))

Mean value of the rating column: 3.52986
Max value of the rating column: 5.0
Min value of the rating column: 1.0

Number of diferent values in the rating column = 5


And check the number of users and items in the database:


In [None]:
print("Number of users in the database = " + str(dataframe["user_id"].nunique()))
print("Number of items in the database = " + str(dataframe["item_id"].nunique()))

Number of users in the database = 943
Number of items in the database = 1682
