## 1.0 Warm up
Börja med att kolla på <a href="https://www.youtube.com/watch?v=4Ws0oPH350U">denna youtube-video</a> och följ efter i kod för att skapa ett enkelt recommender system för filmer med hjälp av KNN. Datasetet som används i videon är från movielens small som består av 100,000 ratings på 9000 filmer och 600 användare.

In [1]:
import pandas as pd
import polars as pl
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from fuzzywuzzy import process
import time

movies='../Data/ml-latest/movies.csv'
ratings='../Data/ml-latest/ratings.csv'

In [2]:
t1 = time.time()

df_movies=pd.read_csv(movies, usecols=['movieId','title'],
    dtype={
        'movieId':'int32',
        'title':'str'
        }
    )

df_ratings=pd.read_csv(ratings, usecols=['userId','movieId','rating'],
    dtype={
        'userId':'int32',
        'movieId':'int32',
        'rating':'float32'
        }
    )

t2 = time.time()
print({f'Took {t2-t1} seconds.'})

{'Took 6.339543104171753 seconds.'}


In [3]:
# polars
t1 = time.time()

dfp_movies=pl.read_csv(movies, infer_schema_length=0, columns=['movieId','title'])
dfp_ratings=pl.read_csv(ratings, columns=['userId','movieId','rating'],
    dtypes={
        'userId': pl.Int32,
        'movieId': pl.Int32,
        'rating': pl.Float32
        }
    )

t2 = time.time()
print(f'Took {t2-t1} seconds, {type(dfp_movies)}.')

Took 1.140707015991211 seconds, <class 'polars.internals.dataframe.frame.DataFrame'>.


Reading csv is about six times faster with polars. (!)

In [4]:
dfp_ratings.describe()

describe,userId,movieId,rating
str,f64,f64,f64
"""count""",27753444.0,27753444.0,27753444.0
"""null_count""",0.0,0.0,0.0
"""mean""",141942.015571,18487.999834,3.530445
"""std""",81707.400091,35102.625248,1.067863
"""min""",1.0,1.0,0.5
"""max""",283228.0,193886.0,5.0
"""median""",142022.0,2716.0,3.5


In [5]:
df_ratings.describe()

Unnamed: 0,userId,movieId,rating
count,27753440.0,27753440.0,27753440.0
mean,141942.0,18488.0,3.530446
std,81707.4,35102.63,1.066353
min,1.0,1.0,0.5
25%,71176.0,1097.0,3.0
50%,142022.0,2716.0,3.5
75%,212459.0,7150.0,4.0
max,283228.0,193886.0,5.0


---
Pivoting with pandas and polars

In [6]:
t1 = time.time()
movies_users = df_ratings.pivot(index='movieId', columns='userId', values='rating')

t2 = time.time()
print(f'Took {t2-t1} seconds.')

  movies_users = df_ratings.pivot(index='movieId', columns='userId', values='rating')


Took 128.32663083076477 seconds.


In [7]:
t1 = time.time()
movies_users.fillna(0)
t2 = time.time()
print(f'Took {t2-t1} seconds.')

: 

: 

In [None]:
mat_movies_users=csr_matrix(movies_users.values)

NameError: name 'movies_users' is not defined

In [None]:
t1 = time.time()
#movies_users = dfp_ratings.pivot(index='movieId', columns='userId', values='rating')

t2 = time.time()
#print(f'Took {t2-t1} seconds.')

: 

: 

**Cannot pivot with polars**

Might have to look into the documentation on <a href="https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.pivot.html#polars.DataFrame.pivot">polas.DataFrame.pivot</a>

`Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.`

For some reason it blows the RAM out the casing of my computer and crashes kernel in ipynb. I will try it in .py instead. ==> same issue. I will have to look into how to optimize performance with pivoting in polars.

In [None]:
# Trying chunks

chunk_size = 10000
df_chunks = pl.read_csv(ratings, chunksize=chunk_size, usecols=['userId','movieId','rating'],dtype={'userId':int,'movieId':int,'rating':float})

movies_users = pl.DataFrame()

for df_chunk in df_chunks:
    pivot_table = df_chunk.pivot(index='movieId', columns='userId', values='rating')
    movies_users = movies_users.append(pivot_table)


TypeError: read_csv() got an unexpected keyword argument 'chunksize'

In [None]:
df_ratings['userId'].value_counts()

123100    23715
117490     9279
134596     8381
212343     7884
242683     7515
          ...  
188125        1
117282        1
127062        1
241836        1
265726        1
Name: userId, Length: 283228, dtype: int64

In [None]:
movies_users


userId,1,2,3,4,5,6,7,8,9,10,...,283219,283220,283221,283222,283223,283224,283225,283226,283227,283228
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,4.0,,,,,,5.0,...,4.0,,,,,,,,,4.5
2,,,,4.0,,,,,,,...,,,,,,,,,,
3,,,,,,,,3.0,,,...,,,,,,4.0,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,2.0,,,,3.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193876,,,,,,,,,,,...,,,,,,,,,,
193878,,,,,,,,,,,...,,,,,,,,,,
193880,,,,,,,,,,,...,,,,,,,,,,
193882,,,,,,,,,,,...,,,,,,,,,,
