# Introduction

This notebook processes the dataset from [Scrape Google Play Reviews]().

The idea is that once processed, this dataset can be used for predicted the rating of an app.

The main processing step that is needed is to embed all the reviews for a single app and take an average of the result.
- We embed with `sbintuitions/sarashina-embedding-v1-1b` embedding model, which has been designed for Japanese inputs.

# Import Data
First let's import the data from the [Scrape Google Play Reviews]() dataset.

In [1]:
from pathlib import Path
import csv

csv_file_path = Path("/kaggle/input/scrape-google-play-reviews/review_data.csv")

with csv_file_path.open('r') as f:
    csv_reader = csv.reader(f)
    datafile = list(csv_reader)

# Compute the embeddings
Note that **GPU accelerators need to be enabled** for this to work.

In [2]:
from sentence_transformers import SentenceTransformer
    
print("Loading the Sarashina embedding model...")
model = SentenceTransformer(
    "sbintuitions/sarashina-embedding-v1-1b",
    device="cpu"  # Reserve the GPU power for computing the embeddings
)
print("Model loaded successfully!")

2025-06-07 13:33:32.450639: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749303212.877300      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749303212.994706      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading the Sarashina embedding model...


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/209 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.57k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/669 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/6.72M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/968 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

Model loaded successfully!


In [None]:
reviews = [
    row[10]
    for row in datafile[1:]
]

print("Starting multi-process pool...")
pool = model.start_multi_process_pool()

review_vectors = model.encode_multi_process(
    reviews,
    pool=pool,
    batch_size=8
)

print("Stopping multi-process pool...")
model.stop_multi_process_pool(pool)

Starting multi-process pool...


2025-06-07 14:00:25.494169: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749304825.518441     137 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749304825.525385     137 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-07 14:00:34.238716: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749304834.261629     149 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749304834.268901     149 cuda_blas.cc:1

Chunks:   0%|          | 0/138 [00:00<?, ?it/s]

In [None]:
import pickle

with open('vectors.pkl', 'wb') as f:  # open a text file
    pickle.dump(review_vectors, f) # serialize the list

# Create a dataframe for this new dataset

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(dataset[1:], columns=dataset[0])
collapsed_df = df.groupby(list(df.columns), sort=False).size()

