# Rating Data Prep
When restarting a scrape, there is no way to know which apps have failed previously requests. Hence, a restart will re-request previously failed ids, which is innefficient and time-consuming. To avoid this we need response history of sorts. Rather than creating a new table and duplicating the data, we will post all responses with a boolean status code. This will also provide a record of the problematic ids for further debugging should that be appropriate or needed.

This brief notebook simply adds the additional status column with a True value for all existing rating data. We won't retroactively add failed ids for categories already processed.

So, this will be done in three steps.
1. Read all rating data into a datafraame
2. Add a status column to the dataframe.
4. Drop the rating table.
5. Add the data back to the repository, which will create the new table with status column.


In [1]:
import pandas as pd
from appvoc.infrastructure.file.io import IOService
from appvoc.container import AppVoCContainer

In [2]:
container = AppVoCContainer()
container.init_resources()
container.wire(packages=["appvoc.data.acquisition"])

## Update Ratings with Status

In [3]:
repo = container.data.rating_repo()
ratings = repo.getall()
#ratings['status'] = True
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26509 entries, 0 to 26508
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           26509 non-null  int64  
 1   name         26509 non-null  object 
 2   category_id  26509 non-null  int64  
 3   category     26509 non-null  object 
 4   rating       26509 non-null  float64
 5   reviews      26509 non-null  int64  
 6   ratings      26509 non-null  int64  
 7   onestar      26509 non-null  int64  
 8   twostar      26509 non-null  int64  
 9   threestar    26509 non-null  int64  
 10  fourstar     26509 non-null  int64  
 11  fivestar     26509 non-null  int64  
 12  status       26509 non-null  bool   
dtypes: bool(1), float64(1), int64(9), object(2)
memory usage: 2.5+ MB


## Drop Rating Table

In [7]:
#repo.replace(data=ratings)
#ratings = repo.getall()
#ratings.info()
ratings.export()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26509 entries, 0 to 26508
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           26509 non-null  int64  
 1   name         26509 non-null  object 
 2   category_id  26509 non-null  int64  
 3   category     26509 non-null  object 
 4   rating       26509 non-null  float64
 5   reviews      26509 non-null  int64  
 6   ratings      26509 non-null  int64  
 7   onestar      26509 non-null  int64  
 8   twostar      26509 non-null  int64  
 9   threestar    26509 non-null  int64  
 10  fourstar     26509 non-null  int64  
 11  fivestar     26509 non-null  int64  
 12  status       26509 non-null  int64  
dtypes: float64(1), int64(10), object(2)
memory usage: 2.6+ MB
