## Create Test Data
This notebook creates test datasets. At present, two datasets are produced.
1. **Raw Augmented Dataset**: This is the development set with additional samples augmented for testing purposes.
2. **Preconditioned Dataset** : The above dataset with newlines removed from text, and proper formatting and encoding of text content.

For some preprocessing methods, preconditioned text is required. 

In [1]:
import warnings
import pandas as pd
import numpy as np
from appvocai-genailabslm.shared.io import IOService
warnings.filterwarnings("ignore")

In [2]:
TEXT_COLUMN = "content"

## Get Data
Reads development and augmented samples datasets.

In [3]:
# Raw Data
FP1 = "data/dev/00_raw/reviews.pkl"
df1 = IOService.read(filepath=FP1)
print(f"Development dataset has shape {df1.shape}")

# Augment Samples
FP2 = "tests/data/00_review_noise.csv"
df2 = IOService.read(filepath=FP2)
print(f"Augment samples has shape {df2.shape}")

Development dataset has shape (183010, 13)
Augment samples has shape (5, 13)


## Augment Data
Combines the development and augmented samples. Random noise is added in the form of missing data and NaNs.

In [4]:
df3 = pd.concat([df1, df2], axis=0)
df3.loc[df3.sample(frac=0.1).index, "rating"] = np.nan  # Add additional NA values
df3.loc[df3.sample(frac=0.1).index, "author"] = ""  # Add additional missing authors
print(f"Raw Test dataset has shape {df3.shape}")


Raw Test dataset has shape (183015, 13)


## Save Raw Test Data

In [5]:
FP3 = "tests/data/01_reviews_raw.pkl"
IOService.write(filepath=FP3, data=df3)
print(f"Raw Test dataset saved at {FP3}")

Raw Test dataset saved at tests/data/01_reviews_raw.pkl


## Precondition Data
Next, we precondition the data for downstream method testing. Newlines are removed, and the data are properly encoded and typed.

In [6]:
df3[TEXT_COLUMN] = df3[TEXT_COLUMN].astype(str)
# Replace newlines with whitespace
df3[TEXT_COLUMN] = df3[TEXT_COLUMN].str.replace("\n", " ")
# Encode utf-8
df3[TEXT_COLUMN] = df3[TEXT_COLUMN].apply(
    lambda x: x.encode("utf-8").decode("utf-8")) 


## Save Preconditioned Test Data

In [7]:
FP3 = "tests/data/02_reviews_preconditioned.pkl"
IOService.write(filepath=FP3, data=df3)
print("Preconditioned Test Data Complete.")

Preconditioned Test Data Complete.
