# Synthetic text generation

Synthetic text generation plays a crucial role in improving the performance of Generative AI applications by providing abundant, high-quality training data without compromising real-world privacy. This is particularly valuable in sectors like healthcare and finance, where sensitive information must be protected. By utilizing synthetic data, generative AI can be trained more effectively, leading to better text generation that augments human creativity, automates routine tasks, and ultimately drives innovation while safeguarding privacy.

In this pipeline we will be covering YData Fabric flow to augment existing unstructured datatasets while protecting sensitive and private data through PII identification and masking.

## Read an existing datasource

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{datasource-id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)



[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m1
[1mNumber of rows: [0m75316
[1mDuplicate rows: [0m20712
[1mTarget column: [0m

[1mColumn detail: [0m
  Column Data type Variable type Characteristics
0   text  longtext        string                

0  duplicates  [dataset]



## Calculating the Metadata & the Profile

In [None]:
from ydata.profiling import ProfileReport

report = ProfileReport(dataset)
report_html = report.to_html()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

## Setting the pipeline step outputs

In [None]:
##add here the outputs logic
import json

df = dataset.head(15)

profile_pipeline_output = {
    'outputs' : [
        {
            'type': 'table',
            'storage': 'inline',
            'format': 'csv',
            'header': list(df.columns),
            'source': df.to_csv(header=False, index=True)
        },
        {
          'type': 'web-app',
          'storage': 'inline',
          'source': report_html,
        },
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(profile_pipeline_output, metadata_file)
