# Read & Profile a database

This Jupyter Notebook provides an in-depth analysis of the Berka database, which contains detailed financial transactions from various cards. The dataset includes information on customer demographics, account balances, transactions, and loans. The primary use-case of this analysis is to generate a synthetic database that mirrors the characteristics and patterns of the original data. 
Through this notebook, we will explore how to build a pipeline that allow to read the most up to date data, train a synthetic data generator and write the generated data to a new database.  

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='f1a18828-bb2b-442d-b0cd-b581ad96b1e9', namespace='4993afef-5f60-40a7-a61d-b42ceb77016c')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


[1mMultiMetadata Summary 
 
[0m[1mTables Summary [0m
[1mNumber of tables: [0m9 
 
  Table name  # cols  # nrows  Primary keys             Foreign keys PK characteristics                           FK characteristics Notes
0     append       3       20            []                                                                                               
1   district      16       77          [a1]                                        [id]                                                   
2    account       4     4500  [account_id]            [district_id]               [id]                      {'district_id': ['id']}      
3     client       6     5369   [client_id]            [district_id]               [id]                      {'district_id': ['id']}      
4       disp       4     5369     [disp_id]  [client_id, account_id]               [id]  {'client_id': ['id'], 'account_id': ['id']}      
5       loan       9      682     [loan_id]             [account_id]          

In [17]:
import pandas as pd

tables_info = []
for k, table in metadata.items():
   tables_info.append({"Table name": k,
                         "# cols": table.ncols,
                         "# nrows": table.summary['nrows'],})

tables_info = pd.DataFrame(tables_info)
tables_info

Unnamed: 0,Table name,# cols,# nrows
0,append,3,20
1,district,16,77
2,account,4,4500
3,client,6,5369
4,disp,4,5369
5,loan,9,682
6,order,6,6471
7,trans,10,135000
8,card,4,892


In [None]:
tables_info.to_csv('og_tables_info.csv', index=True)

## Profile Table transactions & card 

In [2]:
from ydata.profiling import ProfileReport

report_trans = ProfileReport(dataset['trans'], title='Profiling Berka transactions') 
report_card = ProfileReport(dataset['card'], title='Profiling Card transactions')

report_trans_html = report_trans.to_html()
report_card_html = report_card.to_html()


INFO: 2024-08-07 09:46:37,482 generated new fontManager
INFO: 2024-08-07 09:46:37,813 Pandas backend loaded 1.5.3
INFO: 2024-08-07 09:46:37,820 Numpy backend loaded 1.23.5
INFO: 2024-08-07 09:46:37,822 Pyspark backend NOT loaded
INFO: 2024-08-07 09:46:37,823 Python backend loaded


  def hasna(x: np.ndarray) -> bool:
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Pipeline outputs

In [7]:
##add here the outputs logic
import json

profile_pipeline_output = {
    'outputs' :  [
        {
            'type': 'table',
            'storage': 'inline',
            'format': 'csv',
            'header': list(tables_info.columns),
            'source': tables_info.to_csv(header=False, index=True)
        },
        {
          'type': 'web-app',
          'storage': 'inline',
          'source': report_trans_html,
        },
        {
          'type': 'web-app',
          'storage': 'inline',
          'source': report_card_html,
        },
    ]
  }
with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(profile_pipeline_output, metadata_file)
