Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Usage - Gaussian Copula Trained Model consuming high memory when generating synthetic data #304

Closed
AnupamaGangadhar opened this issue Jan 27, 2021 · 2 comments
Labels
bug Something isn't working
Milestone

Comments

@AnupamaGangadhar
Copy link

AnupamaGangadhar commented Jan 27, 2021

Environment details

  • SDV version: sdv==0.4.5
  • Python version: Python 3.7.4
  • Operating System: Catalina 10.15.7 (19H2)

Problem description

The trained model is unable to generate synthetic data for certain sample size

What I already tried

Able to train the model
Unable to generate synthetic data - python process is killed before completion

from sdv import SDV
from sdv import Metadata
from sdv.tabular import GaussianCopula
import json
import os
import logging
import pandas as pd
import numpy as np

@profile
def my_func():
	model_file = "./100.pkl"
	sample_size = 5000
	model = GaussianCopula.load(model_file)
	syndata = model.sample(sample_size)
	#syndata.to_csv(syn_csv_file, header=True,index=False)

if __name__ == '__main__':
	my_func()

$python -m memory_profiler testdatagen.py

Fails at 5000 records
Successful run mem profile for 2000 records is given below

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    10  115.004 MiB  115.004 MiB           1   @profile
    11                                         def my_func():
    12  115.004 MiB    0.000 MiB           1   	model_file = "/tmp/100.pkl"
    13  115.004 MiB    0.000 MiB           1   	sample_size = 2000
    14  115.004 MiB    0.000 MiB           1   	syn_csv_file = "/tmp/100.csv"
    15  403.266 MiB  288.262 MiB           1   	model = GaussianCopula.load(model_file)
    16  138.746 MiB -264.520 MiB           1   	syndata = model.sample(sample_size)

data used for training
json of below format - 500 records

        {
          "row": {
            "ID": "491",
            "Card Type Code": "JC",
            "Card Type Full Name": "xxx",
            "Issuing Bank": "xxx",
            "Card Number": "354268752674xxxx",
            "Card Holder's Name": "yyy yyyyy",
            "Issue Date": "xx/xxxx",
            "Expiry Date": "xx/xxxx",
            "Billing Date": "20",
            "Card PIN": "xxxx",
            "Credit Limit": "102300",
            "Age": "xx",
            "SSN": "xxx-xx-xxxx",
            "JobTitle": "Software Engineer",
            "Additional Details": "xxxx 1510 7973 1294 45 lteenxxx@xxx.edu 509-xxx-8996 sixxxlemachines.xxx Wolf:7009751759485477"
          }
        }

I am able to generate the synthetic data using CTGAN model. Given below is the memory usage

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    11  191.926 MiB  191.926 MiB           1   @profile
    12                                         def my_func():
    13  191.926 MiB    0.000 MiB           1   model_file = "/tmp/102.pkl"
    14  191.926 MiB    0.000 MiB           1   sample_size = 10000
    15  191.926 MiB    0.000 MiB           1   syn_csv_file = "/tmp/102.csv"
    16 1123.863 MiB  931.938 MiB           1   model = CTGAN.load(model_file)
    17  678.570 MiB -445.293 MiB           1   syndata = model.sample(sample_size)
    18  679.195 MiB    0.625 MiB           1   syndata.to_csv(syn_csv_file, header=True,index=False)

One of the papers I read about CTGAN says
A Gaussian copula with appropriate margins generates the features, and the different parts of the development process are modeled with successive neural nets. The simulation machine accommodates only a few covariates; the generation of a large number of features with the Gaussian copula could lead to unrealistic combinations of factor levels.

Could this be the reason for the behaviour seen?

@pvk-developer
Copy link
Member

Hi @AnupamaGangadhar , thank you for reporting this issue!

I have been able to reproduce the problem when the dataset has a large number of unique categorical values like yours. Here is a screen capture of the memory usage while sampling in such scenario, which is the same that you are describing.

Screenshot from 2021-03-11 18-30-47

We are working on a fix for this on RDT to reduce the memory ussage (sdv-dev/RDT#156), but in the meantime I recommend you to change the categorical transformer to categorical instead of the default one_hot_encoding.

This may slightly reduce how well the model learns the correlations with some of the categorical columns, but completely get rid of the memory usage problem.

model = GaussianCopula(categorical_transformer='categorical')
model.fit(data)
model.sample()

@pvk-developer pvk-developer added this to the 0.9.0 milestone Mar 16, 2021
@pvk-developer
Copy link
Member

Hi @AnupamaGangadhar , we have solved the issue sdv-dev/RDT#156 on RDT, and this problem has been fixed as you can observe on my screenshot (the process kept the same memory ram usage while fitting and sampling, we can observe only a small increase while fitting which then decreases and no increase when sampling).

Screenshot from 2021-03-16 19-09-27

This issue will be solved with the next release. Meanwhile, you can install the RDT's release candidate to try it out.

If the issue persists, please feel free to reopen it.

@csala csala added bug Something isn't working and removed question General question about the software labels Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants