GitHub - ravitejainti/data-processing: Data Processing the csv file to anonymise the data

Problem Statement The objective was to:

Generate a CSV file containing the following columns:

first_name last_name address date_of_birth Process the generated CSV file to anonymize sensitive information. The columns to be anonymized are:

first_name last_name address Ensure the solution works with a large dataset, specifically around 2GB in size, and demonstrate that it can handle even larger datasets efficiently.

Utilize a distributed computing platform to process large datasets effectively. In this project, Snowflake was chosen for this purpose.

Approach: Python: For generating synthetic data using the Faker library. Snowflake: A cloud-based data warehousing platform used for large-scale data processing and anonymization. SQL: To perform data manipulation and anonymization within Snowflake. GitHub: sharing the project. Google Drive: For sharing large datasets, as GitHub has file size limitations.

Step 1: Data Generation Python’s Faker library was used to create names, addresses, and dates of birth. Python code can be seen in repository as samplefakedatagenerator.py

Step 2: Loading Data into Snowflake Since Snowflake’s web UI has a file upload limit of 250MB, the generated dataset needed to be split into smaller parts before loading.

Splitting the Large CSV File - large_dataset.csv To split the large CSV file into manageable parts, the following command was used in the terminal:

command: split -b 200m large_dataset.csv part

this resulted files named part_aa, part_ab, part_ac, to part aj.

Step 3: Data Anonymization: create table and performed Anonymization using SHA-256 hashing algorithm. SQL script attached - Anonymization.sql and Exporting Anonymized Data in a final csv file named anonymized_data.csv

Attaching the google drive link: https://drive.google.com/drive/folders/1tnv5quKsPEqH7pZsuEiPP-go8kLkg8sO?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
Drivefolder		Drivefolder
README.md		README.md
anonymization.sql		anonymization.sql
samplefakedatagenerator.py		samplefakedatagenerator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

ravitejainti/data-processing

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages