<a href="https://colab.research.google.com/github/ygautomo/02-Prospera-Datawarehouse/blob/master/01-DW-Project-Data-PreProcessing-20210801.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Warehouse Project- Google Drive, Google Cloud Storage & AWS S3 PreProcessing Files**
## Data Warehouse Project Steps and Code
Status : Last Update 20210801

## **Python Environment Setup**
We will be using a several different libraries throughout this steps. If you've successfully completed the [installation instructions](https://github.com/cs109/content/wiki/Installing-Python), all of the following statements should run.

### Setup Python Environment

In [None]:
# Final Update 20201201
# Reference https://docs.python.org/3/py-modindex.html
# Reference https://towardsdatascience.com/10-tips-for-a-better-google-colab-experience-33f8fe721b82

# Access system-specific parameters and functions. This module provides a portable way of using operating system dependent functionality
import sys
print("Python version:        %6.6s(need at least 3.5.0)" % sys.version)              # (need at least 3.5.0)

# IPython: tools for interactive and parallel computing in Python
import IPython
print("IPython version:      %6.6s (need at least 6.0.0)" % IPython.__version__)      # (need at least 6.0.0)

# Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python
import matplotlib
import matplotlib.pyplot as plt
print("Mapltolib version:    %6.6s (need at least 3.0.0)" % matplotlib.__version__)   # (need at least 3.0.0)

# NumPy is the fundamental package for scientific computing with Python
import numpy as np
print("Numpy version:        %6.6s (need at least 1.15.0)" % np.__version__)          # (need at least 1.15.0)

# Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
# built on top of the Python programming language
import pandas as pd
print("Pandas version:       %6.6s (need at least 0.20.0)" % pd.__version__)          # (need at least 0.20.0)

# Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities
import sklearn as sk
print("Scikit-Learn version: %6.6s (need at least 0.15.0)" % sk.__version__)          # (need at least 0.15.0)

# Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures
import seaborn as sns
print("Seaborn version:      %6.6s (need at least 0.5.0)" % sns.__version__)          # (need at least 0.5.0)

In [None]:
# Customized python environment Setup
pd.set_option('display.precision', 2)

# Pure python package for reading/writing dBase, FoxPro, and Visual FoxPro .dbf files (including memos)
!pip install dbf
import dbf

# Higher-order functions and operations on callable objects. The functools module is for higher-order functions: functions that act on or return other functions
from functools import reduce

# Unix style pathname pattern expansion. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order
import glob

# Core tools for working with streams. The io module provides Python’s main facilities for dealing with various types of I/O
from io import BytesIO

# Encode and decode the JSON format- standard library module
import json

# Mathematical functions (sin() etc.). This module provides access to the mathematical functions defined by the C standard
# import math

# Miscellaneous operating system interfaces. This module provides a portable way of using operating system dependent functionality
import os

# Generate pseudo-random numbers. This module implements pseudo-random number generators for various distributions
import random

# Convert DBF files to CSV, DataFrames, HDF5 tables, and SQL tables. Python3 compatible
!pip install simpledbf
from simpledbf import Dbf5

# Urllib is a package that collects several modules for working with URLs
import urllib.request

# **Machine Learning Pipeline:**
![alt text](https://drive.google.com/uc?id=1zUK9aLiPk1zReXV19RMUQjqe3BrcvbyM)

# **Step 01 - Project Goals & Problems**
* Develop Datawarehouse for Prospera, which data is taken from Egnyte nad transform the data into Google BigQuery as Datawarehouse Platform.

# **Step 02 - Data Retrieval**
Data retrieval: This is mainly data collection, extraction, and acquisition from various data sources and data stores.

Data retrieval process: 
1. Take raw data from Egnyte
2. Standardize file name (linux file system)
3. Convert into csv files
4. Check and review data
5. Convert data type if neccessary
6. Merge data if necessary
7. Create data description and save into json
8. Put raw data into Google Cloud Storage
9. Upload and transform the data into Google BigQuery

Final Update 20210801

## 0201 01 Data Retrieval Environment Variables

In [None]:
# Create Google Drive Environment Variables
# Sakernas data pre-processing
workingDirectory = '/content/drive/MyDrive/04\ Rawdata/04-sakernas'

# Susenas data pre-processing
# workingDirectory = "/content/drive/MyDrive/04\ Rawdata/06-susenas"

print(workingDirectory, type(workingDirectory))

## 0202 Mount Google Drive, Google Cloud Storage & AWS S3

### 0202 01 Mount Google Drive

In [None]:
# Mount Google Drive

# Colaboratory-specific python libraries- non-standard library module
# !pip install google-colab
from google.colab import drive

drive.mount('/content/drive')

# drive.flush_and_unmount()
# drive.mount('/content/drive', force_remount=True)

!pwd
!ls -1

In [None]:
!ls drive/My\ Drive/*/ -d
# !ls drive/My\ Drive/* -d
# !stat drive/My\ Drive/04\ Rawdata/04-sakernas/sakernas-2000 --format=%n:%s *
# os.chdir(workingDirectory)
# !du -ak

In [None]:
# Create Google Drive dictionary
# import json                     # Encode and decode the JSON format- standard library module
# import os                       # Miscellaneous operating system interfaces- standard library module

# Sakernas data pre-processing
# workingDirectory = '/content/drive/MyDrive/04\ Rawdata/04-sakernas'
# print(workingDirectoy, type(workingDirectory))
# os.chdir(workingDirectory)
# !ls -1

# Susenas data pre-processing
# workingDirectory = "/content/drive/MyDrive/04\ Rawdata/06-susenas"
# print(workingDirectory, type(workingDirectory))
# os.chdir(workingDirectory)
# !ls -1

lDirectoryKeys = !ls {workingDirectory} -1
lDirectoryValues = []
for idx, el in enumerate(lDirectoryKeys):
  strDrive = workingDirectory
  # strDrive = workingDirectory.replace("\\","\")
  lDirectoryValues.append(strDrive+"/"+el)
  # print(idx, el)

dictDrive = dict(zip(lDirectoryKeys, lDirectoryValues))
# print(dictDrive)
print(json.dumps(dictDrive, indent = 4))

# Copy data using gsutil cp command
# !gsutil -m cp -r /content/drive/My\ Drive/04\ Rawdata/06-susenas/susenas-2000/ gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/01-bps/06-susenas/
# !gsutil -m cp gs://prospera-spending-bucket-201115/DataRKASAwal/v_rkas.csv /content/drive/MyDrive/04\ Rawdata/07-bos/

In [None]:
# Sakernas data pre-processing
!ls {dictDrive["sakernas-2000"]}/data/*

# Susenas data pre-processing
# !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.csv

### 0202 02 Mount Google Cloud Storage

In [None]:
# Mount Google Cloud Storage

# Colaboratory-specific python libraries non-standard library module
# !pip install google-colab
from google.colab import auth
auth.authenticate_user()

# Project Datawarehouse Prospera {billing: prospera}
# gcpProjectID = 'datawarehouse-001'
# gcpBucketID = 'bucket-prospera-01'

# Project Datawarehouse Prospera {acct: ygautomo; billing: GCP Billing Account 02}
gcpProjectID = 'prospera-datawarehouse-201014'
# gcpBucketID = 'bucket-prospera-datawarehouse-201014-01'

# Project Datawarehouse Prospera {acct: ygautomo; billing: GCP Billing Account 02}
# gcpProjectID = 'smile-database-210122'
# gcpBucketID = 'bucket-smile-database-01'

# !gcloud config set project {gcpProjectID}
# !gsutil ls gs://{gcpBucketID}/01-rawdata/

# Project Datawarehouse Prospera Spending
# gcpProjectID = 'prospera-spending-201115'
# gcpBucketID = 'prospera-spending-bucket-201115'

!gcloud config set project {gcpProjectID}
!gsutil ls
# !gsutil ls gs://{gcpBucketID}/

In [None]:
# Create Google Storage Bucket dictionary
lBucketValues = !gsutil ls
lBucketKeys = []
for idx, el in enumerate(lBucketValues):
  strBucket = lambda x: "bucket0" if x < 10 else "bucket"
  lBucketKeys.append(strBucket(idx+1)+str(idx+1))
  # print(idx, el)

dictBucket = dict(zip(lBucketKeys, lBucketValues))
print(dictBucket)

In [None]:
# Mount Google Cloud Storage -- Setup Boto files

# Overview current credentials
!cat '/content/.config/legacy_credentials/{{email}}/.boto'

In [None]:
# Cloud Storage Client Libraries

# Create/interact with Google Cloud Storage binary large objects (blob)
from google.cloud import storage
client = storage.Client(project=gcpProjectID)
bucket = client.get_bucket(gcpBucketID)

# all_blobs = list(bucket.list_blobs())
blobs = bucket.blob('DataRKASAwal/v_rkas.zip')

### 0202 03 Access AWS S3
Setup boto files

In [None]:
# Access AWS S3 -- Boto files

# Boto is the Amazon Web Services (AWS) SDK for Python- non-standard pyhton libraries
# !pip install boto3
# import boto3

# reference https://realpython.com/python-boto3-aws-s3/
# Add Credentials within .boto files
# [OAuth2]
# client_id = 
# client_secret = 
#
# [Credentials]
# gs_oauth2_refresh_token = 
# aws_access_key_id = 
# aws_secret_access_key = 
#
# [s3]
# use-sigv4=True
# host=s3.us-east-2.amazonaws.com

# !gcloud init
# !gsutil config
# !gsutil version -l
# !gcloud config list

# !cat '/content/drive/My Drive/07 Google Colab/.boto'
# !cat '/content/.config/legacy_credentials/{{email}}/.boto'

# !gsutil cp '/content/drive/My Drive/07 Google Colab/.boto' '/content/.config/legacy_credentials/{{email}}/.boto'

awsBucketID = '01-rawdata'
!gsutil ls s3://{awsBucketID}

### 0202 04 Access Google BigQuery


In [None]:
# Access Google Big Query

# Google Colaboratory tools
# !pip install google-colab
from google.colab import auth
auth.authenticate_user()

# Project Datawarehouse Prospera {billing: prospera} -- Set BQ working directory
# gcpProjectID = 'datawarehouse-001'
# gcpBucketID = 'bucket-prospera-01'
dictBQDirectory = {
  'ifls': 'datawarehouse-001:01_ifls',
  'mfg': 'datawarehouse-001:02_manufacturing',
  'podes': 'datawarehouse-001:03_podess',
  'sakernas': 'datawarehouse-001:04_sakernas',
  'se': 'datawarehouse-001:05_sensus_ekonomi',
  'susenas': 'datawarehouse-001:06_susenas'
}

# Project Datawarehouse Prospera {billing: ygautomo} -- Set BQ working directory
gcpProjectID = 'prospera-datawarehouse-201014'
gcpBucketID = 'bucket-prospera-datawarehouse-201014-01'
dictBQDirectory = {}

!gcloud config set project {gcpProjectID}
!bq ls --project_id={gcpProjectID}
# !bq ls --project_id={gcpProjectID} {dictBQDirectory['susenas']}
# !bq help

In [None]:
# Create BigQuery Dataset dictionary
lDataset = !bq ls --project_id={gcpProjectID}
lDataset = lDataset.fields(0)
lDataset = lDataset[2:len(lDataset)]
print(lDataset)
lDatasetKeys = []
lDatasetValues = []
for el in lDataset:
  lDatasetKeys.append(el[3:])
  lDatasetValues.append(gcpProjectID+":"+el)

print(lDatasetKeys)
dictBQDirectory = dict(zip(lDatasetKeys, lDatasetValues))
print(dictBQDirectory)

In [None]:
!bq ls 06_susenas

## Step 0203 Google Drive PreProcessing Files

In [None]:
# List Google Drive directory contents

# Encode and decode the JSON format standard library module
# import json

# print(dictDrive)
print(json.dumps(dictDrive, indent = 4))
!ls {dictDrive["sakernas-2000"]}/*
# !ls {dictDrive["susenas-2000"]}/*

In [None]:
# List Google Cloud Storage directory contents 
print(dictBucket["bucket01"]+"01-rawdata/")
# !gsutil ls {dictBucket["bucket01"]}"*"
!gsutil ls {dictBucket["bucket01"]}"01-rawdata/"

In [None]:
# Copy data from Google Drive to Google Cloud Storage
# Miscellaneous operating system interfaces
import os

# workingDirectory = '/content/drive/My Drive/04 Rawdata/06-susenas/susenas-2000'
workingDirectory = '/content/drive/My Drive/99 Shared/Database Plan'
print(workingDirectory)
os.chdir(workingDirectory)
!ls

# Copy data using gsutil cp command
# !gsutil -m cp -r {dictDrive["susenas-2006"]} {dictBucket["bucket01"]}"01-rawdata/01-bps/06-susenas/"
# !gsutil -m cp -r /content/drive/My\ Drive/04\ Rawdata/06-susenas/susenas-2000/ gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/01-bps/06-susenas/
# !gsutil -m cp /content/drive/MyDrive/04\ Rawdata/07-bos/v_rkas.dta gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/03-bos/
# !gsutil -m cp gs://prospera-spending-bucket-201115/DataRKASAwal/v_rkas.csv /content/drive/MyDrive/04\ Rawdata/07-bos/
!gsutil -m cp /content/drive/My\ Drive/99\ Shared/Database\ Plan/202101240724_smile_prod.sql gs://bucket-smile-database-01/01-database/

In [None]:
# Copy data from Google Drive to Google Cloud Storage
# Miscellaneous operating system interfaces
import os

'''workingDirectory = '/content/drive/My Drive/04 Rawdata/06-susenas/susenas-2000'
print(workingDirectory)
os.chdir(workingDirectory)
!ls'''

# Copy data using gsutil cp command
# !gsutil -m cp /content/drive/MyDrive/04\ Rawdata/07-bos/v_rkas.dta gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/03-bos/
# !gsutil -m cp -r /content/drive/MyDrive/04\ Rawdata/04-sakernas/sakernas-2012/ gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/01-bps/04-sakernas/
!gsutil -m cp -r /content/drive/MyDrive/10\ Egnyte/ gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/
# !gsutil -m cp /content/drive/MyDrive/04\ Rawdata/04-sakernas/sakernas-2008/data/sakernas0808-dbf-xxx.csv gs://bucket-prospera-datawarehouse-201014-01/01-rawdata/01-bps/04-sakernas/sakernas-2008/data/
# !gsutil -m cp gs://prospera-spending-bucket-201115/DataRKASAwal/v_rkas.csv /content/drive/MyDrive/04\ Rawdata/07-bos/


In [None]:
# Check the directory and put into a list
arrDirectory00 = !ls '/content/drive/My Drive/04 Rawdata'
arrDirectory01 = []
arrDirectory02 = []


for e in arrDirectory00:
  print(type(e))
  (head, tail) = os.path.splitdrive(e)
  print(head)
  if not (tail):
    arrDirectory01.append(e)

for e in arrDirectory01:
  arrDirectory00 = !gsutil ls {e}
  print(e, arrDirectory00)

In [None]:
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


# List .txt files in the root.
#
# Search query reference:
# https://developers.google.com/drive/v2/web/search-parameters
listed = drive.ListFile({'q': "mimeType = 'application/zip' and trashed=false"}).GetList()
# listed = drive.ListFile().GetList()
for file in listed:
  print('title {}, id {}'.format(file['title'], file['id']))

In [None]:
# List Rar & Zip Files
arrZipFiles = !ls '/content/drive/My Drive/04 Rawdata/07-bos/'*.zip

arrZipFiles
# print(len(arrZipFiles))

In [None]:
# %%time

# Read and write ZIP-format archive files. This module provides tools to create, read, write, append, and list a ZIP file
import zipfile

# !unzip '/content/drive/My Drive/04 Rawdata/07-bos/v_rkas.zip' -d '/content/drive/My Drive/04 Rawdata/07-bos/'
!zip '/content/drive/My Drive/04 Rawdata/07-bos/v_rkas.zip' -d '/content/drive/My Drive/04 Rawdata/07-bos/'

"""
for i in range(0,1):
  print(arrZipFiles[i])
  (head, tail) = os.path.split(arrZipFiles[i])
  !unzip '/content/drive/My Drive/04 Rawdata/07-bos/DataRKASAwal.zip' -d '/content/drive/My Drive/04 Rawdata/07-bos/'
"""      
print('Process Completed')

In [None]:
!# cd '/content/drive/My Drive/Database/susenas/susenas-2007'
workingDirectory = '/content/drive/My Drive'
print(workingDirectory)
os.chdir(workingDirectory)
!ls -d */


In [None]:
# from google.colab import drive
# drive.mount('/gdrive')
import glob

file_path = glob.glob("//content/drive/My Drive/**.zip")
for file in file_path:
    print(file)

In [None]:
!gsutil ls gs://content/drive
# !gsutil ls https://www.googleapis.com/drive/v2/files

## Step 0203 AWS S3 PreProcessing Files

In [None]:
# Miscellaneous operating system interfaces- standard library module
import os

# Check the directory and put into a list
arrDirectory00 = !gsutil ls s3://{awsBucketID}
arrDirectory01 = []
arrDirectory02 = []


for e in arrDirectory00:
  (head, tail) = os.path.split(e)
  if not (tail):
    arrDirectory01.append(e)

for e in arrDirectory01:
  arrDirectory00 = !gsutil ls {e}
  print(e, arrDirectory00)

In [None]:
# Copy data from AWS S3 to Google Drive (Not Used)

# Copy data using gsutil cp command
!gsutil -m cp s3://{awsBucketID}/01-bps/03-sakernas/sakernas_2003.dta /content/drive/My\ Drive/04\ Rawdata/04-sakernas/sakernas-2003

In [None]:
!ls

In [None]:
!ls --help

In [None]:
# List Rar & Zip Files

# !gsutil ls --help
arrDirectory = ['01-bps', '02-rand', '03-tnp2k']

workingDirdirectory
print('01-bps')
print('01-bps/01-industri')
!gsutil ls -h -l s3://{awsBucketID}/01-bps/01-industri/**.zip
!gsutil ls -h -l s3://{awsBucketID}/01-bps/01-industri/**.rar

print('01-bps/02-podes')
!gsutil ls -h -l s3://{awsBucketID}/01-bps/02-podes/**.zip
!gsutil ls -h -l s3://{awsBucketID}/01-bps/02-podes/**.rar

print('01-bps/03-sakernas')
!gsutil ls -h -l s3://{awsBucketID}/01-bps/03-sakernas/**.zip
!gsutil ls -h -l s3://{awsBucketID}/01-bps/03-sakernas/**.rar

print('01-bps/04-sensus-ekonomi')
!gsutil ls -h -l s3://{awsBucketID}/01-bps/04-sensus-ekonomi/**.zip
!gsutil ls -h -l s3://{awsBucketID}/01-bps/04-sensus-ekonomi/**.rar
# !gsutil ls s3://{awsBucketID}/01-bps/04-sensus-ekonomi/se-2016/**.zip
# !gsutil ls s3://{awsBucketID}/01-bps/04-sensus-ekonomi/se-2016/**.rar

print('01-bps/05-susenas')
!gsutil ls -h -l s3://{awsBucketID}/01-bps/05-susenas/**.zip
!gsutil ls -h -l s3://{awsBucketID}/01-bps/05-susenas/**.rar

## Step 0204A Unzip Files

In [None]:
# List Zip Files
# arrZipFiles = !gsutil ls s3://{awsBucketID}/01-bps/05-susenas/**.zip
arrZipFiles = !gsutil ls gs://{gcpBucketID}/DataRKASAwal/**.zip

arrZipFiles
# print(len(arrZipFiles))

In [None]:
# %%time

# Read and write ZIP-format archive files. This module provides tools to create, read, write, append, and list a ZIP file
import zipfile

s3_resource = boto3.resource('s3',
         aws_access_key_id='AKIAYGZEZFZROUAV2D6K',
         aws_secret_access_key='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')

for i in range(4,7):
  print(arrZipFiles[i])
  (head, tail) = os.path.split(arrZipFiles[i])
  zip_obj = s3_resource.Object(bucket_name=awsBucketID, key=arrZipFiles[i][16:])
  # zip_obj = s3_resource.Object(bucket_name=awsBucketID, key='01-bps/04-sensus-ekonomi/se-2006/Data SE2006 Listing.zip')
  buffer = BytesIO(zip_obj.get()["Body"].read())

  z = zipfile.ZipFile(buffer)
  for filename in z.namelist():
      file_info = z.getinfo(filename)
      Key=head[16:]+'/'+f'{filename}'
      s3_resource.meta.client.upload_fileobj(
        z.open(filename),
        Bucket=awsBucketID,
        Key=head[16:]+'/'+f'{filename}'
      )
      print(file_info, Key)

print('Process Completed')

In [None]:
# %%time

# Read and write ZIP-format archive files. This module provides tools to create, read, write, append, and list a ZIP file
import io
import zipfile
from google.cloud import storage

storage_client = storage.Client(project=gcpProjectID)
bucket = storage_client.get_bucket(gcpBucketID)
# all_blobs = list(bucket.list_blobs())

filesource = 'DataRKASAwal/v_rkas.zip'
(head, tail) = os.path.split(filesource)
zip_blob = bucket.blob(filesource)
# zip_bytes = BytesIO(zip_blob.get()["Body"].read())
zip_bytes = io.BytesIO(zip_blob.download_as_string())

z = zipfile.ZipFile(zip_bytes)
for filename in z.namelist():
      file_info = z.getinfo(filename)
      file_content = z.read(filename)
      blob = bucket.blob(head + "/" + filename)
      blob.upload_from_string(file_content)
      print(file_info)            

In [None]:
from google.cloud import storage
from zipfile import ZipFile
from zipfile import is_zipfile
import io

def zipextract(bucketname, zipfilename_with_path):

    storage_client = storage.Client(project=gcpProjectID)
    bucket = storage_client.get_bucket(bucketname)

    destination_blob_pathname = zipfilename_with_path

    blob = bucket.blob(destination_blob_pathname)
    zipbytes = io.BytesIO(blob.download_as_string())

    if is_zipfile(zipbytes):
        with ZipFile(zipbytes, 'r') as myzip:
            for contentfilename in myzip.namelist():
                contentfile = myzip.read(contentfilename)
                blob = bucket.blob(zipfilename_with_path + "/" + contentfilename)
                blob.upload_from_string(contentfile)

zipextract('prospera-spending-bucket-201115', 'DataRKASAwal/v_rkas.zip') # if the file is gs://mybucket/path/file.zip

In [None]:
(head, tail) = os.path.split(filename)

In [None]:
for i in range(4,7):
  print(arrZipFiles[i])

  '''
    (head, tail) = os.path.split(arrZipFiles[i])
    print(head[16:])
    print(tail)
    (drive, tail) = os.path.splitext(arrZipFiles[i])
    print(drive)
    print(tail)
    print(os.path.dirname(head))
    print(arrZipFiles[i][16:])
  '''

In [None]:
# Copy data from AWS S3 to Google Drive (Not Used)

# Copy data using gsutil cp command
# !gsutil -m cp s3://{awsBucketID}/01-bps/04-sensus-ekonomi/se-2006/sensus_ekonomi_2006__listing_1/L1.zip /content/drive/My\ Drive/07\ Google\ Colab\

In [None]:
# %%time

# Read and write ZIP-format archive files. This module provides tools to create, read, write, append, and list a ZIP file
import zipfile

s3_resource = boto3.resource('s3',
         aws_access_key_id='AKIAYGZEZFZROUAV2D6K',
         aws_secret_access_key='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')

for i in range(4,7):
  print(arrZipFiles[i])
  (head, tail) = os.path.split(arrZipFiles[i])
  zip_obj = s3_resource.Object(bucket_name=awsBucketID, key=arrZipFiles[i][16:])
  # zip_obj = s3_resource.Object(bucket_name=awsBucketID, key='01-bps/04-sensus-ekonomi/se-2006/Data SE2006 Listing.zip')
  buffer = BytesIO(zip_obj.get()["Body"].read())

  z = zipfile.ZipFile(buffer)
  for filename in z.namelist():
      file_info = z.getinfo(filename)
      Key=head[16:]+'/'+f'{filename}'
      s3_resource.meta.client.upload_fileobj(
        z.open(filename),
        Bucket=awsBucketID,
        Key=head[16:]+'/'+f'{filename}'
      )
      print(file_info, Key)

print('Process Completed')

In [None]:
head[16:]+'/'+f'{filename}'

## Step 0204B Unrar Files

In [None]:
# List Rar Files
arrRarFiles = !gsutil ls s3://{awsBucketID}/01-bps/04-sensus-ekonomi/se-2006/**.rar

print(arrRarFiles)
print(len(arrRarFiles))

In [None]:
for i in range(len(arrRarFiles)):
  print(arrRarFiles[i])

  (head, tail) = os.path.split(arrRarFiles[i])
  print(head[16:])
  print(tail)
  
  (drive, tail) = os.path.splitext(arrRarFiles[i])
  print(drive)
  print(tail)
  
  print(os.path.dirname(head))
  print(arrRarFiles[i][16:])

In [None]:
# Copy data from AWS S3 to Google Drive (Not Used)

# Copy data using gsutil cp command
!gsutil -m cp s3://{awsBucketID}/01-bps/04-sensus-ekonomi/se-2006/sensus_ekonomi_2006__listing_1/L1.rar /content/drive/My\ Drive/07\ Google\ Colab\

In [None]:
# %%time

# RAR archive reader for Python
!pip install rarfile
import rarfile

# Wrapper for UnRAR library, ctypes-based
# !pip install unrar
# from unrar import rarfile

s3_resource = boto3.resource('s3',
         aws_access_key_id='AKIAYGZEZFZROUAV2D6K',
         aws_secret_access_key='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')

for i in range(len(arrRarFiles)):
  print(arrRarFiles[i])
  (head, tail) = os.path.split(arrRarFiles[i])
  rar_obj = s3_resource.Object(bucket_name=awsBucketID, key=arrRarFiles[0][16:])
  # rar_obj = s3_resource.Object(bucket_name=awsBucketID, key='01-bps/04-sensus-ekonomi/se-2006/sensus_ekonomi_2006__listing_1/L1.rar')

  buffer = BytesIO(rar_obj.get()["Body"].read())
  z = rarfile.RarFile(buffer)
  for filename in z.namelist():
      file_info = z.getinfo(filename)
      Key=head[16:]+'/'+f'{filename}'
      s3_resource.meta.client.upload_fileobj(
        z.open(filename),
        Bucket=awsBucketID,
        Key=head[16:]+'/L1-rar/'+f'{filename}'
      )
      print(file_info, Key)

print('Process Completed')

In [None]:
head[16:]
filename

### List File within working Directory

In [None]:
%%time
# Set working directory 01
workingDirectory = dictDirectory['sakernas']
# workingDirectory = dictDirectory['sakernas'] + '/data'
os.chdir(workingDirectory)
path = !pwd
print(path)

# List file on working directory 02
listFile = [f for f in glob.glob('*.csv')]
# listFile = [f for f in glob.glob('*.dta')]
# listFile = [f for f in glob.glob('*.*')]

# Prints all files within directory 03
loop = 0
for e in listFile:
  print(e)
  loop += 1

print(loop)

In [None]:
# Copy data from Google Cloud Storage into AWS S3
bucketName = 'bucket-prospera-01'
bucketDirectory = dictGCSDirectory['se-2016-direktori']

# !gsutil ls s3://{bucketDirectory}

# Delete data using gsutil cp command
# !gsutil rm gs://{bucketName}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/**

# Copy data using gsutil cp command
# !gsutil -m cp -r /content/drive/My\ Drive/Database/se-2016-umb-keuangan/data/* gs://{bucketName}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/data

# !gsutil cp /content/drive/My\ Drive/Database/se-2016-listing/data/se2016-listing-merge.csv gs://{bucketName}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-listing/data
# !gsutil cp /content/drive/My\ Drive/Database/se-2016-listing/data/se2016-listing-33-convert.csv gs://{bucketName}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-listing/data
# !gsutil -m cp -r /content/drive/My\ Drive/Data/* gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-listing/data

# Copy data using gsutil rsync command exclude directories
# !gsutil rsync -d /content/drive/My\ Drive/Database/se-2016-umb-nonkeuangan/ gs://{bucketName}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-nonkeuangan/
# !gsutil rsync -d /content/drive/My\ Drive/Database/se-2016-umb-produksi/ gs://{bucketDirectory}

# Copy data using gsutil rsync command include directories
!gsutil rsync -d -r gs://{bucketDirectory} s3://{bucketDirectory}

### Google Big Query Command

In [None]:
# Set working directory on Google Big Query 01
projectId = 'datawarehouse-001'
directoryBQ = ['datawarehouse-001:04_sensus_ekonomi', 'datawarehouse-001:04_sensus_ekonomi']

!gcloud config set project {projectID}
# List file on Google Big Query working directory 02
# !bq show datawarehouse-001:04_sensus_ekonomi.se_2016_listing_merge
!bq ls  {directoryBQ[0]}
#  !bq ls --max_results=1000 {directoryBQ[0]}
# !bq rm --help

Big Query Delete Table

In [None]:
%%time
# Big Query delete table se2016-listing
listBQFile = [
  'se_2016_listing_11', 'se_2016_listing_12', 'se_2016_listing_13', 'se_2016_listing_14', 'se_2016_listing_15', 
  'se_2016_listing_16', 'se_2016_listing_17', 'se_2016_listing_18', 'se_2016_listing_19', 'se_2016_listing_21', 
  'se_2016_listing_31', 'se_2016_listing_32', 'se_2016_listing_33', 'se_2016_listing_34', 'se_2016_listing_35', 
  'se_2016_listing_36', 'se_2016_listing_51', 'se_2016_listing_52', 'se_2016_listing_53', 'se_2016_listing_61', 
  'se_2016_listing_62', 'se_2016_listing_63', 'se_2016_listing_64', 'se_2016_listing_65', 'se_2016_listing_71', 
  'se_2016_listing_72', 'se_2016_listing_73', 'se_2016_listing_74', 'se_2016_listing_75', 'se_2016_listing_76', 
  'se_2016_listing_81', 'se_2016_listing_82', 'se_2016_listing_91', 'se_2016_listing_94', 'se_2016_listing_merge' 
]

# Big Query delete table se2016-direktori
listBQFile = [
]

# Big Query delete table se2016-umk
listBQFile = [
]

# Big Query delete table se2016-umb-jk
listBQFile = [
]

# Big Query delete table se2016-umb-jnk
listBQFile = [
]

# Big Query delete table se2016-umb-sp
listBQFile = [
]

# Set working directory on Google Big Query 01
projectId = 'datawarehouse-001'
directoryBQ = ['datawarehouse-001:03_sakernas']

# List file on Google Big Query working directory 02
# !bq ls --max_results=1000 {directoryBQ[0]}

# !bq rm --help

loop = 0
for e in listBQFile:
  bqFileName = directoryBQ[0] + "." + e
  !bq rm -f -t {bqFileName}
  # print("delete", e)
  print("delete", bqFileName)
  loop += 1

print(loop)

Big Query Create Table

In [None]:
# Upload file on working directory to Google Big Query

# Set Working Diretory 01
workingDirectory = dictDirectory['sakernas']

!bq load \
    --source_format=CSV \
    --skip_leading_rows=1 \
    datawarehouse-001:04_sensus_ekonomi.se_2016_umb_jk_02_21 \
    gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/data/data-01/se2016-umb-jk-01-21.csv \
    ./se2016-umb-jka-layout.json

In [None]:
%%time
# Big Query create table susenas
listBQFile = [
  'susenas00_ki', 'susenas00-ki.csv', 'susenas00-ki-layout-prospera.json', 'susenas00_kr', 'susenas00-kr.csv', 'susenas00-kr-layout-prospera.json', 'susenas00_kna', 'susenas00-kna.csv', 'susenas00-kna-layout-prospera.json'
]

# Set working directory 01
workingDirectory = dictDirectory['susenas-2000']
os.chdir(workingDirectory)
path = !pwd

pathData = 'gs://bucket-prospera-01/01-rawdata/01-bps/05-susenas/susenas-2000/data/'
fileLayout = ''

loop = 0
for i in range (0, len(listBQFile), 3):
  pathSource = pathData + listBQFile[i+1]
  fileLayout = listBQFile[i+2]
  pathDestination = 'datawarehouse-001:05_susenas.' + listBQFile[i]

  # Upload file on working directory to Google Big Query
  !bq load \
    --source_format=CSV \
    --skip_leading_rows=1 \
    --replace=True \
    {pathDestination} \
    {pathSource} \
    ./{fileLayout}

  print(pathDestination, pathSource, fileLayout)
  loop += 1

print(loop)

In [None]:
%%time
# Big Query create table
listBQFile = [
	'sakernas_1994', 'sakernas_1994.csv',
	'sakernas_1995', 'sakernas_1995.csv',
	'sakernas_1996', 'sakernas_1996.csv',
	'sakernas_1997', 'sakernas_1997.csv',
	'sakernas_1998', 'sakernas_1998.csv',
	'sakernas_1999', 'sakernas_1999.csv',
	'sakernas_2000', 'sakernas_2000.csv',
	'sakernas_2001', 'sakernas_2001.csv',
	'sakernas_2002', 'sakernas_2002.csv',
	'sakernas_2003', 'sakernas_2003.csv',
	'sakernas_2004', 'sakernas_2004.csv',
	'sakernas_2005nov', 'sakernas_2005nov.csv',
	'sakernas_2006aug', 'sakernas_2006aug.csv',
	'sakernas_2007aug', 'sakernas_2007aug.csv',
	'sakernas_2007feb', 'sakernas_2007feb.csv',
	'sakernas_2008aug', 'sakernas_2008aug.csv',
	'sakernas_2008feb', 'sakernas_2008feb.csv',
	'sakernas_2009aug', 'sakernas_2009aug.csv',
	'sakernas_2009feb', 'sakernas_2009feb.csv',
	'sakernas_2010aug', 'sakernas_2010aug.csv',
	'sakernas_2010feb', 'sakernas_2010feb.csv',
	'sakernas_2011aug_rev', 'sakernas_2011aug_rev.csv',
	'sakernas_2011feb', 'sakernas_2011feb.csv',
	'sakernas_2012aug_rev', 'sakernas_2012aug_rev.csv',
	'sakernas_2012feb', 'sakernas_2012feb.csv',
	'sakernas_2013aug_rev', 'sakernas_2013aug_rev.csv',
	'sakernas_2013feb', 'sakernas_2013feb.csv',
	'sakernas_2014aug', 'sakernas_2014aug.csv',
	'sakernas_2014feb', 'sakernas_2014feb.csv',
	'sakernas_2015aug', 'sakernas_2015aug.csv',
	'sakernas_2015feb', 'sakernas_2015feb.csv',
	'sakernas_2016aug', 'sakernas_2016aug.csv',
	'sakernas_2016feb', 'sakernas_2016feb.csv',
	'sakernas_2017aug', 'sakernas_2017aug.csv',
	'sakernas_2017feb', 'sakernas_2017feb.csv',
	'sakernas_2018aug', 'sakernas_2018aug.csv',
	'sakernas_2018feb', 'sakernas_2018feb.csv',
	'sakernas_2019aug', 'sakernas_2019aug.csv'
]

# Set working directory 01
os.chdir(workingDirectory)
path = !pwd

loop = 0
for i in range (0, 76, 2):
  pathData = 'gs://bucket-prospera-01/01-rawdata/01-bps/03-sakernas/'

  if listBQFile[i] == 'se2016_umb_jk_01_11':
    pathData = 'gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/data/data-01/'
    # fileLayout = 'se2016-umb-jka-layout.json'
  elif listBQFile[i] == 'se2016_umb_jk_02_11':
    pathData = 'gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/data/data-02/'
    # fileLayout = 'se2016-umb-jkb-layout.json'
  elif listBQFile[i] == 'se2016_umb_jk_03_11':
    pathData = 'gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/data/data-03/'
    # fileLayout = 'se2016-umb-jkc-layout.json'
  elif listBQFile[i] == 'se2016_umb_jk_11':
    pathData = 'gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-umb-jk/data/'
    # fileLayout = 'se2016-umb-jk-layout.json'

  pathSource = pathData + listBQFile[i+1]
  pathDestination = 'datawarehouse-001:03_sakernas.' + listBQFile[i]

  # Upload file on working directory to Google Big Query
  !bq load \
      --autodetect \
      --source_format=CSV \
      --skip_leading_rows=1 \
      {pathDestination} \
      {pathSource}      # \
      #./{fileLayout}

  print(listBQFile[i])
  loop += 1

print(loop)

## Step 0205 Standardize File Names

In [None]:
!ls {dict'susenas-1979'}

In [None]:
%%time
dictFile = {
  'susenas-2000': {'source': 'susenas-2000-', 'dest': 'susenas00-'},
  'susenas-2001': {'source': 'susenas-2001-', 'dest': 'susenas01-'},
  'susenas-2002': {'source': 'susenas-2002-', 'dest': 'susenas02-'},
  'susenas-2003': {'source': 'se-2016-umk-', 'dest': 'susenas-umk-'},
  'susenas-2004': {'source': '_data1_umk_v1', 'dest': 'susenas-umk-01-'},
  'susenas-2005': {'source': '_data2_umk_v1', 'dest': 'susenas-umk-02-'},
  'susenas-2006': {'source': 'se-2016-umb-jk', 'dest': 'susenas-umb-jk'},
  'susenas-2007': {'source': '_data1_umb-jk_v1', 'dest': 'susenas-umb-jk-01-'},
  'susenas-2008': {'source': '_data2_umb-jk_v1', 'dest': 'susenas-umb-jk-02-'},
  'susenas-2009': {'source': '_data3_umb-jk_v1', 'dest': 'susenas-umb-jk-03-'},
  'susenas-2010': {'source': 'se-2016-umb-jnk', 'dest': 'susenas-umb-jnk'},
  'susenas-2011': {'source': '_data1_umb-jnk_v1', 'dest': 'susenas-umb-jnk-01-'},
  'susenas-2012': {'source': '_data2_umb-jnk_v1', 'dest': 'susenas-umb-jnk-02-'},
  'susenas-2013': {'source': '_data3_umb-jnk_v1', 'dest': 'susenas-umb-jnk-03-'},
  'susenas-2014': {'source': 'se-2016-umb-sp', 'dest': 'susenas-umb-sp'},
  'susenas-2015': {'source': '_data1_umb-sp_v1', 'dest': 'susenas-umb-sp-01-'},
  'susenas-2016': {'source': '_data2_umb-sp_v1', 'dest': 'susenas-umb-sp-02-'},
  'susenas-2017': {'source': '_data3_umb-sp_v1', 'dest': 'susenas-umb-sp-03-'},
  'susenas-2018': {'source': '_data3_umb-sp_v1', 'dest': 'susenas-umb-sp-03-'}
}

# Set working directory 01
workingDirectory = dictDirectory['susenas-2002']
os.chdir(workingDirectory)
path = !pwd

# List dbf file within directory 02
listFile = [f for f in glob.glob('*.*')]
fileSource = dictFile['susenas-2002']['source']
fileDestination = dictFile['susenas-2002']['dest']

# Rename dbf file within directory 03
loop = 0
for e in listFile:
  pathSource = path[0] + '/' + e
  pathDestination = path[0] + '/' + e
  # pathDestination = path[0] + '/' + fileDestination + e
  # print('renaming ' + pathSource)
  fname,ext = os.path.splitext(pathDestination)
  fname = fname.replace(fileSource,fileDestination)

  os.rename(pathSource, fname + ext)
  print(e, fname)
  loop += 1

print(loop)

In [None]:
!pwd

In [None]:
# Set working directory 01
workingDirectory = dictDirectory['susenas-2002']
os.chdir(workingDirectory)
path = !pwd

# List dbf file within directory 02
listFile = [f for f in glob.glob('*.*')]
fileSource = dictFile['susenas-2002']['source']
fileDestination = dictFile['susenas-2002']['dest']

# Rename dbf file within directory 03
loop = 0
for e in listFile:
  pathSource = path[0] + '/' + e
  pathDestination = path[0] + '/' + e
  # pathDestination = path[0] + '/' + fileDestination + e
  # print('renaming ' + pathSource)
  fname,ext = os.path.splitext(pathDestination)
  fname = fname.replace(fileSource,fileDestination)

  os.rename(pathSource, fname + ext)
  print(e, fname)
  loop += 1

print(loop)

## Step 0206A Convert File from stata into csv

In [None]:
%%time
# Unix style pathname pattern expansion. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order
# import glob

# Miscellaneous operating system interfaces
import os

# Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
# built on top of the Python programming language
import pandas as pd


# List dbf file within directory 01
# listFile = [f for f in glob.glob('*.dta')]
listFile = !ls {dictDrive["sakernas-2012"]}/data/*.dta
# listFile = ['/content/drive/MyDrive/04 Rawdata/04-sakernas/sakernas-2002/data/sakernas02.dta']
# listFile = !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.dta
# listFile = ['/content/drive/MyDrive/04 Rawdata/06-susenas/susenas-2007/data tnp2k/susenas07panel-modul-blokq-egnyte.dta']

# Convert dta file into csv 02
loop = 0
for e in listFile:
  # fileSource = path[0] + '/' + e
  fileSource = e.replace("'","")
  fname,ext = os.path.splitext(fileSource)
  fileDestination = fname + '-dta.csv'

  print(fileSource, type(fileSource))
  
  # Convert file from stata into csv
  dfStata = pd.io.stata.read_stata(fileSource, convert_categoricals=False)
  dfStata.to_csv(fileDestination, encoding='utf-8', index=False)

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(fileDestination)

  print("dfBPSData.shape :", dfBPSData.shape)
  print("type(dfBPSData) :", type(dfBPSData))
  print("converting", e, "to", fileDestination)
  loop += 1

print("Number of converted files: ",loop)

In [None]:
%%time
# Unix style pathname pattern expansion. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order
import glob

# Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
# built on top of the Python programming language
import pandas as pd

listFile = [f for f in glob.glob('*.dta')]
# listFile = ['ind96a.dta', 'ind96b.dta']

path = !pwd
loop = 0
for e in listFile:
  pathSource = path[0] + '/' + e
  fname,ext = os.path.splitext(pathSource)
  pathDestination = fname + '.csv'
  print("converting", e, "to", pathDestination)
  
  # Convert file from stata into csv
  dfStata = pd.io.stata.read_stata(pathSource, convert_categoricals=False)
  dfStata.to_csv(pathDestination, encoding='utf-8', index=False)

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(pathDestination)

  print("dfBPSData.shape :", dfBPSData.shape)
  print("type(dfBPSData) :", type(dfBPSData))
  print("converting", e, "to", pathDestination)
  loop += 1

print(loop)

## Step 0206B Convert File from dbf into csv

### 0206B 01 Convert file dbf to csv using dbfread

In [None]:
!ls {dictDrive["sakernas-2003"]}/data/
# !ls {dictDrive["susenas-2007"]}/data\ tnp2k/

In [None]:
listFile = !ls {dictDrive["sakernas-2008"]}/data/*.dbf
# listFile = !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.dbf

In [None]:
listFile
# listFile = listFile[1]

In [None]:
workingDirectory = dictDrive["susenas-2007"]
print(workingDirectory)
bebek = workingDirectory.replace("\\","")
print(bebek)

In [None]:
# %%time
# Unix style pathname pattern expansion. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order
import glob

# Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
# built on top of the Python programming language
import pandas as pd

# Miscellaneous operating system interfaces. This module provides a portable way of using operating system dependent functionality
import os

# import sys
import csv
!pip install dbfread
from dbfread import DBF
from dbfread import FieldParser

class MyFieldParser(FieldParser):
    def parseN(self, field, data):
        data = data.strip().strip(b'*\x00')  # Had to strip out the other characters first before \x00, as per super function specs.
        return super(MyFieldParser, self).parseN(field, data)

    def parseD(self, field, data):
        data = data.strip(b'\x00')
        return super(MyFieldParser, self).parseD(field, data)

# Set working directory 01
"""
workingDirectory = dictDrive["susenas-2007"]+"/data tnp2k/"
workingDirectory = workingDirectory.replace("\\","")
print(type(workingDirectory))
os.chdir(workingDirectory)
path = !pwd
print(path)
!ls -1
"""
# List dbf file within directory 02
# listFile = [f for f in glob.glob('*.dbf')]
listFile = !ls {dictDrive["sakernas-2011"]}/data/*.dbf
# listFile = !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.dbf
# listFile = ['/content/drive/MyDrive/04 Rawdata/06-susenas/susenas-2007/data tnp2k/susenas07panel-modul-blokvl-egnyte.dbf']
# Convert dbf file into csv 03
loop = 0
for e in listFile:
  # fileSource = path[0] + '/' + e
  fileSource = e.replace("'","")
  fname,ext = os.path.splitext(fileSource)
  fileDestination = fname + '-dbf.csv'

  print(fileSource, type(fileSource))
  
  # Convert file from dbf into csv (using dbfread)
  table = DBF(fileSource, parserclass=MyFieldParser)     # table variable is a DBF object
  print(table.header)
  # print(table.field_names)
  # print(table.fields)
  
  with open(fileDestination, 'w', newline = '') as f:               # create a csv file, fill it with dbf content
    writer = csv.writer(f)
    writer.writerow(table.field_names)
    for record in table:
      # print (record.values())
      writer.writerow(list(record.values()))

    # for i, record in enumerate(table):
      # print('bebek')
      # writer.writerow(list(record.values()))

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(fileDestination)

  print(fileDestination, fname)
  print("dfBPSData.shape :", dfBPSData.shape)
  # print("type(dfBPSData) :", type(dfBPSData))
  
  # print("converting", e, "to", fileDestination)
  loop += 1

print("Number of converted files: ",loop)

In [None]:
print(workingDirectory)

In [None]:
# %%time
import pandas as pd
import os
import glob
# import sys
import csv
from dbfread import DBF
from dbfread import FieldParser

class MyFieldParser(FieldParser):
    def parseN(self, field, data):
        data = data.strip().strip(b'*\x00')  # Had to strip out the other characters first before \x00, as per super function specs.
        return super(MyFieldParser, self).parseN(field, data)

    def parseD(self, field, data):
        data = data.strip(b'\x00')
        return super(MyFieldParser, self).parseD(field, data)

# Set working directory 01
"""
workingDirectory = dictDrive["susenas-2007"]+"/data tnp2k/"
workingDirectory = workingDirectory.replace("\\","")
print(type(workingDirectory))
os.chdir(workingDirectory)
path = !pwd
print(path)
!ls -1
"""
# List dbf file within directory 02
# listFile = [f for f in glob.glob('*.dbf')]
listFile = !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.dbf
listFile = ['/content/drive/MyDrive/04 Rawdata/06-susenas/susenas-2007/data tnp2k/susenas07panel-modul-blokvl-egnyte.dbf']
# Convert dbf file into csv 03
loop = 0
for e in listFile:
  # fileSource = path[0] + '/' + e
  fileSource = e.replace("'","")
  fname,ext = os.path.splitext(fileSource)
  fileDestination = fname + '-dbf.csv'

  print(fileSource, type(fileSource))
  
  # Convert file from dbf into csv (using dbfread)
  table = DBF(fileSource, parserclass=MyFieldParser)     # table variable is a DBF object
  print(table.header)
  # print(table.field_names)
  # print(table.fields)
  
  with open(fileDestination, 'w', newline = '') as f:               # create a csv file, fill it with dbf content
    writer = csv.writer(f)
    writer.writerow(table.field_names)
    for record in table:
      # print (record.values())
      writer.writerow(list(record.values()))

    # for i, record in enumerate(table):
      # print('bebek')
      # writer.writerow(list(record.values()))

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(fileDestination)

  print(fileDestination, fname)
  print("dfBPSData.shape :", dfBPSData.shape)
  # print("type(dfBPSData) :", type(dfBPSData))
  
  # print("converting", e, "to", fileDestination)
  loop += 1

print(loop)

In [None]:
import pandas as pd
dfBPSData = pd.read_csv(fileDestination)

print("dfBPSData.shape :", dfBPSData.shape)
print("type(dfBPSData) :", type(dfBPSData))

In [None]:
fileDestination2 = '/content/drive/MyDrive/04 Rawdata/06-susenas/susenas-2007/data tnp2k backup/susenas07-ki.csv'
dfBPSData = pd.read_csv(fileDestination)

print("dfBPSData.shape :", dfBPSData.shape)
print("type(dfBPSData) :", type(dfBPSData))

In [None]:
print(sys.version)

In [None]:
# %%time
import os
# import glob
# import sys
# import csv
# from dbfread import DBF
# from dbfread import FieldParser

"""
class MyFieldParser(FieldParser):
    def parseN(self, field, data):
        data = data.strip().strip(b'*\x00')  # Had to strip out the other characters first before \x00, as per super function specs.
        return super(MyFieldParser, self).parseN(field, data)

    def parseD(self, field, data):
        data = data.strip(b'\x00')
        return super(MyFieldParser, self).parseD(field, data)
"""

# Set working directory 01
# workingDirectory = dictDrive["susenas-2007"]+"/data\ tnp2k/"
# workingDirectory = workingDirectory.replace("\\","")
# os.chdir(workingDirectory)
# path = !pwd
# print(path)

# List dbf file within directory 01
listFile = !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.dbf

# Convert dbf file into csv 03
loop = 0
for e in listFile:
  fileSource = e
  fname,ext = os.path.splitext(fileSource)
  fileDestination = fname + '.csv'

  # Convert file from dbf into csv (using Dbf5)
  # dbfFile = Dbf5(fileSource, codec='utf-8')
  # dfDbf = dbfFile.to_dataframe()
  # dfDbf.to_csv(fileDestination, encoding='utf-8', index=False)
  # dbfFile.to_csv(fileDestination)

  # Convert file from dbf into csv (using dbf)
  # with dbf.Table(fileSource) as table:
  #  dbf.export(table, fileDestination)
  """
  # Convert file from dbf into csv (using dbf)
  table = DBF(fileSource, load=True, parserclass=MyFieldParser)     # table variable is a DBF object
  print(table.header)
  print(table.field_names)
  # print(table.fields)
  with open(fileDestination, 'w', newline = '') as f: # create a csv file, fill it with dbf content
    writer = csv.writer(f)
    writer.writerow(table.field_names)
    for record in table:
      # print (record.values())
    # for i, record in enumerate(table):
      # print('bebek')
      writer.writerow(list(record.values()))

  # Read and import csv file dataset into pandas data frame, change paths if needed
  # dfBPSData = pd.read_csv(pathDestination)

  # print("dfBPSData.shape :", dfBPSData.shape)
  # print("type(dfBPSData) :", type(dfBPSData))
  """
  print("converting", e, "to", fileDestination)
  loop += 1

print(loop)

In [None]:
!pip install dbfread
!pip install dbfpy

In [None]:
import csv
from dbfread import DBF

dbf_table_pth = '/content/drive/My Drive/04 Rawdata/07-bos/v_rkas2.dbf'
# dbf_table_pth = '/content/drive/My Drive/04 Rawdata/07-bos/v_sumber_dana.dbf'

def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
    csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
    table = DBF(dbf_table_pth)# table variable is a DBF object
    with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
        writer = csv.writer(f)
        writer.writerow(table.field_names)# write the column name
        for record in table:# write the rows
            writer.writerow(list(record.values()))
    return csv_fn# return the csv name


dbf_to_csv(dbf_table_pth)

In [None]:
import csv
from dbfread import DBF

dbf_table = DBF('/content/drive/My Drive/04 Rawdata/07-bos/v_rkas2.dbf')

for record in dbf_table:
  print(record)

In [None]:
import csv
from dbfpy import dbf
import os
import sys

# filename = sys.argv[1]
filename = '/content/drive/My Drive/04 Rawdata/07-bos/v_rkas2.dbf'
if filename.endswith('.dbf'):
    print("Converting %s to csv" % filename)
    """csv_fn = filename[:-4]+ ".csv"
    with open(csv_fn,'wb') as csvfile:
        in_db = dbf.Dbf(filename)
        out_csv = csv.writer(csvfile)
        names = []
        for field in in_db.header.fields:
            names.append(field.name)
        #out_csv.writerow(names)
        for rec in in_db:
            row = [i.decode('utf8').encode('cp1250') if isinstance(i, str) else i for i in rec.fieldData]
            out_csv.writerow(rec.fieldData)
        in_db.close()
        print("Done...")"""
else:
  print("Filename does not end with .dbf")

In [None]:
import pandas as pd

from simpledbf import Dbf5

dbf = Dbf5('/content/drive/My Drive/04 Rawdata/07-bos/v_rkas2.dbf')
df = dbf.to_dataframe()

In [None]:
# !pip install pysal
!pip install giddy

In [None]:
!pip install pysal
import pysal as ps

In [None]:
import pysal as ps
import pandas as pd
'''
Arguments
---------
dbfile  : DBF file - Input to be imported
upper   : Condition - If true, make column heads upper case
'''
def dbf2DF(dbfile, upper=True): #Reads in DBF files and returns Pandas DF
    db = ps.open(dbfile) #Pysal to open DBF
    d = {col: db.by_col(col) for col in db.header} #Convert dbf to dictionary
    #pandasDF = pd.DataFrame(db[:]) #Convert to Pandas DF
    pandasDF = pd.DataFrame(d) #Convert to Pandas DF
    if upper == True: #Make columns uppercase if wanted 
        pandasDF.columns = map(str.upper, db.header) 
    db.close() 
    return pandasDF

In [None]:
df = dbf2DF('../input/afrbeep020.dbf')

## Step 0207 Check & Review Data

In [None]:
# Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
# built on top of the Python programming language- non-standard python libraries
import pandas as pd

# Read and import csv file dataset into pandas data frame, change paths if needed
dictFileReview = {
  'susenas00-ki': 'susenas00-ki.csv', 'susenas00-kr': 'susenas00-kr.csv', 'susenas00-kna': 'susenas00-kna.csv', 'susenas00-mod-ki': 'susenas00-mod-ki.csv', 'susenas00-mod-kr': 'susenas00-mod-kr.csv',
  'susenas01-ki': 'susenas01-ki.csv', 'susenas01-kr': 'susenas01-kr.csv', 'susenas01-ind-km': 'susenas01-ind-km.csv', 'susenas01-rt-km': 'susenas01-rt-km.csv',
  'susenas02-ki': 'susenas02jul-ki.csv', 'susenas02-kr': 'susenas02jul-kr.csv'
  
}

# Set working directory 01
# workingDirectory = dictDirectory['sakernas']
# workingDirectory = dictDirectory['sakernas'] + '/data'
# os.chdir(workingDirectory)
path = !pwd

# Review dataset 02
# fname = dictFileReview['susenas00-kna']
fname = 'v_rkas.csv'
# fname = 'susenas02jul-module-consumption.csv'

print(workingDirectory)
pathSource = path[0] + '/' + fname
print(pathSource)
dfBPSData = pd.read_csv(pathSource, sep=';', engine='python', nrows=100)

print(fname)
print("dfBPSData.shape  :", dfBPSData.shape)
print("type(dfBPSData)  :", type(dfBPSData))

In [None]:
dfBPSData

### 0207 01 Check & review data Sakernas

In [None]:
%%time
# Sakernas data pre-processing

# Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
# built on top of the Python programming language- non-standard python libraries
import pandas as pd

# Set working directory 01
# workingDirectory = dictDirectory['sakernas']
# os.chdir(workingDirectory)
# path = !pwd

# List csv file within directory 02
listFile = !ls {dictDrive["sakernas-2000"]}/data/*.csv

# Prints all files within directory 03
loop = 0
for fname in listFile:
  fileSource = fname.replace("'","")
  # pathDestination = pathSource
  dfBPSData = pd.read_csv(fileSource)

  print(fname, ".shape  :", dfBPSData.shape)
  print(dfBPSData.info(verbose=True, null_counts=True))
  loop += 1

print("Number of files: ",loop)

### 0207 02 Check & review data Susenas

In [None]:
%%time
# Set working directory 01
# workingDirectory = dictDirectory['sakernas']
# os.chdir(workingDirectory)
# path = !pwd

# List csv file within directory 02
listFile = !ls {dictDrive["susenas-2007"]}/data\ tnp2k/*.csv

# Prints all files within directory 03
loop = 0
for fname in listFile:
  fileSource = fname.replace("'","")
  # pathDestination = pathSource
  dfBPSData = pd.read_csv(fileSource)

  print(fname, ".shape  :", dfBPSData.shape)
  loop += 1

print(loop)

In [None]:
# Examine dataset, see data type
print(fileSource)
dfBPSData.info(verbose=True, null_counts=True)

In [None]:
# Examine dataset, see data type
print(dfBPSData.describe(percentiles=[], include='all').transpose().to_string())

In [None]:
# Examine dataset, see data values
dfBPSData.head()
# dfBPSData.tail()
# dfBPSData.sort_values(by=['psid'], ascending=False)

In [None]:
# Examine dataset, see data values
dfBPSData.head()
# dfBPSData.tail()
# dfBPSData.sort_values(by=['psid'], ascending=False)

In [None]:
# basic info about columns in each dataset
for name, df in dfs.items():
    print("df: %s\n" %name)
    print("df:", name, "type:", type(df), "\n")
    print("shape: %d rows, %d cols\n" %df.shape)
    
    print("column info:")
    for col in df.columns:
        print("* %s: %d nulls, %d nans, %d unique vals, most common: %s" % (
            col, 
            df[col].isnull().sum(),
            df[col].isna().sum(),
            df[col].nunique(),
            df[col].value_counts().head(2).to_dict()
        ))
    print("\n------\n")

In [None]:
# Examine dataset
# print(dfBPSData.describe(percentiles=[], include='all').transpose().to_string())
print(dfBPSData.count().transpose().to_string())

In [None]:
pd.reset_option('display.show_dimensions')
pd.set_option('display.show_dimensions', False)
print(pd.options.display.max_rows, pd.options.display.show_dimensions)

In [None]:
# Examine dataset, first 5 rows
# dfBPSData['DDESA94'].isna().any()
# dfBPSData.sort_values(by='psid')
# dfBPSData.tail(10)
dfBPSData.isna()

## Step 0208 Convert Data Type
Convert data type float into integer

In [None]:
%%time
# Read and import csv file dataset into pandas data frame, change paths if needed
fname = 'se2016-listing-11.csv'

path = !pwd
# print(path)

pathSource = path[0] + '/' + fname
pathDestination = pathSource
# print(pathSource)
dfBPSData = pd.read_csv(pathSource)

print(fname)
print("dfBPSData.shape  :", dfBPSData.shape)
print("type(dfBPSData)  :", type(dfBPSData))

In [None]:
# Examine dataset, found isna & maximum values
# dfBPSData.columns.isna().any()
# dfBPSData['psid'].max()

In [None]:
# Examine dataset, create data type dictionary
dfBPSTypeSeries  = dict(dfBPSData.dtypes)
print(dfBPSTypeSeries)

In [None]:
# Convert data type float into integer
for (key, values) in dfBPSTypeSeries.items():
  if values=='float64':
    print(key, values)
    dfBPSData[key] = dfBPSData[key].astype('Int64')

    # Special case on certain field
    # if key!='D94_VNOB':
      # dfBPSData[key] = dfBPSData[key].astype('Int64')

In [None]:
# Save data from convert data type operation
print(pathDestination)
dfBPSData.to_csv(pathDestination, encoding='utf-8', index=False)

Convert data type float into integer (Loop)

In [None]:
%%time
# list csv file within directory
listFile = [f for f in glob.glob('*.csv')]

# data type to convert
dDataType = {
  'provinsi':'object',
  'nama_prov':'object',
  'kabupaten':'object',
  'nama_kab':'object',
  'idperkab':'Int64',
  'b1r11d':'object',
  'b1r13':'object',
  'b1r14a':'object',
  'b1r14b':'object',
  'kategori':'object',
  'b1r15c':'object',
  'b1r15d':'object',
  'b1r16':'object',
  'b1r19a':'Int64',
  'b1r21':'object',
  'b1r22a':'object',
  'b1r22b':'object',
  'kat_omset':'Int64',
  'skalausaha':'object',
  'penimbang':'object',
  'renum':'Int64'
    }

path = !pwd
print(path)
loop = 0

for e in listFile:
  pathSource = path[0] + '/' + e
  fname,ext = os.path.splitext(pathSource)
  pathDestination = fname + "-convert" + ext
  dfBPSData = pd.read_csv(pathSource)

  # print(fname, ".shape  :", dfBPSData.shape)
  # print(pathSource)
  print(pathDestination)
  
  # Examine dataset, create data type dictionary
  dfBPSTypeSeries  = dict(dfBPSData.dtypes)

  # Convert data type float into integer
  for (key, values) in dfBPSTypeSeries.items():
    dfBPSData[key] = dfBPSData[key].astype(dDataType[key.lower()])
    # print(key, values, "convert to", dDataType[key.lower()])

  # Save data from convert data type operation
  # print(pathDestination)
  dfBPSData.to_csv(pathDestination, encoding='utf-8', index=False)
  loop += 1

print(loop)

In [None]:
# save data into google cloud storage
bucket_name = 'bucket-prospera-01'
!gsutil cp /content/drive/My\ Drive/Database/se-2016-listing/data/se2016-listing-33.csv gs://{bucket_name}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-listing/data
!gsutil cp /content/drive/My\ Drive/Database/se-2016-listing/data/se-2016-listing-33-convert.csv gs://{bucket_name}/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-listing/data
# !gsutil -m cp -r /content/drive/My\ Drive/Data/* gs://bucket-prospera-01/01-rawdata/01-bps/04-sensus-ekonomi/se-2016/se-2016-listing/data
/

## Step 0209 Merge Dataset
Merge Dataset if required

In [None]:
%%time
# Read and import csv file dataset into pandas data frame, change paths if needed
fname  = 'ind95.csv' # Merge data files
fnameA = 'ind95a.csv'
fnameB = 'ind95b.csv'

path = !pwd
pathSourceA = path[0] + '/' + fnameA
pathSourceB = path[0] + '/' + fnameB
pathDestination = path[0] + '/' + fname
print(pathSourceA)
print(pathSourceB)
dfBPSDataA = pd.read_csv(pathSourceA)
dfBPSDataB = pd.read_csv(pathSourceB)

print("dfBPSDataA.shape  :", dfBPSDataA.shape)
print("type(dfBPSDataA)  :", type(dfBPSDataA))
print("dfBPSDataB.shape  :", dfBPSDataB.shape)
print("type(dfBPSDataB)  :", type(dfBPSDataB))

In [None]:
# Rename joining keys
dfBPSDataA.rename({'nomor': 'NOMOR_A'}, axis='columns', inplace=True)
dfBPSDataB.rename({'nomor': 'NOMOR_B'}, axis='columns', inplace=True)

In [None]:
# Merge data files
dfBPSData = dfBPSDataA.merge(dfBPSDataB, left_on='NOMOR_A', right_on='NOMOR_B')

In [None]:
dfBPSData[['NOMOR_A','NOMOR_B']]

In [None]:
# Save data from convert data type operation
print(pathDestination)
dfBPSData.to_csv(pathDestination, encoding='utf-8', index=False)

### 0209 01 List Files within Working Directory

In [None]:
%%time
# Set working directory 01
workingDirectory = dictDirectory['se-2016-direktori-data']
os.chdir(workingDirectory)
path = !pwd
print(path)

# List file on working directory 02
listFile = [f for f in glob.glob('*.csv')]
# listFile = [f for f in glob.glob('*.dbf')]
# listFile = [f for f in glob.glob('*.*')]

# Prints all files within directory 03
loop = 0
for e in listFile:
  print(e)
  loop += 1

print(loop)

### 0209 02 Merge File 2 Tables tableA + tableB -> tableMerge

In [None]:
%%time
# Merge table se2016-umk
listFile = [
  'se2016-umk-11.csv', 'se2016-umk-01-11.csv', 'se2016-umk-02-11.csv', 'se2016-umk-12.csv', 'se2016-umk-01-12.csv', 'se2016-umk-02-12.csv', 'se2016-umk-13.csv', 'se2016-umk-01-13.csv', 'se2016-umk-02-13.csv', 'se2016-umk-14.csv', 'se2016-umk-01-14.csv', 'se2016-umk-02-14.csv', 'se2016-umk-15.csv', 'se2016-umk-01-15.csv', 'se2016-umk-02-15.csv', 
  'se2016-umk-16.csv', 'se2016-umk-01-16.csv', 'se2016-umk-02-16.csv', 'se2016-umk-17.csv', 'se2016-umk-01-17.csv', 'se2016-umk-02-17.csv', 'se2016-umk-18.csv', 'se2016-umk-01-18.csv', 'se2016-umk-02-18.csv', 'se2016-umk-19.csv', 'se2016-umk-01-19.csv', 'se2016-umk-02-19.csv', 'se2016-umk-21.csv', 'se2016-umk-01-21.csv', 'se2016-umk-02-21.csv', 
  'se2016-umk-31.csv', 'se2016-umk-01-31.csv', 'se2016-umk-02-31.csv', 'se2016-umk-32.csv', 'se2016-umk-01-32.csv', 'se2016-umk-02-32.csv', 'se2016-umk-33.csv', 'se2016-umk-01-33.csv', 'se2016-umk-02-33.csv', 'se2016-umk-34.csv', 'se2016-umk-01-34.csv', 'se2016-umk-02-34.csv', 'se2016-umk-35.csv', 'se2016-umk-01-35.csv', 'se2016-umk-02-35.csv', 
  'se2016-umk-36.csv', 'se2016-umk-01-36.csv', 'se2016-umk-02-36.csv', 'se2016-umk-51.csv', 'se2016-umk-01-51.csv', 'se2016-umk-02-51.csv', 'se2016-umk-52.csv', 'se2016-umk-01-52.csv', 'se2016-umk-02-52.csv', 'se2016-umk-53.csv', 'se2016-umk-01-53.csv', 'se2016-umk-02-53.csv', 'se2016-umk-61.csv', 'se2016-umk-01-61.csv', 'se2016-umk-02-61.csv', 
  'se2016-umk-62.csv', 'se2016-umk-01-62.csv', 'se2016-umk-02-62.csv', 'se2016-umk-63.csv', 'se2016-umk-01-63.csv', 'se2016-umk-02-63.csv', 'se2016-umk-64.csv', 'se2016-umk-01-64.csv', 'se2016-umk-02-64.csv', 'se2016-umk-65.csv', 'se2016-umk-01-65.csv', 'se2016-umk-02-65.csv', 'se2016-umk-71.csv', 'se2016-umk-01-71.csv', 'se2016-umk-02-71.csv', 
  'se2016-umk-72.csv', 'se2016-umk-01-72.csv', 'se2016-umk-02-72.csv', 'se2016-umk-73.csv', 'se2016-umk-01-73.csv', 'se2016-umk-02-73.csv', 'se2016-umk-74.csv', 'se2016-umk-01-74.csv', 'se2016-umk-02-74.csv', 'se2016-umk-75.csv', 'se2016-umk-01-75.csv', 'se2016-umk-02-75.csv', 'se2016-umk-76.csv', 'se2016-umk-01-76.csv', 'se2016-umk-02-76.csv', 
  'se2016-umk-81.csv', 'se2016-umk-01-81.csv', 'se2016-umk-02-81.csv', 'se2016-umk-82.csv', 'se2016-umk-01-82.csv', 'se2016-umk-02-82.csv', 'se2016-umk-91.csv', 'se2016-umk-01-91.csv', 'se2016-umk-02-91.csv', 'se2016-umk-94.csv', 'se2016-umk-01-94.csv', 'se2016-umk-02-94.csv'
]

# Set working directory 01
workingDirectory = dictDirectory['se-2016-umk-data']
os.chdir(workingDirectory)
path = !pwd
# print(path)

loop = 0
for i in range (0, len(listFile), 3):
  fname = listFile[i]     # Merge data files
  fnameA = listFile[i+1]
  fnameB  = listFile[i+2]
  pathSourceA = path[0] + '/data-01/' + fnameA
  pathSourceB = path[0] + '/data-02/' + fnameB
  pathDestination = path[0] + '/' + fname

  # print(pathSourceA, pathSourceB, pathDestination)
  dfBPSDataA = pd.read_csv(pathSourceA)
  dfBPSDataB = pd.read_csv(pathSourceB)

  # Rename joining keys
  dfBPSDataA.rename({'IDPERUSAHA': 'PERUSAHAAN_ID', 'JENISKUESI': 'JENISKUESIONER_A', 'PROV': 'PROVINSI_IDA', 'SKALAUSAHA': 'SKALAUSAHA_A', 'WEIGHT': 'WEIGHT_A'}, axis='columns', inplace=True)
  dfBPSDataB.rename({'IDPERUSAHA': 'PERUSAHAAN_ID', 'JENISKUESI': 'JENISKUESIONER_B', 'PROV': 'PROVINSI_IDB', 'SKALAUSAHA': 'SKALAUSAHA_B', 'WEIGHT': 'WEIGHT_B'}, axis='columns', inplace=True)

  dfBPSData = [dfBPSDataA, dfBPSDataB]

  # Merge data files
  # dfBPSData = dfBPSDataA.merge(dfBPSDataB, left_on='IDPERUSAHAAN_A', right_on='IDPERUSAHAAN_B')
  dfBPSDataMerge = reduce(lambda left,right: pd.merge(left,right,on='PERUSAHAAN_ID'), dfBPSData)
  
  # print(pathDestination, "dfBPSDataA.shape:", dfBPSDataA.shape, "dfBPSDataB.shape:", dfBPSDataB.shape)
  # print(pathDestination, "dfBPSDataA.shape:", dfBPSDataA.shape, "dfBPSDataB.shape:", dfBPSDataB.shape, "dfBPSDataMerge.shape:", dfBPSData.shape)
  # print("dfBPSDataA.shape:", dfBPSDataA.shape[0], "dfBPSDataB.shape:", dfBPSDataB.shape[0], "dfBPSDataMerge.shape:", dfBPSData.shape[0])

  # Save data from merge data type operation
  print(pathDestination, dfBPSDataMerge.shape)
  # dfBPSDataMerge.to_csv(pathDestination, encoding='utf-8', index=False)

  loop += 1

print(loop)

In [None]:
%%time
# Merge all table
# Set working directory 01
workingDirectory = dictDirectory['se-2016-direktori-data']
os.chdir(workingDirectory)
path = !pwd
# print(path)

# List file on working directory 02
listFile = [f for f in glob.glob('*.csv')]

pathDestination = path[0] + '/' + 'se2016-direktori-merge.csv'
dfMerges = []
totalRows = 0
loop = 0
for e in listFile:
  pathSource = path[0] + '/' + e
  # print('merge ' + pathSource)

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(pathSource)

  dfMerges.append(dfBPSData)
  print("dfBPSData.shape :", e, dfBPSData.shape)
  totalRows += dfBPSData.shape[0]
  loop += 1

print(loop)


dfBPSDataMerge = pd.concat(dfMerges)
print("dfBPSDataMerge.shape :", dfBPSDataMerge.shape, totalRows)

print(pathDestination)
dfBPSDataMerge.to_csv(pathDestination, encoding='utf-8', index=False)

### 0209 02 Merge File 3 Tables tableA + tableB + tableC -> tableMerge

In [None]:
%%time
# Merge table se2016-umb-jk
listFile = [
  'se2016-umb-jk-11.csv', 'se2016-umb-jk-01-11.csv', 'se2016-umb-jk-02-11.csv', 'se2016-umb-jk-03-11.csv', 'se2016-umb-jk-12.csv', 'se2016-umb-jk-01-12.csv', 'se2016-umb-jk-02-12.csv', 'se2016-umb-jk-03-12.csv', 'se2016-umb-jk-13.csv', 'se2016-umb-jk-01-13.csv', 'se2016-umb-jk-02-13.csv', 'se2016-umb-jk-03-13.csv', 'se2016-umb-jk-14.csv', 'se2016-umb-jk-01-14.csv', 'se2016-umb-jk-02-14.csv', 'se2016-umb-jk-03-14.csv', 'se2016-umb-jk-15.csv', 'se2016-umb-jk-01-15.csv', 'se2016-umb-jk-02-15.csv', 'se2016-umb-jk-03-15.csv', 
  'se2016-umb-jk-16.csv', 'se2016-umb-jk-01-16.csv', 'se2016-umb-jk-02-16.csv', 'se2016-umb-jk-03-16.csv', 'se2016-umb-jk-17.csv', 'se2016-umb-jk-01-17.csv', 'se2016-umb-jk-02-17.csv', 'se2016-umb-jk-03-17.csv', 'se2016-umb-jk-18.csv', 'se2016-umb-jk-01-18.csv', 'se2016-umb-jk-02-18.csv', 'se2016-umb-jk-03-18.csv', 'se2016-umb-jk-19.csv', 'se2016-umb-jk-01-19.csv', 'se2016-umb-jk-02-19.csv', 'se2016-umb-jk-03-19.csv', 'se2016-umb-jk-21.csv', 'se2016-umb-jk-01-21.csv', 'se2016-umb-jk-02-21.csv', 'se2016-umb-jk-03-21.csv', 
  'se2016-umb-jk-31.csv', 'se2016-umb-jk-01-31.csv', 'se2016-umb-jk-02-31.csv', 'se2016-umb-jk-03-31.csv', 'se2016-umb-jk-32.csv', 'se2016-umb-jk-01-32.csv', 'se2016-umb-jk-02-32.csv', 'se2016-umb-jk-03-32.csv', 'se2016-umb-jk-33.csv', 'se2016-umb-jk-01-33.csv', 'se2016-umb-jk-02-33.csv', 'se2016-umb-jk-03-33.csv', 'se2016-umb-jk-34.csv', 'se2016-umb-jk-01-34.csv', 'se2016-umb-jk-02-34.csv', 'se2016-umb-jk-03-34.csv', 'se2016-umb-jk-35.csv', 'se2016-umb-jk-01-35.csv', 'se2016-umb-jk-02-35.csv', 'se2016-umb-jk-03-35.csv', 
  'se2016-umb-jk-36.csv', 'se2016-umb-jk-01-36.csv', 'se2016-umb-jk-02-36.csv', 'se2016-umb-jk-03-36.csv', 'se2016-umb-jk-51.csv', 'se2016-umb-jk-01-51.csv', 'se2016-umb-jk-02-51.csv', 'se2016-umb-jk-03-51.csv', 'se2016-umb-jk-52.csv', 'se2016-umb-jk-01-52.csv', 'se2016-umb-jk-02-52.csv', 'se2016-umb-jk-03-52.csv', 'se2016-umb-jk-53.csv', 'se2016-umb-jk-01-53.csv', 'se2016-umb-jk-02-53.csv', 'se2016-umb-jk-03-53.csv', 'se2016-umb-jk-61.csv', 'se2016-umb-jk-01-61.csv', 'se2016-umb-jk-02-61.csv', 'se2016-umb-jk-03-61.csv', 
  'se2016-umb-jk-62.csv', 'se2016-umb-jk-01-62.csv', 'se2016-umb-jk-02-62.csv', 'se2016-umb-jk-03-62.csv', 'se2016-umb-jk-63.csv', 'se2016-umb-jk-01-63.csv', 'se2016-umb-jk-02-63.csv', 'se2016-umb-jk-03-63.csv', 'se2016-umb-jk-64.csv', 'se2016-umb-jk-01-64.csv', 'se2016-umb-jk-02-64.csv', 'se2016-umb-jk-03-64.csv', 'se2016-umb-jk-65.csv', 'se2016-umb-jk-01-65.csv', 'se2016-umb-jk-02-65.csv', 'se2016-umb-jk-03-65.csv', 'se2016-umb-jk-71.csv', 'se2016-umb-jk-01-71.csv', 'se2016-umb-jk-02-71.csv', 'se2016-umb-jk-03-71.csv', 
  'se2016-umb-jk-72.csv', 'se2016-umb-jk-01-72.csv', 'se2016-umb-jk-02-72.csv', 'se2016-umb-jk-03-72.csv', 'se2016-umb-jk-73.csv', 'se2016-umb-jk-01-73.csv', 'se2016-umb-jk-02-73.csv', 'se2016-umb-jk-03-73.csv', 'se2016-umb-jk-74.csv', 'se2016-umb-jk-01-74.csv', 'se2016-umb-jk-02-74.csv', 'se2016-umb-jk-03-74.csv', 'se2016-umb-jk-75.csv', 'se2016-umb-jk-01-75.csv', 'se2016-umb-jk-02-75.csv', 'se2016-umb-jk-03-75.csv', 'se2016-umb-jk-76.csv', 'se2016-umb-jk-01-76.csv', 'se2016-umb-jk-02-76.csv', 'se2016-umb-jk-03-76.csv', 
  'se2016-umb-jk-81.csv', 'se2016-umb-jk-01-81.csv', 'se2016-umb-jk-02-81.csv', 'se2016-umb-jk-03-81.csv', 'se2016-umb-jk-82.csv', 'se2016-umb-jk-01-82.csv', 'se2016-umb-jk-02-82.csv', 'se2016-umb-jk-03-82.csv', 'se2016-umb-jk-91.csv', 'se2016-umb-jk-01-91.csv', 'se2016-umb-jk-02-91.csv', 'se2016-umb-jk-03-91.csv', 'se2016-umb-jk-94.csv', 'se2016-umb-jk-01-94.csv', 'se2016-umb-jk-02-94.csv', 'se2016-umb-jk-03-94.csv'
]

# Merge table se2016-umb-jnk
listFile = [
  'se2016-umb-jnk-11.csv', 'se2016-umb-jnk-01-11.csv', 'se2016-umb-jnk-02-11.csv', 'se2016-umb-jnk-03-11.csv', 'se2016-umb-jnk-12.csv', 'se2016-umb-jnk-01-12.csv', 'se2016-umb-jnk-02-12.csv', 'se2016-umb-jnk-03-12.csv', 'se2016-umb-jnk-13.csv', 'se2016-umb-jnk-01-13.csv', 'se2016-umb-jnk-02-13.csv', 'se2016-umb-jnk-03-13.csv', 'se2016-umb-jnk-14.csv', 'se2016-umb-jnk-01-14.csv', 'se2016-umb-jnk-02-14.csv', 'se2016-umb-jnk-03-14.csv', 'se2016-umb-jnk-15.csv', 'se2016-umb-jnk-01-15.csv', 'se2016-umb-jnk-02-15.csv', 'se2016-umb-jnk-03-15.csv', 
  'se2016-umb-jnk-16.csv', 'se2016-umb-jnk-01-16.csv', 'se2016-umb-jnk-02-16.csv', 'se2016-umb-jnk-03-16.csv', 'se2016-umb-jnk-17.csv', 'se2016-umb-jnk-01-17.csv', 'se2016-umb-jnk-02-17.csv', 'se2016-umb-jnk-03-17.csv', 'se2016-umb-jnk-18.csv', 'se2016-umb-jnk-01-18.csv', 'se2016-umb-jnk-02-18.csv', 'se2016-umb-jnk-03-18.csv', 'se2016-umb-jnk-19.csv', 'se2016-umb-jnk-01-19.csv', 'se2016-umb-jnk-02-19.csv', 'se2016-umb-jnk-03-19.csv', 'se2016-umb-jnk-21.csv', 'se2016-umb-jnk-01-21.csv', 'se2016-umb-jnk-02-21.csv', 'se2016-umb-jnk-03-21.csv', 
  'se2016-umb-jnk-31.csv', 'se2016-umb-jnk-01-31.csv', 'se2016-umb-jnk-02-31.csv', 'se2016-umb-jnk-03-31.csv', 'se2016-umb-jnk-32.csv', 'se2016-umb-jnk-01-32.csv', 'se2016-umb-jnk-02-32.csv', 'se2016-umb-jnk-03-32.csv', 'se2016-umb-jnk-33.csv', 'se2016-umb-jnk-01-33.csv', 'se2016-umb-jnk-02-33.csv', 'se2016-umb-jnk-03-33.csv', 'se2016-umb-jnk-34.csv', 'se2016-umb-jnk-01-34.csv', 'se2016-umb-jnk-02-34.csv', 'se2016-umb-jnk-03-34.csv', 'se2016-umb-jnk-35.csv', 'se2016-umb-jnk-01-35.csv', 'se2016-umb-jnk-02-35.csv', 'se2016-umb-jnk-03-35.csv', 
  'se2016-umb-jnk-36.csv', 'se2016-umb-jnk-01-36.csv', 'se2016-umb-jnk-02-36.csv', 'se2016-umb-jnk-03-36.csv', 'se2016-umb-jnk-51.csv', 'se2016-umb-jnk-01-51.csv', 'se2016-umb-jnk-02-51.csv', 'se2016-umb-jnk-03-51.csv', 'se2016-umb-jnk-52.csv', 'se2016-umb-jnk-01-52.csv', 'se2016-umb-jnk-02-52.csv', 'se2016-umb-jnk-03-52.csv', 'se2016-umb-jnk-53.csv', 'se2016-umb-jnk-01-53.csv', 'se2016-umb-jnk-02-53.csv', 'se2016-umb-jnk-03-53.csv', 'se2016-umb-jnk-61.csv', 'se2016-umb-jnk-01-61.csv', 'se2016-umb-jnk-02-61.csv', 'se2016-umb-jnk-03-61.csv', 
  'se2016-umb-jnk-62.csv', 'se2016-umb-jnk-01-62.csv', 'se2016-umb-jnk-02-62.csv', 'se2016-umb-jnk-03-62.csv', 'se2016-umb-jnk-63.csv', 'se2016-umb-jnk-01-63.csv', 'se2016-umb-jnk-02-63.csv', 'se2016-umb-jnk-03-63.csv', 'se2016-umb-jnk-64.csv', 'se2016-umb-jnk-01-64.csv', 'se2016-umb-jnk-02-64.csv', 'se2016-umb-jnk-03-64.csv', 'se2016-umb-jnk-65.csv', 'se2016-umb-jnk-01-65.csv', 'se2016-umb-jnk-02-65.csv', 'se2016-umb-jnk-03-65.csv', 'se2016-umb-jnk-71.csv', 'se2016-umb-jnk-01-71.csv', 'se2016-umb-jnk-02-71.csv', 'se2016-umb-jnk-03-71.csv', 
  'se2016-umb-jnk-72.csv', 'se2016-umb-jnk-01-72.csv', 'se2016-umb-jnk-02-72.csv', 'se2016-umb-jnk-03-72.csv', 'se2016-umb-jnk-73.csv', 'se2016-umb-jnk-01-73.csv', 'se2016-umb-jnk-02-73.csv', 'se2016-umb-jnk-03-73.csv', 'se2016-umb-jnk-74.csv', 'se2016-umb-jnk-01-74.csv', 'se2016-umb-jnk-02-74.csv', 'se2016-umb-jnk-03-74.csv', 'se2016-umb-jnk-75.csv', 'se2016-umb-jnk-01-75.csv', 'se2016-umb-jnk-02-75.csv', 'se2016-umb-jnk-03-75.csv', 'se2016-umb-jnk-76.csv', 'se2016-umb-jnk-01-76.csv', 'se2016-umb-jnk-02-76.csv', 'se2016-umb-jnk-03-76.csv', 
  'se2016-umb-jnk-81.csv', 'se2016-umb-jnk-01-81.csv', 'se2016-umb-jnk-02-81.csv', 'se2016-umb-jnk-03-81.csv', 'se2016-umb-jnk-82.csv', 'se2016-umb-jnk-01-82.csv', 'se2016-umb-jnk-02-82.csv', 'se2016-umb-jnk-03-82.csv', 'se2016-umb-jnk-91.csv', 'se2016-umb-jnk-01-91.csv', 'se2016-umb-jnk-02-91.csv', 'se2016-umb-jnk-03-91.csv', 'se2016-umb-jnk-94.csv', 'se2016-umb-jnk-01-94.csv', 'se2016-umb-jnk-02-94.csv', 'se2016-umb-jnk-03-94.csv'
]

# Merge table se2016-umb-sp
listFile = [
  'se2016-umb-sp-11.csv', 'se2016-umb-sp-01-11.csv', 'se2016-umb-sp-02-11.csv', 'se2016-umb-sp-03-11.csv', 'se2016-umb-sp-12.csv', 'se2016-umb-sp-01-12.csv', 'se2016-umb-sp-02-12.csv', 'se2016-umb-sp-03-12.csv', 'se2016-umb-sp-13.csv', 'se2016-umb-sp-01-13.csv', 'se2016-umb-sp-02-13.csv', 'se2016-umb-sp-03-13.csv', 'se2016-umb-sp-14.csv', 'se2016-umb-sp-01-14.csv', 'se2016-umb-sp-02-14.csv', 'se2016-umb-sp-03-14.csv', 'se2016-umb-sp-15.csv', 'se2016-umb-sp-01-15.csv', 'se2016-umb-sp-02-15.csv', 'se2016-umb-sp-03-15.csv', 
  'se2016-umb-sp-16.csv', 'se2016-umb-sp-01-16.csv', 'se2016-umb-sp-02-16.csv', 'se2016-umb-sp-03-16.csv', 'se2016-umb-sp-17.csv', 'se2016-umb-sp-01-17.csv', 'se2016-umb-sp-02-17.csv', 'se2016-umb-sp-03-17.csv', 'se2016-umb-sp-18.csv', 'se2016-umb-sp-01-18.csv', 'se2016-umb-sp-02-18.csv', 'se2016-umb-sp-03-18.csv', 'se2016-umb-sp-19.csv', 'se2016-umb-sp-01-19.csv', 'se2016-umb-sp-02-19.csv', 'se2016-umb-sp-03-19.csv', 'se2016-umb-sp-21.csv', 'se2016-umb-sp-01-21.csv', 'se2016-umb-sp-02-21.csv', 'se2016-umb-sp-03-21.csv', 
  'se2016-umb-sp-31.csv', 'se2016-umb-sp-01-31.csv', 'se2016-umb-sp-02-31.csv', 'se2016-umb-sp-03-31.csv', 'se2016-umb-sp-32.csv', 'se2016-umb-sp-01-32.csv', 'se2016-umb-sp-02-32.csv', 'se2016-umb-sp-03-32.csv', 'se2016-umb-sp-33.csv', 'se2016-umb-sp-01-33.csv', 'se2016-umb-sp-02-33.csv', 'se2016-umb-sp-03-33.csv', 'se2016-umb-sp-34.csv', 'se2016-umb-sp-01-34.csv', 'se2016-umb-sp-02-34.csv', 'se2016-umb-sp-03-34.csv', 'se2016-umb-sp-35.csv', 'se2016-umb-sp-01-35.csv', 'se2016-umb-sp-02-35.csv', 'se2016-umb-sp-03-35.csv', 
  'se2016-umb-sp-36.csv', 'se2016-umb-sp-01-36.csv', 'se2016-umb-sp-02-36.csv', 'se2016-umb-sp-03-36.csv', 'se2016-umb-sp-51.csv', 'se2016-umb-sp-01-51.csv', 'se2016-umb-sp-02-51.csv', 'se2016-umb-sp-03-51.csv', 'se2016-umb-sp-52.csv', 'se2016-umb-sp-01-52.csv', 'se2016-umb-sp-02-52.csv', 'se2016-umb-sp-03-52.csv', 'se2016-umb-sp-53.csv', 'se2016-umb-sp-01-53.csv', 'se2016-umb-sp-02-53.csv', 'se2016-umb-sp-03-53.csv', 'se2016-umb-sp-61.csv', 'se2016-umb-sp-01-61.csv', 'se2016-umb-sp-02-61.csv', 'se2016-umb-sp-03-61.csv', 
  'se2016-umb-sp-62.csv', 'se2016-umb-sp-01-62.csv', 'se2016-umb-sp-02-62.csv', 'se2016-umb-sp-03-62.csv', 'se2016-umb-sp-63.csv', 'se2016-umb-sp-01-63.csv', 'se2016-umb-sp-02-63.csv', 'se2016-umb-sp-03-63.csv', 'se2016-umb-sp-64.csv', 'se2016-umb-sp-01-64.csv', 'se2016-umb-sp-02-64.csv', 'se2016-umb-sp-03-64.csv', 'se2016-umb-sp-65.csv', 'se2016-umb-sp-01-65.csv', 'se2016-umb-sp-02-65.csv', 'se2016-umb-sp-03-65.csv', 'se2016-umb-sp-71.csv', 'se2016-umb-sp-01-71.csv', 'se2016-umb-sp-02-71.csv', 'se2016-umb-sp-03-71.csv', 
  'se2016-umb-sp-72.csv', 'se2016-umb-sp-01-72.csv', 'se2016-umb-sp-02-72.csv', 'se2016-umb-sp-03-72.csv', 'se2016-umb-sp-73.csv', 'se2016-umb-sp-01-73.csv', 'se2016-umb-sp-02-73.csv', 'se2016-umb-sp-03-73.csv', 'se2016-umb-sp-74.csv', 'se2016-umb-sp-01-74.csv', 'se2016-umb-sp-02-74.csv', 'se2016-umb-sp-03-74.csv', 'se2016-umb-sp-75.csv', 'se2016-umb-sp-01-75.csv', 'se2016-umb-sp-02-75.csv', 'se2016-umb-sp-03-75.csv', 'se2016-umb-sp-76.csv', 'se2016-umb-sp-01-76.csv', 'se2016-umb-sp-02-76.csv', 'se2016-umb-sp-03-76.csv', 
  'se2016-umb-sp-81.csv', 'se2016-umb-sp-01-81.csv', 'se2016-umb-sp-02-81.csv', 'se2016-umb-sp-03-81.csv', 'se2016-umb-sp-82.csv', 'se2016-umb-sp-01-82.csv', 'se2016-umb-sp-02-82.csv', 'se2016-umb-sp-03-82.csv', 'se2016-umb-sp-91.csv', 'se2016-umb-sp-01-91.csv', 'se2016-umb-sp-02-91.csv', 'se2016-umb-sp-03-91.csv', 'se2016-umb-sp-94.csv', 'se2016-umb-sp-01-94.csv', 'se2016-umb-sp-02-94.csv', 'se2016-umb-sp-03-94.csv'
]

# Set working directory 01
workingDirectory = dictDirectory['se-2016-umb-sp-data']
os.chdir(workingDirectory)
path = !pwd
# print(path)

loop = 0
for i in range (0, 136, 4):
  # print(listFile[i], "merge with", listFile[i+1], "into", listFile[i+2])
  fname = listFile[i]     # Merge data files
  fnameA = listFile[i+1]
  fnameB  = listFile[i+2]
  fnameC  = listFile[i+3]
  pathSourceA = path[0] + '/data-01/' + fnameA
  pathSourceB = path[0] + '/data-02/' + fnameB
  pathSourceC = path[0] + '/data-03/' + fnameC
  pathDestination = path[0] + '/' + fname
  
  # print(pathSourceA, pathSourceB, pathSourceC, pathDestination)
  dfBPSDataA = pd.read_csv(pathSourceA)
  dfBPSDataB = pd.read_csv(pathSourceB)
  dfBPSDataC = pd.read_csv(pathSourceC)

  # Rename joining keys
  dfBPSDataA.rename({'IDPERUSAHA': 'PERUSAHAAN_ID', 'PROV': 'PROVINSI_IDA', 'SKALAUSAHA': 'SKALAUSAHA_A', 'WEIGHT': 'WEIGHT_A'}, axis='columns', inplace=True)
  dfBPSDataB.rename({'IDPERUSAHA': 'PERUSAHAAN_ID', 'JENISKUESI': 'JENISKUESIONER_B', 'PROV': 'PROVINSI_IDB', 'SKALAUSAHA': 'SKALAUSAHA_B', 'WEIGHT': 'WEIGHT_B'}, axis='columns', inplace=True)
  dfBPSDataC.rename({'IDPERUSAHA': 'PERUSAHAAN_ID', 'JENISKUESI': 'JENISKUESIONER_C', 'PROV': 'PROVINSI_IDC', 'SKALAUSAHA': 'SKALAUSAHA_C', 'WEIGHT': 'WEIGHT_C'}, axis='columns', inplace=True)
  
  dfBPSData = [dfBPSDataA, dfBPSDataB, dfBPSDataC]

  # Merge data files
  # dfBPSData = dfBPSDataA.merge(dfBPSDataB, left_on='IDPERUSAHAAN_A', right_on='IDPERUSAHAAN_B')
  dfBPSDataMerge = reduce(lambda left,right: pd.merge(left,right,on='PERUSAHAAN_ID'), dfBPSData)
  
  # print(pathDestination, "dfBPSDataA.shape:", dfBPSDataA.shape, "dfBPSDataB.shape:", dfBPSDataB.shape)
  # print(pathDestination, "dfBPSDataA.shape:", dfBPSDataA.shape, "dfBPSDataB.shape:", dfBPSDataB.shape, "dfBPSDataMerge.shape:", dfBPSData.shape)
  # print("dfBPSDataA.shape:", dfBPSDataA.shape[0], "dfBPSDataB.shape:", dfBPSDataB.shape[0], "dfBPSDataMerge.shape:", dfBPSData.shape[0])

  # Save data from merge data type operation
  print(pathDestination, dfBPSDataA.shape, dfBPSDataMerge.shape)
  # print(fname, dfBPSDataA.shape, dfBPSDataMerge.shape)
  dfBPSDataMerge.to_csv(pathDestination, encoding='utf-8', index=False)

  loop += 1

print(loop)

In [None]:
%%time
# Merge table se2016-umk
# Set working directory 01
workingDirectory = dictDirectory['se-2016-umb-sp-data']
os.chdir(workingDirectory)
path = !pwd
# print(path)

# List file on working directory 02
listFile = [f for f in glob.glob('*.csv')]

pathDestination = path[0] + '/' + 'se2016-umb-sp-merge.csv'
dfMerges = []
totalRows = 0
loop = 0
for e in listFile:
  pathSource = path[0] + '/' + e
  # print('merge ' + pathSource)

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(pathSource)

  dfMerges.append(dfBPSData)
  print("dfBPSData.shape :", e, dfBPSData.shape)
  totalRows += dfBPSData.shape[0]
  loop += 1

print(loop)


dfBPSDataMerge = pd.concat(dfMerges)
print("dfBPSDataMerge.shape :", dfBPSDataMerge.shape, totalRows)

print(pathDestination)
dfBPSDataMerge.to_csv(pathDestination, encoding='utf-8', index=False)

In [None]:
# Examine dataset, see data type
dfBPSDataMerge.info(verbose=True, null_counts=True)

In [None]:
pwd

In [None]:
%%time
# Read and import csv file dataset into pandas data frame, change paths if needed

# Set working directory 01
os.chdir(dictDirectory['se-2016-umb-jk'])
path = !pwd
print(path)

# List file on working directory 02
listFile = [f for f in glob.glob('*.csv')]

pathDestination = path[0] + '/' + 'se-2016-umb-jk-merge.csv'
dfMerges = []
totalRows = 0
loop = 0
for e in listFile:
  pathSource = path[0] + '/' + e
  # print('merge ' + pathSource)

  # Read and import csv file dataset into pandas data frame, change paths if needed
  dfBPSData = pd.read_csv(pathSource)

  dfMerges.append(dfBPSData)
  print("dfBPSData.shape :", dfBPSData.shape)
  totalRows += dfBPSData.shape[0]
  # print("type(dfBPSData) :", type(dfBPSData))
  loop += 1

print(loop)


dfBPSDataMerge = pd.concat(dfMerges)
print("dfBPSDataMerge.shape :", dfBPSDataMerge.shape, totalRows)

print(pathDestination)
dfBPSDataMerge.to_csv(pathDestination, encoding='utf-8', index=False)

## Step 0208 Create Data Description

In [None]:
# Sample json file for rawdata IBS 1993
[
	{
		"name": "DSTATS93",
		"type": "String",
		"description": "Status Permodalan"
	},
	{
		"name": "DETYPE93",
		"type": "String",
		"description": "Bentuk Badan Hukum"
	},
	{
		"name": "DPROVI93",
		"type": "String",
		"description": "Propinsi"
	},
	{
		"name": "DKABUP93",
		"type": "String",
		"description": "Kabupaten/Kotamadya"
	},
	{
		"name": "DSRVYR93",
		"type": "String",
		"description": "Tahun Survei"
	},
	{
		"name": "DYRSTR93",
		"type": "String",
		"description": "Tahun Mulai Produksi Komersial di Propinsi ini"
	},
 
...

	{
		"name": "LPDNOU93",
		"type": "Integer",
		"description": "Jumlah Banyaknya Pekerja/Karyawan Pekerja (Produksi + Lainnya) (Laki-laki + Perempuan) dibayar rata-rata setiap bulan"
	},
	{
		"name": "LTLNOU93",
		"type": "Integer",
		"description": "Jumlah Banyaknya Pekerja/Karyawan Pekerja (Produksi + Lainnya) (dibayar + tidak dibayar) (Laki-laki + Perempuan) rata-rata setiap bulan"
	},

 ...

	{
		"name": "EWOVCE93",
		"type": "Integer",
		"description": "Nilai Kayu Bakar dipakai selama tahun 1993 (Pembangkit Listrik)"
	},
	{
		"name": "ENCVCE93",
		"type": "Integer",
		"description": "Nilai Bahan Bakar Lainnya dipakai selama tahun 1993 (Pembangkit Listrik)"
	},
	{
		"name": "ETLQUE93",
		"type": "Integer",
		"description": "Banyaknya Bahan Bakar Lainnya dipakai selama tahun 1993 (Pembangkit Listrik)"
	},
	{
		"name": "NST93",
		"type": "String",
		"description": "NST93 Variabel tidak digunakan"
	},
	{
		"name": "PSID",
		"type": "Integer",
		"description": "PSID Variabel"
	}
]

# Step 03 - Data Preparation
In this step, we pre-process the data, clean it, wrangle it, and
manipulate it as needed. Initial exploratory data analysis is also carried out.
* **Data Processing & Wrangling**: 
  Mainly concerned with data processing, cleaning, munging, wrangling and performing initial descriptive and exploratory data analysis
* **Feature Extraction & Engineering**: Here, we extract important features or attributes from the raw data and even create or engineer new features from existing features.
* **Feature Scaling & Selection**: Data features often need to be normalized and scaled to prevent Machine Learning algorithms from getting biased. Besides this, often we need to select a subset of all available features based on feature importance and quality.

Final Update 20200315

## Step 0301 Dataset Summary Analysis

In [None]:
# Examine dataset, shape, rows and columns
print("dfTrain shape   :", dfTrain.shape)
print("type(dfTrain)   :", type(dfTrain))
print("dfTrain.index   :", dfTrain.index)
print("dfTrain.columns :", dfTrain.columns, "\n")

In [None]:
# Examine dataset, first 5 rows
# dfTrain.head()
dfTrain.head().T

In [None]:
# Examine dataset, types of all features and total dataframe size in memory
dfTrain.info()

In [None]:
# Examine dataset, types of all features and total dataframe size in memory
dfTrain.describe().T
# dfTrain.describe(include='all').T

In [None]:
dfTrain.columns.isna().any()


# Step 04 - Deployment and Monitoring
Datawarehouse are deployed in production and are constantly monitored based on their performance and transformation.

Final Update 20200315

In [None]:
%%time
# Big Query delete table se2016-listing
listBQFile = [
  'se_2016_listing_11', 'se_2016_listing_12', 'se_2016_listing_13', 'se_2016_listing_14', 'se_2016_listing_15', 
  'se_2016_listing_16', 'se_2016_listing_17', 'se_2016_listing_18', 'se_2016_listing_19', 'se_2016_listing_21', 
  'se_2016_listing_31', 'se_2016_listing_32', 'se_2016_listing_33', 'se_2016_listing_34', 'se_2016_listing_35', 
  'se_2016_listing_36', 'se_2016_listing_51', 'se_2016_listing_52', 'se_2016_listing_53', 'se_2016_listing_61', 
  'se_2016_listing_62', 'se_2016_listing_63', 'se_2016_listing_64', 'se_2016_listing_65', 'se_2016_listing_71', 
  'se_2016_listing_72', 'se_2016_listing_73', 'se_2016_listing_74', 'se_2016_listing_75', 'se_2016_listing_76', 
  'se_2016_listing_81', 'se_2016_listing_82', 'se_2016_listing_91', 'se_2016_listing_94', 'se_2016_listing_merge' 
]

# Set working directory on Google Big Query 01
projectId = 'datawarehouse-001'
directoryBQ = ['datawarehouse-001:04_sensus_ekonomi', 'datawarehouse-001:04_sensus_ekonomi_rawdata']

# List file on Google Big Query working directory 02
# !bq ls --max_results=1000 {directoryBQ[0]}

# !bq rm --help

loop = 0
for e in listBQFile:
  bqFileName = directoryBQ[0] + "." + e
  # !bq rm -f -t {bqFileName}
  print(bqFileName)
  loop += 1

print(loop)

In [None]:
bebek = 100