Data Profiling
1. Memahami fungsi data profiling
2. Memahami langkah dalam melakukan profiling / Data profiling manual
3. Data Profiling Using pandas_profilin di folder khusus
4. Dataset : 'https://storage.googleapis.com/dqlab-dataset/retail_raw_reduced_data_quality.csv'

Fungsi Data Profiling :
1. Merangkum dataset menggunakan statistik deskriptif
2. Bertujuan untuk memiliki pemahaman yang kuat tentang data sehingga dapat mulai menyusun framework analisis dan memvisualisasikan data

Data Profiling Manual :
1. Importing Data
2. Inspeksi Data
3. Memahami Struktur Data
Fungsi len akan menghitung semua pengamatan, terlepas dari apakah ada null-value atau tidak (include missing value), 
Fungsi count menghitung jumlah pengamatan non-NA/non-null dalam suatu series/column.
4. Menghitung Berapa Persen Jumlah Missing Value
5. Descriptive Statistics (Max, Min, Mean, Mode, Median, std)
6. Quantile Statistics
7. Correlation antar variabel numerik
Koefisien korelasi berkisar antara -1 hingga 1, Korelasi 1 adalah korelasi positif total, korelasi -1 adalah korelasi negatif total dan korelasi 0 adalah korelasi non-linear.

In [2]:
# Data Profiing Manual
# Importing Data
import pandas as pd
import numpy as np
import io
import pandas_profiling
retail_raw = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/retail_raw_reduced_data_quality.csv')

# Inspeksi Data
# Cetak tipe data di setiap kolom retail_raw
print(retail_raw.dtypes)

# Memahami Struktur Data
# Fungsi Len
# Kolom city
length_city = len(retail_raw['city'])
print('Length kolom city:', length_city)

# Kolom product_id
length_product_id = len(retail_raw['product_id'])
print('Length kolom product_id:', length_product_id)

# Fungsi Count
# Count kolom city
count_city = retail_raw['city'].count()
print('Count kolom count_city:', count_city)

# count kolom product_id
count_product_id = retail_raw['product_id'].count()
print('Count kolom product_id', count_product_id)


order_id         int64
order_date      object
customer_id      int64
city            object
province        object
product_id      object
brand           object
quantity       float64
item_price     float64
dtype: object
Length kolom city: 5000
Length kolom product_id: 5000
Count kolom count_city: 4984
Count kolom product_id 4989


In [3]:
# Data Profiing Manual
# Menghitung berapa persen jumlah missing value
# Menggunakan len dan count
import pandas as pd
import numpy as np
import io
import pandas_profiling
retail_raw = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/retail_raw_reduced_data_quality.csv')

# Kolom city
length_city = len(retail_raw['city'])
count_city = retail_raw['city'].count()
# Kolom product id
length_product_id = len(retail_raw['product_id'])
count_product_id = retail_raw['product_id'].count()

# Missing value pada kolom : city
number_of_missing_values_city = length_city - count_city
ratio_of_missing_values_city = number_of_missing_values_city/length_city
pct_of_missing_values_city = '{0:.1f}%'.format(ratio_of_missing_values_city * 100)
print('Persentase missing value kolom city:', pct_of_missing_values_city)

# Tugas praktek: Missing value pada kolom : product_id
number_of_missing_values_product_id = length_product_id - count_product_id
ratio_of_missing_values_product_id = number_of_missing_values_product_id/length_product_id
pct_of_missing_values_product_id = '{0:.1f}%'.format(ratio_of_missing_values_product_id * 100)
print('Persentase missing value kolom product_id:', pct_of_missing_values_product_id)

Persentase missing value kolom city: 0.3%
Persentase missing value kolom product_id: 0.2%


In [4]:
# Data Profiing Manual
# Descriptive Statistics (Max, Min, Mean, Mode, Median, std)
import pandas as pd
import numpy as np
import io
import pandas_profiling
retail_raw = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/retail_raw_reduced_data_quality.csv')

# Deskriptif statistics kolom : quantity
print('Kolom quantity')
print('Minimum value: ', retail_raw['quantity'].min())
print('Maximum value: ', retail_raw['quantity'].max())
print('Mean value: ', retail_raw['quantity'].mean())
print('Mode value: ', retail_raw['quantity'].mode())
print('Median value: ', retail_raw['quantity'].median())
print('Standard Deviation value: ', retail_raw['quantity'].std())

# Tugas praktek: Deskriptif statistics kolom : item_price
print('')
print('Kolom item_price')
print('Minimum value: ', retail_raw['item_price'].min())
print('Maximum value: ', retail_raw['item_price'].max())
print('Mean value: ', retail_raw['item_price'].mean())
print('Median value: ', retail_raw['item_price'].median())
print('Standard Deviation value: ', retail_raw['item_price'].std())

Kolom quantity
Minimum value:  1.0
Maximum value:  720.0
Mean value:  11.423987164059366
Mode value:  0    1.0
Name: quantity, dtype: float64
Median value:  5.0
Standard Deviation value:  29.442025010811317

Kolom item_price
Minimum value:  26000.0
Maximum value:  29762000.0
Mean value:  933742.7311008623
Median value:  604000.0
Standard Deviation value:  1030829.8104242863


In [5]:
# Data Profiing Manual
# Quantile Statistics
import pandas as pd
import numpy as np
import io
import pandas_profiling
retail_raw = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/retail_raw_reduced_data_quality.csv')

# Quantile statistics kolom quantity
print('Kolom quantity:')
print(retail_raw['quantity'].quantile([0.25, 0.5, 0.75]))

# Tugas praktek: Quantile statistics kolom item_price
print('')
print('Kolom item_price:')
print(retail_raw['item_price'].quantile([0.25, 0.5, 0.75]))

Kolom quantity:
0.25     2.0
0.50     5.0
0.75    12.0
Name: quantity, dtype: float64

Kolom item_price:
0.25     450000.0
0.50     604000.0
0.75    1045000.0
Name: item_price, dtype: float64


In [6]:
# Data Profiing Manual
# Correlation quantity dan item_price
import pandas as pd
import numpy as np
import io
import pandas_profiling
retail_raw = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/retail_raw_reduced_data_quality.csv')

print('Korelasi quantity dengan item_price')
print(retail_raw[['quantity', 'item_price']].corr())

Korelasi quantity dengan item_price
            quantity  item_price
quantity    1.000000   -0.133936
item_price -0.133936    1.000000
