# **Libraries**

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# **Data Loading**

In [8]:
df_train = pd.read_csv("train_features.csv")

In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3817 entries, 0 to 3816
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tahun_kelahiran          3817 non-null   int64  
 1   pendidikan               3628 non-null   object 
 2   status_pernikahan        3605 non-null   object 
 3   pendapatan               3627 non-null   float64
 4   jumlah_anak_balita       3627 non-null   float64
 5   jumlah_anak_remaja       3613 non-null   float64
 6   terakhir_belanja         3645 non-null   float64
 7   belanja_buah             3636 non-null   float64
 8   belanja_daging           3639 non-null   float64
 9   belanja_ikan             3624 non-null   float64
 10  belanja_kue              3603 non-null   float64
 11  pembelian_diskon         3639 non-null   float64
 12  pembelian_web            3652 non-null   float64
 13  pembelian_toko           3648 non-null   float64
 14  keluhan                 

**insight:**
1. Total baris keseluruhan ada 3817
2. Yang tidak missing value hanya kolom tahun_kelahirn saja, selain itu missing value
3. Type data kolom tangga_menjadi_anggota harusnya datetime

In [10]:
def is_decimal(df, columns):
    for col in columns:
        decimals_exist = False
        for value in df[col].dropna():
            if value % 1 != 0:
                decimals_exist = True
                break
        if decimals_exist:
            print("decimal exist")
        else:
            print("decimal not exit")

# Contoh penggunaan
cols = [ 'pendapatan', 'jumlah_anak_balita', 'jumlah_anak_remaja', 'terakhir_belanja', 'belanja_buah', 'belanja_daging', 'belanja_ikan', 'belanja_kue', 'pembelian_diskon', 'pembelian_web', 'pembelian_toko', 'keluhan']
is_decimal(df_train, cols)

decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit
decimal not exit


**insight :**
Tujuan pengecekan nilai desimal pada fitur bertipe float adalah untuk memastikan data konsisten dan valid.Beberapa fitur, seperti jumlah_anak_balita tidak mungkin memiliki nilai decimal

In [11]:
df_train.describe()

Unnamed: 0,tahun_kelahiran,pendapatan,jumlah_anak_balita,jumlah_anak_remaja,terakhir_belanja,belanja_buah,belanja_daging,belanja_ikan,belanja_kue,pembelian_diskon,pembelian_web,pembelian_toko,keluhan
count,3817.0,3627.0,3627.0,3613.0,3645.0,3636.0,3639.0,3624.0,3603.0,3639.0,3652.0,3648.0,3621.0
mean,1967.823946,114483200.0,0.29308,0.353723,47.23155,59804.239824,438574.8,81428.997792,63377.97058,2.125584,4.436473,5.767818,0.004971
std,11.768131,43460420.0,0.473063,0.493014,27.068512,74024.976109,512042.7,99976.226855,79435.457282,2.100133,3.002522,3.210738,0.07034
min,1899.0,5073000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1959.0,81125120.0,0.0,0.0,25.0,7907.0,49479.5,10115.0,7947.0,0.0,2.0,3.0,0.0
50%,1968.0,115621400.0,0.0,0.0,47.0,26456.0,221993.0,36054.5,27795.0,2.0,4.0,5.0,0.0
75%,1976.0,150496000.0,1.0,1.0,69.0,86162.0,686355.5,121380.0,89502.5,3.0,7.0,8.0,0.0
max,2000.0,332884000.0,2.0,2.0,128.0,396508.0,3489675.0,621600.0,542164.0,20.0,30.0,17.0,1.0


**insight:**
1. dilihat dari jarak antara mean dan median serta quartil,beberapa fitur kemungkinan memiliki distribusi normal
2. semua fitur belanja memiliki distribusi skew
3. dilihat dari jarak antar quartil dan nilai max,selain fitur jumlah anak memiliki outlier
4. varians dari tiap fitur bernilai positif
5. beberapa fitur memiliki nilai nya tinggi tinggi,maka di perlukan scalling data


In [19]:
df_train.duplicated().sum()

0

**insight:**
tidak ada data duplicate

In [20]:
y_train = pd.read_csv('train_labels.csv')

In [21]:
y_train

Unnamed: 0,jumlah_promosi
0,2
1,0
2,1
3,4
4,4
...,...
3812,5
3813,1
3814,0
3815,0


# **EDA**

1. Bagaimana missing value terjadi?
2. Bagaimana distribusi data di tiap fitur?
3. Apakah terdapat outlier?
4. Apakah datanya balance?