# Hands-On Pertemuan 12 and 13: Data Cleaning, Preparation, and Visualization

## Objectives:
- **Pertemuan 12**: Master data cleaning and preparation techniques using Pandas.
- **Pertemuan 13**: Develop skills in data visualization using Matplotlib and Seaborn for effective data analysis.


## Pertemuan 12: Data Cleaning and Preparation using Pandas

### Topics Covered
- Identifying and handling missing data.
- Data transformation and normalization.
- Data filtering and deduplication.
- Standardization of categorical data.
- Outlier detection and handling.


In [None]:
# Exercise 1: Identifying and Handling Missing Data
import pandas as pd

# Sample dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'Age': [24, 30, None, 22, 35],
    'Salary': [48000, None, 57000, None, 60000]
}
df = pd.DataFrame(data)
print('Before cleaning:', df)

# Filling missing values and dropping rows
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
df.dropna(subset=['Name'], inplace=True)
print('\nAfter cleaning:\n', df)


Before cleaning:       Name   Age   Salary
0    Alice  24.0  48000.0
1      Bob  30.0      NaN
2  Charlie   NaN  57000.0
3    David  22.0      NaN
4     None  35.0  60000.0

After cleaning:
       Name    Age   Salary
0    Alice  24.00  48000.0
1      Bob  30.00  57000.0
2  Charlie  27.75  57000.0
3    David  22.00  57000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].median(), inplace=True)


In [None]:
# Exercise 2: Standardizing Categorical Data
# Sample dataset with inconsistent categorical values
data = {
    'Product': ['Laptop', 'Laptop', 'Desktop', 'Tablet', 'Tablet'],
    'Category': ['Electronics', 'electronics', 'Electronics', 'Gadgets', 'gadgets']
}
df = pd.DataFrame(data)

# Standardize category values
df['Category'] = df['Category'].str.capitalize()
print('Standardized Data:\n', df)


Standardized Data:
    Product     Category
0   Laptop  Electronics
1   Laptop  Electronics
2  Desktop  Electronics
3   Tablet      Gadgets
4   Tablet      Gadgets


### Practice Tasks
- Load a dataset of your choice and identify missing values.
- Implement data transformations to normalize numerical columns.
- Standardize categorical columns and remove duplicates.


In [None]:
import pandas as pd

# Dataset cuaca dengan nilai yang hilang
data_cuaca = {
    'Tanggal': ['2024-11-01', '2024-11-02', None, '2024-11-04', '2024-11-05', None, '2024-11-07', '2024-11-08', '2024-11-01', '2024-11-02'],
    'Suhu (°C)': [30.5, 31.0, None, 29.5, 25.6, 28.0, 27.5, None, 30.5, 31.0],
    'Kelembapan (%)': [85, 80, 78, None, 76, 74, 75, None, 85, 80],
    'Kondisi': ['Cerah', None, 'Mendung', 'Hujan', 'mendung', 'Cerah', None, 'Badai', 'cerah', None]
}

# Membuat DataFrame dari data
df_cuaca = pd.DataFrame(data_cuaca)
print("Weather Dataset before cleaning:")
print(df_cuaca)

# Load a dataset of your choice and identify missing values.
print("\nMissing values in column:")
print(df_cuaca.isnull().sum())

# Implement data transformations to normalize numerical columns.
df_cuaca['Suhu (°C)'].fillna(df_cuaca['Suhu (°C)'].mean(), inplace=True)
df_cuaca['Kelembapan (%)'].fillna(df_cuaca['Kelembapan (%)'].median(), inplace=True)
df_cuaca['Kondisi'].fillna(df_cuaca['Kondisi'].mode()[0], inplace=True)
df_cuaca.dropna(subset=['Tanggal'], inplace=True)
print("\nWeather dataset after  Implement data transformations:")
print(df_cuaca)

# Standardize categorical columns  and remove duplicates.
df_cuaca['Kondisi'] = df_cuaca['Kondisi'].str.capitalize()
print("\nWeather dataset after standardizing categorical columns and removing duplicates:")
df_cuaca.drop_duplicates(inplace=True)
print(df_cuaca)



Weather Dataset before cleaning:
      Tanggal  Suhu (°C)  Kelembapan (%)  Kondisi
0  2024-11-01       30.5            85.0    Cerah
1  2024-11-02       31.0            80.0     None
2        None        NaN            78.0  Mendung
3  2024-11-04       29.5             NaN    Hujan
4  2024-11-05       25.6            76.0  mendung
5        None       28.0            74.0    Cerah
6  2024-11-07       27.5            75.0     None
7  2024-11-08        NaN             NaN    Badai
8  2024-11-01       30.5            85.0    cerah
9  2024-11-02       31.0            80.0     None

Missing values in column:
Tanggal           2
Suhu (°C)         2
Kelembapan (%)    2
Kondisi           3
dtype: int64

Weather dataset after  Implement data transformations:
      Tanggal  Suhu (°C)  Kelembapan (%)  Kondisi
0  2024-11-01       30.5            85.0    Cerah
1  2024-11-02       31.0            80.0    Cerah
3  2024-11-04       29.5            79.0    Hujan
4  2024-11-05       25.6            76.0 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cuaca['Suhu (°C)'].fillna(df_cuaca['Suhu (°C)'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cuaca['Kelembapan (%)'].fillna(df_cuaca['Kelembapan (%)'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never wor

## Homework for Students
- **Pertemuan 12**: Clean a real-world dataset (from Kaggle or another source), perform normalization, handle outliers, and prepare the data for analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
df = pd.read_csv('titanic.csv')
print('Dataset before cleaning:\n', df)

# Clean the data
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna('Unknown', inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Cabin'].fillna('No Cabin', inplace=True)
df['Name'].fillna('Unknown', inplace=True)
print("Dataset after Cleaning:\n", df)


Dataset before cleaning:
      PassengerId  Survived  ...  Cabin Embarked
0              1         0  ...    NaN        S
1              2         1  ...    C85        C
2              3         1  ...    NaN        S
3              4         1  ...   C123        S
4              5         0  ...    NaN        S
..           ...       ...  ...    ...      ...
886          887         0  ...    NaN        S
887          888         1  ...    B42        S
888          889         0  ...    NaN        S
889          890         1  ...   C148        C
890          891         0  ...    NaN        Q

[891 rows x 12 columns]
Dataset after Cleaning:
      PassengerId  Survived  ...     Cabin Embarked
0              1         0  ...  No Cabin        S
1              2         1  ...       C85        C
2              3         1  ...  No Cabin        S
3              4         1  ...      C123        S
4              5         0  ...  No Cabin        S
..           ...       ...  ...       ... 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va

In [None]:
# Perform normalization
df['Normalized Age'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
df['Normalized Fare'] = (df['Fare'] - df['Fare'].min()) / (df['Fare'].max() - df['Fare'].min())
print('Data Perform Normalization:\n',  df)

Data Perform Normalization:
      PassengerId  Survived  ...  Normalized Fare Normalized Age
0              1         0  ...         0.014151       0.271174
1              2         1  ...         0.139136       0.472229
2              3         1  ...         0.015469       0.321438
3              4         1  ...         0.103644       0.434531
4              5         0  ...         0.015713       0.434531
..           ...       ...  ...              ...            ...
886          887         0  ...         0.025374       0.334004
887          888         1  ...         0.058556       0.233476
888          889         0  ...         0.045771       0.346569
889          890         1  ...         0.058556       0.321438
890          891         0  ...         0.015127       0.396833

[891 rows x 14 columns]


In [None]:
# Handle outliers for Fare
fare_threshold = 1.5 * (df['Fare'].quantile(0.75) - df['Fare'].quantile(0.25))
upper_limit = df['Fare'].quantile(0.75) + fare_threshold
lower_limit = df['Fare'].quantile(0.25) - fare_threshold
df['Is_Outlier_Fare'] = ~df['Fare'].between(lower_limit, upper_limit)
print('Data after handle outliers  for Fare:\n', df)

# Handle outliers for Age
age_threshold = 1.5 * (df['Age'].quantile(0.75) - df['Age'].quantile(0.25))
age_upper_limit = df['Age'].quantile(0.75) + age_threshold
age_lower_limit = df['Age'].quantile(0.25) - age_threshold
df['Is_Outlier_Age'] = ~df['Age'].between(age_lower_limit, age_upper_limit)
print('\nData after handle outlier for Age:\n', df)

Data after handle outliers  for Fare:
      PassengerId  Survived  ...  Is_Outlier_Fare Is_Outlier_Age
0              1         0  ...            False          False
1              2         1  ...             True          False
2              3         1  ...            False          False
3              4         1  ...            False          False
4              5         0  ...            False          False
..           ...       ...  ...              ...            ...
886          887         0  ...            False          False
887          888         1  ...            False          False
888          889         0  ...            False          False
889          890         1  ...            False          False
890          891         0  ...            False          False

[891 rows x 16 columns]

Data after handle outlier for Age:
      PassengerId  Survived  ...  Is_Outlier_Fare Is_Outlier_Age
0              1         0  ...            False          False
1  