### **Data Wrangling**

#### What is Data Wrangling?
Data wrangling (also known as data munging) is the process of cleaning, organizing, and transforming raw data into a format that is easier to analyze.

Think of it like preparing ingredients before cooking — cutting, cleaning, and measuring everything so it’s ready to be used.

#### Data Wrangling Process
1. Gathering Data
2. Assessing Data
3. Cleaning Data

---


In [1]:
"""
Execute this cell before continue
""" 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

#### Gathering Data

In [None]:
"""
Reading various data types using Pandas
""" 

students_df = pd.read_csv("Data Wrangling Dataset/students.csv")
students_df

# TODO: Explore pandas input/output API references here https://pandas.pydata.org/docs/reference/io.html
# TODO: Load the student grades from a JSON file in the same folder store as grades_df
# TODO: Extract the first table from an HTML file containing attendance data store as attendance_df
# TODO: Connect to an SQLite database file and read the "enrollment" table into a DataFrame store as enrollment_df
# Hint: Use the sqlite3 module to connect, and pandas to run a SQL SELECT query.

Before continue, please take 5 minutes reading to https://pandas.pydata.org/docs/user_guide/merging.html

In [None]:
"""
Combining Multiple Data: Join
""" 
print("Student data before join:")
print(students_df)
print("============================================================")

print("Additional data before join:")
print(grades_df)
print("============================================================")

# We need to set the index to 'student_id' before joining
students_df_reindexed = students_df.set_index("student_id")
grades_df_reindexed = grades_df.set_index("student_id")

joined_df = students_df_reindexed.join(grades_df_reindexed)
print(joined_df)

# TODO: Join students.csv and students2.csv using pandas join().
# Hint: Set the index to 'student_id' before joining and add suffixes "_left" and "_right".

In [None]:
"""
Combining Multiple Data: Concatenate
"""

print("Student data before concatenate:")
print(students_df)
print("============================================================")

print("Grades data before concatenate:")
print(grades_df)
print("============================================================")

print("Student data after concatenate:")
concatenated_df = pd.concat([students_df, grades_df])
print(concatenated_df)

# TODO: Try change the axis to 1
# TODO: Concatenate data from students.csv and students2.csv using pandas concat().

In [None]:
"""
Combining Multiple Data: Merge
""" 
print("Student data before merge:")
print(students_df)
print("============================================================")

print("Grades data before merge:")
print(grades_df)
print("============================================================")

print("Student data after merge:")
merged_df = pd.merge(students_df, grades_df)
print(merged_df)

# TODO: Merge data from students.csv and students2.csv using pandas merge().
# Expected Output:
#   student_id   full_name         gender   major               semester   course   grade   attendance_percent
# 0  101         Ali Ahmad         Male     Computer Science    NaN        NaN      NaN     NaN
# 1  102         Siti Nurhaliza    Female   Information Systems NaN        NaN      NaN     NaN
# 2  103         John Doe          Male     Data Science        NaN        NaN      NaN     NaN
# 3  104         Aisha Yusuf       Female   Computer Science    NaN        NaN      NaN     NaN
# 4  105         Muhammad Rizki    Male     Information Systems 2023A      Math     90.0    92.0
# 5  106         Nur Aini          Female   Data Science        2023A      Math     78.0    87.0
# 6  107         Kevin Lim         Male     Computer Science    2023A      Math     82.0    93.0
# 7  108         Melati Dewi       Female   Data Science        2023A      Math     95.0    89.0
# 8  109         Arif Rahman       Male     Information Systems 2023A      Math     87.0    84.0
# 9  110         Putri Lestari     Female   Computer Science    2023A      Math     89.0    90.0

In [None]:
"""
Class Activity: Gathering Data
"""

# TODO: Gather data from E-Commerce Public Dataset
# TODO: Read data from orders_item_dataset.csv
# TODO: Add product_category_name in english to orders_item_dataset
# Hint: Use the products_dataset.csv and product_category_name_translation.csv

#### Assessing & Cleaning Data

In [None]:
"""
Execute this cell before continue
"""

dirty_data = pd.DataFrame({
    'student_id': [101, 102, 103, 104, 105, 106, 106, 107, 108],
    'full_name': [
        'Ali Ahmad', 'Siti Nurhaliza', 'John Doe', 'Aisha Yusuf',
        'Muhammad Rizki', 'Nur Aini', 'Nur Aini', 'Kevin Lim', None
    ],
    'gender': [
        'Male', 'Female', 'Unknown', 'Female',
        'Male', 'Female', 'Female', 'Male', None
    ],
    'age': [20, 21, 22, 20, 21, 20, 20, 200, 21],
    'major': [
        'Computer Science', 'Data Science', 'Data Science',
        'Computer Science', None, 'Data Science', 'Data Science',
        'Computer Science', 'Data Science'
    ],
    'grade': [88, 90, 100, 80, 87, 92, 92, 76, 81],
    'attendance_percent': [95, 97, 100, 89, 88, None, None, 90, 94],
    'final_score': [
        83.7, 85.4, 94.9, 87.2, 82.8, 
        87, 87, 71.8, 77.8
    ],
    'study_hours': [3, 3, 3, 3, 1, 2, 2, 1, 2]
})

dirty_data

In [None]:
"""
Assessing data: Duplicate data
"""

# Identify duplicate data
dirty_data.duplicated().sum()

# Drop duplicate data
clean_data = dirty_data.drop_duplicates()
clean_data.duplicated().sum()
clean_data

# Quiz:
# 106	Nur Aini	Female	20
# 106	Nur Aini	Female	21
# Is it duplicate data?	

In [None]:
"""
Assessing data: Missing value
"""

# Identify missing value
clean_data.isna().sum()

In [None]:
"""
Cleaning data: Drop missing value
"""

# df.dropna()	                --> Drop rows with at least one missing values
# df.dropna(how="all")	        --> Drop rows with all missing values
# df.dropna(axis=1)	            --> Drop columns with any missing values
# df.dropna(subset=["column1"])	--> Drop rows where column1 has a missing value
# df.dropna(thresh=5)           --> Drop rows with fewer than 5 non-NaN values

# TODO: Drop the missing values that have 2 or more missing values

In [None]:
"""
Cleaning data: Fill missing value for categorical data
"""

# Find the mode
major_mode = clean_data["major"].mode()[0]

# Fill missing value using the mode
clean_data["major"].fillna(major_mode, inplace=True)
# clean_data["major"] = clean_data["major"].fillna(major_mode)
clean_data


In [None]:
"""
Cleaning data: Fill missing value using interpolation
"""

clean_data["attendance_percent"] = clean_data["attendance_percent"].interpolate(method="linear")
clean_data

In [None]:
"""
Assessing data: Invalid data type
"""

# Check data type
print("Before:")
print(clean_data.dtypes)

# Convert data type
clean_data["age"] = pd.to_numeric(clean_data["age"])

print("After:")
clean_data.dtypes

In [None]:
"""
Assessing data: Invalid value
"""

# Check unique value
for col in clean_data.columns:
    print(col)
    print(clean_data[col].unique())
    print("\n")

# Check which data is invalid
# clean_data[clean_data["gender"] == "Unknown"]

# How you prefer to handle it?

In [None]:
"""
Assessing data: Corelation
"""

clean_data.corr(numeric_only=True)

# TODO: Identify highly correlated features
# TODO: Drop one of them using the drop() method

In [None]:
"""
Assessing data: Outlier
"""

# Easy way to detect outlier
sns.boxplot(data=clean_data.select_dtypes(include=[np.number]))
plt.xticks(rotation=45)
plt.title("Boxplot of Numeric Columns")
plt.show()

In [105]:
"""
Cleaning data: Outlier
"""

# Identify outlier using z-score
# z = (x - mean) / std
# z < -3 or z > 3

zscores = stats.zscore(clean_data["age"])
zscores

# Drop the outlier
# clean_data = clean_data[zscores <= 2]
# clean_data

# TODO: Identify outlier in grade using IQR
# TODO: Drop the outlier

In [None]:
"""
Assessing data: Skewness
"""

clean_data.hist(figsize=(12, 6))

In [None]:
"""
Assessing data: Skewness
"""

for col in clean_data.columns:
    if clean_data[col].dtypes != "object":
        print(col)
        print(stats.skew(clean_data[col]))
        print(" ")



In [None]:
"""
Cleaning data: Transformation
"""

x = clean_data['study_hours']

print(stats.skew(x))

clean_data['log_transformed'] = np.log1p(x)  # log(x + 1)
print(stats.skew(clean_data['log_transformed']))

clean_data

In [None]:
""" 
Assessing data: Statistical summary
"""

dirty_data.describe()

In [None]:
""" 
Class Activity: Data Wrangling
"""

# TODO: Perform data wrangling on the dataset class activity folder inside Data Wrangling Dataset folder

### **Reflection**
If you encounter a lot of missing values, how do you handle them?

(answer here)

### **Exploration**
Next, we will learn how to gain deeper insights from data through Exploratory Data Analysis (EDA). Explore EDA notebooks on Kaggle to see practical examples. Remember, as you work with more diverse datasets, your skills will continue to sharpen.