# Data Wrangling Exercise

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

Data wrangling or data munging is the process of cleaning, transforming, and mapping data from one form to another to utilize it for tasks such as analytics, summarization, reporting, visualization, and so on.

Data wrangling is one of most important and involving steps in the whole Data Science workflow. The output of this process directly impacts all downstream steps such as exploration, summarization, visualization, analysis and even the final result. This clearly shows why Data Scientists spend a lot of time in Data Collection and Wrangling.

## Learning Objectives

- Learn to clean and transform data using pandas
- Master key data wrangling operations:
  - Renaming and rearranging columns
  - Filtering data
  - Handling missing values
  - Managing duplicates
  - Encoding categorical variables
  - Normalizing numeric values
- Perform data summarization and aggregation

### Tasks to be completed

- Clean column names
- Sort and filter data
- Handle missing values
- Remove duplicates
- Encode categorical variables
- Normalize numeric features
- Perform data aggregation

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started

- Please select kernel "conda_python3" from SageMaker notebook instance.

### Import necessary libraries


In [None]:
# Import required libraries
import numpy as np  # NumPy for numerical operations
import pandas as pd  # Pandas for handling tabular data
from IPython.display import display  # Allows displaying objects in Jupyter Notebook outputs
from sklearn import preprocessing  # Import preprocessing utilities from scikit-learn

# Suppress chained assignment warning in Pandas
# This warning occurs when modifying a slice of a DataFrame, which can sometimes lead to unexpected behavior.
# Setting it to None disables the warning but requires careful handling to avoid unintended data modifications.
pd.options.mode.chained_assignment = None  

## Data wrangling utility functions


In [None]:
def describe_dataframe(df=pd.DataFrame()):
    """
    This function generates descriptive statistics for a given dataframe.
    
    Args:
        df (pd.DataFrame): The dataframe to be analyzed. Defaults to an empty DataFrame.
    
    Returns:
        None
    """

    # Print section header
    print("\n\n")
    print("*" * 30)
    print("About the Data")
    print("*" * 30)

    # Print the number of rows and columns in the dataframe
    print("Number of rows::", df.shape[0])
    print("Number of columns::", df.shape[1])
    print("\n")

    # Print column names
    print("Column Names::", df.columns.values.tolist())
    print("\n")

    # Print data types of each column
    print



def cleanup_column_names(df, rename_dict={}, do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed.
    Args:
        rename_dict (dict): keys represent old column names and values point to
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    # If rename_dict is empty or None, apply automatic column renaming
    if not rename_dict:  
        return df.rename(
            columns={
                # Convert column names to lowercase and replace spaces and slashes with underscores
                col: col.lower().replace(" ", "_").replace(r"/", "_")  
                for col in df.columns.values.tolist()
            },
            inplace=do_inplace,  # Apply renaming in place if do_inplace is True
        )
    else:
        # If rename_dict is provided, use it directly for renaming
        return df.rename(columns=rename_dict, inplace=do_inplace)

## Wine recognition dataset

This is UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

Original Owners:

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Dataset characteristics:

- Number of Instances: 178 (50 in each of three classes)
- Number of Attributes: 13 numeric, predictive attributes and the class
- Attribute info:

1. **Alcohol**: alcohol content, reported in units of ABV (alcohol by volume).

1. **Malic acid**: one of the principal organic acids found in wine. Although found in nearly every fruit and berry, it’s flavor is most prominent in green apples; likewise, it projects this sour flavor into wine. For more information, feel free to read about acids in wine.

1. **Ash**: yep, wine has ash in it. Ash is simply the inorganic matter left after evaporation and incineration.

1. **Alcalinity of ash**: the alkalinity of ash determines how basic (as opposed to acidic) the ash in a wine is.

1. **Magnesium**: magnesium is a metal that affects the flavor of wine.

1. **Total phenols**: Phenols are chemicals that affect the taste, color, and mouthfeel (i.e., texture) of wine. For some (very) in-depth information about phenols, we refer you to phenolic content in wine.

1. **Flavoids**: flavonoids are a type of phenol.

1. **Nonflavoid phenols**: nonflavonoids are another type of phenol.

1. **Proanthocyanins**: proanthocyanidins are yet another type of phenol.

1. **Color intensity**: the color intensity of a wine: i.e., how dark it is.

1. **Hue**: the hue of a wine, which is typically determined by the color of the cultivar used (although this is not always the case).

1. **OD280/OD315 of diluted wines**: protein content measurements.

1. **Proline**: an amino acid present in wines.

- Class
  - Class 0: 59
  - Class 1: 71
  - Class 2: 48

'messy_wine_data.csv' is a modified from 'Wine recognition dataset' by introducing some missing values.


## Import data

In [None]:
# Define the file path for the messy wine dataset
messy_wine_data = "../../Data/messy_wine_data.csv"

# Load the dataset into a Pandas DataFrame
df = pd.read_csv(messy_wine_data)

# Display the first few rows of the dataset to inspect its contents
df.head()

In [None]:
# Call the describe_dataframe function to generate statistical summaries  
# of the given DataFrame (df). This typically includes metrics like mean, 
# standard deviation, min, max, and quartiles for numerical columns.
describe_dataframe(df)

In [None]:
# Print the shape (dimensions) of the DataFrame `df`
# `df.shape` returns a tuple (number of rows, number of columns)
print("Shape of df={}".format(df.shape))

## Rename Columns


In [None]:
# Print the column names of the DataFrame 'df'.
print("Dataframe columns:\n{}".format(df.columns.tolist()))
# - `df.columns` accesses the column index of the DataFrame 'df'.
# - `.tolist()` converts the column index (which is a Pandas Index object) into a Python list.
# - `"{}\n".format(...)` is a string formatting method to insert the list of column names into the string and add a newline character for better readability.
# - `print(...)` function displays the formatted string to the console, showing the list of column names.

In [None]:
# Call cleanup_column_names function on a Pandas DataFrame.
cleanup_column_names(df)

In [None]:
# Print the list of column names of the DataFrame 'df'.
print("Dataframe columns:\n{}".format(df.columns.tolist()))

## Sort Rows on defined attributes


In [None]:
# Display the first 5 rows of the DataFrame 'df'.
# This function is useful for quickly inspecting the structure and sample data of the DataFrame.
df.head()

In [None]:
# Sort the DataFrame `df` based on two columns:  
# 1. `malic_acid` in ascending order (smallest to largest)  
# 2. `ash` in descending order (largest to smallest)  
sorted_df = df.sort_values(["malic_acid", "ash"], ascending=[True, False])

# Display the first few rows of the sorted DataFrame  
display(sorted_df.head())

In [None]:
# Sort the DataFrame by the "alcohol" column in descending order
# The highest alcohol values will appear first
sorted_df =  # Your code goes here

# Display the first few rows of the sorted DataFrame
display(sorted_df.head())

## Rearrange Columns in a Dataframe


In [None]:
# Display the first 5 rows of the DataFrame 'df'.
# This function is useful for quickly inspecting the structure and sample data of the DataFrame.
df.head()

In [None]:
# Rearrange the columns in a specific order for better readability and analysis.
# The columns are ordered as: 'class', followed by various chemical properties of the wine.

display(
    df[
        [   # List of columns arranged in the desired order
            "class",  # Target variable representing the wine class/category
            "alcohol",  # Alcohol content in the wine
            "malic_acid",  # Malic acid concentration
            "ash",  # Ash content in the wine
            "alcalinity_of_ash",  # Alkalinity of ash
            "magnesium",  # Magnesium content
            "total_phenols",  # Total phenolic compounds
            "flavanoids",  # Flavonoid content
            "nonflavanoid_phenols",  # Non-flavonoid phenols
            "proanthocyanins",  # Proanthocyanins (a type of phenolic compound)
            "color_intensity",  # Intensity of the wine color
            "hue",  # Hue of the wine
            "od280_od315_of_diluted_wines",  # OD280/OD315 ratio (indicator of wine quality)
            "proline",  # Proline content (an amino acid relevant to wine properties)
        ]
    ].head()  # Display the first few rows of the reordered DataFrame
)

In [None]:
# Display the first few rows of the dataframe (head) with columns rearranged in a specified order

# Rearrange columns in the order of 'alcohol', 'color_intensity',   'hue',
# 'malic_acid',   'ash',  'alcalinity_of_ash', 'magnesium',  'total_phenols',
# 'flavanoids',  'nonflavanoid_phenols', 'proanthocyanins',
# 'od280_od315_of_diluted_wines','proline', 'class'.

# The `display()` function is used to show the top rows of the dataframe.

# Your code goes here

## Filtering Columns

Using Column Index


In [None]:
# Access the DataFrame's 3rd column (index 3) using iloc
# iloc[:, 3] selects all rows from the column at index 3
# .values returns the underlying NumPy array for that column
# [0:10] slices the first 10 values from the array

# Print the first 10 values from the 3rd column

# Your code goes here

### Using Column Name


In [None]:
# Print the first 10 values of the 'total_phenols' column from the DataFrame
# 'df' is assumed to be the DataFrame containing the data

# Accesses 'total_phenols' column and prints the first 10 values

# Your code goes here

### Using Column Datatype


In [None]:
# Select columns with data type 'float64' from the DataFrame (df)
# `select_dtypes(include=["float64"])` filters columns that are of type float64
# This will return a DataFrame containing only the float64 columns
float_columns = df.select_dtypes(include=["float64"])

# Print the first 10 values of the first float64 column (from the filtered DataFrame)
# `values` gives the underlying NumPy array of the DataFrame
# `[:10, 0]` selects the first 10 rows and the first column (0-indexed) from the NumPy array
print(float_columns.values[:10, 0])

## Filtering Rows

Select specific rows


In [None]:
# Select specific rows by their index positions: 21, 45, and 100
# The .iloc[] method is used to access rows by their integer index positions (0-based index)

# Display rows 21, 45, and 100 from the DataFrame 'df'

# Your code goes here

### Exclude Specific Row indices


In [None]:
# drop the first and third rows (index 0 and index 2)
# The 'axis=0' specifies that we are working with rows (not columns).
# 'drop' method removes the rows from the DataFrame based on their index values.

# Display the first few rows of the modified DataFrame

# Your code goes here

### Conditional Filtering


In [None]:
# Filter the rows where the 'ash' column value is greater than 2
# Then display the first 5 rows of the resulting DataFrame

# Your code goes here

### Offset from top of the dataframe


In [None]:
# Skip the top 100 rows of the DataFrame `df` and display the first 5 rows of the remaining data

# Your code goes here

### Offset from bottom of the dataframe


In [None]:
# Skip the last 10 rows of the DataFrame 'df' and display the first few rows of the remaining data
# The slicing `df[:-10]` skips the last 10 rows by specifying a range that goes up to the 10th-to-last row

# Display the first few rows of the modified DataFrame

# Your code goes here

## TypeCasting/Data Type Conversion


In [None]:
# Print the data types of the columns in the original dataframe
print("Old dtypes:\n", df.dtypes)

# Change the data type of the 'hue' column from float (float64) to integer (int)

# Your code goes here

# Print the data types of the columns in the dataframe after the change
print("New dtypes:\n", df.dtypes)

## Missing Values

_Note: you make get some `FutureWarning` notifications in the following cells. They shouldn't cause problems._


In [None]:
# Drop rows with missing values in the 'malic_acid' column
# This will remove any rows where 'malic_acid' is NaN

df_dropped = # Your code goes here

# Display the shape of the DataFrame after dropping rows
df_dropped.shape  # This will return the number of rows and columns in the new DataFrame

In [None]:
# Fill missing 'magnesium' values in the df_dropped DataFrame with the mean of 'magnesium' column
# We use np.round to round the mean to 2 decimal places

# Your code goes here

In [None]:
# Fill missing values in the "flavanoids" column using the previous row's value (forward fill)
# This method propagates the last valid observation forward to the next missing value

# Your code goes here

In [None]:
# Fill missing values in the "flavanoids" column by using the value from the next row (backward fill)

# Your code goes here

## Duplicates


In [None]:
# Display the first few rows of the dataframe before dropping duplicates
display(df_dropped.head())

# Print the shape (number of rows and columns) of the dataframe before dropping duplicates
print("Shape of df before dropping duplicates ={}".format(df_dropped.shape))

In [None]:
# Drop duplicate rows based on the 'alcohol' column

# Your code goes here

# Display the updated dataframe (after duplicates are dropped) to verify the changes
display(df_dropped.head())

# Print the shape (number of rows and columns) of the dataframe after dropping duplicates
print("Shape of df after dropping duplicates ={}".format(df_dropped.shape))

## Encode Categoricals


In [None]:
# Get One Hot Encoding using get_dummies() for 'class'
# The get_dummies() function converts categorical variable(s) into dummy/indicator variables
# Here, we are applying it to the 'class' column of the DataFrame 'df'

# 'columns=["class"]' specifies that the 'class' column will be encoded

# Your code goes here

# The 'head()' function displays the first 5 rows of the resulting DataFrame after encoding

## Random Sampling data from DataFrame


In [None]:
# Randomly sample 30% of the rows from the DataFrame (with replacement)
# frac=0.3: Specifies the fraction of rows to sample (30%)
# replace=True: Sampling is done with replacement, meaning the same row can be sampled multiple times
# random_state=42: Ensures reproducibility by setting a fixed seed for the random number generator
# display(): Used to show the first few rows of the randomly sampled DataFrame

# Your code goes here

## Normalizing Numeric Values

Normalizing values is a crucial step in machine learning to ensure that all features contribute equally to the model's learning process. Features with larger magnitude values can dominate the model, leading to bias and poor performance, especially in algorithms sensitive to scale, such as gradient-based methods (e.g., linear regression, neural networks) and distance-based algorithms (e.g., k-Nearest Neighbors, k-Means). Normalization scales features to a standard range, such as [0, 1] or a mean of 0 and a standard deviation of 1, which helps improve convergence speed, enhances model accuracy, and ensures fair comparison across features. Additionally, normalization reduces the impact of outliers, making the model more robust and reliable. By standardizing the data, normalization enables algorithms to perform optimally, leading to better generalization and interpretability of results.

Normalize 'alcohol' values using **Min-Max Scaler**


In [None]:
# Normalize 'alcohol' values using Min-Max Scaler

# Create a copy of the dataframe without missing values to avoid altering the original data
df_normalized = df.dropna().copy()

# Create an instance of MinMaxScaler from sklearn.preprocessing to scale data between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()

# Transform the 'alcohol' column values:
# - Reshape the data to (-1, 1) to ensure it’s in the correct 2D format for the scaler
# - Fit the MinMaxScaler to the 'alcohol' values and then transform them into scaled values
alcohol_scaled = min_max_scaler.fit_transform(
    df_normalized["alcohol"].values.reshape(-1, 1)
)

# Replace the original 'alcohol' column with the scaled values
df_normalized["alcohol"] = alcohol_scaled.reshape(-1, 1)

In [None]:
# Display the first few rows of the normalized DataFrame to check the results
display(df_normalized.head())

### Normalize quantity purchased values using **Robust Scaler**


In [None]:
# Normalize 'magnesium' values using Robust Scaler

# Create a copy of the DataFrame with any NaN values dropped for clean data
df_normalized = df.dropna().copy()

# Create a RobustScaler instance (this scales data by removing the median and scaling according to the interquartile range)

# Your code goes here

# Apply RobustScaler to the 'magnesium' column by reshaping it into a 2D array (required by the scaler)
# The .fit_transform() method fits the scaler and applies it to normalize the 'magnesium' values

# Your code goes here

# Replace the original 'magnesium' values in the DataFrame with the scaled values
df_normalized["magnesium"] = magnesium_scaled.reshape(-1, 1)

In [None]:
# Display the first few rows (default 5 rows) of the DataFrame to inspect the normalized data
display(df_normalized.head())

## Data Summarization

Condition based aggregation


In [None]:
# Get the mean 'hue' of class 1 wine
# Filter the dataframe to select only rows where the 'class' column is equal to 1,
# then select the 'hue' column from those rows and calculate the mean.
mean_hue = df["hue"][df["class"] == 1].mean()

# Print the mean 'hue' value for class 1 wine
print("Mean 'hue' of class 1 wine :: {}".format(mean_hue))

In [None]:
# Filter the rows where the 'class' column is equal to 0 (class 0 wine)
# Then, select the 'alcohol' column from these filtered rows
# Finally, get the maximum value from the 'alcohol' column for class 0 wine

# Your code goes here

# Print the result, showing the maximum 'alcohol' value for class 0 wine
print("Max 'alcohol' of class 0 wine :: {}".format(max_alcohol))

## Conclusion

Through this exercise, we learned essential data wrangling techniques including:

- Data cleaning and transformation
- Column manipulation
- Row filtering
- Missing value handling
- Data type conversion
- Categorical encoding
- Numeric value normalization
- Data summarization

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
