# Data Wrangling Exercise

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

Data wrangling or data munging is the process of cleaning, transforming, and mapping data from one form to another to utilize it for tasks such as analytics, summarization, reporting, visualization, and so on.

Data wrangling is one of most important and involving steps in the whole Data Science workflow. The output of this process directly impacts all downstream steps such as exploration, summarization, visualization, analysis and even the final result. This clearly shows why Data Scientists spend a lot of time in Data Collection and Wrangling.

## Learning Objectives

- Learn to clean and transform data using pandas
- Master key data wrangling operations:
  - Renaming and rearranging columns
  - Filtering data
  - Handling missing values
  - Managing duplicates
  - Encoding categorical variables
  - Normalizing numeric values
- Perform data summarization and aggregation

### Tasks to be completed

- Clean column names
- Sort and filter data
- Handle missing values
- Remove duplicates
- Encode categorical variables
- Normalize numeric features
- Perform data aggregation

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started

### Set up conda environment

TODO: use official file

Ensure that you have created then conda environment using the `environment.yml` file included in this repository. E.g.,

```
# Create conda environment
conda env create -f conda_env_submodule_4.yml

# Register the kernel
python -m ipykernel install --user \
    --name=nigms_sandbox_ud__submodule_4 \
    --display-name "Python (NIGMS Sandbox UD, Submodule 4)"
```

Then, when starting the notebook, select the `"Python (nigms_sandbox_ud)"` kernel from the list.

Note that you may need to restart Jupyter Lab for these changes to take effect.

### Import necessary libraries


In [None]:
# import required libraries
import numpy as np
import pandas as pd
from IPython.display import display  # Display a Python object in all frontends
from sklearn import preprocessing

pd.options.mode.chained_assignment = (
    None  # ignoring the warning when working on slices of dataframes
)

## Import data

TODO: move the data import here


## Data wrangling utility functions


In [None]:
def describe_dataframe(df=pd.DataFrame()):
    """This function generates descriptive stats of a dataframe
    Args:
        df (dataframe): the dataframe to be analyzed
    Returns:
        None

    """
    print("\n\n")
    print("*" * 30)
    print("About the Data")
    print("*" * 30)

    print("Number of rows::", df.shape[0])
    print("Number of columns::", df.shape[1])
    print("\n")

    print("Column Names::", df.columns.values.tolist())
    print("\n")

    print("Column Data Types::\n", df.dtypes)
    print("\n")

    print("Columns with Missing Values::", df.columns[df.isnull().any()].tolist())
    print("\n")

    print("Number of rows with Missing Values::", df.isna().any(axis=1).sum())
    print("\n")

    print("Sample Indices with missing data::", df[df.isna().any(axis=1)].index[0:5])
    print("\n")

    print("General Stats::")
    print(df.info())
    print("\n")

    print("Summary Stats::")
    print(df.describe())
    print("\n")

    print("Dataframe Sample Rows::")
    display(df.head(5))


def cleanup_column_names(df, rename_dict={}, do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed.
    Args:
        rename_dict (dict): keys represent old column names and values point to
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    if not rename_dict:
        return df.rename(
            columns={
                col: col.lower().replace(" ", "_").replace(r"/", "_")
                for col in df.columns.values.tolist()
            },
            inplace=do_inplace,
        )
    else:
        return df.rename(columns=rename_dict, inplace=do_inplace)

## Wine recognition dataset

This is UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

Original Owners:

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Dataset characteristics:

- Number of Instances: 178 (50 in each of three classes)
- Number of Attributes: 13 numeric, predictive attributes and the class
- Attribute info:

1. **Alcohol**: alcohol content, reported in units of ABV (alcohol by volume).

1. **Malic acid**: one of the principal organic acids found in wine. Although found in nearly every fruit and berry, it’s flavor is most prominent in green apples; likewise, it projects this sour flavor into wine. For more information, feel free to read about acids in wine.

1. **Ash**: yep, wine has ash in it. Ash is simply the inorganic matter left after evaporation and incineration.

1. **Alcalinity of ash**: the alkalinity of ash determines how basic (as opposed to acidic) the ash in a wine is.

1. **Magnesium**: magnesium is a metal that affects the flavor of wine.

1. **Total phenols**: Phenols are chemicals that affect the taste, color, and mouthfeel (i.e., texture) of wine. For some (very) in-depth information about phenols, we refer you to phenolic content in wine.

1. **Flavoids**: flavonoids are a type of phenol.

1. **Nonflavoid phenols**: nonflavonoids are another type of phenol.

1. **Proanthocyanins**: proanthocyanidins are yet another type of phenol.

1. **Color intensity**: the color intensity of a wine: i.e., how dark it is.

1. **Hue**: the hue of a wine, which is typically determined by the color of the cultivar used (although this is not always the case).

1. **OD280/OD315 of diluted wines**: protein content measurements.

1. **Proline**: an amino acid present in wines.

- Class
  - Class 0: 59
  - Class 1: 71
  - Class 2: 48

'messy_wine_data.csv' is a modified from 'Wine recognition dataset' by introducing some missing values.


In [None]:
# Load messy wine dataset
messy_wine_data = "../../Data/messy_wine_data.csv"

df = pd.read_csv(messy_wine_data)
df.head()

In [None]:
# describe the stats of dataframe
describe_dataframe(df)

In [None]:
print("Shape of df={}".format(df.shape))

## Rename Columns


In [None]:
print("Dataframe columns:\n{}".format(df.columns.tolist()))

In [None]:
cleanup_column_names(df)

In [None]:
print("Dataframe columns:\n{}".format(df.columns.tolist()))

## Sort Rows on defined attributes


In [None]:
df.head()

In [None]:
# Sort data by ascending malic_acid and decreasing ash
display(df.sort_values(["malic_acid", "ash"], ascending=[True, False]).head())

In [None]:
# Sort data by decreasing alcohol
display(df.sort_values(["alcohol"], ascending=[False]).head())

## Rearrange Columns in a Dataframe


In [None]:
df.head()

In [None]:
# Rearrange columns in the order of 'class', 'alcohol',	'malic_acid',	'ash',	'alcalinity_of_ash',
# 'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',
# 'color_intensity',	'hue',	'od280_od315_of_diluted_wines','proline'.
display(
    df[
        [
            "class",
            "alcohol",
            "malic_acid",
            "ash",
            "alcalinity_of_ash",
            "magnesium",
            "total_phenols",
            "flavanoids",
            "nonflavanoid_phenols",
            "proanthocyanins",
            "color_intensity",
            "hue",
            "od280_od315_of_diluted_wines",
            "proline",
        ]
    ].head()
)

In [None]:
# Rearrange columns in the order of 'alcohol', 'color_intensity',	'hue',	'malic_acid',	'ash',	'alcalinity_of_ash',
# 'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',
# 'od280_od315_of_diluted_wines','proline', 'class'.
display(
    df[
        [
            "alcohol",
            "color_intensity",
            "hue",
            "malic_acid",
            "ash",
            "alcalinity_of_ash",
            "magnesium",
            "total_phenols",
            "flavanoids",
            "nonflavanoid_phenols",
            "proanthocyanins",
            "od280_od315_of_diluted_wines",
            "proline",
            "class",
        ]
    ].head()
)

## Filtering Columns

Using Column Index


In [None]:
# print 10 values from column at index 3
print(df.iloc[:, 3].values[0:10])

### Using Column Name


In [None]:
# print 10 values of total_phenols
print(df.total_phenols.values[0:10])

### Using Column Datatype


In [None]:
# print 10 values of columns with data type float
print(df.select_dtypes(include=["float64"]).values[:10, 0])

## Filtering Rows

Select specific rows


In [None]:
# Select rows of 21, 45, 100
display(df.iloc[[21, 45, 100]])

### Exclude Specific Row indices


In [None]:
# drop the first and third rows
display(df.drop([0, 2], axis=0).head())

### Conditional Filtering


In [None]:
# Get those wines with ash > 2
display(df[df.ash > 2].head())

### Offset from top of the dataframe


In [None]:
# Skip the top 100 rows
display(df[100:].head())

### Offset from bottom of the dataframe


In [None]:
# Skip the last 10 rows
display(df[:-10].head())

## TypeCasting/Data Type Conversion


In [None]:
print("Old dtypes:\n", df.dtypes)
# change the data type of hue	 object to 'int'
df["hue"] = df["hue"].astype(int)
# compare dtypes of the original df with this one
print("New dtypes:\n", df.dtypes)

## Missing Values

*Note: you make get some `FutureWarning` notifications in the following cells.  They shouldn't cause problems.*


In [None]:
# Drop rows with missing values in 'malic_acid' column
df_dropped = df.dropna(subset=["malic_acid"])
df_dropped.shape

In [None]:
# Fill Missing 'magnesium' values with mean 'magnesium'
df_dropped["magnesium"].fillna(
    value=np.round(df.magnesium.mean(), decimals=2), inplace=True
)

In [None]:
# Fill Missing flavanoids values with value from previous row (forward fill)
df_dropped["flavanoids"].fillna(method="ffill", inplace=True)

In [None]:
# Fill Missing user_type values with value from next row (backward fill)
df_dropped["flavanoids"].fillna(method="bfill", inplace=True)

## Duplicates


In [None]:
# Before dropping Duplicate 'alcohol' rows
display(df_dropped.head())
print("Shape of df before dropping duplicates ={}".format(df_dropped.shape))

In [None]:
# After dropping Duplicate 'alcohol' rows
df_dropped.drop_duplicates(subset=["alcohol"], inplace=True)
# updated dataframe
display(df_dropped.head())
print("Shape of df after dropping duplicates ={}".format(df_dropped.shape))

## Encode Categoricals


In [None]:
# Get One Hot Encoding using get_dummies() for 'class'
display(pd.get_dummies(df, columns=["class"]).head())

## Random Sampling data from DataFrame


In [None]:
# Randomly sample 30% of samples
display(df.sample(frac=0.3, replace=True, random_state=42).head())

## Normalizing Numeric Values

Normalize 'alcohol' values using **Min-Max Scaler**


In [None]:
# Normalize 'alcohol' values using Min-Max Scaler
df_normalized = df.dropna().copy()
# Create a min_max_scaler
min_max_scaler = preprocessing.MinMaxScaler()
# Transform data, reshape your data using array.reshape(-1, 1) if your data has a single feature
alcohol_scaled = min_max_scaler.fit_transform(
    df_normalized["alcohol"].values.reshape(-1, 1)
)
df_normalized["alcohol"] = alcohol_scaled.reshape(-1, 1)

In [None]:
display(df_normalized.head())

### Normalize quantity purchased values using **Robust Scaler**


In [None]:
# Normalize 'magnesium' values using Robust Scaler
df_normalized = df.dropna().copy()
# Create a RobustScaler
robust_scaler = preprocessing.RobustScaler()
magnesium_scaled = robust_scaler.fit_transform(
    df_normalized["magnesium"].values.reshape(-1, 1)
)
df_normalized["magnesium"] = magnesium_scaled.reshape(-1, 1)

In [None]:
display(df_normalized.head())

## Data Summarization

Condition based aggregation


In [None]:
# Get the mean 'hue' of class 1 wine
mean_hue = df["hue"][df["class"] == 1].mean()
print("Mean 'hue' of class 1 wine :: {}".format(mean_hue))

In [None]:
# Get the max 'alcohol' of class 0 wine
max_alcohol = df["alcohol"][df["class"] == 0].max()
print("Max 'alcohol' of class 0 wine :: {}".format(max_alcohol))

## Conclusion

Through this exercise, we learned essential data wrangling techniques including:

- Data cleaning and transformation
- Column manipulation
- Row filtering
- Missing value handling
- Data type conversion
- Categorical encoding
- Numeric value normalization
- Data summarization

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
