#**Data wrangling Exercise**

Data wrangling or data munging is the process of cleaning, transforming, and mapping data from one
form to another to utilize it for tasks such as analytics, summarization, reporting, visualization, and so on.

Data wrangling is one of most important and involving steps in the whole Data Science workflow. The output
of this process directly impacts all downstream steps such as exploration, summarization, visualization,
analysis and even the final result. This clearly shows why Data Scientists spend a lot of time in Data
Collection and Wrangling.

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).


In [None]:
# import required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing

from IPython.display import display # Display a Python object in all frontends

pd.options.mode.chained_assignment = None # ignoring the warning when working on slices of dataframes 


##Data wrangling utility functions

In [None]:
def describe_dataframe(df=pd.DataFrame()):
    """This function generates descriptive stats of a dataframe
    Args:
        df (dataframe): the dataframe to be analyzed
    Returns:
        None

    """
    print("\n\n")
    print("*"*30)
    print("About the Data")
    print("*"*30)
    
    print("Number of rows::",df.shape[0])
    print("Number of columns::",df.shape[1])
    print("\n")
    
    print("Column Names::",df.columns.values.tolist())
    print("\n")
    
    print("Column Data Types::\n",df.dtypes)
    print("\n")
    
    print("Columns with Missing Values::",df.columns[df.isnull().any()].tolist())
    print("\n")
    
    print("Number of rows with Missing Values::",df.isna().any(axis=1).sum())
    print("\n")
    
    print("Sample Indices with missing data::",df[df.isna().any(axis=1)].index[0:5])
    print("\n")
    
    print("General Stats::")
    print(df.info())
    print("\n")
    
    print("Summary Stats::")
    print(df.describe())
    print("\n")
    
    print("Dataframe Sample Rows::")
    display(df.head(5))
    
def cleanup_column_names(df,rename_dict={},do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed. 
    Args:
        rename_dict (dict): keys represent old column names and values point to 
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    if not rename_dict:
        return df.rename(columns={col: col.lower().replace(' ','_').replace(r'/','_') 
                    for col in df.columns.values.tolist()}, 
                  inplace=do_inplace)
    else:
        return df.rename(columns=rename_dict,inplace=do_inplace)

##Wine recognition dataset

This is UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

Original Owners:

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Dataset characteristics:
* Number of Instances: 178 (50 in each of three classes)
* Number of Attributes: 13 numeric, predictive attributes and the class
* Attribute info:
1. **Alcohol**: alcohol content, reported in units of ABV (alcohol by volume).

1. **Malic acid**: one of the principal organic acids found in wine. Although found in nearly every fruit and berry, it’s flavor is most prominent in green apples; likewise, it projects this sour flavor into wine. For more information, feel free to read about acids in wine.

1. **Ash**: yep, wine has ash in it. Ash is simply the inorganic matter left after evaporation and incineration.

1. **Alcalinity of ash**: the alkalinity of ash determines how basic (as opposed to acidic) the ash in a wine is.

1. **Magnesium**: magnesium is a metal that affects the flavor of wine.

1. **Total phenols**: Phenols are chemicals that affect the taste, color, and mouthfeel (i.e., texture) of wine. For some (very) in-depth information about phenols, we refer you to phenolic content in wine.

1. **Flavoids**: flavonoids are a type of phenol.

1. **Nonflavoid phenols**: nonflavonoids are another type of phenol.

1. **Proanthocyanins**: proanthocyanidins are yet another type of phenol.

1. **Color intensity**: the color intensity of a wine: i.e., how dark it is.

1. **Hue**: the hue of a wine, which is typically determined by the color of the cultivar used (although this is not always the case).

1. **OD280/OD315 of diluted wines**: protein content measurements.

1. **Proline**: an amino acid present in wines.
  
* Class
  * Class 0: 59
  * Class 1: 71
  * Class 2: 48

'messy_wine_data.csv' is a modified from 'Wine recognition dataset' by introducing some missing values

In [None]:
# Download 'messy_wine_data.csv'
!pip install wget
!python -m wget -o messy_wine_data.csv "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/messy_wine_data.csv"

df = pd.read_csv('messy_wine_data.csv')
df.head() 


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Saved under messy_wine_data (1).csv


Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,Class
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0


In [None]:
# describe the stats of dataframe
describe_dataframe(df)

In [None]:
print("Shape of df={}".format(df.shape))

##Rename Columns

In [None]:
print("Dataframe columns:\n{}".format(df.columns.tolist()))

In [None]:
cleanup_column_names(df)

In [None]:
print("Dataframe columns:\n{}".format(df.columns.tolist()))

##Sort Rows on defined attributes

In [None]:
df.head()

In [None]:
# Sort data by ascending malic_acid and decreasing ash
display(df.sort_values(['malic_acid', 'ash'], 
                         ascending=[True, False]).head())

In [None]:
# Sort data by decreasing alcohol

# Your code goes here

##Rearrange Columns in a Dataframe

In [None]:
df.head()

In [None]:
# Rearrange columns in the order of 'class', 'alcohol',	'malic_acid',	'ash',	'alcalinity_of_ash',	
# 'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
# 'color_intensity',	'hue',	'od280_od315_of_diluted_wines','proline'.
display(df[['class', 'alcohol',	'malic_acid',	'ash',	'alcalinity_of_ash',	
'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
'color_intensity',	'hue',	'od280_od315_of_diluted_wines','proline']].head())

In [None]:
# Rearrange columns in the order of 'alcohol', 'color_intensity',	'hue',	'malic_acid',	'ash',	'alcalinity_of_ash',	
# 'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
# 'od280_od315_of_diluted_wines','proline', 'class'.

# Your code goes here

##Filtering Columns

Using Column Index

In [None]:
# print 10 values from column at index 3

# Your code goes here (hit: use 'iloc()')

Using Column Name

In [None]:
# print 10 values of total_phenols

# Your code goes here

Using Column Datatype

In [None]:
# print 10 values of columns with data type float

# Your code goes here (hint: use 'select_dtypes()')

##Filtering Rows
Select specific rows

In [None]:
# Select rows of 21, 45, 100

# Your code goes here (hint: use 'iloc()')

Exclude Specific Row indices

In [None]:
# drop the first and third rows

# Your code goes here (hint: use 'drop()')

Conditional Filtering

In [None]:
# Get those wines with ash > 2

# Your code goes here

Offset from top of the dataframe

In [None]:
# Skip the top 100 rows

# Your code goes here

Offset from bottom of the dataframe

In [None]:
# Skip the last 10 rows

# Your code goes here

##TypeCasting/Data Type Conversion

In [None]:
print("Old dtypes:\n", df.dtypes)
# change the data type of 'hue' object to 'int'

# Your code goes here 

# compare dtypes of the original df with this one
print("New dtypes:\n", df.dtypes)

##Missing Values


In [None]:
# Drop rows with missing values in 'malic_acid' column
df_dropped = # Your code goes here
df_dropped.shape

In [None]:
# Fill Missing 'magnesium' values with mean 'magnesium'

# Your code goes here

In [None]:
# Fill Missing 'flavanoids' values with value from previous row (forward fill)

# Your code goes here

In [None]:
# Fill Missing 'flavanoids' values with value from next row (backward fill)

# Your code goes here

##Duplicates


In [None]:
# Before dropping Duplicate 'alcohol' rows
display(df_dropped.head())
print("Shape of df before dropping duplicates ={}".format(df_dropped.shape))

In [None]:
# After dropping Duplicate 'alcohol' rows

# Your code goes here

# updated dataframe
display(df_dropped.head())
print("Shape of df after dropping duplicates ={}".format(df_dropped.shape))

##Encode Categoricals


In [None]:
# Get One Hot Encoding using get_dummies() for 'class'

# Your code goes here (hint: use get_dummies())

##Random Sampling data from DataFrame

In [None]:
# Randomly sample 30% of samples

# Your code goes here (hint: use sample())

##Normalizing Numeric Values
Normalize 'alcohol' values using **Min-Max Scaler**

In [None]:
# Normalize 'alcohol' values using Min-Max Scaler
df_normalized = df.dropna().copy()
# Create a min_max_scaler
min_max_scaler = preprocessing.MinMaxScaler()
# Transform data, reshape your data using array.reshape(-1, 1) if your data has a single feature
alcohol_scaled = min_max_scaler.fit_transform(df_normalized['alcohol'].values.reshape(-1,1))
df_normalized['alcohol'] = alcohol_scaled.reshape(-1,1)

In [None]:
display(df_normalized.head())

Normalize quantity purchased values using **Robust Scaler**

In [None]:
# Normalize 'magnesium' values using Robust Scaler
df_normalized = df.dropna().copy()
# Create a RobustScaler

robust_scaler = # Your code goes here

magnesium_scaled = # Your code goes here

df_normalized['magnesium'] = magnesium_scaled.reshape(-1,1)

In [None]:
display(df_normalized.head())

##Data Summarization
Condition based aggregation

In [None]:
# Get the mean 'hue' of class 1 wine
mean_hue = df['hue'][df['class']==1].mean()
print("Mean 'hue' of class 1 wine :: {}".format(mean_hue))

In [None]:
# Get the max 'alcohol' of class 0 wine

# Your code goes here

print("Max 'alcohol' of class 0 wine :: {}".format(max_alcohol))