# Data Processing and Analysis

Data Processing is the most important and most time consuming component of the overall lifecycle of any Machine Learning project. 

In this notebook, we will analyze a dummy dataset to understand different issues we face with real world datasets and steps to handle the same.

## Import 

In [None]:
# import required libraries
import numpy as np
import pandas as pd
from IPython.display import display
from sklearn import preprocessing

from utils import generate_sample_data

pd.options.mode.chained_assignment = None

## Generate Dataset

+ Question: Generate 1000 sample rows

In [None]:
## Generate a dataset with 1000 rows
df = generate_sample_data(row_count=1000)
df.shape

### Analyze generated Dataset

In [None]:
df.head()

### Dataframe Stats

Determine the following:

* The number of data points (rows). (*Hint:* check out the dataframe `.shape` attribute.)
* The column names. (*Hint:* check out the dataframe `.columns` attribute.)
* The data types for each column. (*Hint:* check out the dataframe `.dtypes` attribute.)

In [None]:
print("Number of rows::",df.shape[0])

### Question
+ Get the number of columns

In [None]:
print("Number of columns::",df.shape[1])

In [None]:
print("Column Names::",df.columns.values.tolist())

In [None]:
print("Column Data Types::\n",df.dtypes)

In [None]:
print("Columns with Missing Values::",df.columns[df.isnull().any()].tolist())

In [None]:
print("Number of rows with Missing Values::",len(pd.isnull(df).any(1).nonzero()[0].tolist()))

#### General Stats

In [None]:
print(df.info())

In [None]:
print(df.describe())

## Standardize Columns

### Question
+ Use ```columns``` attribute and ```tolist()``` method to get the list of all columns

In [None]:
# list all columns
print("Dataframe columns:\n{}".format(df.columns.tolist()))

### Utility to Standardize Columns

+ Question : We usually use lowercase-snakecased column names in python. Write a utility method to do the same. You may user methods like ```lower, replace```. Setting ```inplace``` = ```True``` avoid creating a copy of your dataframe


*Hint:* there are multiple ways to do this, but you could use either the [string processing methods](http://pandas.pydata.org/pandas-docs/stable/text.html) or the [apply method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html).

In [None]:
def cleanup_column_names(df,rename_dict={},do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed. 
    Args:
        rename_dict (dict): keys represent old column names and values point to 
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    if not rename_dict:
        # lower case and replace <space> with <underscore>
        return df.rename(columns={col: col.lower().replace(' ','_') 
                            for col in df.columns.values.tolist()}, 
                         inplace=True)
    else:
        return df.rename(columns=rename_dict,inplace=do_inplace)

In [None]:
cleanup_column_names(df)

In [None]:
# Updated column names
print("Dataframe columns:\n{}".format(df.columns.tolist()))

## Basic Manipulation

### Sort basis specific attributes

+ Question: Sort serial_no in ascending and price in descending order.

In [None]:
# Ascending for Serial No and Descending for Price
display(df.sort_values(['serial_no', 'price'], 
                         ascending=[True, False]).head())

### Reorder columns

In [None]:
display(df[['serial_no','date','user_id','user_type',
              'product_id','quantity_purchased','price']].head())

### Select Attributes

In [None]:
# Using Column Index
# print 10 values from column at index 3
print(df.iloc[:,3].values[0:10])

In [None]:
# Using Column Name
# print 10 values of quantity purchased
print(df.quantity_purchased.values[0:10])

In [None]:
# Using Datatype
# print 10 values of columns with data type float
print(df.select_dtypes(include=['float64']).values[:10,0])

### Select Rows

In [None]:
# Using Row Index
display(df.iloc[[10,501,20]])

In [None]:
# Exclude specific rows
display(df.drop([0,24,51], axis=0).head())

### Question
+ Show only rows which have quantity purchased greater than 25

In [None]:
# Conditional Filtering
# Quantity_Purchased greater than 25
display(df[df.quantity_purchased > 25].head())

In [None]:
# Offset from Top
display(df[100:].head())

In [None]:
# Offset from Bottom
display(df[-10:].head())

### Type Casting

In [None]:
# Existing Datatypes
df.dtypes

In [None]:
# Set Datatime as dtype for date column
df['date'] = pd.to_datetime(df.date)
print(df.dtypes)

### Map/Apply Functionality

### Question
+ Write a utility method to create a new column ```user_class``` from ```user_type``` using the following mapping:
    - ```user_type``` __a__ and __b__ map to ```user_class``` __new__
    - ```user_type``` __c__ maps to ```user_class``` __existing__
    - ```user_type``` __d__ maps to ```user_class``` __loyal_existing__
    - map all other ```user_type``` values as __error__

In [None]:
def expand_user_type(u_type):
    """This function maps user types to user classes
    Args:
        u_type (str): user type value
    Returns:
        (str) user_class value

    """
    if u_type in ['a','b']:
        return 'new'
    elif u_type == 'c':
        return 'existing'
    elif u_type == 'd':
        return 'loyal_existing'
    else:
        return 'error'

In [None]:
# Map User Type to User Class
df['user_class'] = df['user_type'].map(expand_user_type)
display(df.tail())

### Question
+ Get range for each numeric attribute, i.e. max-min

In [None]:
# Apply: Using apply to get attribute ranges
display(df.select_dtypes(include=[np.number]).apply(lambda x: 
                                                        x.max()- x.min()))

In [None]:
# Apply-Map: Extract Week from Date
df['purchase_week'] = df[['date']].applymap(lambda dt:dt.week 
                                                if not pd.isnull(dt.week) 
                                                else 0)

In [None]:
display(df.head())

## Handle Missing Values

In [None]:
# Drop Rows with Missing Dates
df_dropped = df.dropna(subset=['date'])
display(df_dropped.head())

In [None]:
# Filling missing price with mean price
df_dropped['price'].fillna(value=np.round(df.price.mean(),decimals=2),
                                inplace=True)

In [None]:
# Fill missing user types using values from previous row
df_dropped['user_type'].fillna(method='ffill',inplace=True)

## Handle Duplicates

### Question
+ Identify duplicates only for column ```serial_no```

In [None]:
# sample duplicates. Identify for serial_no
display(df_dropped[df_dropped.duplicated(subset=['serial_no'])].head())
print("Shape of df={}".format(df_dropped.shape))

In [None]:
# Drop Duplicates
df_dropped.drop_duplicates(subset=['serial_no'],inplace=True)
display(df_dropped.head())
print("Shape of df={}".format(df_dropped.shape))

### Question
+ Remove rows which have less than 3 attributes with non-missing data
+ Print the shape of dataframe thus prepared

In [None]:
# Remove rows which have less than 3 attributes with non-missing data
display(df.dropna(thresh=3).head())
print("Shape of df={}".format(df.dropna(thresh=3).shape))

## Handle Categoricals

### One Hot Encoding

In [None]:
display(pd.get_dummies(df,columns=['user_type']).head())

### Label Encoding

### Question
+ Use a dictionary to encode user_types in sequence of numbers. Replace missing/Nan's with -1

In [None]:
type_map = {'a': 0, 'b': 1, 'c': 2, 'd': 3, np.NAN: -1}
df['encoded_user_type'] = df.user_type.map(type_map)
display((df.tail()))

## Handle Numerical Attributes

### Min-Max Scalar
### Question
+ Control the range of numerical attribute price by using ```MinMaxScaler``` transformer

In [None]:
df_normalized = df.dropna().copy()
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df_normalized['price'].values.reshape(-1,1))
df_normalized['price'] = np_scaled.reshape(-1,1)

In [None]:
display(df_normalized.head())

### Robust Scalar

In [None]:
df_normalized = df.dropna().copy()
robust_scaler = preprocessing.RobustScaler()
rs_scaled = robust_scaler.fit_transform(df_normalized['quantity_purchased'].values.reshape(-1,1))
df_normalized['quantity_purchased'] = rs_scaled.reshape(-1,1)

In [None]:
display(df_normalized.head())

## Group-By

### Question
+ Group By  attribute ```user_class``` and get sum of quantity_purchased

*Hint:* you may want to use Pandas [`groupby` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) to group by certain attributes before calculating the statistic.

Try calculating multiple statistics (mean, median, etc) in a single table (i.e. with a single groupby call). See the section of the Pandas documentation on [applying multiple functions at once](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once) for a hint.

In [None]:
# Group By attributes user_class and get sum of quantity_purchased
print(df.groupby(['user_class'])['quantity_purchased'].sum())

In [None]:
# Aggregate Functions. Sum, Mean and Non Zero Row Count
display(
    df.groupby(['user_class'])['quantity_purchased'].agg(
        [np.sum, np.mean, np.count_nonzero]))

In [None]:
# Aggregate Functions specific to columns
display(df.groupby(['user_class','user_type']).agg({'price':np.mean,
                                                        'quantity_purchased':np.max}))

In [None]:
# Multiple Aggregate Functions
display(
    df.groupby(['user_class', 'user_type']).agg({
        'price': {
            'total_price': np.sum,
            'mean_price': np.mean,
            'variance_price': np.std,
            'count': np.count_nonzero
        },
        'quantity_purchased': np.sum
    }))

## Pivot Tables

In [None]:
display(df.pivot_table(index='date', columns='user_type', 
                         values='price',aggfunc=np.mean))

## Stacking

In [None]:
print(df.stack())