# Data Processing and Analysis

Data Processing is the most important and most time consuming component of the overall lifecycle of any Machine Learning project. 

In this notebook, we will analyze a dummy dataset to understand different issues we face with real world datasets and steps to handle the same.

## Utilities

We add in some utility functions here which we will be using across this notebook. We have also packaged it into a `utils.py` file which you can use offline. Since we will be using colab for the tutorials, we add in all the functions in the same notebook to save the hassle of file uploads and drive connects

In [0]:
import datetime
import random
from random import randrange
import numpy as np
import pandas as pd


def _random_date(start,date_count):
    """This function generates a random date based on params
    Args:
        start (date object): the base date
        date_count (int): number of dates to be generated
    Returns:
        list of random dates

    """
    current = start
    while date_count > 0:
        curr = current + datetime.timedelta(days=randrange(42))
        yield curr
        date_count-=1
        
        

def generate_sample_data(row_count=100):
    """This function generates a random transaction dataset
    Args:
        row_count (int): number of rows for the dataframe
    Returns:
        a pandas dataframe

    """

    # sentinels
    startDate = datetime.datetime(2016, 1, 1, 13)
    serial_number_sentinel = 1000
    user_id_sentinel = 5001
    product_id_sentinel = 101
    price_sentinel = 2000

    # base list of attributes
    data_dict = {
        'Serial No':
        np.arange(row_count) + serial_number_sentinel,
        'Date':
        np.random.permutation(
            pd.to_datetime([
                x.strftime("%d-%m-%Y")
                for x in _random_date(startDate, row_count)
            ]).date),
        'User ID':
        np.random.permutation(
            np.random.randint(0, row_count, size=int(row_count / 10)) +
            user_id_sentinel).tolist() * 10,
        'Product ID':
        np.random.permutation(
            np.random.randint(0, row_count, size=int(row_count / 10)) +
            product_id_sentinel).tolist() * 10,
        'Quantity Purchased':
        np.random.permutation(np.random.randint(1, 42, size=row_count)),
        'Price':
        np.round(
            np.abs(np.random.randn(row_count) + 1) * price_sentinel,
            decimals=2),
        'User Type':
        np.random.permutation(
            [chr(random.randrange(97, 97 + 3 + 1)) for i in range(row_count)])
    }

    # introduce missing values
    for index in range(int(np.sqrt(row_count))):
        data_dict['Price'][np.argmax(
            data_dict['Price'] == random.choice(data_dict['Price']))] = np.nan
        data_dict['User Type'][np.argmax(
            data_dict['User Type'] == random.choice(
                data_dict['User Type']))] = np.nan
        data_dict['Date'][np.argmax(
            data_dict['Date'] == random.choice(data_dict['Date']))] = np.nan
        data_dict['Product ID'][np.argmax(data_dict['Product ID'] == random.
                                          choice(data_dict['Product ID']))] = 0
        data_dict['Serial No'][np.argmax(data_dict['Serial No'] == random.
                                         choice(data_dict['Serial No']))] = -1
        data_dict['User ID'][np.argmax(data_dict['User ID'] == random.choice(
            data_dict['User ID']))] = -101

    # create data frame
    df = pd.DataFrame(data_dict)

    return df

## Import Dependencies

In [0]:
# import required libraries
import numpy as np
import pandas as pd
from IPython.display import display
from sklearn import preprocessing

pd.options.mode.chained_assignment = None

## Generate Dataset

+ Question: Generate 1000 sample rows

In [4]:
## Generate a dataset with 1000 rows
df = generate_sample_data(row_count=1000)
df.shape

(1000, 7)

### Analyze generated Dataset

In [5]:
df.head()

Unnamed: 0,Serial No,Date,User ID,Product ID,Quantity Purchased,Price,User Type
0,1000,2016-05-01,-101,0,8,,n
1,-1,2016-09-01,5362,375,32,2698.38,n
2,1002,,5022,419,39,426.02,n
3,1003,2016-01-25,5811,219,6,4047.37,n
4,1004,2016-11-02,5403,158,9,1171.53,n


### Dataframe Stats

Determine the following:

* The number of data points (rows). (*Hint:* check out the dataframe `.shape` attribute.)
* The column names. (*Hint:* check out the dataframe `.columns` attribute.)
* The data types for each column. (*Hint:* check out the dataframe `.dtypes` attribute.)

In [6]:
print("Number of rows::",df.shape[0])

Number of rows:: 1000


### Question
+ Get the number of columns

In [7]:
print("Number of columns::",df.shape[1])

Number of columns:: 7


In [8]:
print("Column Names::",df.columns.values.tolist())

Column Names:: ['Serial No', 'Date', 'User ID', 'Product ID', 'Quantity Purchased', 'Price', 'User Type']


In [9]:
print("Column Data Types::\n",df.dtypes)

Column Data Types::
 Serial No               int64
Date                   object
User ID                 int64
Product ID              int64
Quantity Purchased      int64
Price                 float64
User Type              object
dtype: object


In [10]:
print("Columns with Missing Values::",df.columns[df.isnull().any()].tolist())

Columns with Missing Values:: ['Date', 'Price']


In [11]:
print("Number of rows with Missing Values::",len(pd.isnull(df).any(1).nonzero()[0].tolist()))

Number of rows with Missing Values:: 60


  """Entry point for launching an IPython kernel.


#### General Stats

In [12]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
Serial No             1000 non-null int64
Date                  969 non-null object
User ID               1000 non-null int64
Product ID            1000 non-null int64
Quantity Purchased    1000 non-null int64
Price                 969 non-null float64
User Type             1000 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 54.8+ KB
None


In [13]:
print(df.describe())

         Serial No      User ID   Product ID  Quantity Purchased        Price
count  1000.000000  1000.000000  1000.000000         1000.000000   969.000000
mean   1454.778000  5423.365000   591.819000           20.538000  2348.003581
std     383.426818   331.940079   304.103468           11.758181  1637.761952
min      -1.000000  -101.000000     0.000000            1.000000     4.240000
25%    1225.750000  5176.000000   294.000000           11.000000   969.560000
50%    1483.500000  5401.000000   536.000000           20.000000  2112.670000
75%    1743.250000  5672.250000   883.000000           30.250000  3447.960000
max    1999.000000  5968.000000  1087.000000           41.000000  9393.630000


## Standardize Columns

### Question
+ Use ```columns``` attribute and ```tolist()``` method to get the list of all columns

In [14]:
# list all columns
print("Dataframe columns:\n{}".format(df.columns.tolist()))

Dataframe columns:
['Serial No', 'Date', 'User ID', 'Product ID', 'Quantity Purchased', 'Price', 'User Type']


### Utility to Standardize Columns

+ Question : We usually use lowercase-snakecased column names in python. Write a utility method to do the same. You may user methods like ```lower, replace```. Setting ```inplace``` = ```True``` avoid creating a copy of your dataframe


*Hint:* there are multiple ways to do this, but you could use either the [string processing methods](http://pandas.pydata.org/pandas-docs/stable/text.html) or the [apply method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html).

In [0]:
def cleanup_column_names(df,rename_dict={},do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed. 
    Args:
        rename_dict (dict): keys represent old column names and values point to 
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    if not rename_dict:
        # lower case and replace <space> with <underscore>
        return df.rename(columns={col: col.lower().replace(' ','_') 
                            for col in df.columns.values.tolist()}, 
                         inplace=True)
    else:
        return df.rename(columns=rename_dict,inplace=do_inplace)

In [0]:
cleanup_column_names(df)

In [17]:
# Updated column names
print("Dataframe columns:\n{}".format(df.columns.tolist()))

Dataframe columns:
['serial_no', 'date', 'user_id', 'product_id', 'quantity_purchased', 'price', 'user_type']


## Basic Manipulation

### Sort basis specific attributes

+ Question: Sort serial_no in ascending and price in descending order.

In [18]:
# Ascending for Serial No and Descending for Price
display(df.sort_values(['serial_no', 'price'], 
                         ascending=[True, False]).head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
502,-1,2016-01-30,5022,419,24,9393.63,d
384,-1,2016-01-29,5597,1000,27,7312.06,a
821,-1,2016-09-02,5384,1016,20,7246.72,a
604,-1,2016-08-02,5403,158,24,5955.69,d
514,-1,2016-06-01,5336,329,25,4979.46,c


### Reorder columns

In [19]:
display(df[['serial_no','date','user_id','user_type',
              'product_id','quantity_purchased','price']].head())

Unnamed: 0,serial_no,date,user_id,user_type,product_id,quantity_purchased,price
0,1000,2016-05-01,-101,n,0,8,
1,-1,2016-09-01,5362,n,375,32,2698.38
2,1002,,5022,n,419,39,426.02
3,1003,2016-01-25,5811,n,219,6,4047.37
4,1004,2016-11-02,5403,n,158,9,1171.53


### Select Attributes

In [20]:
# Using Column Index
# print 10 values from column at index 3
print(df.iloc[:,3].values[0:10])

[  0 375 419 219 158 909 615 505 408 887]


In [21]:
# Using Column Name
# print 10 values of quantity purchased
print(df.quantity_purchased.values[0:10])

[ 8 32 39  6  9  3 33  3 12 37]


In [22]:
# Using Datatype
# print 10 values of columns with data type float
print(df.select_dtypes(include=['float64']).values[:10,0])

[    nan 2698.38  426.02 4047.37 1171.53 1621.08 4545.24 4335.78 2821.18
 3017.22]


### Select Rows

In [23]:
# Using Row Index
display(df.iloc[[10,501,20]])

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
10,1010,2016-01-13,5834,752,20,1913.9,n
501,1501,2016-01-31,5362,375,9,3107.75,c
20,1020,2016-01-14,5841,1052,29,3549.63,a


In [24]:
# Exclude specific rows
display(df.drop([0,24,51], axis=0).head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
1,-1,2016-09-01,5362,375,32,2698.38,n
2,1002,,5022,419,39,426.02,n
3,1003,2016-01-25,5811,219,6,4047.37,n
4,1004,2016-11-02,5403,158,9,1171.53,n
5,1005,,5414,909,3,1621.08,n


### Question
+ Show only rows which have quantity purchased greater than 25

In [25]:
# Conditional Filtering
# Quantity_Purchased greater than 25
display(df[df.quantity_purchased > 25].head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
1,-1,2016-09-01,5362,375,32,2698.38,n
2,1002,,5022,419,39,426.02,n
6,1006,,5395,615,33,4545.24,n
9,1009,2016-01-26,5301,887,37,3017.22,n
11,1011,2016-01-26,5011,261,41,1581.44,n


In [26]:
# Offset from Top
display(df[100:].head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
100,1100,2016-01-31,5544,1011,11,2413.95,d
101,1101,2016-07-02,5362,375,20,1022.67,d
102,1102,,5022,419,33,,c
103,1103,2016-01-15,5811,219,9,1702.96,b
104,1104,2016-07-02,5403,158,41,253.14,d


In [27]:
# Offset from Bottom
display(df[-10:].head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type
990,1990,2016-10-01,5190,622,19,2070.26,a
991,1991,2016-11-02,5058,1060,15,3465.3,b
992,1992,2016-01-15,5454,506,4,413.36,b
993,1993,2016-03-02,5762,168,5,2801.6,a
994,1994,2016-01-28,5184,647,9,5036.47,b


### Type Casting

In [28]:
# Existing Datatypes
df.dtypes

serial_no               int64
date                   object
user_id                 int64
product_id              int64
quantity_purchased      int64
price                 float64
user_type              object
dtype: object

In [29]:
# Set Datatime as dtype for date column
df['date'] = pd.to_datetime(df.date)
print(df.dtypes)

serial_no                      int64
date                  datetime64[ns]
user_id                        int64
product_id                     int64
quantity_purchased             int64
price                        float64
user_type                     object
dtype: object


### Map/Apply Functionality

### Question
+ Write a utility method to create a new column ```user_class``` from ```user_type``` using the following mapping:
    - ```user_type``` __a__ and __b__ map to ```user_class``` __new__
    - ```user_type``` __c__ maps to ```user_class``` __existing__
    - ```user_type``` __d__ maps to ```user_class``` __loyal_existing__
    - map all other ```user_type``` values as __error__

In [0]:
def expand_user_type(u_type):
    """This function maps user types to user classes
    Args:
        u_type (str): user type value
    Returns:
        (str) user_class value

    """
    if u_type in ['a','b']:
        return 'new'
    elif u_type == 'c':
        return 'existing'
    elif u_type == 'd':
        return 'loyal_existing'
    else:
        return 'error'

In [31]:
# Map User Type to User Class
df['user_class'] = df['user_type'].map(expand_user_type)
display(df.tail())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class
995,1995,2016-02-02,5186,1033,18,4584.31,a,new
996,1996,2016-01-21,5104,886,27,1100.8,b,new
997,1997,2016-01-21,5668,892,41,815.71,a,new
998,1998,2016-01-16,5096,732,1,1321.48,c,existing
999,1999,2016-05-02,5219,237,24,328.34,a,new


### Question
+ Get range for each numeric attribute, i.e. max-min

In [32]:
# Apply: Using apply to get attribute ranges
display(df.select_dtypes(include=[np.number]).apply(lambda x: 
                                                        x.max()- x.min()))

serial_no             2000.00
user_id               6069.00
product_id            1087.00
quantity_purchased      40.00
price                 9389.39
dtype: float64

In [0]:
# Apply-Map: Extract Week from Date
df['purchase_week'] = df[['date']].applymap(lambda dt:dt.week 
                                                if not pd.isnull(dt.week) 
                                                else 0)

In [34]:
display(df.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
0,1000,2016-05-01,-101,0,8,,n,error,17
1,-1,2016-09-01,5362,375,32,2698.38,n,error,35
2,1002,NaT,5022,419,39,426.02,n,error,0
3,1003,2016-01-25,5811,219,6,4047.37,n,error,4
4,1004,2016-11-02,5403,158,9,1171.53,n,error,44


## Handle Missing Values

In [35]:
# Drop Rows with Missing Dates
df_dropped = df.dropna(subset=['date'])
display(df_dropped.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
0,1000,2016-05-01,-101,0,8,,n,error,17
1,-1,2016-09-01,5362,375,32,2698.38,n,error,35
3,1003,2016-01-25,5811,219,6,4047.37,n,error,4
4,1004,2016-11-02,5403,158,9,1171.53,n,error,44
8,1008,2016-01-22,5968,408,12,2821.18,n,error,3


In [0]:
# Filling missing price with mean price
df_dropped['price'].fillna(value=np.round(df.price.mean(),decimals=2),
                                inplace=True)

In [0]:
# Fill missing user types using values from previous row
df_dropped['user_type'].fillna(method='ffill',inplace=True)

## Handle Duplicates

### Question
+ Identify duplicates only for column ```serial_no```

In [38]:
# sample duplicates. Identify for serial_no
display(df_dropped[df_dropped.duplicated(subset=['serial_no'])].head())
print("Shape of df={}".format(df_dropped.shape))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
50,-1,2016-04-02,5098,412,22,4954.93,d,loyal_existing,13
99,-1,2016-03-01,5219,237,25,222.57,c,existing,9
134,-1,2016-03-02,5056,861,11,508.33,c,existing,9
183,-1,2016-01-15,5032,721,27,2693.6,c,existing,2
280,-1,2016-06-02,5685,985,21,4426.05,c,existing,22


Shape of df=(969, 9)


In [39]:
# Drop Duplicates
df_dropped.drop_duplicates(subset=['serial_no'],inplace=True)
display(df_dropped.head())
print("Shape of df={}".format(df_dropped.shape))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
0,1000,2016-05-01,-101,0,8,2348.0,n,error,17
1,-1,2016-09-01,5362,375,32,2698.38,n,error,35
3,1003,2016-01-25,5811,219,6,4047.37,n,error,4
4,1004,2016-11-02,5403,158,9,1171.53,n,error,44
8,1008,2016-01-22,5968,408,12,2821.18,n,error,3


Shape of df=(941, 9)


### Question
+ Remove rows which have less than 3 attributes with non-missing data
+ Print the shape of dataframe thus prepared

In [42]:
# Remove rows which have less than 3 attributes with non-missing data
display(df.dropna(thresh=3).head())
print("Shape of df={}".format(df.dropna(thresh=3).shape))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week
0,1000,2016-05-01,-101,0,8,,n,error,17
1,-1,2016-09-01,5362,375,32,2698.38,n,error,35
2,1002,NaT,5022,419,39,426.02,n,error,0
3,1003,2016-01-25,5811,219,6,4047.37,n,error,4
4,1004,2016-11-02,5403,158,9,1171.53,n,error,44


Shape of df=(1000, 9)


## Handle Categoricals

### One Hot Encoding

In [43]:
display(pd.get_dummies(df,columns=['user_type']).head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_class,purchase_week,user_type_a,user_type_b,user_type_c,user_type_d,user_type_n
0,1000,2016-05-01,-101,0,8,,error,17,0,0,0,0,1
1,-1,2016-09-01,5362,375,32,2698.38,error,35,0,0,0,0,1
2,1002,NaT,5022,419,39,426.02,error,0,0,0,0,0,1
3,1003,2016-01-25,5811,219,6,4047.37,error,4,0,0,0,0,1
4,1004,2016-11-02,5403,158,9,1171.53,error,44,0,0,0,0,1


### Label Encoding

### Question
+ Use a dictionary to encode user_types in sequence of numbers. Replace missing/Nan's with -1

In [44]:
type_map = {'a': 0, 'b': 1, 'c': 2, 'd': 3, np.NAN: -1}
df['encoded_user_type'] = df.user_type.map(type_map)
display((df.tail()))

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week,encoded_user_type
995,1995,2016-02-02,5186,1033,18,4584.31,a,new,5,0.0
996,1996,2016-01-21,5104,886,27,1100.8,b,new,3,1.0
997,1997,2016-01-21,5668,892,41,815.71,a,new,3,0.0
998,1998,2016-01-16,5096,732,1,1321.48,c,existing,2,2.0
999,1999,2016-05-02,5219,237,24,328.34,a,new,18,0.0


## Handle Numerical Attributes

### Min-Max Scalar
### Question
+ Control the range of numerical attribute price by using ```MinMaxScaler``` transformer

In [0]:
df_normalized = df.dropna().copy()
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df_normalized['price'].values.reshape(-1,1))
df_normalized['price'] = np_scaled.reshape(-1,1)

In [46]:
display(df_normalized.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week,encoded_user_type
20,1020,2016-01-14,5841,1052,29,0.377595,a,new,2,0.0
25,1025,2016-09-01,5167,1077,26,0.354658,d,loyal_existing,35,3.0
30,1030,2016-03-02,5434,956,22,0.066092,d,loyal_existing,9,3.0
36,1036,2016-01-14,5611,586,39,0.017376,a,new,2,0.0
37,1037,2016-05-01,5836,530,20,0.442558,d,loyal_existing,17,3.0


### Robust Scaler

In [0]:
df_normalized = df.dropna().copy()
robust_scaler = preprocessing.RobustScaler()
rs_scaled = robust_scaler.fit_transform(df_normalized['quantity_purchased'].values.reshape(-1,1))
df_normalized['quantity_purchased'] = rs_scaled.reshape(-1,1)

In [48]:
display(df_normalized.head())

Unnamed: 0,serial_no,date,user_id,product_id,quantity_purchased,price,user_type,user_class,purchase_week,encoded_user_type
20,1020,2016-01-14,5841,1052,0.473684,3549.63,a,new,2,0.0
25,1025,2016-09-01,5167,1077,0.315789,3334.26,d,loyal_existing,35,3.0
30,1030,2016-03-02,5434,956,0.105263,624.8,d,loyal_existing,9,3.0
36,1036,2016-01-14,5611,586,1.0,167.39,a,new,2,0.0
37,1037,2016-05-01,5836,530,0.0,4159.59,d,loyal_existing,17,3.0


## Group-By

### Question
+ Group By  attribute ```user_class``` and get sum of quantity_purchased

*Hint:* you may want to use Pandas [`groupby` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) to group by certain attributes before calculating the statistic.

Try calculating multiple statistics (mean, median, etc) in a single table (i.e. with a single groupby call). See the section of the Pandas documentation on [applying multiple functions at once](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once) for a hint.

In [49]:
# Group By attributes user_class and get sum of quantity_purchased
print(df.groupby(['user_class'])['quantity_purchased'].sum())

user_class
error              656
existing          5878
loyal_existing    4134
new               9870
Name: quantity_purchased, dtype: int64


In [50]:
# Aggregate Functions. Sum, Mean and Non Zero Row Count
display(
    df.groupby(['user_class'])['quantity_purchased'].agg(
        [np.sum, np.mean, np.count_nonzero]))

Unnamed: 0_level_0,sum,mean,count_nonzero
user_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
error,656,21.16129,31
existing,5878,21.220217,277
loyal_existing,4134,19.779904,209
new,9870,20.434783,483


In [51]:
# Aggregate Functions specific to columns
display(df.groupby(['user_class','user_type']).agg({'price':np.mean,
                                                        'quantity_purchased':np.max}))

Unnamed: 0_level_0,Unnamed: 1_level_0,price,quantity_purchased
user_class,user_type,Unnamed: 2_level_1,Unnamed: 3_level_1
error,n,2768.931333,41
existing,c,2323.129515,41
loyal_existing,d,2297.724335,41
new,a,2368.391376,41
new,b,2347.20584,41


In [52]:
# Multiple Aggregate Functions
display(
    df.groupby(['user_class', 'user_type']).agg({
        'price': {
            'total_price': np.sum,
            'mean_price': np.mean,
            'variance_price': np.std,
            'count': np.count_nonzero
        },
        'quantity_purchased': np.sum
    }))

in a future version.

For column-specific groupby renaming, use named aggregation

    >>> df.groupby(...).agg(name=('column', aggfunc))

  return super().aggregate(arg, *args, **kwargs)


Unnamed: 0_level_0,Unnamed: 1_level_0,price,price,price,price,quantity_purchased
Unnamed: 0_level_1,Unnamed: 1_level_1,total_price,mean_price,variance_price,count,sum
user_class,user_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
error,n,83067.94,2768.931333,1445.762191,31.0,656
existing,c,622598.71,2323.129515,1632.362506,277.0,5878
loyal_existing,d,466438.04,2297.724335,1638.207794,209.0,4134
new,a,516309.32,2368.391376,1673.862628,225.0,4497
new,b,586801.46,2347.20584,1638.116872,258.0,5373


## Pivot Tables

In [53]:
display(df.pivot_table(index='date', columns='user_type', 
                         values='price',aggfunc=np.mean))

user_type,a,b,c,d,n
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-01-01,3041.698333,3023.85,2614.0325,1276.478,
2016-01-02,1121.17,2442.1775,2020.081429,1578.444286,
2016-01-13,2978.797778,2318.06,1671.386667,4222.955,1913.9
2016-01-14,1790.955,1591.11,2304.13,3595.118,
2016-01-15,3023.073333,1951.65,1837.246667,1824.73,2604.705
2016-01-16,2420.365,2990.763333,2635.99,2351.9375,
2016-01-17,,1861.55,3580.0,2514.778889,
2016-01-18,2530.961667,1780.6525,1483.172,2199.813333,
2016-01-19,2824.6,757.11,2499.586667,1587.3575,920.16
2016-01-20,2324.895,2534.811429,1536.6825,2313.14625,


## Stacking

In [54]:
print(df.stack())

0    serial_no                            1000
     date                  2016-05-01 00:00:00
     user_id                              -101
     product_id                              0
     quantity_purchased                      8
                                  ...         
999  price                              328.34
     user_type                               a
     user_class                            new
     purchase_week                          18
     encoded_user_type                       0
Length: 9907, dtype: object
