# Background

## Top-Down Approach 

The coursebook is part of the **Data Analytics Specialization** offered by [Algoritma](https://algorit.ma). It takes a more accessible approach compared to Algoritma's core educational products, by getting participants to overcome the "how" barrier first, rather than a detailed breakdown of the "why". 

This translates to an overall easier learning curve, one where the reader is prompted to write short snippets of code in frequent intervals, before being offered an explanation on the underlying theoretical frameworks. Instead of mastering the syntactic design of the Python programming language, then moving into data structures, and then the `pandas` library, and then the mathematical details in an imputation algorithm, and its code implementation; we would do the opposite: Implement the imputation, then a succinct explanation of why it works and applicational considerations (what to look out for, what are assumptions it made, when _not_ to use it etc).

## Training Objectives

This coursebook is intended for participants who have completed the preceding courses offered in the **Data Analytics Developer** Specialization. This is the second course, **Exploratory Data Analysis**

The coursebook focuses on:
- Why and What: Exploratory Data Analysis
- Date Time objects
- Categorical data types
- Cross Tabulation and Pivot Table
- Treating Duplicates and Missing Values 

At the end of this course is a Learn-by-Building section, where you are expected to apply all that you've learned on a new dataset, and attempt the given questions.

# Data Preparation and Exploration

About 60 years ago, John Tukey defined data analysis as the "procedures for analyzing data, techniques for interpreting the results of such procedures ... and all the machinery of mathematical statistics which apply to analyzing data". His championing of EDA encouraged the development of statsitical computing packages, especially S at Bell Labs (which later inspired R).

He wrote a book titled _Exploratory Data Analysis_ arguing that too much emphasis in statistics was placed on hypothesis testing (confirmatory data analysis) while not enough was placed on the discovery of the unexpected. 

> Exploratory data analysis isolates patterns and features of the data and reveals these forcefully to the analyst.

This course aims to present a selection of EDA techniques -- some developed by John Tukey himself -- but with a special emphasis on its application to modern business analytics.

In the previous course, we've got our hands on a few common techniques:

- `.head()` and `.tail()`
- `.describe()`
- `.shape` and `.size`
- `.axes`
- `.dtypes`

In the following chapters, we'll expand our EDA toolset with the following additions:  

- Tables
- Cross-Tables and Aggregates
- Using `aggfunc` for aggregate functions
- Pivot Tables

In [26]:
import pandas as pd
import numpy as np
print(pd.__version__)

# pandas output display setup
pd.set_option('display.float_format', lambda x: '%.2f' % x) 
pd.options.display.float_format = '{:,}'.format

1.0.5


## Working with Datetime

Given the program's special emphasis on business-driven analytics, one data type of particular interest to us is the `datetime`. In the first part of this coursebook, we've seen an example of `datetime` in the section introducing data types (`employees.joined`).

A large portion of data science work performed by business executives involve time series and/or dates (think about the kind of data science work done by computer vision researchers, and compare that to the work done by credit rating analysts or marketing executives and this special relationship between business and datetime data becomes apparent), so adding a level of familiarity with this format will serve you well in the long run. 

As a start, let's read our data,`household.csv`:

In [27]:
# read `household.csv` in data_input folder
household = pd.read_csv('data_input/household.csv')

# check `household` data types
household.dtypes

receipt_id            int64
receipts_item_id      int64
purchase_time        object
category             object
sub_category         object
format               object
unit_price          float64
discount              int64
quantity              int64
yearmonth            object
dtype: object

In [28]:
# household['yearmonth'].astype('datetime64')

Notice that all columns are in the right data types, except for `purchase_time`. The correct data type for this column would have to be a `datetime`. In previous module, you've learned how you can use `.astype()` to adjust a data type for a column. In fact, pandas has a function to work with datetime object in particular.

To convert a column `x` to a datetime, we would use:

    `x = pd.to_datetime(x)`
    

In [29]:
# household.purchase_time
# household['purchase_time']

In [30]:
# use pd.to_datetime() to convert `purchase_time`
household['purchase_time'] = pd.to_datetime(household['purchase_time'])
    
# check dtypes
household.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

Unlike using `astype()`, with `pd.to_datetime()` you are allowed to specify more arguments for the datetime conversion. Why it matters? Suppose we have a column which stores a daily sales data from end of January to the beginning of February:

In [31]:
date = pd.Series(['30-01-2020', '31-01-2020', '01-02-2020','02-02-2020'])
date

0    30-01-2020
1    31-01-2020
2    01-02-2020
3    02-02-2020
dtype: object

The legal and cultural expectations for datetime format may vary between countries. In Indonesia for example, most people are used to storing dates in DMY order. Let's see what happen next when we convert our `date` to datetime object:

In [32]:
date.astype('datetime64')

0   2020-01-30
1   2020-01-31
2   2020-01-02
3   2020-02-02
dtype: datetime64[ns]

In [33]:
date = pd.Series(['01/30/2020', '31-01-2020', '01-02-2020','02/02/20', '01/11/20','Feb-01-2020'])

date.astype('datetime64')

0   2020-01-30
1   2020-01-31
2   2020-01-02
3   2020-02-02
4   2020-01-11
5   2020-02-01
dtype: datetime64[ns]

Take a look on the third observation; rather than representing February 1st as it suppose, the data converted to January 2nd. Thing to note here, for dates with multiple representations, `pandas` will infer it as a month first order by default.

Using `pd.to_datetime`, you can specify your date formatting with parameters such as `format`* or `dayfirst`:

In [34]:
# pd.to_datetime(date, format='%d-%m-%Y')

pd.to_datetime(date, dayfirst=True)

0   2020-01-30
1   2020-01-31
2   2020-02-01
3   2020-02-02
4   2020-11-01
5   2020-02-01
dtype: datetime64[ns]

In [35]:
# pd.to_datetime(date, format='%d-%m-%Y')

\*_Using Python's `datetime` module, `pandas` pass the date string to `.strptime()` and follows by what's called Python's strptime directives. The full list of directives can be found in this [Documentation](https://strftime.org/)._

---

**Note** 

Beberapa metode untuk converting datetime:

1. `data['column'].astype('datetime64')`: ketika tidak ada parameter tambahan yang ingin ditambahkan saat mengconvert informasi waktu.
2. `pd.to_datetime(data['column'])`: digunakan ketika ingin menambahkan pangaturan tambahan pada data waktu yang ingin di convert. Contoh parameter yang bisa ditambahkan:
    - `dayfirst=True`: digunakan ketika format tanggal dimulai dengan tanggal (e.g.: `[11-01-2020,12-01-2020,1-01-2020]`)
3. `pd.read_csv('data.csv', parse_dates=['column'])`: digunakan ketika kita sudah mengetahui kolom-kolom bernilai waktu pada file yang ingin kita baca.

In [36]:
pd.read_csv('data_input/household.csv', parse_dates=['purchase_time']).dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

In [37]:
household.head()

Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth
0,9622257,32369294,2018-07-22 21:19:00,Rice,Rice,supermarket,128000.0,0,1,2018-07
1,9446359,31885876,2018-07-15 16:17:00,Rice,Rice,minimarket,102750.0,0,1,2018-07
2,9470290,31930241,2018-07-15 12:12:00,Rice,Rice,supermarket,64000.0,0,3,2018-07
3,9643416,32418582,2018-07-24 08:27:00,Rice,Rice,minimarket,65000.0,0,1,2018-07
4,9692093,32561236,2018-07-26 11:28:00,Rice,Rice,supermarket,124500.0,0,1,2018-07


---

Other than `to_datetime`, `pandas` has a number of machineries to work with `datetime` objects. These are convenient for when we need to extract the `month`, or `year`, or `weekday_name` from `datetime`. Some common applications in business analysis include:

- `household['purchase_time'].dt.month`: bulan (dalam digit)
- `household['purchase_time'].dt.month_name()`: bulan (karakter)
- `household['purchase_time'].dt.year`: tahun
- `household['purchase_time'].dt.day`: tanggal
- `household['purchase_time'].dt.dayofweek`: hari dalam digit 1-7
- `household['purchase_time'].dt.hour`: jam
- `household['purchase_time'].dt.day_name()`: hari (nama)

In [38]:
# household['purchase_time'].dt.day_name()
# household['purchase_time'].dt.hour

There are also other functions that can be helpful in certain situations. Supposed we want to transform the existing `datetime` column into values of periods we can use the `.to_period` method:

- `household['purchase_time'].dt.to_period('D')`
- `household['purchase_time'].dt.to_period('W')`
- `household['purchase_time'].dt.to_period('M')`
- `household['purchase_time'].dt.to_period('Q')`

In [39]:
# household['purchase_time'].dt.to_period('M')

**Knowledge Check:** Date time types  
_Est. Time required: 10-15 min until 14.50~55

1. In the following cell, start again by reading in the `household.csv` dataset. Drop `receipt_id` and `sub_category` columns as we won't use the columns for our analysis.  
2. Make sure the `purchase_time` column has converted as a datetime object.
3. Use `x.dt.day_name()`, assuming `x` is a datetime object to get the day of week. Assign this to a new column in your `household` Data Frame, name it `weekday`
4. The `yearmonth` column stores the information of year and month of the `purchase_time`. Using `dt.to_period()`, how will you recreate the column if you needed the same information?
5. Print the first 5 rows of your data to verify that your preprocessing steps are correct

Tips: In the cell above, start from:

`household = pd.read_csv("data_input/household.csv")`

Inspect the first 5 rows of your data and pay close attention to the `weekday` column. 

```
# manipulasi kolom: mengubah value existing / membuat kolom baru:

data['column'] = _________

```

In [40]:
household = pd.read_csv("data_input/household.csv", parse_dates=['purchase_time'])
household = household.drop(columns=['receipt_id','sub_category'])
household['weekday'] = household['purchase_time'].dt.day_name()
household['yearmonth'] = household['purchase_time'].dt.to_period('M')

household.head()

Unnamed: 0,receipts_item_id,purchase_time,category,format,unit_price,discount,quantity,yearmonth,weekday
0,32369294,2018-07-22 21:19:00,Rice,supermarket,128000.0,0,1,2018-07,Sunday
1,31885876,2018-07-15 16:17:00,Rice,minimarket,102750.0,0,1,2018-07,Sunday
2,31930241,2018-07-15 12:12:00,Rice,supermarket,64000.0,0,3,2018-07,Sunday
3,32418582,2018-07-24 08:27:00,Rice,minimarket,65000.0,0,1,2018-07,Tuesday
4,32561236,2018-07-26 11:28:00,Rice,supermarket,124500.0,0,1,2018-07,Thursday


**Bonus challenge:**  

Suppose that the estimated shipping time will take around 2 days after the products being purchased. Create a new column, name it `shipdate_est` which stores the estimated shipping time of each transaction!

In [41]:
household['shipdate_est'] = household['purchase_time'] + pd.Timedelta(days=2)
household.head()

Unnamed: 0,receipts_item_id,purchase_time,category,format,unit_price,discount,quantity,yearmonth,weekday,shipdate_est
0,32369294,2018-07-22 21:19:00,Rice,supermarket,128000.0,0,1,2018-07,Sunday,2018-07-24 21:19:00
1,31885876,2018-07-15 16:17:00,Rice,minimarket,102750.0,0,1,2018-07,Sunday,2018-07-17 16:17:00
2,31930241,2018-07-15 12:12:00,Rice,supermarket,64000.0,0,3,2018-07,Sunday,2018-07-17 12:12:00
3,32418582,2018-07-24 08:27:00,Rice,minimarket,65000.0,0,1,2018-07,Tuesday,2018-07-26 08:27:00
4,32561236,2018-07-26 11:28:00,Rice,supermarket,124500.0,0,1,2018-07,Thursday,2018-07-28 11:28:00


Extra Note on timedelta:

In [42]:
t1 = pd.to_datetime('1/1/2020 01:00')
t2 = pd.to_datetime('1/1/2020 03:00')

t2 - t1

Timedelta('0 days 02:00:00')

## Working with Categories

From the output of `dtypes`, we see that there are three variables currently stored as `object` type where a `category` is more appropriate. This is a common diagnostic step, and one that you will employ in almost every data analysis project.

In [43]:
household.dtypes

receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                period[M]
weekday                     object
shipdate_est        datetime64[ns]
dtype: object

Recall what you have learned in the previous module about pandas data types, which columns do you think appropriate to be converted as `category`?

In [44]:
## Your code here

#household['category'] = household['category'].astype('category')

household[['category', 'format','weekday']] = household[['category', 'format','weekday']].astype('category')
household.dtypes

receipts_item_id             int64
purchase_time       datetime64[ns]
category                  category
format                    category
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                period[M]
weekday                   category
shipdate_est        datetime64[ns]
dtype: object

### [Optional]: Alternative Solutions

In [45]:
household = pd.read_csv("data_input/household.csv", parse_dates=['purchase_time'])
household.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

#### Solution 1:

In [46]:
data1 = household.select_dtypes(exclude='object')
data2 = household.select_dtypes('object').astype('category')

data2.dtypes

category        category
sub_category    category
format          category
yearmonth       category
dtype: object

In [47]:
household_clean = pd.concat([data1, data2], axis=1)
household_clean.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
unit_price                 float64
discount                     int64
quantity                     int64
category                  category
sub_category              category
format                    category
yearmonth                 category
dtype: object

#### Solution 2

In [48]:
objectcols = household.select_dtypes(include='object')
household[objectcols.columns] = objectcols.astype('category')
household.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                  category
sub_category              category
format                    category
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                 category
dtype: object

### Extra notes on `category`: Why is it important?

From [pandas documentation](https://pandas.pydata.org/pandas-docs/version/0.15.1/categorical.html), the categorical data type is useful in the following cases:

- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see [here](https://pandas.pydata.org/pandas-docs/version/0.15.1/categorical.html#categorical-memory).
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
- As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

Now, pay attention to `weekday`. The main different between `object` and `category` is that categorical object also storing information of how each observation should belong to certain group/categories. You can access the category by using `.cat.categories` as such: 

In [49]:
household['weekday'] = household['purchase_time'].dt.day_name().astype('category')
household['weekday']

0           Sunday
1           Sunday
2           Sunday
3          Tuesday
4         Thursday
           ...    
71995    Wednesday
71996    Wednesday
71997    Wednesday
71998     Thursday
71999      Tuesday
Name: weekday, Length: 72000, dtype: category
Categories (7, object): [Friday, Monday, Saturday, Sunday, Thursday, Tuesday, Wednesday]

In [50]:
# .cat.categories: access the categorical values

household['weekday'].cat.categories

Index(['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
       'Wednesday'],
      dtype='object')

We can reorder the categories to follow the true day order by using `.cat.reorder_categories`:

In [51]:
day_order = ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday','Sunday']
household['weekday'] = household['weekday'].cat.reorder_categories(day_order)

In [52]:
household['weekday'].cat.categories

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')

**Why the order is important?**

You might now wonder why should we order the level of our categorical data? Note that all analysis you'll go through on the categorical column will always refer to its `categories`; e.g. if you want to present a plot or a table as a result of your analysis.

*End of Day 1*

---

## *Inclass Questions*  

- **Q**: Handling tanggal bahasa indonesia?
- **A**: Gunakan library `dateparser`

```py
import dateparser

dateparser.parse('2015-Februari-09')
pd.Series(['2015-Februari-09','05-Januari-2010']).apply(dateparser.parse)
```

- **Q**: Menambah 2 jam, 6 menit, 30 detik?  
- **A**: `data['col'] + pd.Timedelta(days = 2, minutes = 6, seconds = 30)`


- **Q**: Merubah nama kolom?
- **A**: Gunakan `.replace()` untuk mengubah nama kolom maupun index:

```py
data.replace(columns={'existing_name' : 'new_name'})
```

- **Q**: Cara mengkategorikan waktu menjadi beberapa kelompok?
- **A**: (Lebih detailnya dpt lihat di [Extra Material: Function & Conditional Value](https://drive.google.com/open?id=10YbEAQf2nLbF24r6vPHn2WyxbmvpeuNQ&authuser=0))

```py
# Solusi 1: menggunakan if-else condition

## create fun
def hour_grouping(x):
    if x >= 6 and x <= 12:
        return "12am to 7am"
    elif x >= 8 and x <= 15:
        return "8am to 3pm"
    else:
        return "4pm and 11pm"
        
## apply fun to column
data['hour_group'] = data['hour'].apply(hour_grouping)
```

```py

# Solusi 2: menggunakan fungsi `select` dari `numpy`
import numpy as np

cond1 = data['day'].isin(["Saturday","Sunday"])
choice1 = "Weekend"

data['day_group']= np.select(condlist = [cond1], choicelist = [choice1], default = "Weekday")
```

# Contingency Tables

One of the simplest EDA toolkit is the frequency table (contingency tables) and cross-tabulation tables. It is highly familiar, convenient, and practical for a wide array of statistical tasks. The simplest form of a table is to display counts of a `categorical` column.

In `pandas`, each column of a `DataFrame` is a `Series`. To get the counts of each unique levels in a categorical column, we can use `.value_counts()`. The resulting object is a `Series` and in descending order so that the most frequent element is on top. 

In [55]:
household['sub_category'].value_counts()

Detergent    36000
Sugar        24000
Rice         12000
Name: sub_category, dtype: int64

Try and perform `.value_counts()` on the `format` column, adding either:

- `sort=False` as a parameter to prevent any sorting of elements, or
- `ascending=True` as a parameter to sort in ascending order instead

How do you think each parameter works?

In [68]:
# sort= True :(default): untuk yang freq-nya paling banyak ada di urutan pertma 
# sort= False: urutan berdasarkan order dari nilai kategori (.cat.categories)

## Your code here
household['sub_category'].value_counts(sort= False)


Detergent    36000
Rice         12000
Sugar        24000
Name: sub_category, dtype: int64

In [65]:
# ascending=True : urut dari value yg kecil
# ascending=False:(default) : urut dari value yg besar

## Your code here
household['sub_category'].value_counts(sort=True, ascending= False)

Detergent    36000
Sugar        24000
Rice         12000
Name: sub_category, dtype: int64

In [73]:
type(household['sub_category'].value_counts())

pandas.core.series.Series

`crosstab` is a very versatile solution to producing frequency tables on a `DataFrame` object. Its utility really goes further than that but we'll start with a simple use-case.

Consider the following code: we use `pd.crosstab()` passing in the values to group by in the rows (`index`) and columns (`columns`) respectively. 

In [75]:
household.head(3)

Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth,weekday
0,9622257,32369294,2018-07-22 21:19:00,Rice,Rice,supermarket,128000.0,0,1,2018-07,Sunday
1,9446359,31885876,2018-07-15 16:17:00,Rice,Rice,minimarket,102750.0,0,1,2018-07,Sunday
2,9470290,31930241,2018-07-15 12:12:00,Rice,Rice,supermarket,64000.0,0,3,2018-07,Sunday


In [81]:
pd.crosstab(index = household['sub_category'], columns = "jumlah_baris")

col_0,jumlah_baris
sub_category,Unnamed: 1_level_1
Detergent,36000
Rice,12000
Sugar,24000


Realize that in the code above, we're setting the row (index) to be `sub_category` and the function will by default compute a frequency table. 

In [84]:
pd.crosstab(index = household['sub_category'],
            columns = "jumlah_baris",
           normalize=True) # mengubah frekuensi menjadi bentuk persentase

col_0,jumlah_baris
sub_category,Unnamed: 1_level_1
Detergent,0.5
Rice,0.1666666666666666
Sugar,0.3333333333333333


In the cell above, we set the values to be normalized over each columns, and this will divide each values in place over the sum of all values. This is equivalent to a manual calculation:

In [87]:
catego = pd.crosstab(index=household['sub_category'], columns="count")

catego / catego.sum()

col_0,count
sub_category,Unnamed: 1_level_1
Detergent,0.5
Rice,0.1666666666666666
Sugar,0.3333333333333333


We can also use the same `crosstab` method to compute a cross-tabulation of two factors. In the following cell, the `index` references the sub-category column while the `columns` references the format column:

In [88]:
pd.crosstab(index = household['sub_category'],
            columns = household['format'])

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,2611,24345,9044
Rice,999,7088,3913
Sugar,1761,15370,6869


In [89]:
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
           normalize = True)

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,0.0362638888888888,0.338125,0.1256111111111111
Rice,0.013875,0.0984444444444444,0.0543472222222222
Sugar,0.0244583333333333,0.2134722222222222,0.0954027777777777


**More on `normalize`:**


In [96]:
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
           normalize = "all").round(2) # equivalent dengan `normalize=True` (proporsi berdasarkan keseluruhan nilai)

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,0.04,0.34,0.13
Rice,0.01,0.1,0.05
Sugar,0.02,0.21,0.1


In [93]:
# 100% nya adalah baris
## contoh interpretasi: dari setiap penjualan produk, format penjualan manakah yang penjualannya terbanyak?
## dari 100% transaksi deterjen, paling banyak transaksi terjadi di minimarket (68%)
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
           normalize = "index").round(2)

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,0.07,0.68,0.25
Rice,0.08,0.59,0.33
Sugar,0.07,0.64,0.29


In [97]:
# 100% nya adalah per kolom
## contoh intepretasi: dari setiap format penjualan, produk mana yang paling banyak dibeli?
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
           normalize = "columns").round(2)

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,0.49,0.52,0.46
Rice,0.19,0.15,0.2
Sugar,0.33,0.33,0.35


This is intuitive in a way: We use `crosstab()` which, we recall, computes the count and we pass in `index` and `columns` which correspond to the row and column respectively.

When we add `margins=True` to our method call, then an extra row and column of margins (subtotals) will be included in the output:

In [106]:
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
           margins=True, # margins=True, menampilkan kolom "All": total setiap index&kolom
           margins_name="Grand Total") # margins_name: rename "All" yang dihasilkan

format,hypermarket,minimarket,supermarket,Grand Total
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Detergent,2611,24345,9044,36000
Rice,999,7088,3913,12000
Sugar,1761,15370,6869,24000
Grand Total,5371,46803,19826,72000


Parameter `margins=True` pada pd.crosstab():
   - Untuk table frekuensi (`pd.crosstab(index,columns)`): me-return nilai `sum`
   - Untuk table agregasi (`pd.crosstab(index,columns,aggfunc,value)`): me-return nilai `aggfunc`

**Knowledge Check:**

Use `pd.crosstab()` to answer following business question:

- Cobalah untuk menganalisis jumlah transaksi perbulan di setiap `format` market penjualan. Di bulan berapakah jumlah transaksi terendah terjadi untuk setiap `format` market? 

*Tips:*

- Anda dapat menggunakan `.idxmin()` pada tabel yang Anda menghasilkan untuk mendapatkan indeks terendahnya.

---

*Bonus Challenge*

- Dalam rangka menaikkan jumlah transaksi di `minimarket`, perusahaan berencana untuk mengadakan *flash-sale* di jam dengan transaksi terendah. Gunakan data transaksi di `minimarket` pada bulan terendahnya (Recall conditional subsetting!), dan buatlah sebuah kolom baru yang menyimpan informasi jam terjadinya transaksi. Di jam berapakah *flash-sale* tersebut sebaiknya diadakan?

In [174]:
## Your code below

# note: idxmin() --> mengembalikan index dari nilai min di setiap kolom
pd.crosstab(index=household['yearmonth'],columns=household['format']).idxmin() 

# Bonus challenge
flash_sale = household[(household['format'] == 'minimarket') & (household['yearmonth'] == '2018-03')].copy()
flash_sale['purchase_hour'] = household['purchase_time'].dt.hour 

pd.crosstab(index = flash_sale['purchase_hour'], columns = 'Count').idxmin()



col_0
Count    1
dtype: int64

## Aggregation Table

In the following section, we will introduce another parameter to perform aggregation on our table. The `aggfunc` parameter when present, required the `values` parameter to be specified as well. `values` is the values to aggregate according to the factors in our index and columns:

In [175]:
pd.crosstab(
    index=household['yearmonth'],
    columns=household['format'],
    values = household['unit_price'],
    aggfunc = 'mean'
).round(2)

format,hypermarket,minimarket,supermarket
yearmonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-10,27999.69,23515.6,28357.9
2017-11,27366.44,23377.05,28014.8
2017-12,26140.6,22633.66,28794.75
2018-01,26376.15,23229.15,28976.18
2018-02,29083.82,23441.28,27142.76
2018-03,28015.66,23605.62,25725.31
2018-04,28742.96,23303.87,27629.84
2018-05,26392.83,24159.44,27184.82
2018-06,26334.65,23960.48,27152.3
2018-07,26217.67,23123.03,29395.02


**Knowledge Check**: Cross tabulation  

Create a cross-tab using `sub_category` as the index (row) and `format` as the column. Fill the values with the median of `unit_price` across each row and column. Add a subtotal to both the row and column by setting `margins=True`.

1. On average, Sugar is cheapest at...?
2. On average, Detergent is most expensive at...?

Create a new cell for your code and answer the questions above.

In [176]:
## Your code below


## -- Solution code
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
            values= household['unit_price'], 
            aggfunc= 'median',margins=True).round(2)

format,hypermarket,minimarket,supermarket,All
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Detergent,16900.0,16800.0,16500.0,16800.0
Rice,64000.0,62900.0,64000.0,63500.0
Sugar,12250.0,12500.0,12400.0,12500.0
All,15990.0,15500.0,14907.5,15472.5


In [183]:
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
            values= household['unit_price'], 
            aggfunc= 'median').round(2).idxmin(axis=1)

sub_category
Detergent    supermarket
Rice          minimarket
Sugar        hypermarket
dtype: object

### Higher-dimensional Tables

If we need to inspect our data in higher resolution, we can create cross-tabulation using more than one factor. This allows us to yield insights on a more granular level yet have our output remain relatively compact and structured:

In [177]:
pd.crosstab(index=household['yearmonth'], 
            columns= [household['format'], household['sub_category']], 
            values=household['unit_price'],
            aggfunc='median')

# pd.crosstab(index =household['yearmonth'], 
#             columns= household['format'], 
#             values=household['unit_price'],
#             aggfunc=['median','sum'])


format,hypermarket,hypermarket,hypermarket,minimarket,minimarket,minimarket,supermarket,supermarket,supermarket
sub_category,Detergent,Rice,Sugar,Detergent,Rice,Sugar,Detergent,Rice,Sugar
yearmonth,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2017-10,17400.0,64000.0,12500.0,16800.0,62500.0,12500.0,16925.0,64000.0,12500.0
2017-11,16770.0,64000.0,12400.0,16800.0,62500.0,12500.0,16500.0,64000.0,12400.0
2017-12,17500.0,64000.0,12000.0,16600.0,62500.0,12500.0,16600.0,64000.0,12400.0
2018-01,16800.0,64000.0,12275.0,16200.0,62500.0,12500.0,16700.0,64000.0,12400.0
2018-02,17500.0,64000.0,11990.0,17000.0,63500.0,12500.0,16200.0,64000.0,12290.0
2018-03,16900.0,64000.0,12000.0,16300.0,63500.0,12500.0,15680.0,64000.0,12400.0
2018-04,16815.0,64000.0,11990.0,16800.0,63500.0,12500.0,15700.0,64000.0,12400.0
2018-05,16950.0,64000.0,12000.0,16800.0,63000.0,12500.0,16700.0,64000.0,12400.0
2018-06,16550.0,64000.0,12300.0,17300.0,63500.0,12500.0,16700.0,64000.0,12400.0
2018-07,16550.0,64000.0,12325.0,16800.0,63500.0,12500.0,16600.0,64000.0,12300.0


In `pandas` we call a higher-dimensional tables as Multi-Index Dataframe. We are going to dive deeper into the structure of the object on the the next chapter.

## Pivot Tables

If our data is already in a `DataFrame` format, using `pd.pivot_table` can sometimes be more convenient compared to a `pd.crosstab`. 

Fortunately, much of the parameters in a `pivot_table()` function is the same as `pd.crosstab()`. The noticable difference is the use of an additional `data` parameter, which allow us to specify the `DataFrame` that is used to construct the pivot table.

We create a `pivot_table` by passing in the following:
- `data`: our `DataFrame`
- `index`: the column to be used as rows
- `columns`: the column to be used as columns
- `values`: the values used to fill in the table
- `aggfunc`: the aggregation function

In [178]:
pd.crosstab(index = household['sub_category'],
            columns = household['format'],
            values= household['unit_price'], 
            aggfunc= 'median')

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,16900.0,16800.0,16500.0
Rice,64000.0,62900.0,64000.0
Sugar,12250.0,12500.0,12400.0


Perbedaan `crosstab` & `pivot_table`:

   - paramater `data`: data tidak perlu di definisikan berulang saat mingin mengisi `index`/`columns`/dsb.
   - parameter `index`/`columns`: tidak lagi bersifat wajib (jika hanya ingin menghitung satu kategori, cukup dipilih salah satu)
   - parameter `values`: tidak lagi bersifat wajib, jika tidak diinput, akan memunculkan `aggfunc` adari setiap kolom numerik dari data.
   - parameter `aggfunc`: tidak lagi bersifat wajib, secara default akan dihitung rata2 (`mean`)

In [165]:
# data pada pivot_table = bisa diisi sebagai parameter / digunakan sebagai method
pd.pivot_table(
    data = household,
    index = 'sub_category',
    columns = 'format',
    values = 'unit_price',
    aggfunc = 'median'
).round(2)

household.pivot_table(
    index = 'sub_category',
    columns = 'format',
    values = 'unit_price',
    aggfunc = 'median'
)

household[household['discount'] != 0].pivot_table(
    index = 'sub_category',
    columns = 'format',
    values = 'unit_price',
    aggfunc = 'median'
)



format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,15444.75,14900.0,14500.0
Rice,60000.0,59500.0,60000.0
Sugar,11990.0,11900.0,11900.0


In [181]:
# menampilkan rata2 per sub_category dari setiap kolom numerik 
household.pivot_table(
    index='sub_category'
)

# menampilkan total per sub_category dari setiap kolom numerik
household.pivot_table(
    index='sub_category',
    aggfunc='sum'
)

# menampilkan rata2 unit_price per sub_category
household.pivot_table(
    index = 'sub_category',
    values = 'unit_price'
)

Unnamed: 0_level_0,discount,quantity,receipt_id,receipts_item_id,sales,unit_price
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Detergent,47395824,49660,275654753656,885522501282,866068295.98006,644176555.7035401
Rice,10023669,15995,91801621879,294953949760,1084398659.009,840157755.753
Sugar,3938557,41111,183642451566,590005122697,511914528.0564,303481584.57210004


**Group Discussion:**

Gunakan `pd.crosstab()` maupun `pd.pivot_table()`:

1. Berapa total `sales` (dibuat dengan mengalikan kolom `unit_price` dan `quantity`) dari setiap `sub_category` di setiap harinya? Apakah hari-hari biasa memberikan sales yang lebih tinggi atau di akhir pekan?

2. Bagaimana dengan jumlah transaksinya? Apakah orang-orang cenderung berbelanja di hari biasa/akhir pekan?

3. Buatlah pertanyaan bisnis berdasarkan data `household`, dan gunakan metode EDA apapun yang sudah Anda pelajari untuk menjawab pertanyaan tsb.



---

*End of Day 2*

### Inclass Questions

- **Q**: Cara sort berdasarkan kolom tertentu?
- **A**: Gunakan `.sort_values()`, defaultnya diurutkan secara *ascending* (kecil ke besar), jika ingin diurutkan besar ke kecil tambahkan parameter `ascending=False`

In [101]:
# household.sort_values('unit_price', ascending=False)

In [184]:
# pd.crosstab(index = household['sub_category'],
#             columns = household['format']).\
# sort_values('hypermarket', ascending=False)

- **Q**: Bagaimana cara kerja `axis` pada `idxmin()`?
- **A**: `idxmin()` secara defaultnya akan menampilkan index (`axis=0`) dari nilai terkecil di setiap kolom. Jika kita ingin mengetahui kolom dari nilai terkecil di setiap baris, gunakan (`axis=1`). 

## Missing Values and Duplicates

During the data exploration and preparation phase, it is likely we come across some problematic details in our data. This could be the value of _-1_ for the _age_ column, a value of _blank_ for the _customer segment_ column, or a value of _None_ for the _loan duration_ column. All of these are examples of "untidy" data, which is rather common depending on the data collection and recording process in a company.

In `pandas`, we use `NaN` (not a number) to denote missing data; The equivalent for datetime is `NaT` but both are essentially compatible with each other. From the docs:
> The choice of using `NaN` internally to denote missing data was largely for simplicity and performance reasons. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.

### Missing Values

Saat membaca data, jika kolom tidak benar2 kosong, tapi diisi dgn nilai tertentu (Contoh: "Missing"," ", "-"), gunakan parameter `na_values()` untuk mendefinisikannya:

```
data = pd.read_csv("data.csv",na_values = ["Missing"," ","-"])
```

In [78]:
#pd.read_csv("na.csv",na_values=["-"," ","Missing"])

In [53]:
household = pd.read_csv("data_input/sample_household.csv",
                        index_col = 'receipts_item_id',
                       parse_dates=['purchase_time'])
household.head()

Notice from the output that between row 3 to 8 there are at least a few rows with missing data. We can use `isna()` and `notna()` to detect missing values. An example code is as below:

In [10]:
# mengembalikan jumlah baris yang kosong


In [11]:
# mengembalikan jumlah baris yang TIDAK kosong


A common way of using the `.isna()` method is to combine it with the subsetting methods we've learned in previous lessons:

In [12]:
# menghasilkan data yg memiliki nilai NA


In [13]:
# menghasilkan data yg memiliki nilai NA pada kolom weekday


Go ahead and use `notna()` to extract all the rows where `weekday` column is not missing:

In [14]:
# menghasilkan data yg TIDAK memiliki nilai NA pada kolom weekday


### Missing Values Treatment

Once you've identified the missing values, there are 3 common ways to deal with it:

- Use `dropna` with a reasonable threshold to remove any rows that contain too little values rendering it unhelpful to your analysis
- Replace the missing values with a central value (mean or median)
- Imputation through a predictive model

#### NA Deletion

When we are certain that the rows with `NA`s can be safely dropped, we can use `dropna()`, optionally specifying a threshold. By default, this method drops the row if any NA value is present (`how='any'`), but it can be set to do this only when all values are NA in that row (`how='all'`).

```
    # drops row if all values are NA
    household.dropna(how='all')
    
    # drops row if it doesn't have at least 5 non-NA values
    household.dropna(thresh=5) 
```

#### NA Imputation

Another common methods when working with missing values are demonstrated in the following section. We make a copy of the NA-included DataFrame, and name it `household2`:

In [97]:
household2 = household.copy()
household2.head(1)

In the following cell, the technique is demonstrably repetitive or even verbose. This is done to give us an idea of all the different options we can pick from. 

You may observe, for example that the two lines of code are functionally identical:
- `.fillna(0)`
- `.replace(np.nan, 0)`

In [116]:
# convert NA categories to 'Missing'
household2[['category', 'format','discount']] = household2[['category', 'format','discount']].fillna('Missing')

# convert NA unit_price to 0
household2.unit_price = household2.unit_price.fillna(0)

# convert NA purchase_time with 'bfill'
household2.purchase_time = pd.to_datetime(household2.purchase_time)
household2.purchase_time = household2.fillna(method='bfill')

# convert NA quantity with -1
household2.quantity = household2.quantity.fillna(-1)

household2.head()

Unnamed: 0_level_0,purchase_time,category,format,unit_price,discount,quantity,weekday
receipts_item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
32000000,2018-07-17 18:05:00,Missing,Missing,0.0,Missing,-1.0,
32000001,2018-07-17 18:05:00,Missing,Missing,0.0,Missing,-1.0,
32030785,2018-07-17 18:05:00,Rice,minimarket,63500.0,0.0,1.0,Tuesday
32000002,2018-07-22 21:19:00,Missing,Missing,0.0,Missing,-1.0,
32000003,2018-07-22 21:19:00,Missing,Missing,0.0,Missing,-1.0,


### Duplicates

To check whether our data has any duplicates, we can use `.duplicated()` method:

When we have data where duplicated observations are recorded, we can use `.drop_duplicates()` specifying whether the first occurence or the last should be kept:

In [137]:
print(household2.shape)
print(household2.drop_duplicates().shape)

(20, 7)
(16, 7)


**Knowledge Check:**   

Duplicates may mean a different thing from a data point-of-view and a business analyst's point-of-view. You want to be extra careful about whether the duplicates is an intended characteristic of your data, or whether it poses a violation to the business logic. 

Would you drop the "duplicated" values:

   - a. A medical center collects anonymized heart rate monitoring data from patients. It has duplicate observations collected across a span of 3 months.
   - b. An insurance company uses machine learning to deliver dynamic pricing to its customers. Each row contains the customer's name, occupation / profession and historical health data. It has duplicate observations collected across a span of 3 months


A key difference between `crosstab` and `pivot_table` is that `crosstab` uses `len` (or `count`) as the default aggregation function while `pivot_table` using the mean. Copy the code from the cell above and make a change: use `sum` as the aggregation function instead: 