## 1. Introduction
Data wrangling is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. This report illustrates the process of how the `EPISODE_201314`, `EPISODE_201415`, and `EPISODE_201415` cancer dataset are wrangled using `pandas`. The goal of data wrangling is to make the structure of the data clean and consistent, and eliminate the redundant and erroneous value, in order to make the dataset more easier to be analyzed and visualized. The data wrangling process includes three steps:
- Step 1: Gather all of the dataset and combine them into one dataset, during which some adjustments will be done.
- Step 2: Assess the dataset by visual and programmatic ways, and identify the quality, tidienss issue, and potential issues which will not be resolved in this document.
- Step 3: Clean all of the identified the issue and check the result. The cleaned dataset will be stored in other `.csv` files. Note that the issue will not be resolved as the same sequence listed in Step 2, for it will be conveninent to resolve some issues first and others in the next.

## 2. Methodology
### 2.1 Data Quality Principle
Before we begin the data wrangling, the principle of good data quality and tidiness should be clarified. _Quality_ issues pertain to the content of data. A good quality has four dimensions:
- Completeness: The dataset has no missing rows, columns, and cells
- Validity: The dataset should conform to a schema, or the a set of rules related to real-world and table-specific constraints.
- Accuracy: The dataset should not contain valid but wrong data. 
- Consistency: The format representing the data should be standard.

### 2.2 Data Tidiness Principle
_Tidiness_ isues pertain to the structure of data. A good tidiness has three dimensions:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

## 3. Data Gathering

In [1]:
# import the package needed for analysis and visualizations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline

# set the option to see all of columns in this notebook
pd.set_option('display.max_columns', 200)

In [2]:
# read the csv file and combine the dataset
df_1314 = pd.read_csv('EPISODE_201314.csv')
df_1415 = pd.read_csv('EPISODE_201415.csv')
df_1516 = pd.read_csv('EPISODE_201516.csv')

df_1314['FIS_YEAR'] = '13/14'
df_1415['FIS_YEAR'] = '14/15'
df_1516['FIS_YEAR'] = '15/16'

df = pd.concat([df_1314, df_1415, df_1516], ignore_index=True)

## 4. Data Assessment
### 4.1 Visual Assessment

In [3]:
df.head()

Unnamed: 0,USI,UNIQUE_KEY,SEX,DATE_OF_BIRTH,ADM_DT,SEP_DT,CARE_TYP,ADM_TYP,ADM_SRC,FIS_YEAR
0,,6387796,2,10JUL2013,10JUL2013,10JUL2013,4,Y,Y,13/14
1,,6401408,2,02SEP2013,02SEP2013,02SEP2013,4,Y,Y,13/14
2,,6402862,1,01JAN1910,07SEP2013,07SEP2013,4,C,H,13/14
3,,6405955,1,26NOV1982,12SEP2013,12SEP2013,10,K,K,13/14
4,,6405894,1,10JUN1941,16SEP2013,17SEP2013,10,K,K,13/14


In [4]:
df.tail()

Unnamed: 0,USI,UNIQUE_KEY,SEX,DATE_OF_BIRTH,ADM_DT,SEP_DT,CARE_TYP,ADM_TYP,ADM_SRC,FIS_YEAR
261628,2729079.0,6755051,1,09FEB1938,30JUN2016,07JUL2016,4,P,T,15/16
261629,2729200.0,6755246,1,14JAN1955,30JUN2016,30JUN2016,4,P,H,15/16
261630,2729527.0,6756100,2,06JUL1938,24JUN2016,24JUN2016,4,P,T,15/16
261631,2729527.0,6756101,2,06JUL1938,27JUN2016,27JUN2016,4,P,T,15/16
261632,2729527.0,6756102,2,06JUL1938,29JUN2016,29JUN2016,4,P,T,15/16


In [5]:
df.sample(10)

Unnamed: 0,USI,UNIQUE_KEY,SEX,DATE_OF_BIRTH,ADM_DT,SEP_DT,CARE_TYP,ADM_TYP,ADM_SRC,FIS_YEAR
114526,816390.0,6565865,1,06JAN1938,08AUG2014,08AUG2014,4,P,H,14/15
202588,893526.0,6675835,2,18JAN1940,29SEP2015,30SEP2015,4,C,H,15/16
151874,2408116.0,6597884,1,11OCT1970,09DEC2014,11DEC2014,4,P,H,14/15
131950,1897807.0,6614082,1,15JAN1960,12FEB2015,13FEB2015,4,C,H,14/15
242302,2470141.0,6752010,2,08SEP1956,21JUN2016,21JUN2016,4,P,H,15/16
221879,1927121.0,6731168,1,20MAR1956,12APR2016,12APR2016,4,P,H,15/16
5728,6078.0,6427896,1,03MAY1944,24OCT2013,24OCT2013,4,X,H,13/14
128136,1175104.0,6622096,2,28APR1974,13MAR2015,13MAR2015,4,C,H,14/15
145359,2183442.0,6585745,1,14JAN1948,23OCT2014,25OCT2014,4,O,H,14/15
61111,2176463.0,6551505,1,03MAR1986,22JUN2014,24JUN2014,4,C,H,13/14


From visual assessment, some issues are found:
- Firstly, the data type of `DATE_OF_BIRTH`, `ADM_DT`, and `SEP_DT` is not the formal `datetime`.
- Secondly, the data type of `USI` is `float` rather than `int`.
- Thirdly, `DATE_OF_BIRTH` and `SEP_DT` contain `NaN` values.

It should be noted that even if `USI` contains lots of `NaN` values, this means some clinics have input the data as clinic unit rather than individual unit, so it should not be seen as an issue. However, since there are just 87 `NaN` values in the whole dataset, in order to simply the work, I will drop them for further analysis.

### 4.2 Programmatic Assessment

In [6]:
# get the information of the table
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261633 entries, 0 to 261632
Data columns (total 10 columns):
USI              261546 non-null float64
UNIQUE_KEY       261633 non-null int64
SEX              261633 non-null int64
DATE_OF_BIRTH    261629 non-null object
ADM_DT           261633 non-null object
SEP_DT           261625 non-null object
CARE_TYP         261633 non-null object
ADM_TYP          261633 non-null object
ADM_SRC          261633 non-null object
FIS_YEAR         261633 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 20.0+ MB


In [7]:
# check the duplication in the dataset
df.duplicated().sum()

0

In [8]:
# check the value of `SEX`
df.SEX.value_counts()

1    139931
2    121702
Name: SEX, dtype: int64

In [9]:
# check the value of `CARE_TYP`
df.CARE_TYP.value_counts()

4     248406
9       4262
5A      3095
6       2309
5S      2061
8       1176
2        144
R1       129
10        50
1          1
Name: CARE_TYP, dtype: int64

In [10]:
# check the value of `ADM_TYP`
df.ADM_TYP.value_counts()

P    110767
C     89516
X     32525
L     21303
O      4482
S      2965
K        50
M        17
Y         8
Name: ADM_TYP, dtype: int64

In [11]:
# check the value of `ADM_SRC`
df.ADM_SRC.value_counts()

H    243166
T     14031
S      2965
N      1227
B       117
A        69
K        50
Y         8
Name: ADM_SRC, dtype: int64

It seems that all of the issues are identified in _Section 4.1_. Since there are no numeric values in the dataset, there is no need to use `.describe()` method.

### 4.3 Issue Summary
- `USI`, `DATE_OF_BIRTH` and `SEP_DT` contain `NaN` values.
- The data type of `DATE_OF_BIRTH`, `ADM_DT`, and `SEP_DT` is not the formal `datetime`.
- The data type of `USI` is `float` rather than `int`.


## 5. Data Cleaning
### 5.1 Issue 1: `USI`, `DATE_OF_BIRTH`, and `SEP_DT` contain `NaN` values.
#### Define
Drop null values in `USI`, `DATE_OF_BIRTH`, and `SEP_DT`

#### Code

In [12]:
df_1 = df.copy()
df_1.dropna(inplace=True)

#### Test

In [13]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 261534 entries, 21 to 261632
Data columns (total 10 columns):
USI              261534 non-null float64
UNIQUE_KEY       261534 non-null int64
SEX              261534 non-null int64
DATE_OF_BIRTH    261534 non-null object
ADM_DT           261534 non-null object
SEP_DT           261534 non-null object
CARE_TYP         261534 non-null object
ADM_TYP          261534 non-null object
ADM_SRC          261534 non-null object
FIS_YEAR         261534 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 21.9+ MB


### 5.2 Issue 2: The data type of `DATE_OF_BIRTH`, `ADM_DT`, and `SEP_DT` is not the formal `datetime`.
#### Define
Convert the data type of `DATE_OF_BIRTH`, `ADM_DT`, and `SEP_DT` to `datetime`.
#### Code

In [14]:
from datetime import datetime

def convert_datetime(x):
    day = x[:2]
    month = x[2:5]
    year = x[-4:]
    
    month = convert_month(month)
    date_str = year + '-' + month + '-' + day
    date_dt = datetime.strptime(date_str, '%Y-%m-%d')
    return date_dt
    
def convert_month(month):
    if month == 'JAN':
        return '01'
    elif month == 'FEB':
        return '02'
    elif month == 'MAR':
        return '03'
    elif month == 'APR':
        return '04'
    elif month == 'MAY':
        return '05'
    elif month == 'JUN':
        return '06'
    elif month == 'JUL':
        return '07'
    elif month == 'AUG':
        return '08'
    elif month == 'SEP':
        return '09'
    elif month == 'OCT':
        return '10'
    elif month == 'NOV':
        return '11'
    elif month == 'DEC':
        return '12'
    
df_2 = df_1.copy()
df_2['DATE_OF_BIRTH'] = df_2['DATE_OF_BIRTH'].apply(convert_datetime)
df_2['ADM_DT'] = df_2['ADM_DT'].apply(convert_datetime)
df_2['SEP_DT'] = df_2['SEP_DT'].apply(convert_datetime)

#### Test

In [15]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 261534 entries, 21 to 261632
Data columns (total 10 columns):
USI              261534 non-null float64
UNIQUE_KEY       261534 non-null int64
SEX              261534 non-null int64
DATE_OF_BIRTH    261534 non-null datetime64[ns]
ADM_DT           261534 non-null datetime64[ns]
SEP_DT           261534 non-null datetime64[ns]
CARE_TYP         261534 non-null object
ADM_TYP          261534 non-null object
ADM_SRC          261534 non-null object
FIS_YEAR         261534 non-null object
dtypes: datetime64[ns](3), float64(1), int64(2), object(4)
memory usage: 21.9+ MB


### 5.3 Issue 3: The data type of `USI` is `float` rather than `int`.
#### Define
Convert the data type of `USI` from `float` to `int`.
#### Code

In [16]:
df_3 = df_2.copy()
df_3.USI = df_3.USI.astype('int')

#### Test

In [17]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 261534 entries, 21 to 261632
Data columns (total 10 columns):
USI              261534 non-null int64
UNIQUE_KEY       261534 non-null int64
SEX              261534 non-null int64
DATE_OF_BIRTH    261534 non-null datetime64[ns]
ADM_DT           261534 non-null datetime64[ns]
SEP_DT           261534 non-null datetime64[ns]
CARE_TYP         261534 non-null object
ADM_TYP          261534 non-null object
ADM_SRC          261534 non-null object
FIS_YEAR         261534 non-null object
dtypes: datetime64[ns](3), int64(3), object(4)
memory usage: 21.9+ MB


### 5.4 Summary
All of the issues have been resolved, the characteristics of the table is shown below:

In [18]:
df_clean = df_3.copy()

In [19]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 261534 entries, 21 to 261632
Data columns (total 10 columns):
USI              261534 non-null int64
UNIQUE_KEY       261534 non-null int64
SEX              261534 non-null int64
DATE_OF_BIRTH    261534 non-null datetime64[ns]
ADM_DT           261534 non-null datetime64[ns]
SEP_DT           261534 non-null datetime64[ns]
CARE_TYP         261534 non-null object
ADM_TYP          261534 non-null object
ADM_SRC          261534 non-null object
FIS_YEAR         261534 non-null object
dtypes: datetime64[ns](3), int64(3), object(4)
memory usage: 21.9+ MB


In [20]:
df_clean.head()

Unnamed: 0,USI,UNIQUE_KEY,SEX,DATE_OF_BIRTH,ADM_DT,SEP_DT,CARE_TYP,ADM_TYP,ADM_SRC,FIS_YEAR
21,98,6447017,2,1962-04-07,2013-11-11,2013-11-15,4,L,H,13/14
22,98,6554513,2,1962-04-07,2014-06-27,2014-06-27,4,C,H,13/14
23,112,6518907,2,1939-03-08,2014-03-27,2014-03-27,4,L,H,13/14
24,131,6489769,2,1948-10-20,2014-02-11,2014-02-11,4,L,H,13/14
25,156,6387322,2,1951-05-01,2013-07-07,2013-08-11,4,X,H,13/14


## 6. Data Storage

In [21]:
# store the data to the .csv file
df_clean.to_csv('episode_clean.csv', index=False)

## 7. Conclusion
This report records the whole wrangling process of the `EPISODE` dataset from gathering to storage. During this process, 3 quality issues have been identified. Since the data wrangling is a repetitive rather than one-step process in industry, this wrangling is an initial trial and cannot ensure all of the potential issues are resolved. This document will be improved when other issues are found in analysis and visualization.