# Cleaning Data (ETL)
----
- ETL(Extract Required Data, Transform into required format and Load the data into another dataframe)
1.	It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003)
2.	Data preparation is not just a first step, but must be repeated many times over the course of analysis as new problems come to light or new data is collected
3.	The principles of **tidy** data provide a standard way to organize data values within a dataset. Please [refer white paper](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf)
4.	Some of the common problems are 

    a.	Inconsistent column names
    
    b.	Missing data
    
    c.	Outliers
    
    d.	Duplicate rows
    
    e.	Untidy
    
    f.	Column data types


## 1. Melt() data

1.	Follows Principles of **tidy** data

    a.	Columns represent separate variables
    
    b.	Rows represent individual observations(i.e 1 row = 1 obsveration)
    
    c.	Observational units form tables
    
2.	`Melt function ==>  "Unpivots"` a DataFrame from `wide format` to `long format`.
3. Observe below exercise


### 1.1 Exercise 1: wide format  to long format

In [2]:
# create a dataframe 
import pandas as pd 
df = pd.DataFrame({'A':{0:'a',1:'b',2:'c'},
                   'B':{0:1,1:3,2:5},
                   'C':{0:2,1:4,2:6},
                   'D':{0:7,1:9,2:11},
                   'E':{0:8,1:10,2:12}})
df

Unnamed: 0,A,B,C,D,E
0,a,1,2,7,8
1,b,3,4,9,10
2,c,5,6,11,12


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
A    3 non-null object
B    3 non-null int64
C    3 non-null int64
D    3 non-null int64
E    3 non-null int64
dtypes: int64(4), object(1)
memory usage: 144.0+ bytes


- 5 columns and 3 rows available

In [4]:
df.melt?

In [7]:
# convert dataframe using melt
df_melt = df.melt(id_vars = ["A"],
        value_vars=["B","C","D","E"],
        var_name = "my_var",
        value_name ="my_val")
df_melt

Unnamed: 0,A,my_var,my_val
0,a,B,1
1,b,B,3
2,c,B,5
3,a,C,2
4,b,C,4
5,c,C,6
6,a,D,7
7,b,D,9
8,c,D,11
9,a,E,8


In [9]:
df_melt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
A         12 non-null object
my_var    12 non-null object
my_val    12 non-null int64
dtypes: int64(1), object(2)
memory usage: 368.0+ bytes


### 1.2 Exercise 2: Columns containing values, instead of variables(melt())

In [11]:
import os 

In [12]:
os.chdir("C:\\Users\\Hi\\Google Drive\\01 DS ML DL NLP and AI With Python Lab Copy\\02 Lab Data\\Python")

In [12]:
# 01. import required module and load csv file
import pandas as pd
df = pd.read_csv("Py_cleaning_Tidydata.csv")
# 02. Observe data before melting
df


Unnamed: 0,name,treatment a,treatment b
0,John Smith,,2
1,Jane Doe,16.0,11
2,Mary Johnson,3.0,1


In [13]:
# 03. melt and print data
df_melt = pd.melt(frame = df,
                  id_vars = 'name', 
                  value_vars = ['treatment a','treatment b'],
                  var_name = 'treatment',
                  value_name = 'result')
# 04. Observe data after melting
df_melt


Unnamed: 0,name,treatment,result
0,John Smith,treatment a,
1,Jane Doe,treatment a,16.0
2,Mary Johnson,treatment a,3.0
3,John Smith,treatment b,2.0
4,Jane Doe,treatment b,11.0
5,Mary Johnson,treatment b,1.0


## 2. Pivot (un-melting data)

- `Opposite of melting`
- In melting, we turned columns into rows
- Pivoting: turn unique values into separate columns
- `Violates tidy data principle:` 
    - rows contain observations
    - Multiple variables stored in the same column

### 2.1 Exercise 3: Pivot data

In [13]:
# 01. import required modules
import pandas as pd
# 02. load csv file as dataframe 
df = pd.read_csv("py_cleaning_pivot_data.csv")
# 03. Observe the dataframe data
df

Unnamed: 0,name,treatment,result
0,John Smith,treatment a,
1,Jane Doe,treatment a,16.0
2,Mary Johnson,treatment a,3.0
3,John Smith,treatment b,2.0
4,Jane Doe,treatment b,11.0
5,Mary Johnson,treatment b,1.0


In [15]:
df.pivot?

In [14]:
# 04. pivot the data
df_pivot = df.pivot(index = "name",columns="treatment",values = "result")
# 05. after pivot observe the dataframe 
df_pivot

treatment,treatment a,treatment b
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane Doe,16.0,11.0
John Smith,,2.0
Mary Johnson,3.0,1.0


### 2.2 Exercise 4: Pivot data of duplicate data

In [15]:
# 01. import required modules
import pandas as pd
import numpy as np
# 02. load csv file as dataframe 
df = pd.read_csv("py_cleaning_pivot_data1.csv")
# 03. Observe the dataframe data
df

Unnamed: 0,date,temperature,value
0,2017-09-15,tmax,30
1,2017-09-15,tmin,14
2,2017-09-16,tmax,30
3,2017-09-15,tmax,28
4,2017-09-15,tmin,15


In [18]:
df.pivot_table?

In [16]:
# 04. pivot the data
df_pivot = df.pivot_table(index = "date",
                          columns="temperature",
                          values = "value",
                          aggfunc = np.mean)
# 05. after pivot observe the dataframe 
df_pivot

temperature,tmax,tmin
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-09-15,29.0,14.5
2017-09-16,30.0,


## How many functions we learned
---------
- melt(unpivot) --> wide table into length table --> follows tidy rules
- pivot(un melting) --< lenthy table converting into wide table --> not follows tidy rules
- pivot_table --> removeing duplicates