# Data Cleaning with Pandas

## Introduction

The aim of this notebook is to introduce you to basic data cleaning using Python and Pandas. Most of the contents follow the ideas presented the great report of Jonge van der Loo - <cite>[Introduction to data cleaning with R][1]</cite>.

As explained in [1][1], most of Data Scientist's work is spent in cleaning preparing data before any statistical analysis or model application. It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. Wiley-IEEE.). Of course one can find data sources date are ready to go, but usually these sources are already explored data sets on parcial examples. The reality however, is that data is full of errors, without format consistency and potentially incomplete. The Data Scientist mission is to convert these raw data sources into consistent data sets that can be used as input for further analysis.

Even with technically correct data sets, and after hard work on cleaning, checking, filling data sets can lack of a  standard way to organise data values within a dataset. Hadley Wickham defined this standardization as Tidy Data.

[1]:https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf


## Statistical Analysis in Five Steps

Van der Loo defines statistical data analysis in five steps:
1. Raw data: 
       | type checking, normalizing         
       v                                    
2. Technically correct data                 
       | fix and impute                     
       v                                    
3. Consistent data                          
       | estimate, analyze, derive, etc.
       v
4. Statistical results
       | tabulate, plot
       v
5. Formatted output

In the previous graph you find a numbered list with five items. Each item represent data in different states and arrows represent actions needed at each step to move to the next one. It has to be noted that when data is transformed from first to fifth state it gains value at each step in an incremntal way.

At the first stage, we have data as is. Raw data is a rough diamond that is going to be cut and shined at each step. Among the errors we can find there are: wrong types, different variable encoding, data without labels, etc.

Technically correct data, is data that can be loaded into Pandas structures, let's say that it has the propper "shape" with correct names, types, labels and so on. However variables may be out of range or potentially inconsistent (relations between variables).

In consistent data stage, data is ready for statistical inference. For example the total amount of incomes in a year is the sum of all months incomes.

The later stages contain statistical results derived from the analysis that ultimately can be formated to provide a synthetic layout.

** Best Practice **
It is a good idea to keep the input of each step in a local file, and the methods applied ready to be reproduced at each stage (at least). 

** Why? **
We will see that at each stage, we can potentially loose or modify initial data. This loose or modification of data can influence final analysis. All operations performed over a data set should be reproducible.

Python offers a good interactive environment that facilites the transformation and computation of datasets while generating a nice scripting framework to reproduce procedures.

## Kind Reminder on (statistical) Variables

Data cleaning can be seen as the first step on statistical analysis, and as programmers we tend to forget or mess the statistical terms. What an statician says when it says variable? For a computer programmer, a variable is a memory space that can be filled with a know (or unknown) quantity of information (a.k.a. value). Moreover, this space has an associated notation alias thac can be used in a program in running time to modify the value of the variable. Well, don't take this as an exact definition, but it helps to provide us a general refresh of what a variable is (for us the CS).

Well, staticians have their own variables, lets have an (again) informal definition. In statistics, a variable is an attribute that describes a person, place, thing, or idea (often referred as feature). 

As an example, we can take the list of physical characteristics of 10 persons. The objects of the matrix are the persons, the variables are the measured properties, such as the weight or the color of the eyes. 

Basically in Pandas there are two fundamental data structures Series and DataFrame. The reference page is (http://pandas.pydata.org/pandas-docs/stable/dsintro.html) and we will try to place different and complementary examples to understand its mission.

# Bibliography

[1] Jonge van der Loo, Introduction to data cleaning with R - https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

[2] Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. Wiley-IEEE.

[3] Hadley Wickman. Tidy Data. http://vita.had.co.nz/papers/tidy-data.pdf 