<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

# Intro to Data Cleaning

***

Week 2 | Lesson 2.3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*

- Inspect data types and values
- Diagram a data processing workflow
- Clean up a column using df.apply()

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Introduction](#introduction)   | Inspect data types, df.apply(), .value_counts()  |
| 10 min  | [Demo /Guided Practice](#common_steps)  | Common cleaning |
| 10 min  | [Demo /Guided Practice](#dfd)  | Data flow diagrams |
| 10 min  | [Demo /Guided Practice](#inspect_data_types)  | Inspecting data types |
| 10 min  | [Demo /Guided Practice](#apply)  | Applying functions |
| 10 min  | [Demo /Guided Practice](#value_counts)  | .value_counts() |
| 20 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |   |

<a name="introduction"></a>
## Introduction: data cleaning (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add some more tools to our toolbox.




### Conceptually: what do you look for, and how do you stay organized?

There's no magic formula, but we'll list common cleaning operations.

Reproducibility matters, so document! For high-level planning and documentation, **data flow diagrams** are helpful.



### Technically: once you know what you want to do, how do you do it in pandas?

Pandas has many functions to help process and manipulate your data. You learned some this morning. We'll take a second look at .dtypes, .apply() and .value_counts.

**.dtypes** is the data type attribute of numpy/pandas objects.

**df.apply()** applies a function along any axis of a DataFrame.

**pandas.Series.value_counts** returns a Series containing counts of unique values. Excludes NaN values.

<a name="common_steps"></a>

## Common steps in cleaning data

- Drop outliers

For numerical data, consider dropping values > 3 SDs from the mean.

For categorical data, consider dropping cases accounting for < 1% of items.

- Normalize

Options include:

Max-min: $X_{norm} = (X − X_{min}) / (X_{max} − X_{min})$

Z-score: $(X_i - mean(X)) / sd$

## Common steps in cleaning data

- Relabel

*("Bachelor's", "BSc", "Bachelor of Arts") -> "BA"*

- Decode

*1 -> "EU", 2 -> "Asia-Pacific", 3 -> "MENA"*



## Common steps in cleaning 

- Recast

*("1.0", "2.0", "3.0") -> (1,2,3)*

- Handle null values

Pandas will usually impute NaNs for you. Drop them? Replace with estimates?

## Common steps in cleaning 

- Binarize (dummy variables)

*(Blue, Green, Blue, Red, Red, Green) ->*
*IsBlue: (1, 0, 1, 0, 0, 0); IsGreen: (0, 1, 0, 0, 0, 1); IsRed: (0, 0, 0, 1, 1, 0)*

- Discretization

*(20, 56, 7, 2, 14, 89, 70, 40) -> (Adult, Adult, Child, Child, Child, Senior, Senior, Adult)*

> Look at the Billboard dataset (below). What kinds of cleaning might it require?

In [80]:
import pandas as pd
bb = pd.read_csv('assets/datasets/billboard.csv')
bb["genre"].value_counts()

Rock           103
Country         74
Rap             58
Rock'n'roll     34
R&B             13
R & B           10
Pop              9
Latin            9
Electronica      4
Gospel           1
Jazz             1
Reggae           1
Name: genre, dtype: int64

In [81]:
bb.columns.values

array(['year', 'artist.inverted', 'track', 'time', 'genre', 'date.entered',
       'date.peaked', 'x1st.week', 'x2nd.week', 'x3rd.week', 'x4th.week',
       'x5th.week', 'x6th.week', 'x7th.week', 'x8th.week', 'x9th.week',
       'x10th.week', 'x11th.week', 'x12th.week', 'x13th.week',
       'x14th.week', 'x15th.week', 'x16th.week', 'x17th.week',
       'x18th.week', 'x19th.week', 'x20th.week', 'x21st.week',
       'x22nd.week', 'x23rd.week', 'x24th.week', 'x25th.week',
       'x26th.week', 'x27th.week', 'x28th.week', 'x29th.week',
       'x30th.week', 'x31st.week', 'x32nd.week', 'x33rd.week',
       'x34th.week', 'x35th.week', 'x36th.week', 'x37th.week',
       'x38th.week', 'x39th.week', 'x40th.week', 'x41st.week',
       'x42nd.week', 'x43rd.week', 'x44th.week', 'x45th.week',
       'x46th.week', 'x47th.week', 'x48th.week', 'x49th.week',
       'x50th.week', 'x51st.week', 'x52nd.week', 'x53rd.week',
       'x54th.week', 'x55th.week', 'x56th.week', 'x57th.week',
       'x58th.week', '

<a name="dfd"></a>


## Planning your system with data flow diagrams (10 mins)


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Data-flow-diagram-notation.svg/220px-Data-flow-diagram-notation.svg.png">



- Function / process (ellipse or circle): takes in data from one or more sources, transforms it, and outputs it to one or more destinations 

- Store / file (parallel lines): place where data persists; takes an input and ought to have an output

- Input-output (rectangle): process that produces or consumes data

- Flow (arrow): shows the flow of specific data


> What is an example of data processing you may want to do? Sketch a data flow diagram and walk your tablemates through it.

<a name="inspect_data_types"></a>
## Demo /Guided Practice: Inspect data types  (10 mins)

Let's create a small dictionary with different data types in it. 

### Import Pandas + Numpy

In [3]:
import pandas as pd
import numpy as np

### Create Test Data

In [14]:
test_data = dict( 
    A = np.random.rand(3),
    B = 1,
    C = 'foo',
    D = pd.Timestamp('20010102'),
    E = pd.Series([1.0]*3).astype('float32'),
    F = False,
    G = pd.Series([1]*3,dtype='int8')
)

In [15]:
test_data

{'A': array([ 0.49037759,  0.38078186,  0.83204938]),
 'B': 1,
 'C': 'foo',
 'D': Timestamp('2001-01-02 00:00:00'),
 'E': 0    1.0
 1    1.0
 2    1.0
 dtype: float32,
 'F': False,
 'G': 0    1
 1    1
 2    1
 dtype: int8}

### Create our DataFrame

In [16]:
dft = pd.DataFrame(test_data)
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.490378,1,foo,2001-01-02,1.0,False,1
1,0.380782,1,foo,2001-01-02,1.0,False,1
2,0.832049,1,foo,2001-01-02,1.0,False,1


In [231]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**What might we expect dtypes in the case of mixed type values in a single dimension?**

ie:  [2, 3, 4, 5, 6, 7, 8.9]

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

### Ints are cast to floats

In [18]:
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

### String elements are cast to ``object`` dtype

In [27]:
pd.Series([1, 2, 3, 'foo'])

0      1
1      2
2      3
3    foo
dtype: object

In [234]:
dft.get_dtype_counts()

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64

> If you turn these into a pd.Series, what will be the dtype?

    [1, 3, 9, .33, False, '03-20-1978', np.arange(22)]



In [235]:
pd.Series([1, 3, 9, .33, False, '03-20-1978', np.arange(22)])

0                                                    1
1                                                    3
2                                                    9
3                                                 0.33
4                                                False
5                                           03-20-1978
6    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
dtype: object

## Why do you think it might be important to know what the column dtypes are? 

In [30]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.490378,1,foo,2001-01-02,1.0,False,1
1,0.380782,1,foo,2001-01-02,1.0,False,1
2,0.832049,1,foo,2001-01-02,1.0,False,1


In [33]:
print(dft.select_dtypes(include=['bool']))
print(dft.select_dtypes(include=['bool'])*2)
print(dft.select_dtypes(include=['object']))
print(dft.select_dtypes(include=['object'])*2)
#dft.select_dtypes(include=['object'])/2

       F
0  False
1  False
2  False
   F
0  0
1  0
2  0
     C
0  foo
1  foo
2  foo
        C
0  foofoo
1  foofoo
2  foofoo


<a name=" df.apply()"></a>
## Demo / Guided Practice:  df.apply(), df.applymap(), Series.map() (20 mins)

df.apply() applies some function to each column (or row) of your dataframe (*"column-wise"*).

df.applymap() applies some function *element-wise*.

Series.map() is a pd.Series method that applies a function element-wise on a series.

> Check: why would these be useful in data cleaning?

In [61]:
# Create test data
import random
y=10
x=np.random.randint(1,y)
df = pd.DataFrame(np.random.randint(-y, y, (y,y)))
df


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-4,3,6,-9,5,7,0,4,9,-9
1,-9,-3,7,9,-6,-9,6,-9,-1,5
2,-4,3,0,0,-7,1,7,-7,4,6
3,-8,1,7,-3,3,-3,2,5,-3,0
4,9,-6,7,-7,7,-1,3,-7,4,8
5,-7,-1,4,-2,-10,5,-5,-7,7,7
6,8,-4,0,5,-6,-8,7,1,-8,7
7,3,5,-1,-8,-3,-4,-5,-5,-5,5
8,7,3,5,1,-6,8,-5,-4,9,-6
9,-9,-5,7,4,5,4,5,-10,-7,6


In [18]:
df.apply(min,axis=0)

0    -8
1    -8
2    -6
3    -3
4   -10
5   -10
6    -6
dtype: int64

In [16]:
df

Unnamed: 0,0,1,2,3,4,5,6
0,-7,-3,-4,0,7,-10,9
1,-3,-4,8,9,2,-4,-4
2,3,-6,9,8,0,0,-4
3,-8,5,1,4,-10,-10,-1
4,-3,-8,-6,6,-1,-5,-6
5,5,-2,9,-3,-1,-4,-1
6,5,-7,4,-1,-5,-6,4


In [17]:
df.apply(min, axis = 1)

0   -10
1    -4
2    -6
3   -10
4    -8
5    -4
6    -7
dtype: int64

In [20]:
df.applymap(np.square)


Unnamed: 0,0,1,2,3,4,5,6
0,49,9,16,0,49,100,81
1,9,16,64,81,4,16,16
2,9,36,81,64,0,0,16
3,64,25,1,16,100,100,1
4,9,64,36,36,1,25,36
5,25,4,81,9,1,16,1
6,25,49,16,1,25,36,16


In [25]:
df[1].map(np.square)

0     9
1    16
2    36
3    25
4    64
5     4
6    49
Name: 1, dtype: int64

> Check: what happens if we try to use the min function in .applymap()?

In [49]:
df.applymap(lambda x: np.sqrt(x**2-x)+x*5)


Unnamed: 0,0,1,2,3,4,5,6
0,-27.516685,-11.535898,-15.527864,0.0,41.480741,-39.511912,53.485281
1,-11.535898,-15.527864,47.483315,53.485281,11.414214,-15.527864,-15.527864
2,17.44949,-23.519259,53.485281,47.483315,0.0,0.0,-15.527864
3,-31.514719,29.472136,5.0,23.464102,-39.511912,-39.511912,-3.585786
4,-11.535898,-31.514719,-23.519259,35.477226,-3.585786,-19.522774,-23.519259
5,29.472136,-7.55051,53.485281,-11.535898,-3.585786,-15.527864,-3.585786
6,29.472136,-27.516685,23.464102,-3.585786,-19.522774,-23.519259,23.464102


In [41]:
df.apply(np.std,axis=1)

0    6.490181
1    5.394631
2    5.233331
3    6.040678
4    4.333072
5    4.403153
6    4.823412
dtype: float64

### Further Reading

For more advanced `.apply` usage, check out these links:

["Why Not"'s Gist Examples](https://gist.github.com/why-not/4582705)

[Chris Albon's Map + Apply Examples](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


### **Check:** How would you find the std of the columns and rows? 

<a name=".value_counts()"></a>
## Demo /Guided Practice: .value_counts() (< 10 mins)

Why is this important?  Basically, this tells us the count of unique values that exist.  It's helpful to identify anything unexpected.  Looking at value_counts(), per series, can give us a quick overview of values expressed in our data.

 - Strings inside of mostly numeric / continious data
 - Non-numeric values
 - General counts of values that we might expect to see
 - Most common / least common values

Let's create some random data

In [63]:
data = np.random.randint(0, 7, size=50)
data

array([4, 4, 3, 1, 6, 6, 4, 4, 0, 3, 4, 6, 0, 6, 2, 1, 4, 3, 5, 2, 6, 2, 2,
       0, 0, 2, 3, 3, 5, 6, 1, 3, 2, 5, 6, 3, 0, 3, 1, 3, 3, 6, 3, 4, 4, 6,
       5, 4, 4, 4])

In [69]:
s = pd.Series(data)
s.head()

0    4
1    4
2    3
3    1
4    6
dtype: int64

In [48]:
# The counts of each number that occurs in our array is listed
s.value_counts()

3    9
1    9
0    9
2    8
4    6
5    5
6    4
dtype: int64

### Lab preview: let's munge the Billboard dataset!

In [49]:
bb.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,"3,38,00 AM",Rock,"September 23, 2000","November 18, 2000",78,63,49,...,*,*,*,*,*,*,*,*,*,*
1,2000,Santana,"Maria, Maria","4,18,00 AM",Rock,"February 12, 2000","April 8, 2000",15,8,6,...,*,*,*,*,*,*,*,*,*,*
2,2000,Savage Garden,I Knew I Loved You,"4,07,00 AM",Rock,"October 23, 1999","January 29, 2000",71,48,43,...,*,*,*,*,*,*,*,*,*,*
3,2000,Madonna,Music,"3,45,00 AM",Rock,"August 12, 2000","September 16, 2000",41,23,18,...,*,*,*,*,*,*,*,*,*,*
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),"3,38,00 AM",Rock,"August 5, 2000","October 14, 2000",57,47,45,...,*,*,*,*,*,*,*,*,*,*
5,2000,Janet,Doesn't Really Matter,"4,17,00 AM",Rock,"June 17, 2000","August 26, 2000",59,52,43,...,*,*,*,*,*,*,*,*,*,*
6,2000,Destiny's Child,Say My Name,"4,31,00 AM",Rock'n'roll,"December 25, 1999","March 18, 2000",83,83,44,...,*,*,*,*,*,*,*,*,*,*
7,2000,"Iglesias, Enrique",Be With You,"3,36,00 AM",Latin,"April 1, 2000","June 24, 2000",63,45,34,...,*,*,*,*,*,*,*,*,*,*
8,2000,Sisqo,Incomplete,"3,52,00 AM",Rock'n'roll,"June 24, 2000","August 12, 2000",77,66,61,...,*,*,*,*,*,*,*,*,*,*
9,2000,Lonestar,Amazed,"4,25,00 AM",Country,"June 5, 1999","March 4, 2000",81,54,44,...,*,*,*,*,*,*,*,*,*,*


Where do we start? Let's start with the null value sentinels.

In [89]:
def replace_nulls(value):
    if value == '*':
        return np.nan
    else:
        return value



In [90]:
bb.applymap(replace_nulls)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,"3,38,00 AM",Rock,"September 23, 2000","November 18, 2000",78,63,49,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria","4,18,00 AM",Rock,"February 12, 2000","April 8, 2000",15,8,6,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,"4,07,00 AM",Rock,"October 23, 1999","January 29, 2000",71,48,43,...,,,,,,,,,,
3,2000,Madonna,Music,"3,45,00 AM",Rock,"August 12, 2000","September 16, 2000",41,23,18,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),"3,38,00 AM",Rock,"August 5, 2000","October 14, 2000",57,47,45,...,,,,,,,,,,
5,2000,Janet,Doesn't Really Matter,"4,17,00 AM",Rock,"June 17, 2000","August 26, 2000",59,52,43,...,,,,,,,,,,
6,2000,Destiny's Child,Say My Name,"4,31,00 AM",Rock'n'roll,"December 25, 1999","March 18, 2000",83,83,44,...,,,,,,,,,,
7,2000,"Iglesias, Enrique",Be With You,"3,36,00 AM",Latin,"April 1, 2000","June 24, 2000",63,45,34,...,,,,,,,,,,
8,2000,Sisqo,Incomplete,"3,52,00 AM",Rock'n'roll,"June 24, 2000","August 12, 2000",77,66,61,...,,,,,,,,,,
9,2000,Lonestar,Amazed,"4,25,00 AM",Country,"June 5, 1999","March 4, 2000",81,54,44,...,,,,,,,,,,


<a name="ind-practice"></a>
## Independent Practice: Topic (20 minutes)

Using our old friend, the [sales_info.csv](assets/datasets/sales_info.csv) dataset:

- Inspect the data types
- Let's say all your values in the first column are too low by 1: use df.applymap to add 1 to each value in it
- Use .value_counts to count the values of each column in the dataset

**Bonus** 
- Write functions to bin the numerical values in each column into 'low', 'medium' and 'high' categories (hint: pandas has a built-in [quantile](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html) function)
- Use .value_counts again on each column of the dataset, and check that the ratios of the new values are what you expect
- Look at the advanced reading, and rewrite your binning functions as lambda functions



In [124]:
#bb.dtypes
df=pd.read_csv("/Users/thomas/GA-DSI/curriculum/week-02/2.3-data-cleaning/assets/datasets/sales_info.csv")
df.columns

Index([u'volume_sold', u'2015_margin', u'2015_q1_sales', u'2016_q1_sales'], dtype='object')

In [115]:
df=pd.read_csv("/Users/thomas/GA-DSI/curriculum/week-02/2.3-data-cleaning/assets/datasets/sales_info.csv")
df['volume_sold']=df['volume_sold'].map(lambda x: x+1)
df

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,19.420760,93.802281,337166.53,337804.05
1,5.776510,21.082425,22351.86,21736.63
2,17.602401,93.612494,277764.46,306942.27
3,5.296111,16.824704,16805.11,9307.75
4,9.156023,35.011457,54411.42,58939.90
5,6.005122,31.877437,255939.81,332979.03
6,15.606750,76.518973,319020.69,302592.88
7,5.456466,19.337345,45340.33,55315.23
8,6.047530,26.142470,57849.23,42398.57
9,6.388070,22.427024,51031.04,56241.57


In [120]:
print [df[x].value_counts() for x in df]

[7.841363     1
11.086030    1
9.176668     1
7.792889     1
8.196779     1
8.824354     1
8.124444     1
7.433606     1
9.783937     1
10.849660    1
11.270185    1
11.637769    1
6.781266     1
4.147741     1
12.505838    1
7.657733     1
7.618174     1
11.252870    1
7.309813     1
9.555078     1
8.785867     1
6.882779     1
8.200364     1
46.556096    1
11.118018    1
11.331430    1
8.682494     1
9.453647     1
10.421713    1
8.560549     1
            ..
16.697651    1
9.124182     1
12.826536    1
10.347349    1
11.260836    1
8.611698     1
12.129382    1
8.930415     1
8.343509     1
52.800686    1
5.904447     1
8.695312     1
9.092883     1
7.447040     1
13.581695    1
6.965175     1
4.725095     1
8.211490     1
8.790503     1
9.500445     1
5.296111     1
6.324497     1
9.753142     1
9.686518     1
12.019652    1
7.686406     1
13.076785    1
15.439435    1
5.557712     1
8.437252     1
Name: volume_sold, dtype: int64, 32.732539     1
36.282033     1
33.417852     1
25.

In [146]:
df.quantile([.33,.67])

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0.33,6.899212,31.164171,58976.5653,57502.0276
0.67,9.62224,45.09512,159671.988,167126.3768


<a name="conclusion"></a>
## Conclusion (5 mins)
So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different
types of data types. We've selected and sliced data too.

Today we added inspecting data types, df.apply, .value_counts to
our pandas arsenal.

### Advanced reading (optional)

.apply() functions are a typical use case for [lambda](http://www.secnetix.de/olli/Python/lambda_functions.hawk) [functions](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/) -- we'll go over them later in the class, but dive in now if you'd like!