<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

# Intro to Data Cleaning

***

Week 2 | Lesson 2.3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Inspect data types
- Clean up a column using df.apply()
- Know what situations to use .value_counts() in your code

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Introduction](#introduction)   | Inpsect data types, df.apply(), .value_counts()  |
| 20 min  | [Demo /Guided Practice](#demo)  | Inpsect data types |
| 20 min  | [Demo /Guided Practice](#demo)  | df.apply() |
| 20 min  | [Demo /Guided Practice](#demo)  | .value_counts() |
| 20 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |   |

<a name="introduction"></a>
## Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a
couple more tools to our toolbox. 

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta, 
category, and object. 

df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below. 

pandas.Series.value_counts returns Series containing counts of unique values. The resulting 
Series will be in descending order so that the first element is the most frequently-occurring 
element. Excludes NA values.

[Pandas: dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf)

[Pandas Series: value_counts](http://nullege.com/codes/search/pandas.Series.value_counts)


<a name="Inpsect data types "></a>
## Demo /Guided Practice: Inspect data types  (20 mins)

Let's create a small dictionary with different data types in it. 

> [demo code](https://github.com/generalassembly-studio/dsi-course-materials/blob/W2-L2.3/curriculum/04-lessons/week-02/2.3-lesson/code/W2%20L2.3%20Intro%20to%20Data%20Cleaning%20demo%20code.ipynb)
can be found in the code folder and contains all the code in this lesson in a Jupyter
notebook. Follow along or create a new notebook.


### Import Pandas + Numpy

In [3]:
import pandas as pd
import numpy as np

### Create Test Data

In [4]:
test_data = dict( 
    A = np.random.rand(3),
    B = 1,
    C = 'foo',
    D = pd.Timestamp('20010102'),
    E = pd.Series([1.0]*3).astype('float32'),
    F = False,
    G = pd.Series([1]*3,dtype='int8')
)

In [5]:
test_data

{'A': array([ 0.70630261,  0.98246752,  0.9984217 ]),
 'B': 1,
 'C': 'foo',
 'D': Timestamp('2001-01-02 00:00:00'),
 'E': 0    1.0
 1    1.0
 2    1.0
 dtype: float32,
 'F': False,
 'G': 0    1
 1    1
 2    1
 dtype: int8}

### Create our DataFrame

In [6]:
dft = pd.DataFrame(test_data)
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.706303,1,foo,2001-01-02,1.0,False,1
1,0.982468,1,foo,2001-01-02,1.0,False,1
2,0.998422,1,foo,2001-01-02,1.0,False,1


In [7]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**What might we expect dtypes in the case of mixed type values in a single dimension?**

ie:  [2, 3, 4, 5, 6, 7, 8.9]

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

### Ints are cast to floats

In [8]:
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

### String elements are cast to ``object`` dtype

In [9]:
pd.Series([1, 2, 3, 'foo'])

0      1
1      2
2      3
3    foo
dtype: object

In [10]:
dft.get_dtype_counts().astype(list)

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: object

### With a partner, take 3 minutes to discus:

*Without* running this code with a Python interpreter, what types would you expect the common `dtype` to be selected?

    [1, 3, 9, .33, False, '03-20-1978', np.arange(22)]



You can do a lot more with dtypes.  Check out 
[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).

## Why do you think it might be important to know what kind of dtypes you're working with? 

<a name=" df.apply()"></a>
## Demo /Guided Practice:  df.apply() (20 mins)

Generally, df.apply(), will apply a singlular function to every cell of the dataframe you use it with.  

Conversely: df.map(), is available when you only want to work with a single dimension of your dataset, ie:  df['a'].map(my_func)

In [11]:
# Create some more test data
df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,-1.200009,0.605222,1.868761,-0.604397
1,-2.132443,0.921174,-1.654215,-2.898519
2,-0.090855,-0.763516,0.804579,0.24159
3,-0.502342,-0.906263,-0.791563,0.050626
4,0.033142,0.17681,1.330263,-0.477527


### Some Examples

In [12]:
# square root ALL CELLS (NaN == Not a Number)
df.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,,0.77796,1.367026,
1,,0.959778,,
2,,,0.896983,0.491518
3,,,,0.225002
4,0.182048,0.420488,1.15337,


_Note: Illustrate with whiteboard DataFrame, with blank axis labels._

### Apply method to only one axis, 0 (columns)
ie:  `df['a'] == [-1.369438  , -0.76541512,  1.75835588,  1.17270527,  0.02630271]`

In [14]:
df.apply(np.mean, axis=0)

a   -0.778501
b    0.006686
c    0.311565
d   -0.737646
dtype: float64

### Apply method to only axis 1 (rows)
This is what the data slice would look like if we were to select only rows.  Here is the slice of the first row that would be affected with axis #1 with .apply():

`df.iloc[0].values == [-1.369438  ,  0.0804468 , -1.22457047,  0.42207757]`

_We are calculating the mean of lists of "rows", not "columns"._

In [32]:
df.apply(np.mean, axis=1)

0   -0.522871
1    0.215635
2    0.134356
3   -0.088068
4   -0.386371
dtype: float64

### Further Reading

For more advanced `.apply` usage, check out these links:

["Why Not"'s Gist Examples](https://gist.github.com/why-not/4582705)

[Chris Albon's Map + Apply Examples](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


### **Check:** How would find the std of the columns and rows? 

<a name=".value_counts()"></a>
## Demo /Guided Practice: .value_counts() (< 20 mins)

Why is this important?  Basically, this tells us the count of unique values that exist.  It's helpful to identify anything unexpected.  Looking at value_counts(), per series, can give us a quick overview of values expressed in our data.

 - Strings inside of mostly numeric / continious data
 - Non-numeric values
 - General counts of values that we might expect to see
 - Most common / least common values

Let's create some random data

In [64]:
data = np.random.randint(0, 7, size = 50)
data

array([0, 4, 6, 0, 0, 0, 2, 0, 3, 2, 0, 5, 0, 4, 5, 1, 4, 6, 1, 1, 2, 3, 6,
       2, 6, 2, 2, 5, 0, 2, 4, 4, 2, 4, 6, 6, 6, 1, 0, 3, 6, 0, 6, 2, 4, 3,
       6, 5, 1, 6])

In [65]:
s = pd.Series(data)
s.head()

0    0
1    4
2    6
3    0
4    0
dtype: int64

In [66]:
# The counts of each number that occurs in our array is listed
pd.value_counts(s)

6    11
0    10
2     9
4     7
1     5
5     4
3     4
dtype: int64

<a name="ind-practice"></a>
## Independent Practice: Topic (20+ minutes)
- Use the sales.csv data set, we've seen this a few times in previous lessons
- Inspect the data types
- You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
- Use .value_counts to count the values of 1 column of the dataset

**Bonus** 
- Add 3 to column 2
- Use .value_counts for each column of the dataset

**Bonus Bonus -- COMPLETELY OPTIONAL!!!**
<img src="http://vignette3.wikia.nocookie.net/erbparodies/images/a/a3/Troll_Based_On.png/revision/latest?cb=20151109194505" style="width: 100px;">

Ruining data should give you a better sense of how to clean it.  Don't feel like you need to attempt this as it's completely optional and it's meant to be _extranious_.  Real-life datasets will not be like this.  The solution isn't as important as the process and thinking behind your approach.  Another way you might want to try to do this, is map out how you would do this with pseudo code with a step-by-step plan, without actually coding anything.

- Add an extra column to your dataframe that is a copy of an existing column with continious data
    - Randomly change the value of continious data cells within it to the following:
      - NaN
      - A blank string
      - A numeric string
      - The same value
    - Report value_counts post-"random data troll" processing. Does it seem random?
    - Convert blank strings and NaN values to float(0)
    - Convert numeric strings to floats with 2f precision
    - Divide by 2 if cell value is prime, use remainder as value
    - Post solution as Gist with comments to Slack

In [38]:
sales = pd.read_csv('/Users/Stav/github-repos/Stav-Grossfeld-DSI/week-01/3.4-lab/assets/datasets/sales_info.csv')



sales['volume_sold'] =  sales['volume_sold'].apply(lambda value: value + 1 )
sales
sales['volume_sold'].value_counts()

7.841363     1
11.086030    1
9.176668     1
7.792889     1
8.196779     1
8.824354     1
8.124444     1
7.433606     1
9.783937     1
10.849660    1
11.270185    1
11.637769    1
6.781266     1
4.147741     1
12.505838    1
7.657733     1
7.618174     1
11.252870    1
7.309813     1
9.555078     1
8.785867     1
6.882779     1
8.200364     1
46.556096    1
11.118018    1
11.331430    1
8.682494     1
9.453647     1
10.421713    1
8.560549     1
            ..
16.697651    1
9.124182     1
12.826536    1
10.347349    1
11.260836    1
8.611698     1
12.129382    1
8.930415     1
8.343509     1
52.800686    1
5.904447     1
8.695312     1
9.092883     1
7.447040     1
13.581695    1
6.965175     1
4.725095     1
8.211490     1
8.790503     1
9.500445     1
5.296111     1
6.324497     1
9.753142     1
9.686518     1
12.019652    1
7.686406     1
13.076785    1
15.439435    1
5.557712     1
8.437252     1
Name: volume_sold, dtype: int64

In [43]:
sales.iloc[:,1] =  sales.iloc[:,1].apply(lambda value: value + 3 )
sales

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,19.420760,105.802281,337166.53,337804.05
1,5.776510,33.082425,22351.86,21736.63
2,17.602401,105.612494,277764.46,306942.27
3,5.296111,28.824704,16805.11,9307.75
4,9.156023,47.011457,54411.42,58939.90
5,6.005122,43.877437,255939.81,332979.03
6,15.606750,88.518973,319020.69,302592.88
7,5.456466,31.337345,45340.33,55315.23
8,6.047530,38.142470,57849.23,42398.57
9,6.388070,34.427024,51031.04,56241.57


<a name="conclusion"></a>
## Conclusion (5 mins)
So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different
types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to
our pandas arsenal. Nice!

In [62]:
   
[sales[i].value_counts() for i in sales]

[7.841363     1
 11.086030    1
 9.176668     1
 7.792889     1
 8.196779     1
 8.824354     1
 8.124444     1
 7.433606     1
 9.783937     1
 10.849660    1
 11.270185    1
 11.637769    1
 6.781266     1
 4.147741     1
 12.505838    1
 7.657733     1
 7.618174     1
 11.252870    1
 7.309813     1
 9.555078     1
 8.785867     1
 6.882779     1
 8.200364     1
 46.556096    1
 11.118018    1
 11.331430    1
 8.682494     1
 9.453647     1
 10.421713    1
 8.560549     1
             ..
 16.697651    1
 9.124182     1
 12.826536    1
 10.347349    1
 11.260836    1
 8.611698     1
 12.129382    1
 8.930415     1
 8.343509     1
 52.800686    1
 5.904447     1
 8.695312     1
 9.092883     1
 7.447040     1
 13.581695    1
 6.965175     1
 4.725095     1
 8.211490     1
 8.790503     1
 9.500445     1
 5.296111     1
 6.324497     1
 9.753142     1
 9.686518     1
 12.019652    1
 7.686406     1
 13.076785    1
 15.439435    1
 5.557712     1
 8.437252     1
 Name: volume_sold, dtyp