<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Guided Practice: Inspecting Data Types and Applying Functions

_Authors: Dave Yerrington (SF)_

---

In [1]:
import pandas as pd
import numpy as np

**1. Create a small DataFrame with different data types (provided).**

In [3]:
# create a small dictionary with different data types

dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                                F = False,
                                G = pd.Series([1]*3,dtype='int8')))

dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.159207,1,foo,2001-01-02,1.0,False,1
1,0.385402,1,foo,2001-01-02,1.0,False,1
2,0.550123,1,foo,2001-01-02,1.0,False,1


**2. Examine the data types of the columns.**

In [4]:
# .dtypes is a really easy way to see what kind of dtypes 
# are in each column. 
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**3. Create a Series object with the integers 1-5 and float 6.0. What data type is the Series?**

In [5]:
# If a pandas object contains data multiple dtypes IN A 
# SINGLE COLUMN, the dtype of the column will be chosen 
# to accommodate all of the data types (object is the 
# most general).
# these ints are coerced to floats

pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

**4. Create a Series with data: `[1, 2, 3, 6., 'foo']`. What data type is the series?**

In [6]:
# string data forces an ``object`` dtype

pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

**5. Find how many columns of each type there are with the `.get_dtype_counts()` function.**

In [7]:
# The method get_dtype_counts() will return the number 
# of columns of each type in a DataFrame:

dft.get_dtype_counts()

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64

**6. Create another small DataFrame (provided).**

In [8]:
# create a small data frame. 

df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,-0.986421,1.278467,-1.325367,0.380944
1,1.172461,0.435899,0.09641,0.400349
2,0.10009,0.261444,-0.174855,-0.008861
3,3.296258,-0.444708,0.071411,0.298726
4,-0.02181,1.146638,1.090202,-0.139361


**7. Use the `.apply()` function to find the square root of all the cells.**

In [9]:
# Use df.apply to find the square root of all the values. 
# NaN means not a number

df.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,,1.130693,,0.617206
1,1.082803,0.660227,0.3105,0.632731
2,0.31637,0.511316,,
3,1.81556,,0.267228,0.546558
4,,1.070812,1.044128,


**8. Use `.apply()` to find the mean of the columns.**

In [10]:
# find the mean of all of the columns

df.apply(np.mean, axis=0)

a    0.712116
b    0.535548
c   -0.048440
d    0.186359
dtype: float64

**9. Find the mean of the rows.**

In [11]:
# find the mean of all of the rows

df.apply(np.mean, axis=1)

0   -0.163095
1    0.526280
2    0.044454
3    0.805422
4    0.518917
dtype: float64

**10. Use numpy to create a random vector of 50 numbers ranging from 0 to 6.**

*Hint: This can be done with `np.random.randint()`.

In [12]:
# Let's create a random array with 50 numbers, ranging 
# from 0 to 6.

data = np.random.randint(0, 7, size = 50)
data

array([6, 4, 3, 6, 5, 5, 5, 0, 5, 0, 0, 1, 1, 6, 4, 3, 6, 2, 4, 0, 4, 4, 4,
       1, 4, 1, 6, 6, 0, 0, 0, 4, 1, 4, 4, 4, 6, 4, 1, 3, 4, 2, 5, 2, 6, 0,
       1, 4, 2, 4])

**11. Convert the vector to a Series and count the occurrences of each number.**

In [13]:
# convert the array into a series

s = pd.Series(data)

In [14]:
# How many of each number is there in the series? Enter 
# value_counts()

pd.value_counts(s)

4    15
6     8
0     8
1     7
5     5
2     4
3     3
dtype: int64

---

# Independent practice: sales data

---
**1. Load the `sales.csv` data set from the datasets directory.**

In [22]:
sales = pd.read_csv('../datasets/sales.csv')
sales.head()

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,18.42076,93.802281,337166.53,337804.05
1,4.77651,21.082425,22351.86,21736.63
2,16.602401,93.612494,277764.46,306942.27
3,4.296111,16.824704,16805.11,9307.75
4,8.156023,35.011457,54411.42,58939.9


**2. Inspect the data types.**

In [7]:
sales.dtypes

volume_sold      float64
2015_margin      float64
2015_q1_sales    float64
2016_q1_sales    float64
dtype: object

**3. Imagine you've found out that all your values in the first column are off by 1. Use `.apply()` or `.map()` to add 1 to the first column of the dataset.**

In [26]:
# pretty simple function to add 1 to a value.
def one_up(value):
    value += 1
    return value


# Using df.map ()
sales['volume_sold'] = sales['volume_sold'].map(one_up)
sales.head()

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,19.42076,93.802281,337166.53,337804.05
1,5.77651,21.082425,22351.86,21736.63
2,17.602401,93.612494,277764.46,306942.27
3,5.296111,16.824704,16805.11,9307.75
4,9.156023,35.011457,54411.42,58939.9


In [27]:
# with df.apply
sales['volume_sold'] = sales['volume_sold'].apply(one_up)
sales.head()

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,20.42076,93.802281,337166.53,337804.05
1,6.77651,21.082425,22351.86,21736.63
2,18.602401,93.612494,277764.46,306942.27
3,6.296111,16.824704,16805.11,9307.75
4,10.156023,35.011457,54411.42,58939.9


In [29]:
# This can also be done without df.map or df.apply.
sales['2015_margin'] = sales['2015_margin']+1
sales.head()

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,20.42076,94.802281,337166.53,337804.05
1,6.77651,22.082425,22351.86,21736.63
2,18.602401,94.612494,277764.46,306942.27
3,6.296111,17.824704,16805.11,9307.75
4,10.156023,36.011457,54411.42,58939.9


**4. Use `.value_counts` to count the values of the first column of the dataset.**

In [31]:
sales['volume_sold'].value_counts()

18.059971    1
9.200364     1
7.531061     1
12.800347    1
10.156023    1
15.635150    1
7.047530     1
7.906274     1
11.292241    1
6.200122     1
5.210727     1
13.997117    1
13.232324    1
7.778097     1
47.556096    1
6.385455     1
10.017945    1
8.630904     1
17.697651    1
10.300180    1
6.776510     1
12.186622    1
8.255355     1
13.975769    1
7.294400     1
13.697456    1
9.930415     1
11.697422    1
11.014870    1
13.129382    1
            ..
11.313785    1
8.048228     1
11.347349    1
9.682494     1
11.348477    1
6.456466     1
13.505838    1
9.437252     1
53.800686    1
13.625606    1
12.677295    1
12.252870    1
13.840780    1
6.941294     1
4.794631     1
31.878030    1
12.331430    1
9.434821     1
8.447040     1
9.196779     1
9.785867     1
13.091709    1
14.509967    1
7.147729     1
14.350027    1
7.388070     1
10.555078    1
8.618174     1
7.723896     1
12.705327    1
Name: volume_sold, dtype: int64

**BONUS: The `.value_counts()` output in question four isn't very insightful.  Use [Pandas.DataFrame.round](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.round.html) to round the column to the nearest tenth before runing `.value_counts()`.**

In [38]:
sales['volume_sold'].round(decimals=1).value_counts()

8.3     7
10.6    6
9.8     6
12.3    6
11.3    5
9.4     5
6.8     4
12.1    4
7.2     4
7.8     3
12.7    3
7.9     3
9.6     3
10.2    3
8.7     3
10.8    3
10.5    3
14.5    3
8.0     3
12.6    3
9.2     3
7.0     3
6.9     3
13.1    3
10.1    3
6.4     3
10.4    3
10.0    2
7.3     2
7.1     2
       ..
10.7    1
8.9     1
13.5    1
20.7    1
15.0    1
6.5     1
8.1     1
5.7     1
52.3    1
4.8     1
11.7    1
5.6     1
7.4     1
18.1    1
53.7    1
8.2     1
6.6     1
5.2     1
15.8    1
18.6    1
20.4    1
13.6    1
5.1     1
11.9    1
15.6    1
53.8    1
78.2    1
7.7     1
13.2    1
17.7    1
Name: volume_sold, dtype: int64