# Solutions

1. [Changing Data Types](#1.-Changing-Data-Types)
1. [Categorical Data Type](#2.-Categorical-Data-Type)
1. [Nullable Integer Data Type](#3.-Nullable-Integer-Data-Type)

## 1. Changing Data Types

In [1]:
import pandas as pd
import numpy as np

### Exercise 1

<span  style="color:green; font-size:16px">Find the maximum integer of a 16-bit integer. Then verify it with numpys `iinfo` function.</span>

In [2]:
2 ** 15 - 1

32767

In [3]:
np.iinfo('int16')

iinfo(min=-32768, max=32767, dtype=int16)

### Exercise 2

<span  style="color:green; font-size:16px">Read in the bikes data and select the `tripduration` column. Find its data type and then use the `memory_usage` method to find how much memory (in bytes) it is using. Change its data type to the smallest possible type so that no information is lost. What percentage of memory has been saved?</span>

In [4]:
bikes = pd.read_csv('../data/bikes.csv')
td = bikes['tripduration']
td.head()

0     993
1     623
2    1040
3     667
4     130
Name: tripduration, dtype: int64

In [5]:
td.memory_usage()

400840

Find the min and max values

In [6]:
td.agg(['min', 'max'])

min       60
max    86188
Name: tripduration, dtype: int64

Unfortunately a uint16 doesn't quite have enough memory to fit the max.

In [7]:
np.iinfo('uint16')

iinfo(min=0, max=65535, dtype=uint16)

We need to use 32 bits. Although you can use uint32 its probably best to stick with int32 as this is much more common.

In [8]:
np.iinfo('int32')

iinfo(min=-2147483648, max=2147483647, dtype=int32)

In [9]:
td_32 = td.astype('int32')
td_32.head()

0     993
1     623
2    1040
3     667
4     130
Name: tripduration, dtype: int32

In [10]:
td_32.memory_usage()

200484

We went from using 64 bits for every value to 32 bits. This should yield a decrease of 50% and we verify this below.

In [11]:
td_32.memory_usage() / td.memory_usage()

0.5001596647041213

### Exercise 3

<span  style="color:green; font-size:16px">Create three different Series. Make them each have a different data type and have a different number of items. Make a fourth Series that has these three Series as the values. Output the fourth Series. Can you make sense of it?</span>

In [12]:
s1 = pd.Series([1, 2])
s2 = pd.Series([4.3, 2.1, 1.554])
s3 = pd.Series(['python', 'pandas', 'numpy'])
s4 = pd.Series([s1, s2, s3])
s4

0                           0    1
1    2
dtype: int64
1      0    4.300
1    2.100
2    1.554
dtype: float64
2    0    python
1    pandas
2     numpy
dtype: object
dtype: object

It's very hard to decipher what is going on. Series objects are not designed to contain other Series.

### Exercise 4

<span  style="color:green; font-size:16px">What month is it 1 million minutes after the unix epoch?</span>

In [13]:
s = pd.Series([1000000]).astype('datetime64[m]')
s

0   1971-11-26 10:40:00
dtype: datetime64[ns]

In [14]:
s.dt.month_name()

0    November
dtype: object

In the time series part, you will learn how to do this in a more direct manner.

In [15]:
pd.Timestamp(1000000, unit='m').month_name()

'November'

### Exercise 5

<span  style="color:green; font-size:16px">Convert the following Series to float.</span>

In [16]:
s = pd.Series(['1.9', '43', 'python'])
s

0       1.9
1        43
2    python
dtype: object

In [17]:
pd.to_numeric(s, errors='coerce')

0     1.9
1    43.0
2     NaN
dtype: float64

### Exercise 6

<span  style="color:green; font-size:16px">Take a look at the `dpcapacity_start` column from the bikes dataset. It contains the capacity of the bike rack when the ride began. This number should be an integer but it is a float. Why do you think pandas read this in as a float? Do something to the DataFrame as a whole so that you can convert just this column to an integer. Choose the lowest size integer that is possible.</span>

In [18]:
bikes['dpcapacity_start'].head()

0    11.0
1    31.0
2    15.0
3    19.0
4    19.0
Name: dpcapacity_start, dtype: float64

There must be missing values since integers do not contain them.

In [19]:
bikes['dpcapacity_start'].isna().sum()

6

The values easily fit within a 8-bit integer.

In [20]:
bikes['dpcapacity_start'].agg(['min', 'max'])

min     0.0
max    55.0
Name: dpcapacity_start, dtype: float64

Let's drop the missing values just for that column and then convert it to an `int8`.

In [21]:
bikes2 = bikes.dropna(subset=['dpcapacity_start']) \
              .astype({'dpcapacity_start': 'int8'})
bikes2['dpcapacity_start'].head()

0    11
1    31
2    15
3    19
4    19
Name: dpcapacity_start, dtype: int8

## 2. Categorical Data Type

Execute the cell below to read in the diamonds dataset and use it to answer the following questions.

In [22]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


In [23]:
pd.set_option('display.max_colwidth', 100)
pd.read_csv('../data/dictionaries/diamonds_dictionary.csv')

Unnamed: 0,Column Name,Description
0,carat,weight of the diamond (0.2--5.01)
1,clarity,"a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))"
2,color,"diamond colour, from J (worst) to D (best)"
3,cut,"quality of the cut (Fair, Good, Very Good, Premium, Ideal)"
4,depth,"total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)"
5,price,"price in US dollars ($326--$18,823)"
6,table,width of top of diamond relative to widest point (43--95)
7,x,length in mm (0--10.74)
8,y,width in mm (0--58.9)
9,z,depth in mm (0--31.8)


### Exercise 1

<span  style="color:green; font-size:16px">Create a new DataFrame `diamonds2` that has the columns clarity, color, and cut as ordered categoricals.</span>

In [24]:
clarity_cats = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
clarity_dtype = pd.CategoricalDtype(clarity_cats, ordered=True)

color_cats = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
color_dtype = pd.CategoricalDtype(color_cats, ordered=True)

cut_cats = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
cut_dtype = pd.CategoricalDtype(cut_cats, ordered=True)

diamonds2 = diamonds.astype({'clarity': clarity_dtype, 'color': color_dtype, 'cut': cut_dtype})
diamonds2.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


In [25]:
diamonds2.dtypes

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object

### Exercise 2

<span  style="color:green; font-size:16px">Find the number of occurrences for each cut type sorted by its cut order.</span>

In [26]:
diamonds2['cut'].value_counts().sort_index()

Fair          1610
Good          4906
Very Good    12082
Premium      13791
Ideal        21551
Name: cut, dtype: int64

## 3. Nullable Integer Data Type