# Solutions

1. [Integer, Float, and Boolean Data types](#1.-Integer,-Float,-and-Boolean-Data-types)
1. [Object, String, and Categorical Data Types](#2.-Object,-String,-and-Categorical-Data-Types)
1. [Datetime, Timedelta, and Period Data Types](#3.-Datetime,-Timedelta,-and-Period-Data-Types)
1. [DataFrame Data Type Conversion](#4.-DataFrame-Data-Type-Conversion)

## 1. Integer, Float, and Boolean Data types

In [1]:
import pandas as pd
import numpy as np

### Exercise 1

<span style="color:green; font-size:16px">Find the maximum number of a 16-bit integer using arithmetic operations. Then verify it with numpy's `iinfo` function.</span>

In [2]:
2 ** 15 - 1

32767

In [3]:
np.iinfo('int16')

iinfo(min=-32768, max=32767, dtype=int16)

### Exercise 2

<span style="color:darkgreen; font-size:16px">Construct a Series that has the nullable integer data type. Make sure it has a mix of integers and missing values.</span>

In [4]:
s = pd.Series([pd.NA, np.nan, 5, -999], dtype='Int16')
s

0    <NA>
1    <NA>
2       5
3    -999
dtype: Int16

### Exercise 3

<span style="color:darkgreen; font-size:16px">Take a look at the Series below. Change it's data type such that it uses the least amount of memory and preserves the numbers as they are.</span>

In [5]:
s = pd.Series([1_000, 60_000])
s

0     1000
1    60000
dtype: int64

In [6]:
s.astype('uint16')

0     1000
1    60000
dtype: uint16

### Exercise 4

<span style="color:green; font-size:16px">Find the precision of a 32-bit float and then create a numpy array with values that have decimal places past that precision.</span>

In [7]:
np.finfo('float32')

finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)

In [8]:
np.array([1.00000001, 1.123456789], dtype='float32')

array([1.       , 1.1234568], dtype=float32)

### Exercise 5

<span style="color:green; font-size:16px">Create a Series of numbers that have decimal places. Use the `astype` method to convert it to an integer and then back to a float. Are the decimals from the original Series preserved?</span>

In [9]:
# no
s = pd.Series([4.98, -23.123])
s

0     4.980
1   -23.123
dtype: float64

In [10]:
s.astype('int64')

0     4
1   -23
dtype: int64

In [11]:
s.astype('int64').astype('float64')

0     4.0
1   -23.0
dtype: float64

### Exercise 6

<span style="color:green; font-size:16px">Create a numpy array with 8-bit unsigned integers. Use negative numbers in the construction along with numbers greater than 255. Does the output make sense?</span>

In [12]:
np.array([-1, 0, 1, 2, 255, 256], dtype='uint8')

array([255,   0,   1,   2, 255,   0], dtype=uint8)

### Exercise 7

<span style="color:green; font-size:16px">Create a numpy array that has the values 50 and 100 in it, but do so without actually using those two values (or any operations that create them).</span>

In [13]:
# use the ability of numpy to wrap numbers around to the start of the range.
# there are 256 numbers in the range beginning at 0
np.array([306, 356], dtype='uint8')

array([ 50, 100], dtype=uint8)

### Exercise 8

<span style="color:green; font-size:16px">Create a numpy array that contains two integers and the numpy nan missing value. Assign it to a variable name and output it to the screen. What data type is it?</span>

In [14]:
# notice the decimals in the output
a = np.array([99, -88, np.nan])
a

array([ 99., -88.,  nan])

In [15]:
a.dtype

dtype('float64')

### Exercise 9

<span style="color:green; font-size:16px">Construct a Series from the array created in exercise 8. What data type is it? Construct a new Series with the same array forcing it to be a nullable integer.</span>

In [16]:
pd.Series(a)

0    99.0
1   -88.0
2     NaN
dtype: float64

In [17]:
pd.Series(a, dtype='Int64')

0      99
1     -88
2    <NA>
dtype: Int64

### Exercise 10

<span style="color:green; font-size:16px">Construct a Series of 32-bit nullable integers using the data type object itself (and not the string).</span>

In [18]:
pd.Series([1, 5, pd.NA], dtype=pd.Int32Dtype())

0       1
1       5
2    <NA>
dtype: Int32

## 2. Object, String, and Categorical Data Types

In [19]:
import pandas as pd
import numpy as np

### Exercise 1

<span style="color:green; font-size:16px">Using its constructor, create a Series containing three two-item lists of integers. Then call the `sum` method on the Series. What is returned?</span>

In [20]:
s = pd.Series([[3, 4], [-9, 99], [5, 59]])
s

0      [3, 4]
1    [-9, 99]
2     [5, 59]
dtype: object

A single list with all values together.

In [21]:
s.sum()

[3, 4, -9, 99, 5, 59]

### Exercise 2

<span style="color:green; font-size:16px">Use the constructor to create a Series of integers, floats, and booleans. Do not set the `dtype` parameter. What data type is your Series?</span>

In [22]:
# object
pd.Series([1, 3.2, True])

0       1
1     3.2
2    True
dtype: object

### Exercise 3

<span style="color:green; font-size:16px">Construct a Series with the same values but force the data type to be a float. Does it work? What happens to the non-float values?</span>

In [23]:
# Yes, it works. other values become floats. True becomes 1.0
pd.Series([1, 3.2, True], dtype='float64')

0    1.0
1    3.2
2    1.0
dtype: float64

### Exercise 4

<span style="color:green; font-size:16px">Construct a Series containing three strings and the four missing values `None`, `np.nan`, `pd.NA`, and `pd.NaT` assigning the result to a variable.</span>

In [24]:
s = pd.Series(['Houston', 'Rockets', 'Basketball', None, np.nan, pd.NA, pd.NaT])
s

0       Houston
1       Rockets
2    Basketball
3          None
4           NaN
5          <NA>
6           NaT
dtype: object

In [25]:
s.astype('string')

0       Houston
1       Rockets
2    Basketball
3          <NA>
4          <NA>
5          <NA>
6          <NA>
dtype: string

### Exercise 5

<span style="color:green; font-size:16px">Using pandas, count the number of missing values in exercise 4.</span>

In [26]:
# pandas treats each one as a missing value
s.isna().sum()

4

### Exercise 6

<span style="color:green; font-size:16px">Convert the Series from exercise 4 to the new string data type. Notice what happens to the missing values.</span>

In [27]:
# pandas only uses pd.NA for missing values in string columns
# object columns
s.astype('string')

0       Houston
1       Rockets
2    Basketball
3          <NA>
4          <NA>
5          <NA>
6          <NA>
dtype: string

### Read in the movie dataset

Execute the cell below to read in the first 10 columns of the movie dataset setting the index to be the title.

In [28]:
pd.set_option('display.max_columns', 100)
movie = pd.read_csv('../data/movie.csv', index_col='title', usecols=range(10))
movie.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear


### Exercise 7

<span style="color:green; font-size:16px">Which of the columns above are good candidates for the categorical data type?</span>

It can be helpful to use the `value_counts` or `nunique` methods to get more information on the columns.

In [29]:
movie['color'].value_counts()

Color              4693
Black and White     204
Name: color, dtype: int64

In [30]:
movie['content_rating'].nunique()

18

In [31]:
movie['director_name'].nunique()

2397

In [32]:
movie['actor1'].nunique()

2095

In [33]:
movie['actor2'].nunique()

3030

I would make color and content_rating categorical as two have known, limited, and discrete values. Although the year column is discrete, it's not exactly limited as more years of data will come in the future. The director_name, actor1, and actor2 columns are discrete and do repeat, but the number of unique values is quite substantial and a large percentage of the overall values. I would leave those as objects.

### Exercise 8

<span style="color:green; font-size:16px">Select the `content_rating` column as a Series and convert it to categorical. Assign the result to the variable `rating`.</span>

In [34]:
rating = movie['content_rating'].astype('category')
rating.head(3)

title
Avatar                                      PG-13
Pirates of the Caribbean: At World's End    PG-13
Spectre                                     PG-13
Name: content_rating, dtype: category
Categories (18, object): [Approved, G, GP, M, ..., TV-Y, TV-Y7, Unrated, X]

### Exercise 9

<span style="color:green; font-size:16px">Write an expression that returns the number of categories.</span>

In [35]:
len(rating.cat.categories)

18

### Exercise 10

<span style="color:green; font-size:16px">Prove that the `str` accessor still works with categorical columns by making the ratings lowercase.</span>

In [36]:
rating.str.lower().head()

title
Avatar                                        pg-13
Pirates of the Caribbean: At World's End      pg-13
Spectre                                       pg-13
The Dark Knight Rises                         pg-13
Star Wars: Episode VII - The Force Awakens      NaN
Name: content_rating, dtype: object

### Exercise 11

<span style="color:green; font-size:16px">Assign the rating 'GGG' as the first value.</span>

In [37]:
rating = rating.cat.add_categories('GGG')
rating.loc['Avatar'] = 'GGG'
rating.head(3)

title
Avatar                                        GGG
Pirates of the Caribbean: At World's End    PG-13
Spectre                                     PG-13
Name: content_rating, dtype: category
Categories (19, object): [Approved, G, GP, M, ..., TV-Y7, Unrated, X, GGG]

### Exercise 12

<span style="color:green; font-size:16px">Convert the following Series to integer.</span>

In [38]:
s = pd.Series(['1', '2'])

In [39]:
s.astype('int64')

0    1
1    2
dtype: int64

### Exercise 13

<span style="color:green; font-size:16px">Convert the following Series to integer.</span>

In [40]:
s = pd.Series(['1', '2', 'BAD DATA'])

In [41]:
pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    NaN
dtype: float64

### Read in the diamonds dataset

Execute the next cell to read in the diamonds dataset.

In [42]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


### Exercise 14

<span style="color:green; font-size:16px">Select the `cut` column as a Series and convert it to an ordered categorical. Use the data dictionary from above. Assign it to the variable `cut_cat`.</span>

In [43]:
categories = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
cut_dtype = pd.CategoricalDtype(categories, ordered=True)
cut_cat = diamonds['cut'].astype(cut_dtype)
cut_cat.head()

0      Ideal
1    Premium
2       Good
3    Premium
4       Good
Name: cut, dtype: category
Categories (5, object): [Fair < Good < Very Good < Premium < Ideal]

### Exercise 15

<span style="color:green; font-size:16px">By only knowing that `cut_cat` is an ordered categorical, write an expression to get the percentage of diamonds that have the lowest category.</span>

In [44]:
cut_cat.value_counts(normalize=True).sort_index().iloc[-1]

0.3995365220615499

## 3. Datetime, Timedelta, and Period Data Types

In [45]:
import numpy as np, pandas as pd

### Exercise 1

<span style="color:green; font-size:16px">Create a numpy array of datetimes with year precision for the years 2000, 2010, and 2020. Assign the result to a variable.</span>

In [46]:
# epoch is 1970
a = np.array([30, 40, 50], dtype='datetime64[Y]')
a

array(['2000', '2010', '2020'], dtype='datetime64[Y]')

### Exercise 2

<span style="color:green; font-size:16px">Staying in numpy, convert the array created in exercise 1 to a data type with second precision and assign the result to a new variable.</span>

In [47]:
b = a.astype('datetime64[s]')
b

array(['2000-01-01T00:00:00', '2010-01-01T00:00:00',
       '2020-01-01T00:00:00'], dtype='datetime64[s]')

### Exercise 3

<span style="color:green; font-size:16px">Staying in numpy, use the `astype` method to return the number of seconds after the epoch for each value from the array created in exercise 2.</span>

In [48]:
b.astype('int64')

array([ 946684800, 1262304000, 1577836800])

### Exercise 4

<span style="color:green; font-size:16px">Use the integers from exercise 3 within the numpy array constructor to get the same result as exercise 2.</span>

In [49]:
np.array([ 946684800, 1262304000, 1577836800], dtype='datetime64[s]')

array(['2000-01-01T00:00:00', '2010-01-01T00:00:00',
       '2020-01-01T00:00:00'], dtype='datetime64[s]')

### Exercise 5

<span style="color:green; font-size:16px">Construct a Series of integers for the years 2000, 2010, and 2020. Then convert it to datetime with the `astype` method.</span>

In [50]:
s = pd.Series([30, 40, 50])
s

0    30
1    40
2    50
dtype: int64

In [51]:
s.astype('datetime64[Y]')

0   2000-01-01
1   2010-01-01
2   2020-01-01
dtype: datetime64[ns]

### Exercise 6

<span style="color:green; font-size:16px">What month is it 1 million minutes after the unix epoch?</span>

In [52]:
s = pd.Series([1000000]).astype('datetime64[m]')
s

0   1971-11-26 10:40:00
dtype: datetime64[ns]

In [53]:
# get month as a string
s.dt.month_name()

0    November
dtype: object

In the time series part, you will learn how to do this in a more direct manner using the Timestamp constructor.

In [54]:
pd.Timestamp(1000000, unit='m').month_name()

'November'

### Exercise 7

<span style="color:green; font-size:16px">Construct a datetime Series using strings with precision down to nanoseconds (9 digits after the decimal)</span>

In [55]:
pd.Series(['2020-01-31 15:45:59.123456789', 
           '2020-02-29 15:45:59.123456789'], dtype='datetime64[ns]')

0   2020-01-31 15:45:59.123456789
1   2020-02-29 15:45:59.123456789
dtype: datetime64[ns]

### Exercise 8

<span style="color:green; font-size:16px">Using only arithmetic operations, find the amount of time 1 million seconds is. Report your answer as 'W days, X hours, Y minutes, Z seconds'.</span>

In [56]:
num = 1_000_000
seconds_in_day = 24 * 60 * 60
days, seconds_remaining = divmod(num, seconds_in_day)
days, seconds_remaining

(11, 49600)

In [57]:
seconds_in_hour = 60 * 60
hours, seconds_remaining = divmod(seconds_remaining, seconds_in_hour)
hours, seconds_remaining

(13, 2800)

In [58]:
seconds_in_minutes = 60
minutes, seconds = divmod(seconds_remaining, seconds_in_minutes)
minutes, seconds

(46, 40)

In [59]:
f'{days} days, {hours} hours, {minutes} minutes, {seconds} seconds'

'11 days, 13 hours, 46 minutes, 40 seconds'

### Exercise 9

<span style="color:green; font-size:16px">Verify the results of exercise 8 by creating a pandas timedelta Series.</span>

In [60]:
pd.Series([1_000_000]).astype('timedelta64[s]')

0   11 days 13:46:40
dtype: timedelta64[ns]

The `to_timedelta` function will be covered in the time series part.

In [61]:
pd.to_timedelta(1_000_000, 's')

Timedelta('11 days 13:46:40')

### Exercise 10

<span style="color:green; font-size:16px">Construct a Series with the data type period that has the hour 10 a.m. through 11 a.m. as the time period on January 1st for the years 2019, 2020, and 2021.</span>

In [62]:
pd.Series(['2019-01-01 10', '2020-01-01 10', '2020-01-01 10'], dtype='Period[h]')

0    2019-01-01 10:00
1    2020-01-01 10:00
2    2020-01-01 10:00
dtype: period[H]

## 4. DataFrame Data Type Conversion

In [63]:
import pandas as pd, numpy as np

### Exercise 1

<span style="color:green; font-size:16px">Read in the bikes dataset and select the `tripduration` column. Find its data type and then use the `memory_usage` method to find how much memory (in bytes) it is using. Change its data type to the smallest possible type so that no information is lost. What percentage of memory has been saved?</span>

In [64]:
bikes = pd.read_csv('../data/bikes.csv')
td = bikes['tripduration']
td.head()

0     993
1     623
2    1040
3     667
4     130
Name: tripduration, dtype: int64

In [65]:
td.memory_usage()

400840

Find the min and max values

In [66]:
td.agg(['min', 'max'])

min       60
max    86188
Name: tripduration, dtype: int64

Unfortunately an unsigned integer, 'uint16', doesn't quite have enough memory to fit the max.

In [67]:
np.iinfo('uint16')

iinfo(min=0, max=65535, dtype=uint16)

We need to use 32 bits. Although you can use uint32 its probably best to stick with int32 as this is much more common.

In [68]:
np.iinfo('int32')

iinfo(min=-2147483648, max=2147483647, dtype=int32)

In [69]:
td2 = td.astype('int32')

32-bit integers take up half as much space as 64-bit integers. Let's verify this.

In [70]:
td2.memory_usage(index=False) / td.memory_usage(index=False)

0.5

### Exercise 2

<span style="color:green; font-size:16px">Read in the diamonds dataset and convert the data types of each column so they use the least amount of memory without losing any information.</span>

In [71]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


In [72]:
np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [73]:
np.iinfo('uint16')

iinfo(min=0, max=65535, dtype=uint16)

In [74]:
diamonds.agg(['min', 'max'])

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
min,0.2,Fair,D,I1,43.0,43.0,326,0.0,0.0,0.0
max,5.01,Very Good,J,VVS2,79.0,95.0,18823,10.74,58.9,31.8


In [75]:
clarity_cats = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
clarity_dtype = pd.CategoricalDtype(clarity_cats, ordered=True)

color_cats = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
color_dtype = pd.CategoricalDtype(color_cats, ordered=True)

cut_cats = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
cut_dtype = pd.CategoricalDtype(cut_cats, ordered=True)

dtype_dict = {'carat': 'float32',
              'cut': cut_dtype,
              'color': color_dtype, 
              'clarity': clarity_dtype, 
              'carat': 'float32',
              'depth': 'float32',
              'table': 'float32',
              'price': 'uint16',
              'x': 'float32',
              'y': 'float32',
              'z': 'float32'}

diamonds2 = diamonds.astype(dtype_dict).round(3)
diamonds2.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.799999,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.900002,65.0,327,4.05,4.07,2.31
