# Introduction

This notebook will explain the following topics and concepts:


1) Converting to Strings 

2) Converting to NUmerics 

3) Categorical Data

4) Converting to Categories

5) Manipulating Categories


# Import the data

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

import pandas as pd


df_tips = pd.read_csv(filepath_or_buffer='../Data/tips.csv')

print(df_tips.dtypes)

ID              int64
total_bill    float64
tip           float64
gender         object
smoker         object
day            object
time           object
size            int64
dtype: object


# Converting to Strings

use `astype(...)`
-  main parameter is dtype - can be any built in python type = str, float, int,. bool, complex 
- can also be any dtype specified in numpy library


In [2]:
# Before
print(df_tips['gender'].dtypes)

# After
df_tips['gender_str'] = df_tips['gender'].astype(dtype=str)
print(df_tips['gender_str'].dtypes)

object
object


# Converting to Numeric Values

use same `astype(...)` function

In [3]:
# Before
print(df_tips.dtypes)

# Change total_bill to string
df_tips['total_bill'] = df_tips['total_bill'].astype(str)
print(df_tips.dtypes)

# Change total_bill back to a float
df_tips['total_bill'] = df_tips['total_bill'].astype(float)
print(df_tips.dtypes)

ID              int64
total_bill    float64
tip           float64
gender         object
smoker         object
day            object
time           object
size            int64
gender_str     object
dtype: object
ID              int64
total_bill     object
tip           float64
gender         object
smoker         object
day            object
time           object
size            int64
gender_str     object
dtype: object
ID              int64
total_bill    float64
tip           float64
gender         object
smoker         object
day            object
time           object
size            int64
gender_str     object
dtype: object


Use `pandas.to_numeric()`

In [4]:
# Artificially construct some data with a value of 'missing' for some of the total_bill values
df_tmp = df_tips.head(10).copy()

df_tmp.loc[ [1,3,5,7], 'total_bill'] = 'missing'

# Before
df_tmp
print(df_tmp.dtypes)


# convert 'total_bill' to float - ERRORS - cannot convert 'missing' to a float
#df_tmp['total_bill'].astype(float)

# convert 'total_bill' to float - ERRORS - cannot convert 'missing' to a float
#pd.to_numeric(df_tmp['total_bill'])

# Use errors to ignore errors
pd.to_numeric(df_tmp['total_bill'], errors='ignore')

# Use errors to convert 'missing' to Nan
pd.to_numeric(df_tmp['total_bill'], errors='coerce')


ID              int64
total_bill     object
tip           float64
gender         object
smoker         object
day            object
time           object
size            int64
gender_str     object
dtype: object


0    16.99
1      NaN
2    21.01
3      NaN
4    24.59
5      NaN
6     8.77
7      NaN
8    15.04
9    14.78
Name: total_bill, dtype: float64

# Categorical Data

Some values naturall fall into categories

e.g. gender (male, female)
     risk (high, medium, low)
     asset class (equity, fixed income, commodity)
     
Use `astype('category')` to have pandas re-code a series or DataFrame into a Category

In [5]:
# change sex coilumn to a string
df_tips['gender'] = df_tips['gender'].astype('str')

# before
print(df_tips.info())

# after - change sex to a category
df_tips['gender'] = df_tips['gender'].astype('category')
print(df_tips.info())

# Display the Categories
df_tips['gender'].cat.categories

# Are they ordered
df_tips['gender'].cat.ordered

# Display the codes
df_tips['gender'].cat.codes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          244 non-null    int64  
 1   total_bill  244 non-null    float64
 2   tip         244 non-null    float64
 3   gender      244 non-null    object 
 4   smoker      244 non-null    object 
 5   day         244 non-null    object 
 6   time        244 non-null    object 
 7   size        244 non-null    int64  
 8   gender_str  244 non-null    object 
dtypes: float64(2), int64(2), object(5)
memory usage: 17.3+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   ID          244 non-null    int64   
 1   total_bill  244 non-null    float64 
 2   tip         244 non-null    float64 
 3   gender      244 non-null    category
 4   smoker      244 

0      0
1      1
2      1
3      1
4      0
      ..
239    1
240    0
241    1
242    1
243    0
Length: 244, dtype: int8