***Tara's Note***: Related to [this repo](https://github.com/lighthouse-labs/Formatting-Data)

# Converting data to a numeric type in Pandas

This is a notebook for the medium article [Converting data to a numeric type in Pandas](https://bindichen.medium.com/converting-data-to-a-numeric-type-in-pandas-db9415caab0b)

Please check out article for instructions

**License**: [BSD 2-Clause](https://opensource.org/licenses/BSD-2-Clause)

In [1]:
import pandas as pd
import numpy as np

In [2]:
def load_df(): 
  return pd.DataFrame({
    'string_col': ['1','2','3','4'],
    'int_col': [1,2,3,4],
    'float_col': [1.1,1.2,1.3,4.7],
    'mix_col': ['a', 2, 3, 4],
    'missing_col': [1.0, 2, 3, np.nan],
    'money_col': ['£1,000.00', '£2,400.00', '£2,400.00', '£2,400.00'],
    'boolean_col': [True, False, True, True],
    'custom': ['Y', 'Y', 'N', 'N']
  })

In [3]:
df = load_df()
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1.1,a,1.0,"£1,000.00",True,Y
1,2,2,1.2,2,2.0,"£2,400.00",False,Y
2,3,3,1.3,3,3.0,"£2,400.00",True,N
3,4,4,4.7,4,,"£2,400.00",True,N


### Display data types

In [4]:
df.dtypes

string_col      object
int_col          int64
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

In [19]:
## Showing data type of a column
df.int_col.dtypes

dtype('int64')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   string_col   4 non-null      object 
 1   int_col      4 non-null      int64  
 2   float_col    4 non-null      float64
 3   mix_col      4 non-null      object 
 4   missing_col  3 non-null      float64
 5   money_col    4 non-null      object 
 6   boolean_col  4 non-null      bool   
 7   custom       4 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(4)
memory usage: 356.0+ bytes


## 1. Converting string/int to int/float

In [21]:
# string to int
df.string_col = df.string_col.astype('int')
df.dtypes

string_col       int32
int_col          int64
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

In [22]:
# For more memory efficiency
df['string_col'] = df['string_col'].astype('int8')
df['string_col'] = df['string_col'].astype('int16')
df.string_col = df.string_col.astype('int32')
df.dtypes

string_col       int32
int_col          int64
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

In [23]:
# string to float
df['string_col'] = df['string_col'].astype('float')
df.dtypes

string_col     float64
int_col          int64
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

In [25]:
# For more precision
df['string_col'] = df['string_col'].astype('float128')
df.dtypes

string_col     float64
int_col          int64
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

In [26]:
# For more memory efficiency
df['string_col'] = df['string_col'].astype('float16')
df['string_col'] = df['string_col'].astype('float32')

## 2. Converting float to int

In [27]:
df['float_col'] = df['float_col'].astype('int')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1.0,1,1,a,1.0,"£1,000.00",True,Y
1,2.0,2,1,2,2.0,"£2,400.00",False,Y
2,3.0,3,1,3,3.0,"£2,400.00",True,N
3,4.0,4,4,4,,"£2,400.00",True,N


In [28]:
df = load_df()
df['float_col'] = df['float_col'].round(0).astype('int')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1,a,1.0,"£1,000.00",True,Y
1,2,2,1,2,2.0,"£2,400.00",False,Y
2,3,3,1,3,3.0,"£2,400.00",True,N
3,4,4,5,4,,"£2,400.00",True,N


## 3. Converting a column of mixed data types

In [57]:
# Getting ValueError
df['mix_col'] = df['mix_col'].astype('int')

ValueError: invalid literal for int() with base 10: 'a'

In [29]:
df['mix_col'] = pd.to_numeric(df['mix_col'], errors='coerce')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1,,1.0,"£1,000.00",True,Y
1,2,2,1,2.0,2.0,"£2,400.00",False,Y
2,3,3,1,3.0,3.0,"£2,400.00",True,N
3,4,4,5,4.0,,"£2,400.00",True,N


In [None]:
# The output is float value
df['mix_col'].dtypes

dtype('float64')

In [30]:
# To convert it to integer
df['mix_col'] = pd.to_numeric(
    df['mix_col'], 
    errors='coerce'
).astype('Int64')

df['mix_col'].dtypes

Int64Dtype()

In [None]:
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1,,1.0,"£1,000.00",True,Y
1,2,2,1,2.0,2.0,"£2,400.00",False,Y
2,3,3,1,3.0,3.0,"£2,400.00",True,N
3,4,4,5,4.0,,"£2,400.00",True,N


In [31]:
df['mix_col'] = pd.to_numeric(
    df['mix_col'], 
    errors='coerce'
).fillna(0).astype('int')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1,0,1.0,"£1,000.00",True,Y
1,2,2,1,2,2.0,"£2,400.00",False,Y
2,3,3,1,3,3.0,"£2,400.00",True,N
3,4,4,5,4,,"£2,400.00",True,N


## 4. Handling missing values

In [32]:
df.missing_col.dtypes

dtype('float64')

In [33]:
## Getting ValueError
df['missing_col'].astype('int')

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [34]:
# Convert to Pandas Int64
df['missing_col'] = df['missing_col'].astype('Int64')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1,0,1.0,"£1,000.00",True,Y
1,2,2,1,2,2.0,"£2,400.00",False,Y
2,3,3,1,3,3.0,"£2,400.00",True,N
3,4,4,5,4,,"£2,400.00",True,N


In [None]:
# Replacing NaN with 0
df['mix_col'] = df['missing_col'].fillna(0).astype('int')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1,1,1,1,1.0,"£1,000.00",True,Y
1,2,2,1,2,2.0,"£2,400.00",False,Y
2,3,3,1,3,3.0,"£2,400.00",True,N
3,4,4,5,0,,"£2,400.00",True,N


## 5. Converting money column to float

In [35]:
df['money_replace'] = df['money_col'].str.replace('£', '').str.replace(',','')
df['money_replace'] = pd.to_numeric(df['money_replace'])
df['money_replace']

0    1000.0
1    2400.0
2    2400.0
3    2400.0
Name: money_replace, dtype: float64

In [None]:
df['money_regex'] = df['money_col'].str.replace('[\£\,]', '', regex=True)
df['money_regex'] = pd.to_numeric(df['money_replace'])
df['money_regex']

0    1000.0
1    2400.0
2    2400.0
3    2400.0
Name: money_regex, dtype: float64

## 6. Converting boolean to 0/1

In [None]:
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom,money_replace,money_regex
0,1,1,1,1,1.0,"£1,000.00",True,Y,1000.0,1000.0
1,2,2,1,2,2.0,"£2,400.00",False,Y,2400.0,2400.0
2,3,3,1,3,3.0,"£2,400.00",True,N,2400.0,2400.0
3,4,4,5,0,,"£2,400.00",True,N,2400.0,2400.0


In [None]:
df['boolean_col'] = df['boolean_col'].astype('int') 
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom,money_replace,money_regex
0,1,1,1,1,1.0,"£1,000.00",1,Y,1000.0,1000.0
1,2,2,1,2,2.0,"£2,400.00",0,Y,2400.0,2400.0
2,3,3,1,3,3.0,"£2,400.00",1,N,2400.0,2400.0
3,4,4,5,0,,"£2,400.00",1,N,2400.0,2400.0


## 7. Convert multiple data columns at once

In [None]:
# Reset DataFrame
df = load_df()

# Run converting one at a time
df['string_col'] = df['string_col'].astype('float16')
df['int_col'] = df['int_col'].astype('float16')
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1.0,1.0,1.1,a,1.0,"£1,000.00",True,Y
1,2.0,2.0,1.2,2,2.0,"£2,400.00",False,Y
2,3.0,3.0,1.3,3,3.0,"£2,400.00",True,N
3,4.0,4.0,4.7,4,,"£2,400.00",True,N


In [None]:
# Reset DataFrame
df = load_df()

# Converting multiple columns at once
df = df.astype({
    'string_col': 'float16',
    'int_col': 'float16'
})
df

Unnamed: 0,string_col,int_col,float_col,mix_col,missing_col,money_col,boolean_col,custom
0,1.0,1.0,1.1,a,1.0,"£1,000.00",True,Y
1,2.0,2.0,1.2,2,2.0,"£2,400.00",False,Y
2,3.0,3.0,1.3,3,3.0,"£2,400.00",True,N
3,4.0,4.0,4.7,4,,"£2,400.00",True,N


## 8. Defining data type of each column when reading a CSV file

In [None]:
df = pd.read_csv(
    'data.csv', 
    dtype={
        'string_col': 'float16',
        'int_col': 'float16'
    }
)
df.dtypes

Unnamed: 0       int64
string_col     float16
int_col        float16
float_col      float64
mix_col         object
missing_col    float64
money_col       object
boolean_col       bool
custom          object
dtype: object

## 9. Creating a custom function to convert data to numeric

In [None]:
df = load_df()

In [None]:
def convert_money(value):
    value = value.replace('£','').replace(',', '')
    return float(value)

df['money_col'].apply(convert_money)

0    1000.0
1    2400.0
2    2400.0
3    2400.0
Name: money_col, dtype: float64

In [None]:
# lambda function
df['money_col'].apply(lambda v: v.replace('£','').replace(',', '')).astype('float')

0    1000.0
1    2400.0
2    2400.0
3    2400.0
Name: money_col, dtype: float64

## 10. `astype()` vs `to_numeric()`

The simplest way to convert data type from one to the other is to use `astype()` method. The method is supported by both Pandas **DataFrame** and **Series**. If you already have a numeric data type (int8, int16, int32, int64, float16, float32, float64, float128, and boolean) you can also use `astype()` to:

* convert it to another numeric data type (int to float, float to int, etc.)
* use it to downcast to a smaller or upcast to a larger byte size

However, `astype()` won’t work for a column of mixed types. For instance, the `mixed_col` has `a` and `missing_col` has `NaN`. If we try to use `astype()` we would get a **ValueError**. As of Pandas 0.20.0, this error can be suppressed by setting the argument `errors='ignore'`, but your original data will be returned untouched.

The Pandas `to_numeric()` function can handle these values more gracefully. Rather than fail, we can set the argument `errors='coerce'` to coerce invalid values to `NaN`