# PyRasgo Data Type Mismatch

This notebook explains how to identify data in columns with the wrong data type with `pyrasgo`.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [PyRasgo](https://app.gitbook.com/@rasgo/s/rasgo-docs/pyrasgo-0.1/dataframe-prep)

In [1]:
import pandas as pd
import pyrasgo

## Connect to Rasgo

NB: This does not run as this has not yet been built

In [None]:
api_key = pyrasgo.register(email='<your email>')
rasgo = pyrasgo.connect(api_key)

## Creating the data

We will create a dataframe that contains multiple occurances of duplication for this example.

In [3]:
df = pd.DataFrame({'A': ['text']*20,
                   'B': [1, 2.2]*10,
                   'C': [True, False]*10,
                   'D': pd.to_datetime('2020-01-01')
                  })

Next, add some mistyped data to the dataframe.

In [4]:
df.iloc[0,0] = 1
df.iloc[1,0] = -2
df.iloc[10,0] = pd.to_datetime('2021-01-01')
df.iloc[5,1] = '2.2'
df.iloc[7,1] = 'A+B'
df.iloc[4,2] = 1
df.iloc[5,2] = 'False'
df.iloc[9,2] = -12.6
df.iloc[12,2] = 'text'
df.iloc[2,3] = 12
df.iloc[12,3] = '2020-01-01'
df

Unnamed: 0,A,B,C,D
0,1,1.0,True,2020-01-01 00:00:00
1,-2,2.2,False,2020-01-01 00:00:00
2,text,1.0,True,12
3,text,2.2,False,2020-01-01 00:00:00
4,text,1.0,1,2020-01-01 00:00:00
5,text,2.2,False,2020-01-01 00:00:00
6,text,1.0,True,2020-01-01 00:00:00
7,text,A+B,False,2020-01-01 00:00:00
8,text,1.0,True,2020-01-01 00:00:00
9,text,2.2,-12.6,2020-01-01 00:00:00


## Identify mistyped data

The function `evaluate.type_mismatches` will cast `column` to `data_type` and return a dataframe containing the recast column with elements that were of the wrong type as **NaN**.

### Cast to numeric

In [5]:
new_column_df = rasgo.evaluate.type_mismatches(df, column='B', data_type='numeric')
new_column_df

0.95%: 1 rows of 20 rows cannot convert.


Unnamed: 0,BCastToNumeric
0,1.0
1,2.2
2,1.0
3,2.2
4,1.0
5,2.2
6,1.0
7,
8,1.0
9,2.2


Convert this to a Boolean series using the pandas function `isnull` and use that series to return the non-numeric data

In [6]:
df[new_column_df.isnull().iloc[:,0]]

Unnamed: 0,A,B,C,D
7,text,A+B,False,2020-01-01 00:00:00


### Cast to datetime

In [7]:
new_column_df = rasgo.evaluate.type_mismatches(df, column='D', data_type='datetime')
new_column_df

0.95%: 1 rows of 20 rows cannot convert.


Unnamed: 0,DCastToDatetime
0,2020-01-01
1,2020-01-01
2,NaT
3,2020-01-01
4,2020-01-01
5,2020-01-01
6,2020-01-01
7,2020-01-01
8,2020-01-01
9,2020-01-01


Convert this to a Boolean series using the pandas function `isnull` and use that series to return the data that is not a datetime.

In [8]:
df[new_column_df.isnull().iloc[:,0]]

Unnamed: 0,A,B,C,D
2,text,1.0,True,12
