![title banner](../banners/start_banner.png)

# Topic : Datatype Inconsistencies

This notebook provides sample code to give an idea of dealing with datatype inconsistencies.

Datatype inconsistencies in this context are situations when your feature is of a type that is not the most useful type for analysis. For example, afeature with money is best used when it is numeric. However, if a symbol like '$' is used, then the feature's type become non-numeric and then, you will need to write separate code to clean this up.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../test_datasets/nba.csv')
df.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


Look at the **'Height'** feature. It is non-numeric because of the presence of a hyphen. Let's clean this feature up and convert it to numeric.

- Step 1 : Split all values in **'Height'** by the hyphen
- Step 2 : Use formula to calculate height in inches
- Step 3 : Create new feature 

In [3]:
# split by hyphen
hght = df['Height'].str.split("-", expand=True)
hght.columns = ['feet', 'inches']

# calculate height in inches
hght = hght.astype('float')
hght['hght_in'] = (hght['feet'] * 12) + hght['inches']

# add new feature
df['height_in'] = hght['hght_in'].copy()
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,height_in
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,74.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,78.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,77.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,77.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,82.0


Inconsitency has now been resolved!

![end banner](../banners/finish_banner.png)