# Cleaning Data

# Changing Datatype of Columns with astype()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Quick Summary

- pd.to_numeric(titanic.Fare)
- titanic.Fare.astype("float")
- summer.Athlete_Name.str.strip()
- titanic.Survived = titanic.Survived.astype("int")
- titanic.Age = titanic.Age.astype("float")

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Titanic Dataset

In [6]:
import pandas as pd

titanic_imp contains index column, we used titanic_imp

In [7]:
titanic = pd.read_csv("titanic_imp2.csv")

In [8]:
titanic.head()

Unnamed: 0,Survived,Class,Gender,Age,SipSp,ParCH,Fare,Emb,Deck
0,0,3,male,22.0,1,0,$7.25,S,
1,1,1,female,38.0,1,0,$71.2833,C,C
2,1,3,female,26.0,0,0,$7.925,S,
3,1,1,female,35.0,1,0,$53.1,S,C
4,0,3,male,35.0,0,0,$8.05,S,


In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 894 entries, 0 to 893
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  894 non-null    object
 1   Class     894 non-null    int64 
 2   Gender    894 non-null    object
 3   Age       758 non-null    object
 4   SipSp     894 non-null    int64 
 5   ParCH     894 non-null    int64 
 6   Fare      894 non-null    object
 7   Emb       892 non-null    object
 8   Deck      203 non-null    object
dtypes: int64(3), object(6)
memory usage: 63.0+ KB


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## __Fare__ column

In [10]:
pd.to_numeric(titanic.Fare)

ValueError: Unable to parse string "$7.25" at position 0

Unable to parse string $7.25 at position 0, "$" sign makes problem

In [11]:
titanic.Fare.str.replace("$","")

0         7.25
1      71.2833
2        7.925
3         53.1
4         8.05
        ...   
889       30.0
890       7.75
891       10.5
892       14.4
893     7.8958
Name: Fare, Length: 894, dtype: object

In [12]:
titanic.Fare = titanic.Fare.str.replace("$","")

In [13]:
titanic.Fare.head()

0       7.25
1    71.2833
2      7.925
3       53.1
4       8.05
Name: Fare, dtype: object

# pd.to_numeric()

In [14]:
pd.to_numeric(titanic.Fare)

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
889    30.0000
890     7.7500
891    10.5000
892    14.4000
893     7.8958
Name: Fare, Length: 894, dtype: float64

## astype("float")

In [15]:
titanic.Fare.astype("float")

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
889    30.0000
890     7.7500
891    10.5000
892    14.4000
893     7.8958
Name: Fare, Length: 894, dtype: float64

In [16]:
titanic.Fare = titanic.Fare.astype("float")

In [23]:
titanic.Fare

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
889    30.0000
890     7.7500
891    10.5000
892    14.4000
893     7.8958
Name: Fare, Length: 894, dtype: float64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## __Survived__ column

There is a replace operatşon needed at first.

In [17]:
titanic.Survived = titanic.Survived.astype("int")

ValueError: invalid literal for int() with base 10: 'yes'

In [18]:
titanic.Survived.value_counts()

0      551
1      341
yes      1
no       1
Name: Survived, dtype: int64

In [19]:
titanic.Survived.replace(to_replace = ["yes","no"], value = [1,0], inplace = True)

In [20]:
titanic.Survived.value_counts()

0    551
1    341
1      1
0      1
Name: Survived, dtype: int64

In [21]:
titanic.Survived = titanic.Survived.astype("int")

In [22]:
titanic.Survived.value_counts()

0    552
1    342
Name: Survived, dtype: int64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## __Age__ column

In [24]:
titanic.Age = titanic.Age.astype("float")

ValueError: could not convert string to float: 'Missing Data'

## could not convert string to float: 'Missing Data'

## We will solve this case with missing value topic.

In [25]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 894 entries, 0 to 893
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  894 non-null    int32  
 1   Class     894 non-null    int64  
 2   Gender    894 non-null    object 
 3   Age       758 non-null    object 
 4   SipSp     894 non-null    int64  
 5   ParCH     894 non-null    int64  
 6   Fare      894 non-null    float64
 7   Emb       892 non-null    object 
 8   Deck      203 non-null    object 
dtypes: float64(1), int32(1), int64(3), object(4)
memory usage: 59.5+ KB


- Survived col is int now.
- Fare col is float now.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Olympic Dataset

In [26]:
summer = pd.read_csv("summer_imp.csv")

In [28]:
summer.head()

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Country,Gender,Event,Medal
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold Medal
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze
3,1896,Athens,Aquatics,Swimming,"Malokinis, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold Medal
4,1896,Athens,Aquatics,Swimming,"Chasapis, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver


In [29]:
summer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31170 entries, 0 to 31169
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Year          31170 non-null  int64 
 1   City          31170 non-null  object
 2   Sport         31170 non-null  object
 3   Discipline    31170 non-null  object
 4   Athlete Name  31170 non-null  object
 5   Country       31166 non-null  object
 6   Gender        31170 non-null  object
 7   Event         31170 non-null  object
 8   Medal         31170 non-null  object
dtypes: int64(1), object(8)
memory usage: 2.1+ MB


Nothing to change for summer dataset every datatypa is correct.