# Cleaning Data

## Removing missing values

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Quick Summary

- titanic[titanic.Emb.isna()]
- titanic.Age.value_counts(dropna = False) #Nan 360 can be seen with dropna False
----
- titanic.dropna( axis = 0 , how= "any").shape #removes all row
- titanic.dropna( axis = 1 , how= "any").shape
----
- titanic.dropna( axis = 1 , thresh= 500).shape
---
- titanic.dropna( axis = 0 , subset = ["Survived", "Class", "Gender", "Age"], thresh= 4).shape

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Titanic Dataset

In [76]:
import pandas as pd

In [77]:
titanic = pd.read_csv("titanic_imp2.csv")

In [78]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 894 entries, 0 to 893
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  894 non-null    object
 1   Class     894 non-null    int64 
 2   Gender    894 non-null    object
 3   Age       758 non-null    object
 4   SipSp     894 non-null    int64 
 5   ParCH     894 non-null    int64 
 6   Fare      894 non-null    object
 7   Emb       892 non-null    object
 8   Deck      203 non-null    object
dtypes: int64(3), object(6)
memory usage: 63.0+ KB


In [79]:
titanic[titanic.Emb.isna()]

Unnamed: 0,Survived,Class,Gender,Age,SipSp,ParCH,Fare,Emb,Deck
61,1,1,female,38.0,0,0,$80.0,,B
829,1,1,female,62.0,0,0,$80.0,,B


In [80]:
titanic.Age.value_counts(dropna = False)

NaN             136
Missing Data     41
24.0             31
22.0             27
18.0             26
               ... 
80.0              1
70.5              1
34.5              1
53.0              1
102               1
Name: Age, Length: 93, dtype: int64

In [81]:
titanic.shape

(894, 9)

In [82]:
titanic.dropna().shape #default axis=0, how='any', thresh=None, subset=None, inplace=False

(187, 9)

In [83]:
titanic.dropna() #removes all rows

Unnamed: 0,Survived,Class,Gender,Age,SipSp,ParCH,Fare,Emb,Deck
1,1,1,female,38.0,1,0,$71.2833,C,C
3,1,1,female,35.0,1,0,$53.1,S,C
6,0,1,male,54.0,0,0,$51.8625,S,E
10,1,3,female,4.0,1,1,$16.7,S,G
11,1,1,female,58.0,0,0,$26.55,S,C
...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,$52.5542,S,D
872,0,1,male,33.0,0,0,$5.0,S,B
879,1,1,female,56.0,0,1,$83.1583,C,C
887,1,1,female,19.0,0,0,$30.0,S,B


## how

    * 'any' : If any NA values are present, drop that row or column.
    * 'all' : If all values are NA, drop that row or column.

In [84]:
titanic.dropna( axis = 0 , how= "any").shape #removes all row

(187, 9)

In [85]:
titanic.dropna( axis = 1 , how= "any").shape #age-emb-deck maintains nan values so that col number decreased to 6

(894, 6)

In [86]:
titanic.dropna( axis = 0 , how= "all").shape

(894, 9)

In [87]:
titanic.dropna( axis = 1 , how= "all").shape

(894, 9)

## thresh

In [88]:
titanic.dropna( axis = 0 , thresh= 8).shape # nan values are less than 9

(772, 9)

In [89]:
titanic.dropna( axis = 1 , thresh= 500).shape

(894, 8)

In [90]:
titanic.dropna( axis = 1 , thresh= 500, inplace = True)

In [91]:
titanic.head() #deck col maintains more than 500 nan values, so that it get lost

Unnamed: 0,Survived,Class,Gender,Age,SipSp,ParCH,Fare,Emb
0,0,3,male,22.0,1,0,$7.25,S
1,1,1,female,38.0,1,0,$71.2833,C
2,1,3,female,26.0,0,0,$7.925,S
3,1,1,female,35.0,1,0,$53.1,S
4,0,3,male,35.0,0,0,$8.05,S


In [92]:
titanic.shape

(894, 8)

## subset

Drop all rows where ther are 4 nan values in col of defined in Subset

In [93]:
titanic.dropna( axis = 0 , subset = ["Survived", "Class", "Gender", "Age"], thresh= 4).shape

(758, 8)

In [94]:
titanic.dropna( axis = 0 , subset = ["Survived", "Class", "Gender", "Age"], how= "any").shape

(758, 8)

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Olympic Dataset

In [44]:
summer = pd.read_csv("summer_imp.csv")

In [45]:
summer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31170 entries, 0 to 31169
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Year          31170 non-null  int64 
 1   City          31170 non-null  object
 2   Sport         31170 non-null  object
 3   Discipline    31170 non-null  object
 4   Athlete Name  31170 non-null  object
 5   Country       31166 non-null  object
 6   Gender        31170 non-null  object
 7   Event         31170 non-null  object
 8   Medal         31170 non-null  object
dtypes: int64(1), object(8)
memory usage: 2.1+ MB


In [47]:
summer[summer.isna().any(axis = 1)]

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Country,Gender,Event,Medal
29608,2012,London,Athletics,Athletics,Pending,,Women,1500M,Gold
31077,2012,London,Weightlifting,Weightlifting,Pending,,Women,63KG,Gold
31096,2012,London,Weightlifting,Weightlifting,Pending,,Men,94KG,Silver
31115,2012,London,Wrestling,Wrestling Freestyle,"KUDUKHOV, Besik",,Men,Wf 60 KG,Silver


In [48]:
summer.dropna(inplace = True)

In [49]:
summer.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31166 entries, 0 to 31169
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Year          31166 non-null  int64 
 1   City          31166 non-null  object
 2   Sport         31166 non-null  object
 3   Discipline    31166 non-null  object
 4   Athlete Name  31166 non-null  object
 5   Country       31166 non-null  object
 6   Gender        31166 non-null  object
 7   Event         31166 non-null  object
 8   Medal         31166 non-null  object
dtypes: int64(1), object(8)
memory usage: 2.4+ MB
