## Cleaning and Transforming data

* Data cleaning means fixing bad data in your data set.

* Bad data could be:

    Empty cells \
    Data in wrong format \
    Wrong data \
    Duplicates 

In [1]:
import pandas as pd

df = pd.read_csv('sample_data.csv')

df.shape


(31, 5)

### Empty Cells
* Empty cells can potentially give you a wrong result when you analyze data.
* One way to deal with empty cells is to remove rows that contain empty cells.
* This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

In [4]:
new_df = df.dropna()

new_df

# By default, the dropna() method returns a new DataFrame, and will not change the original.

# If you want to change the original DataFrame, use the inplace = True argument

df.dropna(inplace = True)

df.shape


(28, 5)

### Replace empty values

* Another way of dealing with empty cells is to insert a new value instead.
* This way you do not have to delete entire rows just because of some empty cells.
* The fillna() method allows us to replace empty cells with a value:

* A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

* Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

* Mean = the average value (the sum of all values divided by number of values).

* Median = the value in the middle, after you have sorted all values ascending.

* Mode = the value that appears most frequently.

In [6]:
import pandas as pd

df = pd.read_csv('sample_data.csv')
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08,'104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [7]:
import pandas as pd

df = pd.read_csv('sample_data.csv')

#df.fillna(130, inplace = True)

# to replace nulls for specific columns
df["Calories"].fillna(130, inplace = True)

df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08,'104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [8]:
import pandas as pd

df = pd.read_csv('sample_data.csv')

x = df["Calories"].mean()
x = df["Calories"].median()
x = df["Calories"].mode()[0]
print(x)
#df["Calories"].fillna(x, inplace = True)

300.0
