# 🧹🧼 CLEANING DATA
Ralph Cajipe 2022


Our anomalies in the health-related dataset: <br>

- The data set contains some **empty cells** ("Date" in row 22, and "Calories" in row 18 and 28).

- The data set contains the **wrong format** ("Date" in row 26).

- The data set contains **wrong data** ("Duration" in row 7).

- The data set contains **duplicates** (row 11 and 12).


In [1]:
import pandas as pd

# Load the dataset

### Create from CSV

In [2]:
df = pd.read_csv('fixed-data.csv')

# Read

### Check the Shape of Dataset

In [3]:
df.shape

(32, 5)

32 rows and 5 columns

### Show Top 5 and Bottom 5 Rows

In [4]:
df.head(10)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


### Summary Statistics

In [5]:
df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,32.0,32.0,32.0,30.0
mean,66.75,103.5,128.5,304.68
std,70.894743,7.832933,12.998759,66.003779
min,6.0,90.0,101.0,195.1
25%,56.25,100.0,120.0,250.7
50%,60.0,102.5,127.5,291.2
75%,60.0,106.5,132.25,343.975
max,450.0,130.0,175.0,479.0


### Show Columns and Data Type

In [6]:
df.columns

Index(['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

### Print entire DataFrame

In [7]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


# 1. Cleaning Empty Cells


### 1.1 (A) Return a new DataFrame with no empty cells:

In [8]:
new_df = df.dropna()
new_df.to_string()

"    Duration         Date  Pulse  Maxpulse  Calories\n0         60  2020/12/01'    110       130     409.1\n1         60  2020/12/02'    117       145     479.0\n2         60  2020/12/03'    103       135     340.0\n3         45  2020/12/04'    109       175     282.4\n4         45  2020/12/05'    117       148     406.0\n5         60  2020/12/06'    102       127     300.0\n6         60  2020/12/07'    110       136     374.0\n7        450  2020/12/08'    104       134     253.3\n8         30  2020/12/09'    109       133     195.1\n9         60  2020/12/10'     98       124     269.0\n10        60  2020/12/11'    103       147     329.3\n11        60  2020/12/12'    100       120     250.7\n12        60  2020/12/12'    100       120     250.7\n13        60  2020/12/13'    106       128     345.3\n14        60  2020/12/14'    104       132     379.3\n15        60  2020/12/15'     98       123     275.0\n16        60  2020/12/16'     98       120     215.2\n17        60  2020/12/17'  

### 1.2 Print the new data set

In [9]:
new_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


**Note:** By default, the dropna() method returns a new DataFrame, and will not change the `df` (original data set).

Question: What happened to the old data set? Explain your answer and print the new data set.

**Answer: The old data set is still there, and the new_df (new data set) has no empty cells, dropping "Date" in row 22, and dropping "Calories" in row 18 and 28.**

### 1.3 (B) Change the original DataFrame using the `inplace = True` argument to remove all rows with NULL values

In [10]:
df = pd.read_csv('fixed-data.csv')

df.dropna(inplace = True)
df.to_string()

"    Duration         Date  Pulse  Maxpulse  Calories\n0         60  2020/12/01'    110       130     409.1\n1         60  2020/12/02'    117       145     479.0\n2         60  2020/12/03'    103       135     340.0\n3         45  2020/12/04'    109       175     282.4\n4         45  2020/12/05'    117       148     406.0\n5         60  2020/12/06'    102       127     300.0\n6         60  2020/12/07'    110       136     374.0\n7        450  2020/12/08'    104       134     253.3\n8         30  2020/12/09'    109       133     195.1\n9         60  2020/12/10'     98       124     269.0\n10        60  2020/12/11'    103       147     329.3\n11        60  2020/12/12'    100       120     250.7\n12        60  2020/12/12'    100       120     250.7\n13        60  2020/12/13'    106       128     345.3\n14        60  2020/12/14'    104       132     379.3\n15        60  2020/12/15'     98       123     275.0\n16        60  2020/12/16'     98       120     215.2\n17        60  2020/12/17'  

### 1.4 Print the updated DataFrame

In [11]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


**Note:** Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.


Question: What happened to the old data set? Explain your answer and print the new data set.

**Answer: The original DataFrame contained NULL values in some rows. These rows were removed when the inplace = True argument was used. This left the DataFrame with only rows that did not contain NULL values.**

### 1.5 (C) Replace Empty Values

In [12]:
# Replace NULL values with the number 130:

df = pd.read_csv('fixed-data.csv')
df.fillna(130, inplace = True)

### 1.6 Print the updated DataFrame

In [13]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


Question: What happened to the old data set? Explain your answer and what happened to the updated DataFrame?

**Answer: The old data set is still there. The only thing that has changed is that the empty cells have been replaced with the number 130. The fillna() method replaces empty cells with the specified value. In this case, the empty cells were replaced with the number 130.**


### 1.7 (D) Replace Only for Specified Columns


In [14]:
# Replace NULL values in the "Calories" columns with the number 130:

df = pd.read_csv('fixed-data.csv')
df["Calories"].fillna(130, inplace = True)

### Print the updated DataFrame

In [15]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


Question: What happened to the old data set? Explain your answer and take a screenshot of the new data set.

**Answer: The old dataset was updated with new information. The changes are that the NULL values in the "Calories" column have been replaced with the number 130.**

### 1.8 (E) Replace Using Mean, Median, or Mode

In [16]:
# Calculate the MEAN, and replace any empty values with it:

df = pd.read_csv('fixed-data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)

In [17]:
# Print MEAN (x)
x


304.68

In [18]:
# Print the new dataframe when using the Mean:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


**Note:** Mean refers to the average value (the sum of all values divided by number of values).

In [19]:
# Calculate the MEDIAN, and replace any empty values with it:

df = pd.read_csv('fixed-data.csv')
x = df["Calories"].median()
df["Calories"].fillna(x, inplace = True)


In [20]:
# Print the MEDIAN (x)
x

291.2

In [21]:
# Print the new dataframe when using the Median:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


**Note:** Median refers to the value in the middle, after you have sorted all values ascending.

In [22]:
# Calculate the MODE, and replace any empty values with it:

df = pd.read_csv('fixed-data.csv')
x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)

In [23]:
# Print the MODE (x)
x

300.0

In [24]:
# Print the new dataframe when using the Mode:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


Question: What happened to the old data set? Explain your answer.

**Answer: The old data set displayed its mean, median, and mode values when applied with these methods.**

# 2. Data of Wrong Format

### 2.1 Look at the original DataFrame

In [25]:
df = pd.read_csv('fixed-data.csv')
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


In our Data Frame, we have two cells with the wrong format. Check out **row 22 and 26**, the 'Date' column should be a string that represents a date:

### 2.2 Try to convert all cells in the 'Date' column into dates.
#### Pandas has a to_datetime() method for this:



In [26]:
# Convert to date:
df = pd.read_csv('fixed-data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())


    Duration       Date  Pulse  Maxpulse  Calories
0         60 2020-12-01    110       130     409.1
1         60 2020-12-02    117       145     479.0
2         60 2020-12-03    103       135     340.0
3         45 2020-12-04    109       175     282.4
4         45 2020-12-05    117       148     406.0
5         60 2020-12-06    102       127     300.0
6         60 2020-12-07    110       136     374.0
7        450 2020-12-08    104       134     253.3
8         30 2020-12-09    109       133     195.1
9         60 2020-12-10     98       124     269.0
10        60 2020-12-11    103       147     329.3
11        60 2020-12-12    100       120     250.7
12        60 2020-12-12    100       120     250.7
13        60 2020-12-13    106       128     345.3
14        60 2020-12-14    104       132     379.3
15        60 2020-12-15     98       123     275.0
16        60 2020-12-16     98       120     215.2
17        60 2020-12-17    100       120     300.0
18        45 2020-12-18     90 

### 2.3 Print the updated Date column in the DataFrame

In [27]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,450,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0


The result from the converting in the example above gave us a *NaT* value for row 22, which can be handled as a **NULL** value, and we can remove the row by using the `dropna()` method.

In [28]:
# Remove rows with a NaT or NULL value in the "Date" column:
df.dropna(subset=['Date'], inplace = True)


### 2.4 Print the updated DataFrame

In [29]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,450,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0


Question: What happened to the old data set? Explain your answer and what happened to the new data set?

**Answer: The data set was applied with `df.dropna(subset=['Date'], inplace = True)` and the updated DataFrame has the row 22 with the NULL values in the "Date" column removed.**



# 3. Cleaning Wrong Data

### 3.1 Look at the original DataFrame

In [30]:
df = pd.read_csv('fixed-data.csv')
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


### 3.2  Fix wrong values, like the one for "Duration" in row 7

In [31]:
# Set "Duration" = 45 in row 7
df.loc[7, 'Duration'] = 45
df


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,45,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


Question: What happened to the old data set?

**Answer:
The updated DataFrame now has the value for "Duration" in row 7 changed from 450 to 45.**


### 3.3 Replace wrong data for larger data sets

#### Reset data set to original state

In [36]:
# Reset DataFrame:
df = pd.read_csv('fixed-data.csv')
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


**Row 7** has the wrong value for "Duration" again.

For small data sets you might be able to replace the wrong data one by one, but not for big data sets.

To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries.

Loop through all values in the "Duration" column.


In [39]:
# If the value is higher than 120, set it to 120
for x in df.index:
    if df.loc[x, "Duration"] > 120:
        df.loc[x, "Duration"] = 120

### 3.4 Look at the updated DataFrame

In [40]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,120,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


Question: What happened to the old data set? Explain your answer.


**Answer: In this case, we have cleaned the wrong data in the "Duration" column. Since row 7 is outside the boundary that we set in the conditional statement, row 7 has been changed from 450 to 120 this time.**

### 3.5 Remove the rows that contain wrong data in Duration

Another way of handling wrong data is to remove the rows that contain wrong data.

This way you do not have to find out what to replace them with, and there is a good chance you do not need them to do your analyses.

#### Reset data set to original state

In [43]:
# Reset DataFrame:
df = pd.read_csv('fixed-data.csv')
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


In [44]:
# Delete rows where "Duration" is higher than 120:
for x in df.index:
    if df.loc[x, "Duration"] > 120:
        df.drop(x, inplace = True)

### 3.6 Look at the DataFrame

In [45]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0
10,60,2020/12/11',103,147,329.3


Question: What happened to the old data set? Explain your answer.

**Answer: In this case, we have cleaned the wrong data in the "Duration" column. Since row 7 is outside the boundary that we set in the conditional statement, row 7 has been removed from the DataFrame.**

# 4. Removing Duplicates

*Duplicate rows* are rows that have been registered **more than one time**.

By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.

To discover duplicates, we can use the `duplicated()` method.

The `duplicated()` method returns a Boolean values for each row:


### 4.1 Reset data set to original state

In [53]:
# Reset DataFrame:
df = pd.read_csv('fixed-data.csv')
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


Rows 11 and 12 are duplicates, and let's confirm that by using the `duplicated()` method.

### 4.2 Check for duplicates

In [54]:
# Returns True for every row that is a duplicate, otherwise False:
df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
dtype: bool

Question: What happened to the old data set? Explain your answer.

**Answer: It showed boolean results for every index. In this case, we have two duplicates, row 11 and row 12.**

### 4.3 Remove duplicates

To remove duplicates, use the drop_duplicates() method.

In [55]:
# Remove all duplicates:
df.drop_duplicates(inplace = True)


**Note:** The `(inplace = True)` will make sure that the method does NOT return a new DataFrame, but it will remove all duplicates from the original DataFrame.


### 4.4 Check if there are duplicates again

In [56]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0


In [57]:
df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
dtype: bool

Question: What happened to the old data set? Explain your answer.


**Answer: We have removed the duplicate (row 12), and the `duplicated()` method now returns `False` for every row.**

# END