<a href="https://vigneashpandiyan.github.io/publications/Codes/" target="_blank" rel="noopener noreferrer">
  <img src="https://vigneashpandiyan.github.io/images/Link.png"
       style="max-width: 800px; width: 100%; height: auto;">
</a>

### Data Cleaning

Data cleaning is the process of correcting or removing inaccurate, incomplete, duplicate, or irrelevant data from the dataset. It is a crucial step in data analysis and machine learning because clean data ensures more accurate results, meaningful insights and reliable models. Common tasks include handling missing values, fixing formatting issues, and removing duplicates.

The below dataset will be used for the next few blocks. At any point, it can be executed again to reset the dataset.

In [None]:
# @title
import pandas as pd

data = [
    [60, '2020/12/01', 110, 130, 409.1],
    [60, '2020/12/02', 117, 145, 479.0],
    [60, '2020/12/03', 103, 135, 340.0],
    [45, '2020/12/04', 109, 175, 282.4],
    [45, '2020/12/05', 117, 148, 406.0],
    [60, '2020/12/06', 102, 127, 300.0],
    [60, '2020/12/07', 110, 136, 374.0],
    [40, '2020/12/08', 104, 134, 253.3],
    [30, '2020/12/09', 109, 133, 195.1],
    [60, '2020/12/10', 98, 124, 269.0],
    [60, '2020/12/11', 103, 147, 329.3],
    [60, '2020/12/12', 100, 120, 250.7],
    [60, '2020/12/12', 100, 120, 250.7],
    [60, '2020/12/13', 106, 128, 345.3],
    [60, '2020/12/14', 104, 132, 379.3],
    [60, '2020/12/15', 98, 123, 275.0],
    [60, '2020/12/16', 98, 120, 215.2],
    [60, '2020/12/17', 100, 120, 300.0],
    [45, '2020/12/18', 90, 112, None],
    [60, '2020/12/19', 103, 123, 323.0],
    [45, '2020/12/20', 97, 125, 243.0],
    [60, '2020/12/21', 108, 131, 364.2],
    [45, None, 100, 119, 282.0],
    [60, '2020/12/23', 130, 101, 300.0],
    [45, '2020/12/24', 105, 132, 246.0],
    [60, '2020/12/25', 102, 126, 334.5],
    [60, '2020/12/26', 100, 120, 250.0],
    [60, '2020/12/27', 92, 118, 241.0],
    [60, '2020/12/28', 103, 132, None],
    [60, '2020/12/29', 100, 132, 280.0],
    [60, '2020/12/30', 102, 129, 380.3],
    [60, '2020/12/31', 92, 115, 243.0]
]

columns = ["Duration", "Date", "Pulse", "Maxpulse", "Calories"]
df = pd.DataFrame(data, columns=columns)
df.to_csv("calories.csv", index=False)


In [None]:
 print(df.info())

In [None]:
print(df.to_string())

The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).

The data set contains duplicates (row 11 and 12).

### Missing Data
Empty cells can potentially give a wrong result when the data is analyzed.
One way to deal with missing data is to remove simply remove rows that contain empty cells. This is usually fine for  large data sets as removing a few rows will not have a big impact on the result.

In [None]:
import pandas as pd

df = pd.read_csv('calories.csv')

new_df = df.dropna() #Remove rows with empty cells

print(new_df.to_string())

Note this is a new dataFrame, which leaves the original file unmodified.
By using:

> df.dropna(inplace = True)

, the original dataFrame may be modified.

Another way of dealing with empty cells is to insert a new value instead. This is termed as data imputation, wherein missing values in a dataset are replaced with estimated values. One way involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. Other statistical quantities may also be used.

The fillna() method replaces empty cells with a single value. This may be a precalculated mean, mode or median.


In [None]:
x = df["Calories"].mean()
new_df = df.fillna(x) #Replace missing data with mean of Calories

print(new_df.to_string())

Hoever, note that the above method also replaced the missing date too!
To only replace empty values for one column, the column name may be specified:


In [None]:
x = df["Calories"].median()
new_df = df.fillna({"Calories": x}) #Replace with mode of calories
print(new_df.to_string())

Lastly, it makes sense to find out any duplicated records in the dataset.
The duplicated() method returns a Boolean values for each row where data is repeated.

In [None]:
 print(new_df.duplicated())

And finally, to remove all duplicates:

In [None]:
 new_df.drop_duplicates(inplace = True) #inplace paramter will modify the dataframe
 print(new_df.to_string())

### Filtering

 Filtering the dataset allows to extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.

In [None]:
filtered_df = new_df.loc[new_df['Pulse'] > 99, ['Maxpulse', 'Calories']]
print(filtered_df)