
# Pandas - Data Cleaning Tutorial
In this lecture, we will explore how to clean data using the Pandas library in Python.

We will cover the following topics:
- Removing Empty Cells
- Fixing Data with Wrong Format
- Fixing Wrong Data
- Removing Duplicates



## 1. Removing Empty Cells
Empty cells can potentially give you incorrect results when analyzing data.

### Example Data:
Here is an example dataset with some empty cells:
```plaintext
      Duration          Date  Pulse  Maxpulse  Calories
  0         60  '2020/12/01'    110       130     409.1
  1         60  '2020/12/02'    117       145     479.0
  2         60  '2020/12/03'    103       135     340.0
  3         45  '2020/12/04'    109       175     282.4
  4         45  '2020/12/05'    117       148     406.0
  5         60  '2020/12/06'    102       127     300.0
  6         60  '2020/12/07'    110       136     374.0
  7        450  '2020/12/08'    104       134     253.3
  8         30  '2020/12/09'    109       133     195.1
  9         60  '2020/12/10'     98       124     269.0
 10         60  '2020/12/11'    103       147     329.3
 11         60  '2020/12/12'    100       120     250.7
 12         60  '2020/12/12'    100       120     250.7
 13         60  '2020/12/13'    106       128     345.3
 14         60  '2020/12/14'    104       132     379.3
 15         60  '2020/12/15'     98       123     275.0
 16         60  '2020/12/16'     98       120     215.2
 17         60  '2020/12/17'    100       120     300.0
 18         45  '2020/12/18'     90       112       NaN
 19         60  '2020/12/19'    103       123     323.0
 20         45  '2020/12/20'     97       125     243.0
 21         60  '2020/12/21'    108       131     364.2
 22         45           NaN    100       119     282.0
 23         60  '2020/12/23'    130       101     300.0
 24         45  '2020/12/24'    105       132     246.0
 25         60  '2020/12/25'    102       126     334.5
 26         60      20201226    100       120     250.0
 27         60  '2020/12/27'     92       118     241.0
 28         60  '2020/12/28'    103       132       NaN
 29         60  '2020/12/29'    100       132     280.0
 30         60  '2020/12/30'    102       129     380.3
 31         60  '2020/12/31'     92       115     243.0
```

### Removing Empty Cells
You can remove rows with empty cells using the `dropna()` method:


In [2]:

import pandas as pd

# Sample data
data = {
    "Duration": [60, 60, 60, 45, 45, 60, 60, 450, 30, 60, 60, 60, 60, 60, 60, 60, 60, 60, 45, 60, 45, 60, 45, 60, 45, 60, 60, 60, 60, 60, 60, 60],
    "Date": [None, '2020/12/02', None, '2020/12/04', '2020/12/05', '2020/12/06', '2020/12/07', '2020/12/08', '2020/12/09', '2020/12/10',
             '2020/12/11', '2020/12/12', '2020/12/12', '2020/12/13', '2020/12/14', '2020/12/15', '2020/12/16', '2020/12/17', '2020/12/18', '2020/12/19',
             '2020/12/20', '2020/12/21', None, '2020/12/23', '2020/12/24', '2020/12/25', '20201226', '2020/12/27', '2020/12/28', '2020/12/29', '2020/12/30', '2020/12/31'],
    "Calories": [409.1, 479.0, 340.0, 282.4, 406.0, 300.0, 374.0, 253.3, 195.1, 269.0, 329.3, 250.7, 250.7, 345.3, 379.3, 275.0, 215.2, 300.0, None, 323.0,
                 243.0, 364.2, 282.0, 300.0, 246.0, 334.5, 250.0, 241.0, None, 280.0, 380.3, 243.0]
}

df = pd.DataFrame(data)

# Remove rows with empty cells
df_cleaned = df.dropna()
print(df_cleaned.to_string())


    Duration        Date  Calories
1         60  2020/12/02     479.0
3         45  2020/12/04     282.4
4         45  2020/12/05     406.0
5         60  2020/12/06     300.0
6         60  2020/12/07     374.0
7        450  2020/12/08     253.3
8         30  2020/12/09     195.1
9         60  2020/12/10     269.0
10        60  2020/12/11     329.3
11        60  2020/12/12     250.7
12        60  2020/12/12     250.7
13        60  2020/12/13     345.3
14        60  2020/12/14     379.3
15        60  2020/12/15     275.0
16        60  2020/12/16     215.2
17        60  2020/12/17     300.0
19        60  2020/12/19     323.0
20        45  2020/12/20     243.0
21        60  2020/12/21     364.2
23        60  2020/12/23     300.0
24        45  2020/12/24     246.0
25        60  2020/12/25     334.5
26        60    20201226     250.0
27        60  2020/12/27     241.0
29        60  2020/12/29     280.0
30        60  2020/12/30     380.3
31        60  2020/12/31     243.0


In [3]:
df_cleaned
#what do you see with the index?

Unnamed: 0,Duration,Date,Calories
1,60,2020/12/02,479.0
3,45,2020/12/04,282.4
4,45,2020/12/05,406.0
5,60,2020/12/06,300.0
6,60,2020/12/07,374.0
7,450,2020/12/08,253.3
8,30,2020/12/09,195.1
9,60,2020/12/10,269.0
10,60,2020/12/11,329.3
11,60,2020/12/12,250.7


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  32 non-null     int64  
 1   Date      29 non-null     object 
 2   Calories  30 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 900.0+ bytes



## 2. Fixing Data with Wrong Format
Sometimes, data may have wrong formats that need correction. For instance, the 'Date' column should have all values in the date format.

### Example:
```plaintext
Row 22 has an empty date.
Row 26 has a date in the wrong format ('20201226').
```

We can use the `to_datetime()` method to convert the 'Date' column into the correct format:


In [None]:

# Convert the 'Date' column into datetime format
df['Date'] = df['Date'].astype(str)

print(df)

mask =df['Date'].str.match(r'^\d(8)$')
print(mask)

df.loc[mask, 'Date'] = pd.to_datetime(df.loc[mask, 'Date'], format='')


In [None]:
df

In [None]:
df.info()


## 3. Fixing Wrong Data
Sometimes, data values may be wrong, such as a duration of 450 minutes in a dataset where most durations are between 30 and 60 minutes.

### Example:
We will fix the duration value in row 7 to 45 minutes:


In [None]:

# Fix wrong data in 'Duration' column
df.loc[df['Duration'] > 120, 'Duration'] = 120

# Show the updated DataFrame
print(df.to_string())


In [None]:
df


## 4. Removing Duplicates
Duplicate rows are rows that have been registered more than once in the dataset.

### Example:
Rows 11 and 12 are duplicates.

To remove duplicates, we use the `drop_duplicates()` method:


In [None]:

# Remove duplicates
df_no_duplicates = df.drop_duplicates()

# Show the cleaned DataFrame
print(df_no_duplicates.to_string())
