## 2nd Exercise

Let's work through a practical example of data cleaning and preprocessing using Python's pandas library. 

We'll use an illustrative 'real-world' dataset that contains information about temperature measurements, including dates and temperature values. 

In this example, we'll cover checking for missing values, handling duplicates of dates, and identifying non-date entries.

## Step 1: Import Required Libraries

In [13]:
import pandas as pd

## Step 2: Load the Dataset

Assume you have a CSV file named "sample_temperature_data.csv" with columns like "Date" and "Temperature." Load the dataset into a pandas DataFrame. The data also contain NaN values - [Check here](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

In [14]:
# Load the dataset
data = pd.read_csv("sample_temperature_data.csv")

# Display the first few rows of the dataset
print(data.head())

         Date  Temperature
0   2023-01-1         15.5
1   2023-01-2         16.2
2   2023-01-3         14.8
3  2023-01-04          NaN
4  2023-01-05         17.2


## Step 3: Data Cleaning and Preprocessing

### 3a. Checking for Missing Values:

In [15]:
# Check for missing values
missing_values = data.isnull().sum()

# Display the count of missing values for each column
print(missing_values)

Date           1
Temperature    1
dtype: int64


### 3b. Handling Duplicate Dates:

In [16]:
# Check for duplicate date entries
duplicate_dates = data[data.duplicated('Date')]

# Display the duplicate rows based on the "Date" column
print(duplicate_dates)


          Date  Temperature
6   2023-01-06         16.5
10  2023-01-10         16.1


### 3c. Identifying Non-Date Entries:

In [17]:
# Identify non-date entries in the "Date" column
non_date_entries = data[~pd.to_datetime(data['Date'], errors='coerce').notnull()]

# Display the rows with non-date entries
print(non_date_entries)

   Date  Temperature
13  NaN         18.2


### Step 4: Save Processed Data

Save the cleaned and preprocessed data to a new CSV file.

In [18]:
# Save cleaned data to a new CSV file
data.to_csv("cleaned_temperature_data.csv", index=False)

# Display the first few rows of the cleaned data
print(data.head())


         Date  Temperature
0   2023-01-1         15.5
1   2023-01-2         16.2
2   2023-01-3         14.8
3  2023-01-04          NaN
4  2023-01-05         17.2


### TASK 1: Delete identified duplicate and display new table (hint create a code cell and use code below)

```
Drop duplicate date entries:
data.drop_duplicates(subset='Date', inplace=True)

Display the dataset after removing duplicates:
print("\nDataset after Removing Duplicates:")
print(data)
```

### TASK 2: Try and use some of the previous code to fill missing values here.

### TASK 3: Calculate and display the percentage of missing values for each column

In this code, we calculate the percentage of missing values for each column by dividing the count of missing values ```data.isnull().sum()``` by the total number of rows in the dataset ```len(data)```. Finally, we display the calculated missing value percentages for each column.

This will provide you with insights into which columns have the highest percentage of missing values, helping you prioritize your data cleaning efforts.

```
# Calculate the percentage of missing values for each column
missing_percentage = (data.isnull().sum() / len(data)) * 100

# Display the percentage of missing values for each column
print("Percentage of Missing Values for Each Column:")
print(missing_percentage)
```









## Summary

In this example, 
- we loaded a temperature data dataset, 
- checked for missing values, 
- handled duplicate date entries, and 
- identified non-date entries. 

The cleaned and preprocessed data is then saved to a new CSV file.

More advanced techniques based on the specific nature of your dataset