# Introduction to Data Preprocessing

## Importance of Data Quality
Data quality is crucial because the performance of machine learning models depends heavily on the quality of the data used for training. Poor quality data can lead to inaccurate models, misleading conclusions, and suboptimal decision-making. High-quality data should be accurate, complete, consistent, and relevant.

Key aspects of data quality include:
1. Accuracy: Correctness of data values.
2. Completeness: All necessary data is present.
3. Consistency: Data should be consistent across different sources.
4. Relevance: Data should be relevant to the problem being solved.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.ndimage import gaussian_filter

## Load the COVID-19 Dataset
We'll use the COVID-19 dataset available on Kaggle. This dataset contains information about COVID-19 cases, including country-wise statistics.

Link to the dataset: [COVID-19 Dataset](https://www.kaggle.com/imdevskp/corona-virus-report)

In [2]:
# Load the COVID-19 dataset
url = "https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset
print("First few rows of the COVID-19 dataset:")
df.head()

## Common Data Issues

### 1. Missing Values
Missing values can occur due to various reasons like data entry errors, sensor malfunction, or data corruption. It's essential to handle missing values appropriately to avoid bias and inaccuracies in the model.

Example:

In [3]:
# Checking for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# Since there are no missing values in this dataset, we'll simulate some missing values for demonstration
df_missing = df.copy()
df_missing.loc[0:10, 'Confirmed'] = np.nan

# Handling missing values by imputation
df_missing['Confirmed'].fillna(df_missing['Confirmed'].mean(), inplace=True)

print("\nData after handling missing values:")
print(df_missing.isnull().sum())

### 2. Outliers
Outliers are data points that differ significantly from other observations. They can distort statistical analyses and models if not handled properly.

Example:

In [4]:
# Visualizing outliers in the 'Confirmed' cases using box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Confirmed'])
plt.title("Box plot of Confirmed Cases with outliers")
plt.show()

# Handling outliers by capping
q1 = df['Confirmed'].quantile(0.25)
q3 = df['Confirmed'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

df['Confirmed'] = np.where(df['Confirmed'] > upper_bound, upper_bound, np.where(df['Confirmed'] < lower_bound, lower_bound, df['Confirmed']))

print("\nData after capping outliers in Confirmed Cases:")
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Confirmed'])
plt.title("Box plot of Confirmed Cases after capping outliers")
plt.show()

### 3. Noise
Noise refers to random variations in data that can obscure patterns. It can result from errors in data collection or processing. Smoothing techniques can help reduce noise.

Example:

In [5]:
# Sample noisy signal
np.random.seed(0)
time = np.linspace(0, 4*np.pi, 500)
signal = np.sin(time) + np.random.normal(0, 0.2, 500)

plt.plot(time, signal, label='Noisy Signal')
plt.title("Noisy Signal")
plt.legend()
plt.show()

# Reducing noise using Gaussian filter
smoothed_signal = gaussian_filter(signal, sigma=2)
plt.plot(time, smoothed_signal, label='Smoothed Signal')
plt.title("Smoothed Signal")
plt.legend()
plt.show()