# Module 1: Data Analysis and Data Preprocessing

## Section 1: Handling missing data

### Part 1: Dropping missing data

In this part, we will discuss the concept of Dropping Missing Data, a data preprocessing technique used when dealing with missing values in a dataset. When missing data is present, one approach is to simply drop the rows or columns containing missing values. However, this technique should be used with caution, as it may lead to data loss and potential bias in the analysis. Let's explore Dropping Missing Data in more detail.

### 1.1 Understanding dropping missing data

Dropping missing data involves removing rows or columns that contain one or more missing values from the dataset. This approach is a simple and quick way to handle missing data, and it can be effective when the amount of missing data is small and random.

There are two main strategies for dropping missing data:

- Row Dropping: If the missing values are primarily in a small number of rows, dropping those rows may not significantly impact the overall dataset.

- Column Dropping: If a feature has a large number of missing values or a high proportion of missing data, it might be better to drop the entire column.

### 1.2 Pros and cons of Dropping missing data
Pros:

- Simple and quick to implement.
- Suitable for small datasets with a small proportion of missing data.
- Removes problematic rows or columns that could affect the analysis.

Cons:

- Leads to data loss, which can reduce the overall sample size.
- May introduce bias, especially if the missing data is not missing completely at random (MCAR).
- In large datasets, dropping missing data can lead to significant information loss.

### 1.3 Dropping missing data using pandas

Scikit-learn doesn't directly provide functions for dropping missing values since its primary focus is on machine learning algorithms, not data preprocessing. Instead, you can use Pandas, or other libraries like NumPy to handle missing values.
Dropping missing data can be performed using Pandas, a popular library for data manipulation. 
Here's an example:

In [None]:
import pandas as pd

#Suppose we have a DataFrame df with some missing values:
data = {
    'A': [1, 2, None, 4, 5],
    'B': [10, 20, 30, None, 50],
    'C': [100, 200, 300, 400, None]
}

df = pd.DataFrame(data)

# The DataFrame df looks like this
print("Original DataFrame:")
print(df)

#We can use Pandas' dropna() function to drop rows with any missing values:
# Dropping rows with any missing values
df_dropped_rows = df.dropna()
# After dropping rows with any missing values, df_dropped_rows will look like this:
print("\nDropped rows DataFrame:")
print(df_dropped_rows)
# As you can see, rows 2, 3, and 4, which contained missing values, have been dropped from the DataFrame.

#You can also use dropna() to drop columns with any missing values by specifying the axis parameter:

# Dropping columns with any missing values
df_dropped_columns = df.dropna(axis=1)
#After dropping columns with any missing values, df_dropped_columns will look like this:
print("\nDropped columns DataFrame:")
print(df_dropped_columns)
# As you can see, all columns containing missing values have been dropped from the DataFrame.

### Drop missing data using NumPy

In this NumPy example, we create a NumPy array arr with missing values (represented as np.nan). We then use np.isnan(arr) to identify the locations of missing values and np.isnan(arr).any(axis=1) to check if any values are missing in each row. The ~ operator is used to negate the condition, and we use it to index the array and drop rows with any missing values.

In [None]:
import numpy as np

# Sample array with missing values
arr = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
print("Original Array:")
print(arr)

# Dropping rows with any missing values
arr_dropped_rows = arr[~np.isnan(arr).any(axis=1)]
print("\nArray after dropping rows with any missing values:")
print(arr_dropped_rows)

In this NumPy example, we create a NumPy array arr with missing values (represented as np.nan). We then use np.isnan(arr) to identify the locations of missing values and np.isnan(arr).any(axis=1) to check if any values are missing in each row. The ~ operator is used to negate the condition, and we use it to index the array and drop rows with any missing values.

### 1.4 Summary

Dropping missing data is a straightforward data preprocessing technique used to handle missing values. It can be effective when the amount of missing data is small and random. However, it's essential to carefully consider the impact of data loss and potential bias before using this approach.

Remember that dropping missing data is just one of the methods to handle missing values. Depending on your dataset and the extent of missing data, other imputation or interpolation techniques may be more appropriate. Always consider the characteristics of your data and the implications of each method before making a decision.

In the next part, we will explore imputation data preprocessing, which can be more suitable for handling missing data in certain scenarios.