Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

Solution :

**Missing values** in a dataset are data that are not stored for certain variables or participants. They can occur due to various reasons, such as incomplete data entry, equipment malfunctions, lost files, etc. Missing values can be classified into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).


Handling missing values is crucial because they can lead to a lack of precision in the statistical analysis. If not handled properly, you may end up building a biased machine learning model, leading to incorrect results. It's important to handle missing values to avoid this and to successfully manage data and draw accurate inferences about the data.


Some machine learning algorithms can handle missing values. For example, the **k-nearest neighbors (KNN)** algorithm can ignore a column when a value is missing. It works on the principle of a distance measure. Another algorithm that can handle missing values is **Naive Bayes**. These algorithms can support missing values when making a prediction.



Q2: List down techniques used to handle missing data. Give an example of each with python code.

Solution :

There are several techniques to handle missing data in a dataset:

1. Deleting Rows with Missing Values: This method involves removing the rows that contain missing values. However, it is not generally advised as it might result in loss of information from other columns which do not have missing values.

2. Mean/Median/Mode Imputation: This method involves filling the missing values with the mean, median, or mode of the non-missing data in the same column.


In [11]:
# deleting rows with missing values
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [9,10,10,12] })

# deleting rows with missing values
df.dropna()



Unnamed: 0,A,B,C
0,1.0,5.0,9
3,4.0,8.0,12


In [10]:
# Filling missing values with mean 
df['A'] = df['A'].fillna(int(df['A'].mean()))
df.head()

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,2.0,7.0,10
3,4.0,8.0,12


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Solution: 

**Imbalanced data** refers to a situation in classification machine learning where one target class represents a significant portion of observations. This means that one class has a much higher or lower number of observations than the other class(es). Imbalanced data can pose challenges for machine learning algorithms and affect their performance1. It can occur in various domains such as finance, healthcare, and public sectors.

If imbalanced data is not handled properly, it can lead to models that are **biased** toward the majority class, resulting in poor performance of the minority class. This is because many machine learning algorithms are designed to maximize overall accuracy. As a result, the minority class observations might look like noise to the model and are often ignored. This can lead to misleading accuracy scores and poor model performance. Moreover, imbalanced data can cause **overfitting**, where the model learns to memorize the majority class and fails to generalize to new, unseen data. This results in poor performance on real-world applications, as the model cannot adapt to variations in the data.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

Solution: 

**Up-sampling** is the process of increasing the frequency or size of the data. In the context of image processing, up-sampling increases the resolution and size of the image. It involves adding more data points between the existing data points in the dataset.

**Down-sampling**, on the other hand, is the process of decreasing the frequency or size of the data. In image processing, down-sampling reduces the number of pixels in an image, thereby decreasing its resolution and size.

**Example when up-sampling and down-sampling are required:**

In **image processing**, down-sampling can be used to reduce the storage and/or transmission requirements of images. For instance, if you have a high-resolution image that you need to send over a network with bandwidth limitations, you might down-sample the image to reduce its size.

Up-sampling, on the other hand, can be used when you want to increase the resolution of an image. For example, if you have a low-resolution image that you want to print in a large format, you might up-sample the image to increase its resolution and avoid pixelation.

Q5: What is data Augmentation? Explain SMOTE.

Solution:

Data augmentation is a set of techniques used to increase the amount and diversity of data by generating new data points from existing data. This process does not involve collecting new data, but rather transforming the already present data. In the context of image processing, for example, data augmentation might involve operations such as rotation, shearing, zooming, cropping, flipping, and changing the brightness level.

SMOTE is a technique used in ML to address imbalance datasets where the minority class has significantly fewer instances than the majority class. SMOTE involves generating synthetic instances of the minority class by interpolating between existing instances.

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Solution:

Outliers in a dataset are data points that deviate significantly from the rest of the observations. They are extreme values that stand out greatly from the overall pattern of values in a dataset or graph. Outliers can occur due to various reasons such as variability in the data, experimental errors, or human errors.

Handling outliers is crucial for several reasons:
1. Detecting Errors
2. Understanding Natural Variation
3. Impacting Statistical Measures
4. Ensuring Accurate Results
5. Affecting Machine Learning Models


Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Solution:

There are several techniques to handle missing data in a dataset:
1. Deleting Rows with missing values
2. Mean/Median/Mode Imputation
3. Random Sample Imputation

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

Solution:

There are several strategies to determine if missing data is missing at random or if there is a pattern:

1. **Try to obtain the missing data**: If possible, try to collect the missing data. This could involve reaching out to the data source or conducting further research.

2. **Leave out incomplete cases and use only those for which all variables are available**: This strategy involves analyzing only the complete cases in your dataset.

3. **Replace missing data by a conservative estimate, e.g., the sample mean**: This involves imputing the missing values with a conservative estimate like the mean, median, or mode.

4. **Try to estimate the missing data from the other data on the person**: If you have other related variables in your dataset, you can use them to estimate the missing values.

5. **Mean or Median Imputation**: This method involves filling the missing values with the mean or median of the non-missing data in the same column.

6. **Multivariate Imputation by Chained Equations (MICE)**: This is a more advanced method that involves using multiple variables in your dataset to estimate the missing values.

7. **Random Forest**: This method involves using a Random Forest model to predict and fill in missing values based on other variables.

Remember that it's important to understand why your data is missing before deciding on a strategy. The reason for the missing data can help you determine whether it's missing at random (MAR), missing completely at random (MCAR), or missing not at random (MNAR).



Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Solution: 

In terms of evaluation metrics, accuracy might not be a good indicator due to the imbalance. Instead, consider using:

* Precision: It tells us what proportion of patients that we diagnosed as having the condition, actually had the condition.

* Recall (Sensitivity): It tells us what proportion of patients that actually had the condition were diagnosed by the algorithm as having the condition.

* F1-Score: It is the harmonic mean of precision and recall and provides a balance between them.

* Area Under the Receiver Operating Characteristic Curve (AUROC): It tells us how much the model is capable of distinguishing between classes.

* Confusion Matrix: It is a table that describes the performance of a classification model.

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

Solution: 

When dealing with an imbalanced dataset where the majority of customers are satisfied, you can use the following methods to down-sample the majority class and balance the dataset:

1. **Random Under-sampling**: This involves randomly eliminating instances from the majority class to achieve a more balanced dataset. However, this method might discard potentially useful data.

2. **Cluster-Based Under-sampling**: In this method, the majority class is divided into several clusters. Instances in each cluster are then under-sampled. This ensures that the under-sampling process does not result in loss of data diversity.

3. **Tomek Links**: Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are instances that are very close to each other but belong to different classes. You can remove the instances of the majority class from each pair to increase the separation between classes.

4. **Edited Nearest Neighbors (ENN)**: This method removes any instance in the majority class whose prediction made by its three nearest neighbors disagrees with its actual class.

Remember, while these methods can help balance the classes, they might also remove potentially important information. It's essential to keep this in mind and consider using a combination of under-sampling and over-sampling (for the minority class) techniques for best results.

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

Solution: 

When dealing with an imbalanced dataset where the minority class represents a rare event, you can use the following methods to up-sample the minority class and balance the dataset:

1. **Random Over-sampling**: This involves randomly duplicating instances from the minority class to achieve a more balanced dataset. However, this method might lead to overfitting due to the exact replication of data points.

2. **Synthetic Minority Over-sampling Technique (SMOTE)**: This method creates synthetic instances of the minority class by interpolating between existing ones. This can increase the diversity of data, reducing the risk of overfitting compared to random over-sampling.

3. **Adaptive Synthetic (ADASYN) Sampling Method**: This is an improved version of SMOTE. It uses a density distribution as a criterion to automatically decide the number of synthetic samples that need to be generated for each minority sample.

4. **Borderline-SMOTE**: This is a variant of SMOTE that selects instances near the decision boundary of the minority class for generating synthetic instances.

Remember, while these methods can help balance the classes, they might also introduce noise into the dataset. It's essential to keep this in mind and consider using a combination of under-sampling (for the majority class) and over-sampling techniques for best results.