## Ans1

Missing values in a dataset refer to the absence of data in one or more fields of a record. Missing values can occur due to various reasons such as data entry errors, data corruption, or data not being collected for some samples.Handling missing values is essential because they can have a significant impact on the accuracy and reliability of any data analysis or modeling that is performed using the dataset. If missing values are not dealt with properly, it can lead to biased results, reduced statistical power, and incorrect conclusions.<br>
Some algorithms that are not affected by missing values are:<br>
Decision Trees: Decision Trees are robust to missing values as they can handle them by assigning a split based on the available data.<br>
Random Forests: Random Forests are also robust to missing values because they use multiple decision trees to make a prediction, and each decision tree can handle missing values independently<br>
Gaussian Mixture Models: Gaussian Mixture Models can handle missing values by estimating the missing data based on the available data and the model parameters.<br>
K-Nearest Neighbors: K-Nearest Neighbors can handle missing values by simply ignoring the missing values and using only the available features to calculate distances between samples.

## Ans2

Techniques to handle missing data are:<br>
1 Replacing with mean<br>
2 Replacing with median<br>
3 Replacing with mode<br>
4 Deletion of missing values<br>
5 Interpolation of data

In [None]:
# Example:   
# Mean Median Mode Imputation
import pandas as pd

# Load the dataset
df = pd.read_csv("data.csv")

# Replace missing values with the mean
df.fillna(df.mean(), inplace=True)

# Check the number of missing values
print(df.isnull().sum())




#Deleting the missing values
import pandas as pd

# Load the dataset
df = pd.read_csv("data.csv")

# Replace missing values with the mean
df.fillna(df.mean(), inplace=True)

# Check the number of missing values
print(df.isnull().sum())



# Interpolation of values

import pandas as pd

# Load the dataset
df = pd.read_csv("data.csv")

# Interpolate missing values using linear method
df.interpolate(method='linear', inplace=True)

# Check the number of missing values
print(df.isnull().sum())


## Ans3

Imbalanced Imbalanced data refers to a dataset where the number of instances in each class is not equal , One class has significantly more or fewer samples than the others. For example, a binary classification problem with 95% of the samples belonging to class A and only 5% to class B is an imbalanced dataset.

If imbalanced data is not handled, it can lead to several issues in machine learning models. One of the main problems is that the model may become biased towards the majority class, which may lead to poor performance on the minority class. In other words, the model may predict the majority class for most of the instances, resulting in high accuracy for the majority class but low accuracy for the minority class.

Another issue is that the evaluation metrics may be misleading. For example, if the model predicts the majority class for all instances, it will have an accuracy of 95%, which may seem good at first. However, this model is not useful for the minority class, which is the one we may be interested in.

## Ans4

Up-sampling and down-sampling are two common techniques used to address imbalanced data in machine learning.

Up-sampling involves increasing the number of instances in the minority class to balance the class distribution. This can be done by randomly duplicating instances from the minority class or by generating synthetic samples using techniques such as SMOTE.

Down-sampling involves reducing the number of instances in the majority class to balance the class distribution. This can be done by randomly removing instances from the majority class.

Example:

Suppose you are working on a binary classification problem to predict whether a customer will churn or not. You have a dataset with 10,000 samples, out of which only 10% belong to the positive class (churned customers). This is an imbalanced dataset, as the majority class (non-churned customers) accounts for 90% of the samples.

In this case, up-sampling can be used to increase the number of samples in the positive class, which can improve the model's ability to detect the minority class. This can be done by randomly duplicating instances from the positive class or by using synthetic data generation techniques.

On the other hand, down-sampling may be used when the majority class has a significantly larger number of samples than the minority class, and the goal is to reduce the number of samples to balance the dataset. For example, if the majority class has 95% of the samples, and the minority class has only 5%, down-sampling can be used to reduce the number of samples in the majority class to balance the class distribution. However, as mentioned earlier, down-sampling may discard some information, which can affect the model's performance.

## Ans5

Data augmentation is a technique used in machine learning to increase the size and diversity of a dataset by generating new examples from the existing ones. This can help to improve the performance of machine learning models by reducing overfitting and improving generalization.

One popular data augmentation technique is SMOTE. SMOTE is used to address the problem of imbalanced data by generating synthetic examples of the minority class.

SMOTE works by creating synthetic examples by interpolating between existing minority class examples. For each example in the minority class, SMOTE selects k nearest neighbors from the minority class and creates a new example by randomly selecting a point between the original example and one of its k nearest neighbors. The number of synthetic examples generated can be controlled by adjusting the sampling strategy and the k value.

For example, suppose you have a dataset with two classes, one of which has significantly fewer samples than the other. You can use SMOTE to generate synthetic examples of the minority class to balance the dataset. By applying SMOTE to the minority class, you can create new examples that are similar to the existing ones but with small variations. 



## Ans6

Outliers in a dataset refer to observations that are significantly different from other observations in the dataset. These observations may be unusually high or low in value or may have extreme values compared to the rest of the data. Outliers can occur due to errors in data collection or entry, measurement error, or natural variation in the data.

Handling outliers is essential for several reasons. First, outliers can skew statististical analyses, leading to incorrect conclusions about the data. For example, if outliers are not handled, they may significantly impact the mean and standard deviation, which are commonly used in statistical analyses. This can lead to incorrect interpretations of the data, and the resulting models may not be representative of the underlying distribution.

Second, outliers can also have a significant impact on machine learning algorithms. Many machine learning algorithms are sensitive to outliers, and models trained on datasets with outliers may have poor predictive performance. For example, in regression analysis, outliers can significantly impact the coefficients of the model, leading to incorrect predictions.

Third, outliers can also impact data visualization, making it difficult to understand the underlying distribution. Outliers can cause the scale of the data to be distorted, leading to a misleading visualization of the data.

## Ans7

Deletion: This technique involves removing observations or variables that contain missing values.This can lead to loss of data.

Imputation: This technique involves filling in the missing values with anestimate based on the available data. There are several imputation techniques, such as mean imputation , median imputation, regression imputation, and k-nearest neighbor imputation.

Hot Deck imputation: This technique involves filling in missing values by randomly selecting a value from a similar observation in the dataset. Hot deck imputation maintains the statistical properties of the original dataset and is considered better than other imputation techniques when the amount of missing data is relatively low.

Multiple imputation: This technique involves creating several imputed datasets and analyzing each of them separately to obtain a final estimate. Multiple imputation is preferred when the amount of missing data is high, and imputing values by only one technique may introduce a lot of bias in the final results.

## Ans8

Visual analysis: A simple technique to identify the pattern of missing data is to use visual analysis. For example, you can create a histogram or a bar chart of the data, with missing values plotted separately. This can help identify if there is any systematic pattern to the missing data.

Correlation analysis: Correlation analysis can help identify the relationship between the missing data and other variables in the dataset. If there is a systematic relationship between the missing data and other variables, it can be an indication that the missing data is not missing at random.

Missing data tests: There are several statistical tests available that can be used to determine if the missing data is missing at random or not. For example, the Little's MCAR test can be used to test if the missing data is missing completely at random (MCAR). The Missing at Random (MAR) and Missing Not at Random (MNAR) tests can be used to determine if there is any systematic pattern to the missing data.

Imputation techniques: Imputation techniques can also be used to determine if the missing data is missing at random or not. For example, if the imputed values are similar to the observed values, it can be an indication that the missing data is missing at random. On the other hand, if the imputed values are different from the observed values, it can be an indication that the missing data is not missing at random.

## Ans9

Use appropriate evaluation metrics: Using appropriate evaluation metrics is crucial when dealing with imbalanced datasets. Metrics such as accuracy can be misleading as they can give high scores even when the model is not performing well on the minority class. Instead, metrics such as precision, recall, F1-score, and AUC-ROC can provide a better understanding of how well the model is performing on the minority class.

Resampling techniques: Resampling techniques such as oversampling and undersampling can be used to balance the dataset. Oversampling involves creating synthetic examples for the minority class, while undersampling involves removing examples from the majority class. However, it is essential to be cautious when using these techniques as they can introduce bias and overfitting.

Class weight balancing: Most machine learning algorithms allow for adjusting the class weights to balance the dataset. This technique can help the model to focus more on the minority class, giving more importance to its performance.

Ensemble techniques: Ensemble techniques such as bagging, boosting, and stacking can help to improve the model's performance on the minority class. By combining several models, these techniques can help to reduce bias and improve generalization.

## Ans10

Random Under-Sampling: Randomly removing instances from the majority class can balance the dataset. However, this method can lead to information loss and reduce the representativeness of the majority class.

Cluster Centroids: The Cluster Centroids method selects a subset of centroids from the majority class that represent the majority class without including all the majority class instances.

Tomek Links: The Tomek Links method identifies the instances in the majority class that are closest to the minority class and removes them. This method helps to improve the decision boundary between the classes.

Edited Nearest Neighbors: The Edited Nearest Neighbors method removes the majority class instances that are misclassified by the k-nearest neighbors algorithm. This method can help to remove noisy instances from the majority class.

## Ans11

Random Over-Sampling: Randomly replicating instances from the minority class can balance the dataset. However, this method can lead to overfitting, as the same instances are used in both the training and testing sets.

SMOTE: Synthetic Minority Over-Sampling Technique (SMOTE) creates synthetic examples for the minority class based on the existing instances by creating new synthetic examples along the line segments joining the minority class instances. This method helps to create a more diverse and balanced dataset.

ADASYN: The Adaptive Synthetic Sampling (ADASYN) method is similar to SMOTE, but it creates more synthetic examples for the minority class based on the difficulty of the classification problem.

Synthetic Sampling using GANs: Synthetic Sampling using Generative Adversarial Networks (GANs) can be used to generate synthetic examples for the minority class that are more realistic and diverse. This method has shown promising results in creating more realistic synthetic examples.