### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

* Missing values in a dataset refer to the absence of data points for certain observations or features. 
* It is essential to handle missing values as they can lead to biased or incorrect results when analyzing or modeling data. Some algorithms that are not affected by missing values include tree-based models like decision trees and random forests

### Q2: List down techniques used to handle missing data. Give an example of each with python code.

* Dropping missing values: removing any rows or columns with missing values
    * **Example code:** <code>df.dropna(axis=0)</code>  drops any rows with missing values in dataframe df
* Mean/median imputation: replacing missing values with the mean or median of the feature
    * **Example code:**  <code>df.fillna(df.mean())</code> replaces missing values in dataframe df with the mean of each feature
* Mode imputation: replacing missing values with the mode (most common value) of the feature
    * **Example code:** <code>df.fillna(df.mode().iloc[0])</code> replaces missing values in dataframe df with the mode of each feature
* Forward/backward fill: replacing missing values with the previous or next value in the sequence
    * **Example code:** <code>df.fillna(method='ffill')</code> replaces missing values in dataframe df with the previous value in the sequence
* Interpolation: using a function to estimate missing values based on surrounding data points
    * **Example code:** <code>df.interpolate()</code> estimates missing values in dataframe df using linear interpolation

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

* Imbalanced data refers to a situation where one class of a binary target variable is significantly more prevalent than the other.
* If imbalanced data is not handled, machine learning models may be biased towards the majority class, leading to poor performance on the minority class.

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

>Upsampling
    >* Up-sampling involves increasing the number of instances in the minority class
    >* Up-sampling may be required when the minority class is underrepresented and more data is needed to train a model effectively. 

>DownSampling
    >* down-sampling involves decreasing the number of instances in the majority class. 
    >* Down-sampling may be required when the majority class is overrepresented and the dataset is too large for effective modeling.

### Q5: What is data Augmentation? Explain SMOTE.

* Data augmentation is a technique used to increase the size of a dataset by adding modified versions of existing data. The modified data is generated by applying various transformations such as rotation, scaling, cropping, or flipping to the original data. This technique is commonly used in deep learning to prevent overfitting and improve the generalization of models.

* SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation method used to address the imbalanced class problem in classification tasks. It creates synthetic samples by interpolating between minority class examples. SMOTE selects one minority class sample at random and finds its k-nearest neighbors. It then generates new examples by randomly selecting points along the line segments connecting the minority class sample and its k-nearest neighbors. This process is repeated until the desired number of synthetic samples is generated.

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

* Outliers are data points that are significantly different from the rest of the data in a dataset. They can be caused by measurement errors, data corruption, or extreme values that are outside the typical range of the data. 
* Outliers can significantly affect the statistical properties of a dataset and the performance of machine learning models. It is essential to handle outliers because they can skew statistical analyses and affect the accuracy of predictive models.


### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

* There are several techniques that can be used to handle missing data in a dataset. 
1. **Q8: To determine if the missing data is missing at random or if there is a pattern to the missing data, several techniques can be used. One technique is to calculate the missingness percentage for each variable and compare it to the overall missingness percentage. If the missingness percentage is similar for all variables, it is likely that the missing data is missing at random. Another technique is to visualize the missingness pattern using heatmaps or dendrograms. These visualizations can help identify patterns in the missing data.

Q9: Imbalanced datasets can be challenging to handle in machine learning projects. Some strategies that can be used to evaluate the performance of machine learning models on imbalanced datasets include precision-recall curves, F1 score, and area under the ROC curve. These metrics are better suited for evaluating models on imbalanced datasets than accuracy because accuracy can be misleading when the dataset is imbalanced.Remove the rows or columns** that contain missing values. However, this approach may lead to a loss of valuable information. 
2. **Impute missing values** by using statistical methods such as mean imputation, median imputation, or regression imputation. Additionally, if the amount of missing data is small, it may be possible to use machine learning models that can handle missing values.

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

* To determine if the missing data is missing at random or if there is a pattern to the missing data, several techniques can be used. 
1. Calculate the missingness percentage for each variable and compare it to the overall missingness percentage. If the missingness percentage is similar   for all variables, it is likely that the missing data is missing at random. 
2. To visualize the missingness pattern using heatmaps or dendrograms. These visualizations can help identify patterns in the missing data. Imbalanced datasets can be challenging to handle in machine learning projects. Some strategies that can be used to evaluate the performance of machine learning models on imbalanced datasets include precision-recall curves, F1 score, and area under the ROC curve. These metrics are better suited for evaluating models on imbalanced datasets than accuracy because accuracy can be misleading when the dataset is imbalanced.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

* Imbalanced datasets can be challenging to handle in machine learning projects. Some strategies that can be used to evaluate the performance of machine learning models on imbalanced datasets include precision-recall curves, F1 score, and area under the ROC curve. These metrics are better suited for evaluating models on imbalanced datasets than accuracy because accuracy can be misleading when the dataset is imbalanced.

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

* When dealing with an unbalanced dataset where the majority of customers report being satisfied, there are a few methods that can be employed to balance the dataset and down-sample the majority class:

* **Random under-sampling:** This involves randomly selecting a subset of the majority class samples so that the number of samples in the majority and minority classes are closer in number. However, this method can lead to loss of valuable data.

* **Tomek links:** This involves identifying the samples that are closest to each other from the majority and minority classes and removing the majority class sample. This method can improve the decision boundary between the classes, but can also lead to loss of important information.

* **Cluster-based under-sampling:** This involves identifying the clusters of the majority class and reducing the number of samples in each cluster. This method can help retain the information contained in the majority class samples.

* **Synthetic minority over-sampling technique (SMOTE):** This method involves creating synthetic samples for the minority class by interpolating between existing minority class samples.

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

* When dealing with an unbalanced dataset where the minority class has a low percentage of occurrences, there are several methods that can be employed to balance the dataset and up-sample the minority class:

* **Random over-sampling:** This involves randomly duplicating minority class samples so that the number of samples in the majority and minority classes are closer in number. However, this method can lead to overfitting and poor performance.

* **Synthetic minority over-sampling technique (SMOTE):** This method involves creating synthetic samples for the minority class by interpolating between existing minority class samples.

* **Adaptive synthetic (ADASYN):** This is an extension of SMOTE that generates more synthetic examples in regions where the density of the minority class is lower.

* **Minority class augmentation:** This involves augmenting the minority class data by adding noise or modifying the existing data to generate more examples.

* **Ensemble methods:** This involves combining several models to create a more robust classifier. This can be especially useful when dealing with imbalanced datasets, as it can help to prevent overfitting and improve the accuracy of the predictions.