### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

* Missing values in a dataset refer to the absence of data points for certain observations or features. 
* It is essential to handle missing values as they can lead to biased or incorrect results when analyzing or modeling data. Some algorithms that are not affected by missing values include tree-based models like decision trees and random forests

### Q2: List down techniques used to handle missing data. Give an example of each with python code.

* Dropping missing values: removing any rows or columns with missing values
    * **Example code:** <code>df.dropna(axis=0)</code>  drops any rows with missing values in dataframe df
* Mean/median imputation: replacing missing values with the mean or median of the feature
    * **Example code:**  <code>df.fillna(df.mean())</code> replaces missing values in dataframe df with the mean of each feature
* Mode imputation: replacing missing values with the mode (most common value) of the feature
    * **Example code:** <code>df.fillna(df.mode().iloc[0])</code> replaces missing values in dataframe df with the mode of each feature
* Forward/backward fill: replacing missing values with the previous or next value in the sequence
    * **Example code:** <code>df.fillna(method='ffill')</code> replaces missing values in dataframe df with the previous value in the sequence
* Interpolation: using a function to estimate missing values based on surrounding data points
    * **Example code:** <code>df.interpolate()</code> estimates missing values in dataframe df using linear interpolation

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

* Imbalanced data refers to a situation where one class of a binary target variable is significantly more prevalent than the other.
* If imbalanced data is not handled, machine learning models may be biased towards the majority class, leading to poor performance on the minority class.

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

>Upsampling
    >* Up-sampling involves increasing the number of instances in the minority class
    >* Up-sampling may be required when the minority class is underrepresented and more data is needed to train a model effectively. 

>DownSampling
    >* down-sampling involves decreasing the number of instances in the majority class. 
    >* Down-sampling may be required when the majority class is overrepresented and the dataset is too large for effective modeling.

### Q5: What is data Augmentation? Explain SMOTE.

* Data augmentation is a technique used to increase the size of a dataset by adding modified versions of existing data. The modified data is generated by applying various transformations such as rotation, scaling, cropping, or flipping to the original data. This technique is commonly used in deep learning to prevent overfitting and improve the generalization of models.

* SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation method used to address the imbalanced class problem in classification tasks. It creates synthetic samples by interpolating between minority class examples. SMOTE selects one minority class sample at random and finds its k-nearest neighbors. It then generates new examples by randomly selecting points along the line segments connecting the minority class sample and its k-nearest neighbors. This process is repeated until the desired number of synthetic samples is generated.

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

* Outliers are data points that are significantly different from the rest of the data in a dataset. They can be caused by measurement errors, data corruption, or extreme values that are outside the typical range of the data. 
* Outliers can significantly affect the statistical properties of a dataset and the performance of machine learning models. It is essential to handle outliers because they can skew statistical analyses and affect the accuracy of predictive models.


### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

 There are several techniques that can be used to handle missing data in an analysis. Some common approaches are:

1. **Deletion:** This involves removing any rows or columns with missing data. This method can be useful if the amount of missing data is relatively small, but it can lead to a loss of valuable information.

2.    **Imputation:** This involves filling in missing values with estimated values based on the available data. There are several techniques for imputation, including mean imputation, regression imputation, and multiple imputation.

3.    **Prediction:** This involves using machine learning algorithms to predict missing values based on the available data. This approach can be useful if the amount of missing data is relatively small and the available data is highly predictive of the missing data.

4.    **Ignore missingness:** In some cases, the missingness may not affect the analysis and can be ignored.

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

To determine if the missing data is missing at random or if there is a pattern to the missing data, some strategies that can be used are:

1. **Statistical tests:** Hypothesis testing can be used to determine if the missing data is random or if there is a pattern. This involves comparing the distribution of the missing data with the distribution of the available data.

2. **Visualization:** Plotting the available data against the missing data can help identify any patterns or trends.

3. **Machine learning:** Using machine learning algorithms to predict missing values can help identify any patterns or trends in the data.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

1.    **Precision and Recall:** Precision and recall are metrics that can be used to evaluate the performance of a model on imbalanced datasets. Precision measures the proportion of true positives among the predicted positives, while recall measures the proportion of true positives among all actual positives.

2.    **Confusion matrix:** A confusion matrix can be used to evaluate the performance of a model on imbalanced datasets. This matrix displays the number of true positives, true negatives, false positives, and false negatives.

3.    **Resampling:** Resampling techniques such as oversampling and undersampling can be used to balance the dataset and improve the performance of the model on the minority class.

4.    **Cost-sensitive learning:** Cost-sensitive learning involves assigning different costs to misclassification of different classes. This can help the model prioritize the minority class and improve its performance.

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

* When dealing with an unbalanced dataset where the majority of customers report being satisfied, there are a few methods that can be employed to balance the dataset and down-sample the majority class:

* **Random under-sampling:** This involves randomly selecting a subset of the majority class samples so that the number of samples in the majority and minority classes are closer in number. However, this method can lead to loss of valuable data.

* **Tomek links:** This involves identifying the samples that are closest to each other from the majority and minority classes and removing the majority class sample. This method can improve the decision boundary between the classes, but can also lead to loss of important information.

* **Cluster-based under-sampling:** This involves identifying the clusters of the majority class and reducing the number of samples in each cluster. This method can help retain the information contained in the majority class samples.

* **Synthetic minority over-sampling technique (SMOTE):** This method involves creating synthetic samples for the minority class by interpolating between existing minority class samples.

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

* When dealing with an unbalanced dataset where the minority class has a low percentage of occurrences, there are several methods that can be employed to balance the dataset and up-sample the minority class:

* **Random over-sampling:** This involves randomly duplicating minority class samples so that the number of samples in the majority and minority classes are closer in number. However, this method can lead to overfitting and poor performance.

* **Synthetic minority over-sampling technique (SMOTE):** This method involves creating synthetic samples for the minority class by interpolating between existing minority class samples.

* **Adaptive synthetic (ADASYN):** This is an extension of SMOTE that generates more synthetic examples in regions where the density of the minority class is lower.

* **Minority class augmentation:** This involves augmenting the minority class data by adding noise or modifying the existing data to generate more examples.

* **Ensemble methods:** This involves combining several models to create a more robust classifier. This can be especially useful when dealing with imbalanced datasets, as it can help to prevent overfitting and improve the accuracy of the predictions.