## Q1

Missing values in a dataset refer to the absence of data for one or more variables or features in some of the data points or observations. These missing values are typically denoted as NaN (Not a Number) or NULL in various programming languages.

1. Data Integrity: Missing values can lead to incomplete or inaccurate analyses, potentially causing incorrect conclusions or predictions.
2. Bias: Ignoring missing data can introduce bias into your analyses because the data that is missing may not be missing at random. It could be related to specific patterns or reasons that are important to understand.

Algorithms are :
1. K nearest neighbours
2. Decision Tree
3. Random Forest

## Q2

There are several techniques for handling missing data in a dataset.
1. Delete the entire row which has null /nan values.
2. Imputation:
    1. Mean Imputation
    2. Median Imputation
    3. Mode Imputation

In [1]:
import seaborn as sns

In [27]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [24]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [25]:
df.dropna(axis=1,inplace=True)

In [26]:
df.isnull().sum()

survived      0
pclass        0
sex           0
sibsp         0
parch         0
fare          0
class         0
who           0
adult_male    0
alive         0
alone         0
dtype: int64

In [28]:
## Mean Imputation

In [29]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [31]:
age_mean = df['age'].mean()

In [32]:
age_mean

29.69911764705882

In [33]:
df['age'].fillna(age_mean,inplace=True)

In [34]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [35]:
## Median Imputation

In [42]:
df = sns.load_dataset('titanic')
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [43]:
age_median = df['age'].median()

In [45]:
df['age'].fillna(age_median,inplace=True)

In [46]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [53]:
## Mode Imputation

In [52]:
df['deck'].dtype

CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F', 'G'], ordered=False)

In [57]:
mode_deck = df['deck'].mode()[0]

In [58]:
mode_deck

'C'

In [59]:
df['deck'].fillna(mode_deck,inplace=True)

In [60]:
df.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
deck           0
embark_town    2
alive          0
alone          0
dtype: int64

## Q3

Imbalanced data refers to a situation in a classification problem where the classes you are trying to predict are not represented equally. In other words, one class (the minority class) has significantly fewer examples than the other class or classes (the majority class or classes). 

1. Model Bias: Machine learning models, especially traditional ones like logistic regression or decision trees, tend to be biased towards the majority class. Since they aim to maximize overall accuracy, they may predict the majority class for most cases, ignoring the minority class. This leads to a biased model that performs poorly on the minority class.

## Q4

Up-sampling and down-sampling are techniques used to address the issue of imbalanced data in machine learning. They are used when you have a dataset with significantly unequal representation of classes, i.e., one class (the minority class) has very few examples compared to another class (the majority class).

1. Upsampling:
    1. Up-sampling involves increasing the number of instances in the minority class to balance the dataset.
    2. This is typically done by duplicating or generating synthetic samples for the minority class until its representation is closer to that of the majority class.
    3. Example of Up-sampling:
        1. Suppose you're working on a medical diagnosis task where you want to predict a rare disease. If you have a dataset with 1000 healthy patients (majority class) and only 50 patients with the disease (minority class), you might up-sample the minority class to have a similar number of samples as the majority class, say 1000, by creating synthetic samples or duplicating the existing ones.
        
2. Downsampling:
    1. Down-sampling involves reducing the number of instances in the majority class to balance the dataset.
    2. This is typically done by randomly removing samples from the majority class until its representation is closer to that of the minority class.
    3. Example of Down-sampling:
        1. Consider a credit card fraud detection task where the majority of transactions are legitimate, and only a small fraction is fraudulent. If you have a dataset with 100,000 legitimate transactions (majority class) and 500 fraudulent transactions (minority class), you might down-sample the majority class to have, say, 1000 samples, by randomly removing instances.

## Q5

Data augmentation is a technique used in machine learning and data preprocessing to artificially increase the diversity and quantity of your training dataset by applying various transformations to the existing data. The goal is to create additional training examples that are similar to the original data but with slight variations, which can help improve the generalization and robustness of machine learning models.


SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation technique designed to address the issue of imbalanced datasets in classification tasks. SMOTE generates synthetic samples for the minority class to balance the class distribution. 

## Q6

Outliers in a dataset are data points or observations that significantly differ from the majority of the data. These data points are usually distant from the central tendency of the data, such as the mean or median.


It is essential to handle outliers for several reasons:

1. Data Quality: Outliers can be the result of errors in data collection, measurement, or entry. Removing or correcting these errors improves the overall quality of the dataset.
2. Model Performance: Outliers can have a disproportionate impact on the performance of statistical models and machine learning algorithms. They can lead to model instability, overfitting, or bias in parameter estimates.

## Q7

Handling missing data is a crucial step in data analysis, as missing values can lead to inaccurate or biased results. When you encounter missing data in a customer data analysis project, you can employ various techniques to handle it effectively.

1. Imputation: 
    1. Mean Imputation
    2. Median Imputation
    3. Mode Imputation
    
2. Deletion (entire Rows)
3. Predictive Modeling:
    1. Build predictive models (e.g., regression, decision trees, or machine learning models) to estimate missing values based on other features in the dataset.

## Q8

When dealing with missing data in a large dataset, it's essential to determine whether the missing data is missing at random (MAR) or if there is a pattern to it. Understanding the missing data mechanism can help you decide on the appropriate strategies for handling the missing values.


1. Visualization:
    1. Create data visualizations such as heatmaps, scatter plots, or histograms to visualize the distribution of missing values across features. You can use libraries like Matplotlib, Seaborn, or Plotly.
    2. Plot missing data patterns over time, if applicable, in time-series data.
    
2. Summary Statistics:
    1. Calculate summary statistics for the missing and non-missing data separately to identify potential patterns. Compare means, medians, variances, and other relevant statistics.
    2. Perform statistical tests (e.g., t-tests, chi-square tests) to check if there are significant differences between missing and non-missing groups.

## Q9

When dealing with an imbalanced medical diagnosis dataset where the majority of patients do not have the condition of interest (positive class), and only a small percentage do, it's important to use appropriate strategies to evaluate the performance of your machine learning model. The standard accuracy metric may not be informative in such cases because the model could achieve high accuracy by simply predicting the majority class for all instances.


1. Resampling Techniques:

    1. Consider resampling techniques like oversampling the minority class or undersampling the majority class to balance the dataset before training the model.
    2. Evaluate the model's performance on the balanced dataset and compare it to the original imbalanced dataset.

## Q10

When dealing with an unbalanced dataset where the majority of customers report being satisfied, you can employ down-sampling techniques to balance the dataset. Down-sampling involves reducing the number of instances in the majority class to match the size of the minority class (customer dissatisfaction, in this case). Here are some common methods to down-sample the majority class:

1. Random Under-Sampling:

    1. Randomly remove a subset of instances from the majority class until it matches the size of the minority class.
    2. While this method is simple, it may lead to a loss of valuable information if important patterns are present in the majority class.

## Q11

When working with an imbalanced dataset where the occurrence of a rare event is underrepresented, you can employ up-sampling techniques to balance the dataset by increasing the number of instances in the minority class. Up-sampling aims to create additional synthetic samples for the minority class to make it closer in size to the majority class. Here are some common methods to up-sample the minority class:

1. SMOTE (Synthetic Minority Over-sampling Technique):

    1. SMOTE generates synthetic samples by interpolating between existing minority class instances.
    2. For each instance in the minority class, SMOTE selects k-nearest neighbors and creates new samples by blending the selected instance with one or more of its neighbors.
    3. SMOTE can be effective in creating realistic synthetic samples and reducing overfitting 