Why is it important to handle missing values in dataset?
Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values. You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.

What are missing values in dataset?
Missing data, or missing values, occur when you don't have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons. In any dataset, there are usually some missing data

What algorithms can handle missing values?
Other algorithms that natively support missing values:

k-NN and Random Forest algorithms can also support missing values. the k-NN algorithm considers the missing values by taking the majority of the K nearest values

##q2:
How to Handle Missing Data with Python
by Jason Brownlee on March 20, 2017 in Data Preparation
Tweet Tweet  Share
Last Updated on August 28, 2020

Real-world data often has missing values.

Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.

Handling missing data is important as many machine learning algorithms do not support data with missing values.

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:

How to marking invalid or corrupt values as missing in your dataset.
How to remove rows with missing data from your dataset.
How to impute missing values with mean values in your dataset.
Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Note: The examples in this post assume that you have Python 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.22 or higher. If you need help setting up your environment see this tutorial.

Update Mar/2018: Changed link to dataset files.
Update Dec/2019: Updated link to dataset to GitHub version.
Update May/2020: Updated code examples for API changes. Added references.
How to Handle Missing Values with Python
How to Handle Missing Values with Python
Photo by CoCreatr, some rights reserved.

Overview
This tutorial is divided into 6 parts:

Diabetes Dataset: where we look at a dataset that has known missing values.
Mark Missing Values: where we learn how to mark missing values in a dataset.
Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
Impute Missing Values: where we replace missing values with sensible values.
Algorithms that Support Missing Values: where we learn about algorithms that support missing values.
First, let’s take a look at our sample dataset with missing values.

1. Diabetes Dataset
The Diabetes Dataset involves predicting the onset of diabetes within 5 years in given medical details.

Dataset File.
Dataset Details
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
...
This dataset is known to have missing values.

Specifically, there are missing observations for some columns that are marked as a zero value.

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

Download the dataset from here and save it to your current working directory with the file name pima-indians-diabetes.csv .

pima-indians-diabetes.csv
Want to Get Started With Data Preparation?
Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

2. Mark Missing Values
Most data has missing values, and the likelihood of having missing values increases with the size of the dataset.

Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as the data set size increases.

— Page 187, Feature Engineering and Selection, 2019.

In this section, we will look at how we can identify and mark values as missing.

We can use plots and summary statistics to help identify missing or corrupt data.

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# summarize the dataset
print(dataset.describe())

##Q3:
What happens when data is imbalanced?
A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes

: Imbalanced data set will lead algorithms to get good results by returning the majority. That will be a problem if you are interested in the minority more. So, balancing is a way to force the algorithm to give more weight to the minority.

#### Q4:
What is up sampling and down sampling?
Upsampling and downsampling lab - Rhea
Up-Sampling is a "Zero-Padding Procedure" that increase the number of samples of a DT signal. More specificals, when up sampling, zeros are added between the samples of a signal. Down-Sampling is to decrease the sample size.

What is need for up sampling and down sampling?
Upsampling requires a lowpass filter after increasing the data rate, and downsampling requires a lowpass filter before decimation. Therefore, both operations can be accomplished by a single filter with the lower of the two cutoff frequencies.


#### Q5:
SMOTE is an algorithm that performs data augmentation by creating synthetic data points based on the original data points. SMOTE can be seen as an advanced version of oversampling, or as a specific algorithm for data augmentation.


Outliers are the observations in a dataset that deviate significantly from the rest of the data. In any data science project, it is essential to identify and handle outliers, as they can have a significant impact on many statistical methods, such as means, standard deviations, etc., and the performance of ML models

##Q6:
How do you handle missing data in data analysis?
Mean, Median and Mode

This is one of the most common methods of imputing values when dealing with missing data. In cases where there are a small number of missing observations, data scientists can calculate the mean or median of the existing observations open_in_new.


How do you find the percentage of missing data?
To find the percentage of missing values in each column of an R data frame, we can use colMeans function with is.na function. This will find the mean of missing values in each column. After that we can multiply the output with 100 to get the percentage

What would be your strategy to handle a situation indicating an imbalanced dataset?
Approach to deal with the imbalanced dataset problem
Choose Proper Evaluation Metric. The accuracy of a classifier is the total number of correct predictions by the classifier divided by the total number of predictions. ...
Resampling (Oversampling and Undersampling) ...
SMOTE. ...
BalancedBaggingClassifier. ...
Threshold moving.

Q9:
Which techniques can be used to deal with a dataset having imbalanced classes?
When we are using an imbalanced dataset, we can oversample the minority class using replacement. This technique is called oversampling. Similarly, we can randomly delete rows from the majority class to match them with the minority class which is called undersampling.

Q11:
SMOTE(oversampling technique is the best to use here)