In [3]:
import pandas as pd
import seaborn as sns
import numpy as np

#### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Ans: Missing data is a missing piece of information in our dataset. It can be anywhere in our data. 'NaN' or 'Null' values are in place of missing data and hence denote missing values. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.

Missing data can reduce the accuracy of the model. While doing data preprocessing espically EDA the visualization that we get for a particular feature can be misleading because of the presence of null values. The model created at the end can be biased.

k-NN and Random Forest algorithms support missing values. the k-NN algorithm considers the missing values by taking the majority of the K nearest values.

#### Q2: List down techniques used to handle missing data.  Give an example of each with python code.

Some techniques to handle missing data are:

a) Deleting the missing value:
- In Deleting the missing value technique we use dropna function of pandas to drop/delete all values in particular or whole dataset. This is done on a condition that missing data should be less than 5%. When performing deleting of missing value we use Complete case Analysis. 

b) Imputing the missing value: Imputing the missing value involves various techniques as follows:
 - Replacing with arbitary value.
 - Replacing with mean value.
 - Replacing with median value.
 - Replacing with mode value.
 - Replacing with forward value, backward value also known as forward fill and backward fill.

In [74]:
df = sns.load_dataset('titanic')
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [75]:
# Deleting the missing value: 
df.drop(['embark_town'], axis = 1).head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,no,True


In [79]:
#- Replacing with arbitary value: Replacing with age = 25
df['age_replace'] = df['age'].fillna(25)
df[['age_replace', 'age']].head()

Unnamed: 0,age_replace,age
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0


In [82]:
# Replacing with mean value.
df['age_mean'] = df['age'].fillna(df['age'].mean())
df[['age_mean', 'age']].head()

Unnamed: 0,age_mean,age
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0


In [84]:
# Replacing with median value.
df['age_median'] = df['age'].fillna(df['age'].median())
df[['age_median', 'age']].head()

Unnamed: 0,age_median,age
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0


In [86]:
# Replacing with mode value.
df['deck_mode'] = df['deck'].fillna(df['deck'].mode()[0])
df[['deck_mode', 'deck']].head()

Unnamed: 0,deck_mode,deck
0,C,
1,C,C
2,C,
3,C,C
4,C,


In [89]:
# Replacing with forward value, backward value also known as forward fill and backward fill.
df['for_deck'] = df['deck'].fillna(method = 'ffill')
df['back_deck'] = df['deck'].fillna(method = 'bfill')

df[['for_deck', 'back_deck', 'deck']].head()

Unnamed: 0,for_deck,back_deck,deck
0,,C,
1,C,C,C
2,C,C,
3,C,C,C
4,C,E,


#### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Ans: A classification dataset with skewed/imbalanced class proportions is called imbalanced dataset. Classes that make up a large proportion of the data set are called majority classes and those that make up a smaller proportion are minority classes. Through imbalance dataset in the data our model will learn only majority class features and will produce biased prediction for majority class data.

Following issues may occur due to imbalanced dataset :

- modelling and learning feature correlation properties for lower sampled classes.
- detecting relevant feature class separation, i.e. identification of relevant features unique to each class.
- addition of large bias to “standard” evaluation metrics which are generally designed for similar class sizes.


#### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.

Ans: Down-sampling: Down-sampling balances the dataset by reducing the size of the abundant/majority class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be constructed for further modelling.

Up-sampling: On the contrary, up-sampling is used when the quantity of data is insufficient. It tries to balance dataset by increasing the size of rare samples. Rather than getting rid of abundant samples, new rare samples are generated by using e.g. repetition, bootstrapping or SMOTE (Synthetic Minority Over-Sampling Technique). 

There is no advantage of one over the another sampling technique, often we employ both to achieve stable and successful results.

#### Q5: What is data Augmentation? Explain SMOTE.

Data Augmentation: Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points.  

SMOTE: SMOTE (Synthetic Minority Oversampling Technique) is an oversampling method of balancing class distribution in the dataset. It selects the minority examples that are close to the feature space. Then, it draws the line between the examples in the features space and draws a new sample at a point along that line. 

In simple words, the algorithm selects the random example from the minority class and selects a random neighbor using K Nearest Neighbors. The synthetic example is created between two examples in the feature space. 

#### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Ans: In simple terms, an outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a dataset we are working with. Multiple reasons cause outliers to appear in a dataset. In this sense, a measurement error or an input error can lead to the existence of outlier values. Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.

- Understanding the outliers is critical in analyzing data for at least two aspects:

a) The outliers may negatively bias the entire result of an analysis.

b) the behavior of outliers may be precisely what is being sought.

#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?



Ans: It depends on data type, if we have numerical data missing then we will use mean or median imputation and if we have categorical data missing then we will use mode imputation

#### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?



In Missing data at random the probability of data being missing depends only on the observed value or data. So to understand this let us see an example : if we check the survey data, we may find that all the people have answered their 'Gender' but 'Age' values are mostly missing for people who have answered their 'Gender' as 'female.'. So the reason for missing values of the ‘Age’ variable can be explained by the 'Gender' variable. so we can conclude that there is some pattern in missing data if we observe the data carefully.

#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?



Ans: we can use sampling technique to balance the dataset for majority and minority class. For our particular case we will use up-sampling to increase the number of data points for minority class to balance our medical dataset. 

#### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?



Ans: We will down-sample the majority class to balance classes in data. 

#### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

Ans: We will up-sample SMOTE technique for low percentage of occurrences in minority class.