#Assessment Question 1
###Q Explain how you would handle missing data in a given dataset and provide a code snippet demonstrating this.

Handling missing data in a dataset is a common issue in data analysis and can be approached in several ways depending on the nature of the data and the context of the analysis. Here are a few techniques:

**Deleting Rows:** This is the simplest approach, where you delete rows that have missing values. This approach is only advisable if the number of missing values is small and if you believe that they are missing randomly.

**Imputation:** Imputation involves filling missing values with either a measure of central tendency, like the mean or median, or with some other value. There are multiple types of imputation including single, multiple, and regression imputation.

**Prediction Models:** In this case, the missing values are predicted using machine learning models. The features with complete data are used to model the feature that has missing data, and this model is then used to predict the missing values. Techniques such as regression, KNN (K-nearest neighbors), and random forests can be used for prediction.

**Interpolation:** This method assumes that the data follows a certain trend, and missing values are filled based on this trend.

**Advanced Imputation Techniques:** Techniques such as MICE (Multivariate Imputation by Chained Equations), where multiple imputations are performed and averaged, or Deep Learning-based techniques, which can handle non-linear data and high-dimensional spaces, can be used.


In [1]:
import seaborn as sns
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Check for missing values in 'age'
print(df['age'].isnull().sum())  # Prints the number of missing values




177


In [4]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


##Deleting Rows

In [5]:
# 1. Deleting Rows
df1 = df.copy()
df1.dropna(subset=['age'], inplace=True)

df1

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


##Imputation

In [6]:
# 2. Imputation using Mean
df2 = df.copy()
imputer = SimpleImputer(strategy='mean')
df2['age'] = imputer.fit_transform(df2[['age']])

df2

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.000000,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.000000,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.000000,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.000000,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.000000,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.000000,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,29.699118,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.000000,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


##Prediction Models

In [7]:
# 3. Prediction Model using KNN - Here we consider 'pclass' and 'sibsp' as other relevant features
df3 = df.copy()
knn_imputer = KNNImputer(n_neighbors=5)
df3[['age', 'pclass', 'sibsp']] = knn_imputer.fit_transform(df3[['age', 'pclass', 'sibsp']])

df3

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3.0,male,22.0,1.0,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1.0,female,38.0,1.0,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3.0,female,26.0,0.0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1.0,female,35.0,1.0,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3.0,male,35.0,0.0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2.0,male,27.0,0.0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1.0,female,19.0,0.0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3.0,female,27.4,1.0,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1.0,male,26.0,0.0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


##Interpolation

In [8]:
# 4. Interpolation
df4 = df.copy()
df4['age'].interpolate(method='linear', inplace=True)

df4

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,22.5,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


##Advanced Imputation

In [9]:
# 5. Advanced Imputation - MICE technique
df5 = df.copy()
iter_imputer = IterativeImputer(random_state=0)
df5[['age', 'pclass', 'sibsp']] = iter_imputer.fit_transform(df5[['age', 'pclass', 'sibsp']])

df5

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3.0,male,22.000000,1.0,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1.0,female,38.000000,1.0,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3.0,female,26.000000,0.0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1.0,female,35.000000,1.0,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3.0,male,35.000000,0.0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2.0,male,27.000000,0.0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1.0,female,19.000000,0.0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3.0,female,22.970765,1.0,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1.0,male,26.000000,0.0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True
