<span style='color:red'> NOTE: You can only pass the lab, when you provide both code and markdown </span>

Use Code for your analysis
Use Markdown to document and elaborate on your findings, conclusions, assertions, etc.

# DS_ML_I_P2: Excercise: Working with missing data
This excercise should give you some practice in working with missing data of different feature types

## Load the iris dataset with missing values into a dataframe 
File: datasets.zip/datasets/iris/data_someMissing.all

Hint: When data is missing, pandas might not be able to determine the proper type of columns by itself. Look carefully at the data types and act accordingly! You have different options to change the types of columns:
* When reading, have a look at [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), esp the parameter *na_filter*
* In memory, have a look at [dataframe.astype](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) and the transformation functions [dataframe.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) in combination with [pandas.to_numeric](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html)

In [233]:
import pandas as pd

Read the `.all` file and set whitespace as separator and treat the `?` as missing values. Header doesn't exist in the file, hence it was specified when the file is read to treat it accordingly.

In [234]:
df = pd.read_csv("datasets/iris/data_someMissing.all", sep=' ', header=None, na_filter=True, na_values='?', names=['sl','sw','pl','pw','class'])

In [235]:
df.columns

Index(['sl', 'sw', 'pl', 'pw', 'class'], dtype='object')

In [236]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   sl      149 non-null    float64
 1   sw      149 non-null    float64
 2   pl      149 non-null    float64
 3   pw      148 non-null    float64
 4   class   149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


The following codes show the amount of missing values of each feature, how many rows have at least one missing values, and show the rows

In [237]:
missing_values_counts = df.isna().sum()
print(f"Amount of missing values of each columns: \n{missing_values_counts}")

Amount of missing values of each columns: 
sl       1
sw       1
pl       1
pw       2
class    1
dtype: int64


In [238]:
missing_val_rows = df[df.isna().any(axis=1)]
print(f"Amount of rows with missing values: {len(missing_val_rows)}")

Amount of rows with missing values: 5


In [239]:
missing_val_rows

Unnamed: 0,sl,sw,pl,pw,class
6,4.6,3.4,,0.3,Iris-setosa
9,4.9,3.1,1.5,0.1,
13,4.3,3.0,1.1,,Iris-setosa
14,5.8,4.0,1.2,,Iris-setosa
18,,,1.7,0.3,Iris-setosa


## What are your options to work with the missing values?
My option to handle the mentioned scenarion are either:
* Deletion: removing the rows with the missing values (ignoring the datapoint) or removing the columns with too many missing values.

* Imputation: replacing missing values with subtituted values. They can be e.g. mean, median, mode, etc. of the existing values or the most-make-sense value based on the other features available (using predictive ML model to impute values)

* When missing values is expected (i.e. it is not wrong/faulty), add some flag or default value for the respective missing values

## What is their difference with respect to the features of the dataset and the class associations? 
* **Deletion** can lead to the losing of some observation that might be relevant. It also reduces the dataset size and in extreme case when there are too many missing value rows, the dataset size can shrink very much. These can then affect the overall feature distribution
* If the missing values are not randomly distributed and certain classes are more affected by the missingness, the deletion might distort class associations.
* **Imputation** on the other hand, can help retain the dataset size and preserve the general structure of the feature distributions and class relationship


## Implement some of the options for the dataset
* check, how these options change the statistical values of
  * each feature
  * each class
* useful functions in pandas for this step
  * find out, if a value [is null](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html)
  * [removing data that null ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna)
  * [fill null data with other value](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
  * [replace values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)

## Implementation

### Deletion
Basically all rows with missing values are dropped

In [240]:
# your code here
df_no_nan = pd.DataFrame()
df_no_nan = df.dropna().copy()

### Imputation
All missing values in the numeric features were replaced by the median of its respective columns, while the missing values of class association is flagged as unknown

In [241]:
df_imputed = pd.DataFrame()
df_imputed = df.fillna(df.median(numeric_only=True)).copy()
df_imputed['class'] = df['class'].fillna('Unknown')

## Statistics & Visualization for Features

**Summary Statistic 1**

Showing the summary statistics of features of the resulting dataframes from different missing-values-handling methods. This part is relevant to see the mean values

In [242]:
# print statistics for each of the options you realize
print(f"Summary stats for: Initial Dataset\n")
print(df.describe().round(3))


Summary stats for: Initial Dataset

            sl       sw       pl       pw
count  149.000  149.000  149.000  148.000
mean     5.844    3.052    3.774    1.214
std      0.831    0.433    1.761    0.757
min      4.300    2.000    1.000    0.100
25%      5.100    2.800    1.600    0.300
50%      5.800    3.000    4.400    1.300
75%      6.400    3.300    5.100    1.800
max      7.900    4.400    6.900    2.500


In [243]:
print(f"Summary stats for: No Missing Value Dataset\n")
print(df_no_nan.describe().round(3))


Summary stats for: No Missing Value Dataset

            sl       sw       pl       pw
count  145.000  145.000  145.000  145.000
mean     5.870    3.043    3.840    1.234
std      0.822    0.431    1.738    0.752
min      4.400    2.000    1.000    0.100
25%      5.100    2.800    1.600    0.400
50%      5.800    3.000    4.400    1.300
75%      6.400    3.300    5.100    1.800
max      7.900    4.400    6.900    2.500


In [244]:
print(f"Summary stats for: Imputed Dataset\n")
print(df_imputed.describe().round(3))

Summary stats for: Imputed Dataset

            sl       sw       pl       pw
count  150.000  150.000  150.000  150.000
mean     5.844    3.052    3.778    1.215
std      0.828    0.432    1.755    0.752
min      4.300    2.000    1.000    0.100
25%      5.100    2.800    1.600    0.300
50%      5.800    3.000    4.400    1.300
75%      6.400    3.300    5.100    1.800
max      7.900    4.400    6.900    2.500


Concatenate dataframes from different hanlding methods for later visualization

In [245]:
df['source'] = 'original'
df_imputed['source'] = 'imputed'
df_no_nan['source'] = 'no_nan'
df_combined = pd.concat([df,df_no_nan, df_imputed], ignore_index=True)

In [246]:
import plotly.express as px

**Visualization 1**

Show boxplots of different features accross different missing-values-handling methods

In [247]:
fig = px.box(df_combined[['sl', 'sw', 'pl', 'pw']],
            color=df_combined['source'],
            title= "Comparison of Iris Features' Distribution Across Dataset Versions ")
fig.update_traces(boxmean=True)
fig.show()

## Statistics & Visualization for Class Association
**Summary Statistic 2**

Showing the summary statistics of the class associations from different missing-value-handling methods.

In [248]:
print(f"Summary stats of the association class - Initial Dataset\n")
print(df['class'].describe(include=all))

Summary stats of the association class - Initial Dataset

count                 149
unique                  3
top       Iris-versicolor
freq                   50
Name: class, dtype: object


In [249]:
print(f"Summary stats of the association class - Initial Dataset\n")
print(df_imputed['class'].describe(include=all))

Summary stats of the association class - Initial Dataset

count                150
unique                 4
top       Iris-virginica
freq                  50
Name: class, dtype: object


In [250]:
print(f"Summary stats of the association class - Initial Dataset\n")
print(df_no_nan['class'].describe(include=all))

Summary stats of the association class - Initial Dataset

count                 145
unique                  3
top       Iris-versicolor
freq                   50
Name: class, dtype: object


**Visualization 2**

Showing the distribution of class within different resulting dataframes. `NaN` value of the first the original data set is replaced with `'NaN'` (string datatype) for the sake of visualization and showing my points.

In [252]:
df_combined_counts = df_combined.fillna('NaN').groupby(['source','class'], dropna=False).size().reset_index(name='count').sort_index(ascending=False)
fig = px.bar(df_combined_counts,x ='source', y='count', color='class', 
             barmode='group', text='count',
             title='Data Point Counts by Association Class and Missing-Values Handling Method')
fig.show()

## Conclusion

1. There is no significant effect on the distribution of each feature accross different missing-value-handling method as it can be seen from the boxplot (see *Visualization 1*).

2. The differences of average(mean) of each feature between Deletion, Imputation, and the original dataset is also very minimal. This is probably because there are not so many missing values within the dataset (see *Summary Statistics 1*).

3. As we can see in the bar chart, the imputation method retain the class structure by adding new class Unknown for NaN values. This can be used for example for testing or validation of the model trained on this dataset (see *Visualization 2*)