# Python Machine Learning - Code Examples

# Dealing with missing data 

### Overview
- [1. Dealing with missing data](#ch1)</a>
  - [1.1. Eliminating samples or features with missing values](#ch1_1)</a>
- [2. Imputation of missing values](#ch2)</a>
  - [2.1. Dataframe with numeric features](#ch2_1)</a>
    - [2.1.1. with `SimpleImputer`](#ch2_1_1)</a>  
    - [2.1.2. with `df.fillna`](#ch2_1_2)</a>    
  - [2.2. Dataframe with numeric and categorical features](#ch2_2)</a>
    - [2.2.1. with `SimpleImputer`](#ch2_2_1)</a>
    - [2.2.2. with `df.fillna`](#ch2_2_2)</a>  
- [3. Question](#ch_question)</a>
- [4. Further work](#ch_fw)</a>

In [1]:
# to execute and return the results of executions
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

##  1. Dealing with missing data  <a name="ch1"></a>

In [2]:
import numpy as np
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [3]:
# print the missings 

print(df.isna())
print(df.isnull())
df.isna().sum()

       A      B      C      D
0  False  False  False  False
1  False  False   True  False
2  False  False  False   True
       A      B      C      D
0  False  False  False  False
1  False  False   True  False
2  False  False  False   True


A    0
B    0
C    1
D    1
dtype: int64

#### Eliminating samples or features with missing values <a name="ch1_1"></a>


In [4]:
#### deleting rows of the data frame
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [5]:
# only drop rows where all columns are NaN
df.dropna(how='all')  

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [6]:
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [7]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


In [8]:
# Eliminate features with NaN
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


## 2. Imputation of missing values <a name="ch2"></a>
### 2.1 Dataframe with numeric features <a name="ch2_1"></a>

In [9]:
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


#### 2.1.1 with `SimpleImputer` <a name="ch2_1_1"></a>

In [10]:
from sklearn.impute import SimpleImputer

# type of imputation
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
# Calculation
imr = imr.fit(df)
#Application
imputed_data = imr.transform(df)
imputed_data = pd.DataFrame(imputed_data, columns=df.columns)
imputed_data


Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


#### 2.1.2 with `df.fillna` <a name="ch2_1_2"></a>

In [11]:
df = df.fillna(df.mode())
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,12.0,8.0
2,10.0,11.0,12.0,


### 2.2 Dataframe with numerical and categorical features <a name="ch2_2"></a>

In [12]:
import numpy as np
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,red,4.0
5.0,6.0,,8.0
10.0,11.0,red,
5.0,3.0,blue,8.0'''


df = pd.read_csv(StringIO(csv_data))
df
result = df.dtypes

print("Output:")
print(result)

Unnamed: 0,A,B,C,D
0,1.0,2.0,red,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,red,
3,5.0,3.0,blue,8.0


Output:
A    float64
B    float64
C     object
D    float64
dtype: object


In [13]:
# examine the data types
feat_type = df.dtypes
print(feat_type)

A    float64
B    float64
C     object
D    float64
dtype: object


**create numeric and categorical  dataframes**

In [14]:
# numeric
df_num = df.select_dtypes(include='number')
df_num
# categorical
df_cat = df.select_dtypes(include='object')
df_cat

Unnamed: 0,A,B,D
0,1.0,2.0,4.0
1,5.0,6.0,8.0
2,10.0,11.0,
3,5.0,3.0,8.0


Unnamed: 0,C
0,red
1,
2,red
3,blue


#### 2.2.1 with `SimpleImputer` <a name="ch2_2_1"></a>

In [15]:
from sklearn.impute import SimpleImputer

**impute numeric**

In [16]:
# type of imputation
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
# Calculation
imr = imr.fit(df_num)
#Application
imputed_data = imr.transform(df_num)
imputed_data = pd.DataFrame(imputed_data, columns=df_num.columns)
imputed_data

Unnamed: 0,A,B,D
0,1.0,2.0,4.0
1,5.0,6.0,8.0
2,10.0,11.0,6.666667
3,5.0,3.0,8.0


**impute categorical**

In [17]:
# type of imputation
imr = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# Calculation
imr = imr.fit(df_cat)
#Application
imputed_data = imr.transform(df_cat)
imputed_data = pd.DataFrame(imputed_data, columns=df_cat.columns)
imputed_data

Unnamed: 0,C
0,red
1,red
2,red
3,blue


#### 2.2.2 with `df.fillna` <a name="ch2_2_2"></a>

## 3. Question <a name="ch_question"></a>
Is the strategy `most_frequent` also suitble to perfom imputation in numeric features?

## 4. Further work <a name="ch_fw"></a>

Download of Machine Learning Repository: Wisconsin breast cancer diagnostic data set.

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

and perform the necessary operations to
1. eliminate rows with missing values
2. substitute the missing values.
