# Missing Values
"**Missing values**" in the context of machine learning and data sets are those values that are absent or unavailable for certain observations or variables in a data set. These values can arise for a variety of reasons, such as data collection errors, measurement failures, or simply because the information is not available.

The presence of missing values can negatively impact the performance of machine learning models, as many algorithms cannot directly handle these values. Therefore, it is important to address missing values appropriately before training a model. Some common strategies for dealing with missing values include imputation, where missing values are replaced with estimates based on other values in the data set, or removing observations or variables with missing values if they are few compared to the size of the data set.

### Most common strategies for dealing with missing values
1. **Deletion of observations or variables:** This strategy involves removing rows or columns that contain missing values. This may be appropriate if the number of observations or variables with missing values is small compared to the total size of the dataset, and if the deletion does not introduce significant bias in the remaining data.
2. **Imputation:** Imputation involves estimating missing values based on the available information in the dataset. Some common imputation techniques include:
    - **Mean or median:** Replace missing values with the mean or median of the corresponding variable.
    - **Most frequent value:** Replace missing values with the most frequent value (mode) of the variable.
    - **Imputation with predictive models:** Use predictive models (such as regression, KNN, or decision trees) to predict missing values based on other variables in the dataset.
3. **Missing value indicators:** Instead of imputing missing values, some algorithms can be modified to account for the presence of missing values as an additional feature. A specific value (such as -1 or NaN) is assigned to missing values, and the model learns to handle them during training.
4. **Advanced techniques:** There are more advanced techniques for addressing missing values, such as multiple imputation (where multiple imputed values are generated for missing values to account for uncertainty), or methods specific to temporal or time series data.

<div class="alert alert-block alert-info">
<b>☝🏼🤓:</b> The choice of missing value handling strategy depends on the specific context of the problem, the nature of the data, and the potential impact on the analysis or machine learning model. It is important to carefully evaluate the different options and their implications before making a decision.
</div>

---

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../datasets/titanic/titanic3.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [19]:
# Get names of columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()]

# Making a count of total missing values from every detected column
missing_dict = {}
for col in cols_with_missing:
    missing_count = data[col].isnull().values.sum()
    missing_dict[col] = missing_count

print("Columns with missing values:")
for e, (key, value) in enumerate(missing_dict.items()):
    print(f"{e+1}- {key}: {value}")

Columns with missing values:
1- age: 263
2- fare: 1
3- cabin: 1014
4- embarked: 2
5- boat: 823
6- body: 1188
7- home.dest: 564


## Delete Missing Values

In [7]:
# This method removes all row with all empty columns
data.dropna(axis=0, how='all')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


In [8]:
data2 = data.copy()

# This method removes all rows with any missing values columns
data2.dropna(axis=0, how='any')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


## Imputation Mising Values
* [pd.fillna() function is deprecated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
* [pd.ffil() function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html#pandas.DataFrame.ffill)