# 1. Missing values

In the real world, the data are rarely clean and homogenous and can have missing values for several reasons: data was lost or corrupted during the transmission from the database, human error, programming error. Whether the missing data will be removed, replaced or filled depends on the type of missing data.

`Pandas` uses the floating point value `NaN` (Not u Number) to represent missing data in both floating as well as in non-floating point arrays. The built-in Python `None` value is also treated as NA in object arrays.

There are several functions for detecting, removing, replacing and imputing null values in Pandas DataFrame.

In [None]:
# Run this code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let's look how the missing data look like in the DataFrame.

In [None]:
# Run this code
our_series = pd.Series([25, 2.5, 150, np.nan, 1.5, 'Python', 147])
print(our_series)

# 2. Detecting missing data

In [None]:
# Load the Titanic dataset
data = pd.read_csv('titanic.csv')
data.head(10)

`isnull().values.any()`

- used if we only want to know if there are any missing values in the dataset

In [None]:
# Check whether there are any missing values
data.isnull().values.any()

`isnull()`
- it is used to detect missing values for an array-like object
- returns a boolean same-sized object indicating if the values are missing

- it is an alias of `isna()`

In [None]:
# Apply isnull() on the dataset 'data'
data.isnull()

`notnull()`

- it is used to detect existing (non-missing) values
- it is an alias of `notna()`

In [None]:
# TASK 1 >>>> Check non-missing values in the dataset using .notnull()


`isnull().sum()`

- we can use function chaining to check the total number of missing values for each column in the DataFrame

In [None]:
# Count the total number of missing values in the column using .sum()
data.isnull().sum()

As we can see, there are 177 missing values in the column Age, then 687 missing values in the column Cabin and 2 missing values in the Embarked column.



# 3. Basic visualization of missing data

In [None]:
# Run this code
plt.style.use('default')
missing_values = data.isnull().sum() / len(data) * 100
plt.xticks(np.arange(len(missing_values)), missing_values.index,rotation='vertical')
plt.ylabel('Percentage of missing values')
ax = plt.bar(np.arange(len(missing_values)), missing_values, color = 'skyblue');

# 4. Removing missing data



In some cases, it is appropriate just drop the rows with missing data, in other cases replacing missing data would be better options. 

`dropna()` function

- to remove rows or columns from the DataFrame which contain missing values
- by default drops any row that contain a missing value

Arguments:

`axis = 0` to drop rows

`axis = 1` to drop columns

`how = 'all'` to drop if all the values are missing

`how = 'any'` to drop if any missing value is present

`tresh = ` treshold for missing values

`subset = ['column']` to remove rows in which values are missing or selected column or columns

**If we want to make changes in the original dataset** (for example remove a particular column), we have to specify `inplace = True` within the method. Otherwise the copy of the dataset will be returned and the change will not be performed in the original dataset. 

In [None]:
# Print rows with missing data in the column 'Embarked'
missing_embarked = data[data.Embarked.isnull()]
print(missing_embarked)

In [None]:
# Drop missing values in the column 'Embarked' 
# Specify this column using subset
# Set inplace = True
data.dropna(subset = ['Embarked'], inplace = True)

In [None]:
# Check whether the rows with missing values have been removed
data.Embarked.isna().sum()

In [None]:
# Make a copy of the DataFrame
data_copy = data.copy()

In [None]:
# Drop those rows that contain any missing values
# Set inplace = True
data_copy.dropna(how = 'any', inplace = True)

In [None]:
# Check whether the rows have been removed correctly
data_copy.isna().sum()

In [None]:
# Run this code
dict = {'product': ['apple', np.nan,'cucumber','bread','milk', 'butter', 'sugar'],
        'product_code': [154,153,225,np.nan,56,15, np.nan],
        'price': [0.89, 1.50, 0.65, 1.20, 0.85, np.nan, 1.20],
        'expiration_date': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
        }

df = pd.DataFrame (dict, columns = ['product','product_code','price','expiration_date'])
print(df)

In [None]:
# Drop the last column that contain only missing values
# Set inplace = True
df.dropna(how = 'all', axis = 1, inplace = True)

In [None]:
# Display the DataFrame to check the change
df

In [None]:
# Run this code
df_copy = df.copy()
print(df_copy)

In [None]:
# TASK 2 >>>> Drop rows from df_copy that contain any missing values 
#             Set inplace = True


# 5. Filling in missing data

`fillna()` method

- this method fill in missing data (can be used on a particular column as well)

Arguments:

- we can specify **value** (any number or summary statistics such as mean or median) 

- we can use **interpolation method**: 

`ffill` : uses previous valid values to fill gap

`bfill` : uses next valid value to fill gap

`limit` : for ffill and bfill - maximum number of consecutive periods to fill

`axis` : axis to fill on, default axis = 0 

`inplace = True`



In [None]:
# Fill in missing value in 'price' column with value 0
# Set inplace = True
df.price.fillna(0, inplace = True)
print(df)

In [None]:
# Fill in missing value in column 'product' with '0'
# Set inplace = True
df.product.fillna('0', inplace = True)
print(df)

In [None]:
# Run this code
dictionary = {'column_a': [15, 16, 82, 25],
              'column_b': [np.nan, np.nan, 54, 8],
              'column_c': [np.nan, 15, 15, 25],
              'column_d': [85, 90, np.nan, np.nan]
        }

dataframe_1 = pd.DataFrame (dictionary, columns = ['column_a','column_b','column_c','column_d'])
print(dataframe_1)

In [None]:
# TASK 3 >>>> Fill in missing value in column 'column_c' of dataframe_1 with value 10 
#             Set inplace = True
dataframe_1.column_c.fillna(10, inplace = True)
print(dataframe_1)

# 6. More Complex Methods

We will go through the theory of these more complex methods later as they relate to Machine Learning. 

In [None]:
# Run this code
dict = {'column_1': [15, 16, 82, 25],
        'column_2': [np.nan, np.nan, 54, 8],
        'column_3': [np.nan, 15, 15, 25],
        'column_4': [85, 90, np.nan, np.nan]
        }

our_df = pd.DataFrame (dict, columns = ['column_1','column_2','column_3','column_4'])
print(our_df)

In [None]:
# Fill in missing values using 'method = bfill' which stand for 'backward fill'
# Set inplace = True
our_df.fillna(axis = 0, method = 'bfill', inplace = True)
print(our_df)

The second option is `method = 'ffill'` which stand for forward fill.

In [None]:
# Convert the datatype of the column Age from the DataFrame 'data' to integer data type
data_copy.Age = data_copy.Age.astype('int')

In [None]:
# Fill in missing data of the column 'Age' in the DataFrame 'data' with the average age
# Set inplace = True
average_age = data_copy.Age.mean()
data_copy.Age.fillna(average_age, inplace = True)

In [None]:
# Check whether missing values have been removed from the column 'Age'
data_copy.Age.isnull().sum()

# 7. Duplicate data


In [None]:
# Run this code
actors = [('Michone', 30, 'USA'),
            ('Bob', 28, 'New York'),
            ('Rick', 30, 'New York'),
            ('Carol', 40, 'Paris'),
            ('Daryl', 35, 'London'),
            ('Daryl', 35, 'London'),
            ('Michone', 45, 'London'),
            ('Morgan', 38, 'Sweden')
            ]
df_2 = pd.DataFrame(actors, columns=['first_name', 'age', 'city'])

In [None]:
# Find duplicated values using .duplicated() method
df_2.duplicated().sum()

In [None]:
# Remove duplicate rows
# Set inplace = True
df_2.drop_duplicates(inplace=True)
print(df_2)

In [None]:
# BONUS TASK >>> What movie series does author of this notebook like? :)


# Appendix

Data source: https://www.kaggle.com/hesh97/titanicdataset-traincsv

License: CC0: Public Domain