# 5: Data types and missing values

In this week's tutorial, we will go over some common data types that you will see in pandas as well as learn how to deal with missing values.

We will be using the kaggle house prices dataset which you can download [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

We aim to investigate how the different features of a house affect its final sale price. Each row of the dataset represents a single house and its many characteristics. The target (response) variable is the sale price.

## Import pandas and numpy

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
#from pandas.core.computation.check import NUMEXPR_INSTALLED

## Load data

In [None]:
data = pd.read_csv("train_house-prices-advanced-regression-techniques.csv")
data

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.describe()


In [None]:
data["LotFrontage"].describe()

## Data types

We can use pandas function [dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) to grab the data type of every column in a data frame.

In [None]:
data.dtypes

Alternatively, if we only want to consider a particular column, we can do this.

In [None]:
# Check the data type of the SalePrice column

data['SalePrice'].dtype

What are the most common data types that you will see in pandas?

- int64 (integer)
- float64 (floating point number)
- object (string)
- datetime (datetime)
- bool (true or false)


We can convert a column of one type into another using the [astype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) function.

In [None]:
# Convert the SalePrice column into float64 data type
data['SalePrice'].astype('float64')

In [None]:
data['SalePrice'].astype('bool')

In [None]:
data["SalePrice"].dtype

## Locating missing values

First let's recall how we can figure out how many null values are there in our dataframe.

In [None]:
# How many null values are there in our dataframe?

data.isnull().sum()

This is sometimes difficult to see when we have too many columns. One of my favourite ways to visualise null values is via the missingno.matrix function.

In [None]:
# Import missingno library
#!pip install missingno
import missingno

# Visualise null values
missingno.matrix(data, labels= True, fontsize= 10)

In [None]:
sns.heatmap(data.isnull())

In [None]:
data.iloc[:, 6].isnull().sum()

In [None]:
x = data.iloc[:,-9:-6]

In [None]:
x.isnull().sum()

In [None]:
data["MasVnrType"].isnull().sum()

In [None]:
data.iloc[:, 25]

In [None]:
data.iloc[:,6]

In [None]:
data["Alley"].isnull().sum()

In [None]:
data.iloc[:,-9:-6].isnull().sum()

In [None]:
data1= data.iloc[:,-9:-6]

In [None]:
data1.isnull().sum()

In [None]:
data.Fence.describe()

It is also helpful to compute the percentage of the values in our dataset that are missing.

We can do this by dividing the total number of missing cells by the total number of cells in the dataframe.

In [None]:
# Compute total number of cells in dataframe and cells with missing values
total_cells = np.product(data.shape) #1460 * 81 = 118260 records or entity
total_missing = data.isnull().sum().sum()
print(total_cells)
print(total_missing)

# Compute percentage
percentage_missing = (total_missing / total_cells)* 100
print(percentage_missing)

## Dealing with missing values

There are mainly two ways to deal with missing data.

1. Drop the rows or columns which contain missing data 
2. Replace missing data with substituted values also known as imputation

Both methods have their own individual pros and cons. Which of the two methods you use will be highly dependent on your data as well as the nature of the problem you are trying to solve. If you are working on detailed piece of analysis, this is where you would take the time to really understand each column to figure out the best strategy to handle those missing values.

Generally speaking, dropping data is much easier and straightforward to implement but it does come at the expense of removing potentially useful information from our dataset. This will adversely affect model performance which then leads to inaccurate model predictions.

On the other hand, choosing the best way to impute or replace those missing values require more time, consideration and experience. I will briefly touch upon the different ways to impute missing values in the later part of this notebook.

## Method 1: Drop rows or columns with missing values

If you are in a hurry or don't have a reason to figure out why your values are missing, one option is to remove rows or columns that contain missing values. However, this is not the best approach in most cases because we might lose potentially useful information in our dataset.

Let's see how we can drop rows and columns with missing values using the [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) function. 

In [None]:
# Drop rows with missing values
data.dropna()

Yikes, it appears that we have dropped all the rows in our dataframe. This is not good. 

Ideally, we would only remove rows if we have a large number of training examples and if the rows with missing data is not a high number. In our example, all the rows have at least one missing feature therefore dropping rows with missing data is not a good strategy to use.

Maybe we should remove columns with missing values instead.

In [None]:
# Drop columns with missing values

col_with_na_dropped = data.dropna(axis = 1)
col_with_na_dropped.head()

In [None]:
col_with_na_dropped.info()

In [None]:
# How much data did we lose?

print("Number of columns in original dataset: ", data.shape[1]) #(n. of rows , n. of columns),, (0,1)
print("Number of columns left after dropping: ", col_with_na_dropped.shape[1])
difference = data.shape[1] - col_with_na_dropped.shape[1]
print("We have dropped a total of %d columns." ,difference )

We are dropping a substantial amount of features from our dataset, almost a quarter! 

Features in our example are the characteristics that describe the house. If we remove features that are significant in explaining the sale price of the house, our model will not be able to make accurate predictions. 

In an ideal scenario, it is only safe to drop a column if there is significant random missing data present in a column and if we have reasons to believe that the column is unimportant in predicting our target variable. 

Let's have a closer look at the features that we are dropping. 

In [None]:
data.isnull().any()

In [None]:
print(data.columns[data.isnull().any()])

In [None]:
col_with_na = data.columns[data.isnull().any()]
list(col_with_na)

In [None]:
col_with_na = data.loc[:,['LotFrontage','Alley','MasVnrType','MasVnrArea','BsmtQual','BsmtCond','BsmtExposure',
'BsmtFinType1','BsmtFinType2','Electrical','FireplaceQu','GarageType','GarageYrBlt','GarageFinish','GarageQual',
'GarageCond','PoolQC','Fence','MiscFeature']]

In [None]:
col_with_na.isnull().sum()

In [None]:
col_with_na.isnull().mean() *100

In [None]:
data.drop(columns=['Alley','MasVnrType','FireplaceQu','PoolQC','Fence','MiscFeature'])

In [None]:
data.dropna(subset=['MasVnrArea','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
 'Electrical' ,'GarageYrBlt','GarageFinish','GarageCond'])


In [None]:
data_dropna = data.loc[:, ['MasVnrArea','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
 'Electrical' ,'GarageYrBlt','GarageFinish','GarageCond']]
data_dropna.dropna()

In [None]:
'''
data.drop(columns=['Alley','MasVnrType','MasVnrArea','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
 'Electrical','FireplaceQu','GarageYrBlt','GarageFinish','GarageCond','PoolQC','Fence','MiscFeature'])
'''

To iterate, only drop rows and columns if you have significant amount missing data or that the data is not important in predicting the target variable. 

Now let's look at a better approach for dealing with missing data via imputation.

## Method 2: Imputation ( Filling in missing values )

There are a couple of ways to impute missing data that is subjective to the situation. 

In this section, I will go through the two of the most common technique to fill missing data:

1. Using mean or median values (for numerical variables)
2. Using mode or zero (for categorical variables)

Numerical variables are continuous random variable like height, age, total sales whereas categorical variables are discrete random variables like yes or no, pass or fail, small, medium or large etc.

# Simple Imputation
The main function to use here is the [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function. 

In [None]:
# Suppose we want to fill missing data in the LotFrontage column 
# First let's examine the data type

data['LotFrontage'].dtype

In [None]:
data['LotFrontage'].head(20)

In [None]:
data['LotFrontage'].describe()

Row number 8 has missing value.

Suppose we want to fill all missing data in that column with the median.

In [None]:
# Compute median
data['LotFrontage'].median()

In [None]:
# Impute missing data in LotFrontage with median

data['LotFrontage'] = data['LotFrontage'].fillna(data['LotFrontage'].median())
data['LotFrontage'].head(20)

# Advanced Imputation  
using Machine Learning Model called K-NN Algorithm

In [None]:
from sklearn.impute import KNNImputer
#sklean is a machine learning library 
imputer = KNNImputer(n_neighbors=4)
data['LotFrontage']= imputer.fit_transform(data[['LotFrontage']])
data['LotFrontage'].head(20)

Row number 8 has been filled with the median of the LotFrontage column that is 69.

Now let's look at an example of a categorical variable like GarageType.

In [None]:
# Check data type of GarageType column

data['GarageType'].dtype

In [None]:
data['GarageType'].describe()

In [None]:
# Let's see the value counts in that column including the nulll value

data['GarageType'].value_counts(dropna = False)

The most frequent observation is Attchd.

Suppose we want to fill the missing data with this observation.

In [None]:
data['GarageType'].mode()[0]

In [None]:
data['GarageType'].tail(10)

In [None]:
data['GarageType'] = data['GarageType'].fillna(data['GarageType'].mode()[0])
data['GarageType'].tail(10)

The missing values have now been replaced with the mode.

We can also fill the missing data with any number or text that we like. Let's consider the GarageQual feature.

Suppose we want to replace the null values with the word 'Unknown'.

In [None]:
data['GarageQual'].value_counts(dropna = False)

In [None]:
data['GarageQual'] = data['GarageQual'].fillna('Unknown')
data['GarageQual'].value_counts(dropna = False)

Notice how the NaN value has been replaced with the word Unknown.

There are other more sophisticated methods of imputing missing data like using other features that are correlated to help determine the appropriate substitute value. However, I won't be covering those concepts in this tutorial but if you are interested, you can check out this [article](https://medium.com/x8-the-ai-community/handling-missing-values-in-data-54e1dc77e24f).

In [None]:
data.to_csv("E:\Academic\Work\My  private work\Interimediate Python\pandas-tutorial-master\modified_file.csv") 