# TW2


## Preprocessing Data: Data Clearning

### Handing missing data and invalid data

Handling missing data is important as many machine learning algorithms do not support data with missing values. Our main objectives: 

- How to marking invalid or corrupt values as missing in a dataset.

- How to remove rows with missing data from a dataset.

- How to impute missing values with mean values in a dataset.

#### Two examples below will show the data cleaning process. 

- Learn from the examples by going through each cell.

- Apply the learn tools to conduct preprocessing a new dataset. 


See the more details:

- Working with mssing data, in Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

- Imputation of missing values, in Scikit-learn: https://scikit-learn.org/stable/modules/impute.html#impute


*** Notes: It is import to read the Pandas and Scikit-learn libraries for their functions and examples before you use them.



### A simple example: filling in missing data using Pandas

In [None]:
# Library import
import numpy as np
import pandas as pd
from sklearn import preprocessing

In [None]:
# Load the data
# data file is located in folder data
df = pd.read_csv('./data/log.csv')
print(df.head())
print(df.tail())

In [None]:
# you can make index
df = df.set_index('time')
df = df.sort_index()
print(df.head())

In [None]:
# reset index
df = df.reset_index()
df = df.set_index(['time', 'user'])
print(df.head())

In [None]:
# replace Nan with default value (previos value is used to fill in)
df = df.fillna(method='ffill')
print(df.head())
print(df.tail())

## Data Clearning Exercise

Pima Indians Diabetes Dataset: where we look at a dataset that has known missing values.
Mark Missing Values: where we learn how to mark missing values in a dataset.
Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
Impute Missing Values: where we replace missing values with sensible values.
Algorithms that Support Missing Values: where we learn about algorithms that support missing values.

1. Pima Indians Diabetes Dataset: where we look at a dataset that has known missing values.
2. Mark Missing Values: where we learn how to mark missing values in a dataset.
3. Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
4. Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
5. Impute Missing Values: where we replace missing values with sensible values.
6. Algorithms that Support Missing Values: where we learn about algorithms that support missing values.

### 1. Dataset
#### Working with Pima indians diabetes dataset



#### In TW2, you can find the follwoing files from folder data:

- pima-indians-diabetes.csv

- pima-indians-diabetes.names

Both can be opend in Jupyter notebook

#### Open the data file and look at the data. Also read readme.txt for data descrption. 

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

![image.png](attachment:image.png)

This dataset is known to have missing values. Specifically, there are missing observations for some columns that are marked as a zero value.

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

### 2. Mark Missing Values

we will look at how we can identify and mark values as missing.

We can use plots and summary statistics to help identify missing or corrupt data.

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

In [None]:
df = pd.read_csv('./data/pima-indians-diabetes.csv', header=None)

print(df.describe())
            

We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

1: Plasma glucose concentration

2: Diastolic blood pressure

3: Triceps skinfold thickness

4: 2-Hour serum insulin

5: Body mass index

In [None]:
print(df.head(20))

We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

In [None]:
print((df[[1,2,3,4,5]] == 0).sum())

We can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows. This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN. Values with a NaN value are ignored from operations like sum, count, etc.

We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.

After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

In [None]:
# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)

# count the number of NaN values in each column
print(df.isnull().sum())

Running the example prints the number of missing values in each column. We can see that the columns 1:5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

We can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

In [None]:
print(df.head(20))

Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows.

It is clear from the raw data that marking the missing values had the intended effect.

### 3. Missing Values Causes Problems

Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.

Having missing values in a dataset can cause errors with some machine learning algorithms. We will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values. This is an algorithm that does not work when there are missing values in the dataset.

The below example marks the missing values in the dataset, as we did in the previous sectio (changing 0 to Nan), then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.

*** Notes: The LDA algorithm and 3-fold cross validation will be discussed in class later. 

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

df = read_csv('./data/pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)

result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())

Running the example results in an error, as follows:
    
This is as we expect.

We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.

Now, we can look at methods to handle the missing values.


### 4. Remove Rows With Missing Values

The simplest strategy for handling missing data is to remove records that contain a missing value.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:

In [None]:
dataset = read_csv('./data/pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)

# drop rows with missing values
df.dropna(inplace=True)
# summarize the number of rows and columns in the dataset
print(df.shape)

Running this example, we can see that the number of rows has been aggressively cut from 768 in the original dataset to 392 with all rows containing a NaN removed.

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.

In [None]:
df = read_csv('./data/pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)
# drop rows with missing values
df.dropna(inplace=True)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())

The example runs successfully and prints the accuracy of the model.

Removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.

### 5. Impute Missing Values

Imputing refers to using a model to replace missing values.

There are many options we could consider when replacing a missing value, for example:

- A constant value that has meaning within the domain, such as 0, distinct from all other values.

- A value from another randomly selected record.

- A mean, median or mode value for the column.

- A value estimated by another predictive model.

Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

- For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.

Pandas provides the fillna() function for replacing missing values with a specific value.

- For example, we can use fillna() to replace missing values with the mean value for each column, as follows:

### Using Pandas

In [None]:
df = read_csv('./data/pima-indians-diabetes.csv', header=None)

# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)
print(df.head(5))

# fill missing values with mean column values
df.fillna(df.mean(), inplace=True)
# count the number of NaN values in each column

print(df.head(5))

print(df.isnull().sum())

### Using Scikit-learn

#### SimpleImputer

The scikit-learn library provides the SimpleImputer() class that can be used to replace missing values.

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The Imputer class operates directly on the NumPy array instead of the DataFrame.

The example below uses the Imputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix.

In [None]:
from sklearn.impute import SimpleImputer

df = read_csv('./data/pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)

# fill missing values with mean column values
values = df.values
imputer = SimpleImputer()
transformed_values = imputer.fit_transform(values)

# count the number of NaN values in each column
print(np.isnan(transformed_values).sum())

Running the example shows that all NaN values were imputed successfully.

In either case, we can train algorithms sensitive to NaN values in the transformed dataset, such as LDA.

The example below shows the LDA algorithm trained in the Imputer transformed dataset.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

df = read_csv('./data/pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, np.NaN)

# split dataset into inputs and outputs
values = df.values
X = values[:,0:8]
y = values[:,8]

# fill missing values with mean column values
imputer = SimpleImputer()
transformed_X = imputer.fit_transform(X)

# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy')
print(result.mean())

Running the example prints the accuracy of LDA on the transformed dataset.

Try replacing the missing values with other values and see if you can lift the performance of the model.

Maybe missing values have meaning in the data.

Next we will look at using algorithms that treat missing values as just another value when modeling.

### 6. Algorithms that Support Missing Values

Not all algorithms fail when there is missing data.

There are algorithms that can be made robust to missing data, such as k-Nearest Neighbors that can ignore a column from a distance measure when a value is missing. There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees. Sadly, the scikit-learn implementations of decision trees and k-Nearest Neighbors are not robust to missing values. Although it is being considered.

Nevertheless, this remains as an option if you consider using another algorithm implementation (such as xgboost) or developing your own implementation.

### More details

- See examples of handling missing data at Pandas: 
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

- See also examples of imputating missing data at Scikit-learn:
https://scikit-learn.org/stable/modules/impute.html#impute

Reference:
- Data clearning example with Pima Indians Diabetes dataset was created by Jason Brownlee and modified by Wan Bae. 

## TW2

### Part 1

Dataset: ./data/daily-temperatures.csv

Daily minimum and maximum temperatures (in Celsius) in Melbourne, Australia, 1981-1990

Source: Time Series Data Library (citing: Australian Bureau of Meteorology)


This dataset has known missing values and also incorrect values

- no value is reported so no value in some cells

- invalid values for temperature: You can see some temprature values >= 200 and <= -800, which are invalid

(1) First, you may want to plot the each max temperature and min temperature or both to check the ranges of the data. You may find something incorrect from the plot(s).

(2) Discuss how you would handle these values: missing values and invalid values

(3) Use tools (in Pandas and Scikit-learn) we talked about in the above examples to process data.

(4) Visualize the data. 

### Part 2
Write a summary of what your team has learned from this process. 