# <span style="color:#0b486b">SIT307 - Data Mining and Machine Learning</span>

---
Lecturer:   Richard Dazeley     | richard.dazeley@deakin.edu.au<br />
Assistant:  Adam Bignold | abignold@gmail.com

School of Information Technology, <br />
Deakin University, VIC 3216, Australia.


---


## <span style="color:#0b486b">Practical Session 3: Data Cleaning and Preparation</span>

**Prerequisite**
You should already have done, or be confident with the content of: 
1. Week 2 material

**The purpose of this session is:**

1. learn basic data cleaning skills

**Instructions** 

1. After you download this notebook, save it as another copy and rename it to `"[yourstudentID]_Week_3_Data_cleaning.ipynb"`
2. fill in the code cells indicated with your own solution. You can discuss approaches with other students but must only submit your own original solution. 

## <span style="color:#0b486b">Background</span>
During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R, or Java, or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. In this practical, we will learn tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

### <span style="color:#0b486b">Handling Missing Data   </span>
#### <span style="color:#0b486b">Checking for Missing Data   </span>
Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default. The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected by calling `isnull()` on a pandas Series.

Create a panda Series object using the pd.Series with the words for the following list of words `'If', 'my', 'answers', np.nan, 'you', 'then', 'you', 'should', 'cease', 'asking', 'scary', 'questions.'`. You will need to import pandas and numpy. Print the data series out. 

In [2]:
import pandas as pd
import numpy as np
sentence = pd.Series(['If', 'my', 'answers', np.nan, 'you', 'then', 'you', 'should', 'cease', 'asking', 'scary', 'questions.'])

now call `isnull` to check which data items are null and which are not.

In [4]:
sentence.isnull()

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by referring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data. 

The built-in Python `None` value is also treated as NA in object arrays. Set the  eigth element to `None`. and again print out which items are null.

In [7]:
sentence[7] = None
print(sentence)
print(sentence.isnull())

0             If
1             my
2        answers
3            NaN
4            you
5           then
6            you
7           None
8          cease
9         asking
10         scary
11    questions.
dtype: object
0     False
1     False
2     False
3      True
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
dtype: bool


#### <span style="color:#0b486b">Filtering Out Missing Data   </span>
With `DataFrame` objects, things are a bit more complex. You may want to drop rows or columns that are all `NA` or only those containing any `NA`s. `dropna` by default drops any row containing a missing value.

In the following cell, import nan as NA from numpy then create a `Dataframe` with the following two diemsional data. `[[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]` and store it in a variable called `data`. Now use `dropna` to remove all rows with `NaN` values and store this in a variable called cleaned

In [14]:
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()

now run the following two cells to check your data and cleaned variables are work.

In [15]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [16]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Alternatively, you can pass the parameter `how='all'` and observe the results when displayed.

In [17]:
cleaned = data.dropna(how="all")
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


For more information run `data.dropna?`.

#### <span style="color:#0b486b">Imputation: Filling Missing Data</span>
Rather than filtering out missing data, you may want to fill in the "holes" in any number of ways. For most purposes, the `fillna` method is the workhorse function to use. Calling `fillna` with a constant replaces missing values with that value. Given the following DataFrame: 

In [18]:
import pandas as pd
from numpy import nan as NA
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-2.101571,,
1,-0.061307,,
2,0.589542,,-0.346053
3,0.770829,,-0.396377
4,-0.262307,1.883463,1.211058
5,1.107094,0.777886,2.28216
6,-0.227241,-0.503371,1.290862


fill all the `NaNs` with the value 0 using `fillna`.

In [19]:
df.fillna(value=0)

Unnamed: 0,0,1,2
0,-2.101571,0.0,0.0
1,-0.061307,0.0,0.0
2,0.589542,0.0,-0.346053
3,0.770829,0.0,-0.396377
4,-0.262307,1.883463,1.211058
5,1.107094,0.777886,2.28216
6,-0.227241,-0.503371,1.290862


Try and work out how to fill each column with a different value using a dict.

In [20]:
df.fillna({1:1, 2:2})

Unnamed: 0,0,1,2
0,-2.101571,1.0,2.0
1,-0.061307,1.0,2.0
2,0.589542,1.0,-0.346053
3,0.770829,1.0,-0.396377
4,-0.262307,1.883463,1.211058
5,1.107094,0.777886,2.28216
6,-0.227241,-0.503371,1.290862


Use the following to get more details on the `fillna` function

In [23]:
df.fillna?

### <span style="color:#0b486b">Data Transformation </span>
#### <span style="color:#0b486b">Removing Duplicates </span>
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example: 

In [24]:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame({'k1':['one', 'two'] * 3 + ['two'], 'k2':[1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The `DataFrame` method `duplicated` returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not. try running the following cell and observe the results.

In [25]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, `drop_duplicates` returns a `DataFrame` where the duplicated array is False. 

In [26]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns. Alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the `k1` column: 

In [27]:
data['v1']=range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [28]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


#### <span style="color:#0b486b">Transforming Data Using a Function or Mapping </span>
Consider the following hypothetical data collected about various kinds of film characters: 

In [29]:
import pandas as pd
from numpy import nan as NA
data = pd.DataFrame({'Characters':['Mr pink','Zoe Bell', 'Vincenzo Coccotti', 'Calvin Candie', 'Ordell Robbie', 'richie gecko', 'Shosanna', 'Oswaldo Mobray', 'Captain Koons', 'Mallory Knox'], 
                     'Years':[1992, 2007, 1993, 2012, 1997, 1996, 2009, 2015, 1994, 1994]})
data

Unnamed: 0,Characters,Years
0,Mr pink,1992
1,Zoe Bell,2007
2,Vincenzo Coccotti,1993
3,Calvin Candie,2012
4,Ordell Robbie,1997
5,richie gecko,1996
6,Shosanna,2009
7,Oswaldo Mobray,2015
8,Captain Koons,1994
9,Mallory Knox,1994


Suppose you wanted to add a column indicating the film each character was in. Let’s write down a mapping of each distinct film to the character that appear in the film

In [30]:
films = {'mr pink': 'Reservoir Dogs', 'zoe bell': 'Death Proof', 'Vincenzo Coccotti':'True Romance', 'Calvin Candie':'Django Unchained', 'Ordell Robbie':'Jackie Brown', 'Richie Gecko':'From Dusk Till Dawn', 'Shosanna':'Inglourious Basterds', 'Oswaldo Mobray':'The Hateful Eight', 'Captain Koons':'Pulp Fiction', 'Mallory Knox':'Natural Born Killers'}
films

{'Calvin Candie': 'Django Unchained',
 'Captain Koons': 'Pulp Fiction',
 'Mallory Knox': 'Natural Born Killers',
 'Ordell Robbie': 'Jackie Brown',
 'Oswaldo Mobray': 'The Hateful Eight',
 'Richie Gecko': 'From Dusk Till Dawn',
 'Shosanna': 'Inglourious Basterds',
 'Vincenzo Coccotti': 'True Romance',
 'mr pink': 'Reservoir Dogs',
 'zoe bell': 'Death Proof'}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the characters are capitalized and others are not. 

In [31]:
data['Films'] = data['Characters'].map(films)
data

Unnamed: 0,Characters,Years,Films
0,Mr pink,1992,
1,Zoe Bell,2007,
2,Vincenzo Coccotti,1993,True Romance
3,Calvin Candie,2012,Django Unchained
4,Ordell Robbie,1997,Jackie Brown
5,richie gecko,1996,
6,Shosanna,2009,Inglourious Basterds
7,Oswaldo Mobray,2015,The Hateful Eight
8,Captain Koons,1994,Pulp Fiction
9,Mallory Knox,1994,Natural Born Killers


Thus, we need to convert each value from the character list to lowercase using the `str.lower()` Series method.

In [32]:
lowercased = data['Characters'].str.lower()
data['Films'] = lowercased.map(films)
data

Unnamed: 0,Characters,Years,Films
0,Mr pink,1992,Reservoir Dogs
1,Zoe Bell,2007,Death Proof
2,Vincenzo Coccotti,1993,
3,Calvin Candie,2012,
4,Ordell Robbie,1997,
5,richie gecko,1996,
6,Shosanna,2009,
7,Oswaldo Mobray,2015,
8,Captain Koons,1994,
9,Mallory Knox,1994,


This however only now matched with the first film where the character key was also in lower case. There is no simple function to convert the keys in the dict so instead we need to write a function to iterate over the dict to create a new dict with all keys turned to lower case.

In [33]:
def lower_dict(d):
   new_dict = dict((k.lower(), v) for k, v in d.items())
   return new_dict

a = lower_dict(films)
lowercased = data['Characters'].str.lower()
data['Films'] = lowercased.map(lower_dict(films))
data

Unnamed: 0,Characters,Years,Films
0,Mr pink,1992,Reservoir Dogs
1,Zoe Bell,2007,Death Proof
2,Vincenzo Coccotti,1993,True Romance
3,Calvin Candie,2012,Django Unchained
4,Ordell Robbie,1997,Jackie Brown
5,richie gecko,1996,From Dusk Till Dawn
6,Shosanna,2009,Inglourious Basterds
7,Oswaldo Mobray,2015,The Hateful Eight
8,Captain Koons,1994,Pulp Fiction
9,Mallory Knox,1994,Natural Born Killers


Alternatively, we could also have passed a function that does all the work for us. 

In [34]:
data['Films'] = data['Characters'].map(lambda x: lower_dict(films)[x.lower()])
data

Unnamed: 0,Characters,Years,Films
0,Mr pink,1992,Reservoir Dogs
1,Zoe Bell,2007,Death Proof
2,Vincenzo Coccotti,1993,True Romance
3,Calvin Candie,2012,Django Unchained
4,Ordell Robbie,1997,Jackie Brown
5,richie gecko,1996,From Dusk Till Dawn
6,Shosanna,2009,Inglourious Basterds
7,Oswaldo Mobray,2015,The Hateful Eight
8,Captain Koons,1994,Pulp Fiction
9,Mallory Knox,1994,Natural Born Killers


#### <span style="color:#0b486b">Replacing Values</span>
Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s consider this Series: 

In [35]:
import pandas as pd
from numpy import nan as NA
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series (unless you pass inplace=True):

In [36]:
data.replace(-999, NA)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the substitute value

In [37]:
data.replace([-999, -1000], NA)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes: 

In [38]:
data.replace([-999, -1000], [NA, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict: 

In [39]:
data.replace({-999: NA, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#### <span style="color:#0b486b">Renaming Axis Indexes</span>
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example: 

In [40]:
import pandas as pd
from numpy import nan as NA
data=pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'], 
                  columns=['one', 'two', 'three', 'four'])
data


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Like a Series, the axis indexes have a map method: 

In [41]:
transform=lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to index, modifying the DataFrame in-place

In [42]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11
