# Week 6 Lab: Data Wrangling

This week we will conclude our work on data preprocessing by considering data wrangling, which is the process of collecting and modifying raw data into some other form for subsequent analysing and decision-making. Recall, last week we saw many ways to 'fix up' our data, e.g. by removing/replacing missing or duplicate data, performing discretisation or PCA. This week we will now add to this by exploring hierarchical indexing, and merging and reshaping of data. We will then conclude with some exercises which should hopefully give us plenty of practice in using pandas to work with data. Next week, we will then start to extract useful information from data by looking at linear regression.

As ever, more detail on today's topics can be found in Mckinney's _Python for Data Analysis_, _Introduction to Data Mining_, or by looking for notebooks by Jose Portilla.

To execute the code, click on the corresponding cell and press the SHIFT-ENTER keys simultaneously.

## 6.1 Hierarchical Indexing

Hierarchical indexing allows us to use multiple index levels on an axis. This provides a way for us to work with higher dimensional data in a lower dimensional form. It is also known as multiple indexing.

In [1]:
import pandas as pd
import numpy as np

outside = ['a','a','a','b','b','b','c','c']
inside = [1,2,3,1,2,3,1,2]
data = pd.Series(np.random.randn(8),index=[outside,inside])

print(data)

a  1    0.303163
   2   -0.790910
   3   -0.113731
b  1    0.072670
   2    1.329320
   3    0.407109
c  1    0.485772
   2   -1.127561
dtype: float64


This is a Series with a `MultiIndex` as its index. `MultiIndex` allows us to select more than one row and column in our index.

In [2]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2)],
           )

So, `MultiIndex` is an advanced indexing technique for DataFrames that shows the multiple levels of the indexes. Our dataset has two levels. As expected, we can obtain subsets of the data using the indexes, can slice the data and see more than one index.

In [3]:
print(data['a'],"\n",)

print(data['b':'c'],"\n")

print(data.loc[['a','c']])

1    0.303163
2   -0.790910
3   -0.113731
dtype: float64 

b  1    0.072670
   2    1.329320
   3    0.407109
c  1    0.485772
   2   -1.127561
dtype: float64 

a  1    0.303163
   2   -0.790910
   3   -0.113731
c  1    0.485772
   2   -1.127561
dtype: float64


We can also select values from the inner index.

In [4]:
data.loc[:,1]

a    0.303163
b    0.072670
c    0.485772
dtype: float64

### 6.1.1 Unstack

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For instance, the stack method turns column names into index values, and the unstack method turns index values into column names.

In [5]:
data.unstack()

Unnamed: 0,1,2,3
a,0.303163,-0.79091,-0.113731
b,0.07267,1.32932,0.407109
c,0.485772,-1.127561,


To restore the dataset, we can use the stack method.

In [6]:
data.unstack().stack()

a  1    0.303163
   2   -0.790910
   3   -0.113731
b  1    0.072670
   2    1.329320
   3    0.407109
c  1    0.485772
   2   -1.127561
dtype: float64

## 6.1.2 Indexing

With a DataFrame, either axis can have a hierarchical index.

In [7]:
frame=pd.DataFrame(np.arange(12).reshape((4,3)),
                  index=[['a','a','b','b'],[1,2,1,2]],
                  columns=[['Theropod','Theropod','Sauropoda'],
                          ['Feathered','Unfeathered','Unfeathered']])
print(frame)

     Theropod               Sauropoda
    Feathered Unfeathered Unfeathered
a 1         0           1           2
  2         3           4           5
b 1         6           7           8
  2         9          10          11


Notice that both row and columns have hierarchical indexes. We can also name hierarchical levels.

In [8]:
frame.index.names = ['Genus', 'Species']

frame.columns.names=['Clade','Type']

print(frame)

Clade          Theropod               Sauropoda
Type          Feathered Unfeathered Unfeathered
Genus Species                                  
a     1               0           1           2
      2               3           4           5
b     1               6           7           8
      2               9          10          11


We can then select subgroups of data.

In [9]:
frame['Theropod']

Unnamed: 0_level_0,Type,Feathered,Unfeathered
Genus,Species,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### 6.1.3 Swaplevel and Sorting

Sometimes we may want to swap the level of the indexes. This is where `swaplevel` comes in. This method takes two levels and returns a new object. For example we can swap the Genus and Species indexes in the dataset.

In [10]:
frame.swaplevel('Genus','Species')

Unnamed: 0_level_0,Clade,Theropod,Theropod,Sauropoda
Unnamed: 0_level_1,Type,Feathered,Unfeathered,Unfeathered
Species,Genus,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


To sort the indexes by level, we can use the `sort_index` method. For instance, we can sort the dataset by Level 1.

In [11]:
frame.sort_index(level=1)

Unnamed: 0_level_0,Clade,Theropod,Theropod,Sauropoda
Unnamed: 0_level_1,Type,Feathered,Unfeathered,Unfeathered
Genus,Species,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


### 6.1.4 Summary Statistics

Many descriptive and summary statistics on Series and DataFrame have a level option in which we can specify the level we want to aggregate by on a particular axis. So, on the above we can aggregate by level on either the rows or the columns.

In [12]:
frame.groupby(level=1).sum()

Clade,Theropod,Theropod,Sauropoda
Type,Feathered,Unfeathered,Unfeathered
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


We can see the total values according to the Clade level.

In [13]:
frame.groupby(level='Clade',axis=1).sum()

Unnamed: 0_level_0,Clade,Sauropoda,Theropod
Genus,Species,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,5,7
b,1,8,13
b,2,11,19


## 6.2 Reshaping, Filtering and Grouping Data

For this, create a .csv file with 6 entries and 4 columns. You can tailor this in any way that you want.

In [14]:
frame2 = pd.read_csv('Week6Data.csv')
print(frame2)

    Forename     Surname     First Appearance Site Gender
0     Granny  Weatherwax          Equal Ritea    B      F
1   Eskarina       Smith          Equal Rites    B      F
2        Sam       Vimes      Guards! Guards!    A      M
3  Rincewind         NaN  The Colour of Magic    A      M
4     Horace   Worblehat  The Light Fantastic    B      M
5     Magrat     Garlick         Wyrd Sisters    B      F


We can reshape the data by replacing strings with numbers.

In [15]:
gender = {'M': 0,'F': 1}
frame2['Gender'] = [gender[item] for item in frame2['Gender']]
frame2

Unnamed: 0,Forename,Surname,First Appearance,Site,Gender
0,Granny,Weatherwax,Equal Ritea,B,1
1,Eskarina,Smith,Equal Rites,B,1
2,Sam,Vimes,Guards! Guards!,A,0
3,Rincewind,,The Colour of Magic,A,0
4,Horace,Worblehat,The Light Fantastic,B,0
5,Magrat,Garlick,Wyrd Sisters,B,1


We can use the `groupby()` method that returns a DataFrameGroupBy object. By calling a method such as `value_counts()` on the object obtained, we can get the number of occurrences for each unique value in the specified column.

In [16]:
frame2.groupby('Site').Gender.value_counts()

Site  Gender
A     0         2
B     1         3
      0         1
Name: Gender, dtype: int64

This shows the correlation between the coloumns _Gender_ and _Site_. We can also find the unique values.

In [17]:
frame2.Site.unique()

array(['B', 'A'], dtype=object)

We can fiter the data by specifying different conditions. This can include mathematical operations such as > or <, or Boolean operations such as ==.

In [18]:
frame2[frame2['Site']=='A']

Unnamed: 0,Forename,Surname,First Appearance,Site,Gender
2,Sam,Vimes,Guards! Guards!,A,0
3,Rincewind,,The Colour of Magic,A,0


## 6.3 Merging and Pivoting Data

We can merge two data sets based on a column using the `merge()` function. This uses similar logic as merging SQL tables. First create another data frame.

In [19]:
frame3=pd.DataFrame(
        {'Forename':['Granny','Eskarina','Sam','Rincewind','Horace','Margrat'],
        'Multiple Appearances':['Y','Y','Y','Y','Y','Y']})

print(frame3)

print("\nNow we can merge this with our earlier data\n",pd.merge(frame2,frame3,on='Forename'))

    Forename Multiple Appearances
0     Granny                    Y
1   Eskarina                    Y
2        Sam                    Y
3  Rincewind                    Y
4     Horace                    Y
5    Margrat                    Y

Now we can merge this with our earlier data
     Forename     Surname     First Appearance Site  Gender  \
0     Granny  Weatherwax          Equal Ritea    B       1   
1   Eskarina       Smith          Equal Rites    B       1   
2        Sam       Vimes      Guards! Guards!    A       0   
3  Rincewind         NaN  The Colour of Magic    A       0   
4     Horace   Worblehat  The Light Fantastic    B       0   

  Multiple Appearances  
0                    Y  
1                    Y  
2                    Y  
3                    Y  
4                    Y  


We can pivot the table where we can correlate columns.

In [20]:
frame2.pivot(index='First Appearance',columns='Site',values='Gender')

Site,A,B
First Appearance,Unnamed: 1_level_1,Unnamed: 2_level_1
Equal Ritea,,1.0
Equal Rites,,1.0
Guards! Guards!,0.0,
The Colour of Magic,0.0,
The Light Fantastic,,0.0
Wyrd Sisters,,1.0


## 6.4 Concatenating the DataFrames

We now create another data frame and finally concatenate. Note that when concatenating, the dimensions should match along the axis we are concatenating on.

In [21]:
df=pd.DataFrame(
    {'Appearances':[12, 2, 10, 10, 16, 7],
    'Misc':[1,11,6,9,'NIL',4]})
print(df)

print("\n Now we concatenate:\n",pd.concat([frame2,df],axis=1))

   Appearances Misc
0           12    1
1            2   11
2           10    6
3           10    9
4           16  NIL
5            7    4

 Now we concatenate:
     Forename     Surname     First Appearance Site  Gender  Appearances Misc
0     Granny  Weatherwax          Equal Ritea    B       1           12    1
1   Eskarina       Smith          Equal Rites    B       1            2   11
2        Sam       Vimes      Guards! Guards!    A       0           10    6
3  Rincewind         NaN  The Colour of Magic    A       0           10    9
4     Horace   Worblehat  The Light Fantastic    B       0           16  NIL
5     Magrat     Garlick         Wyrd Sisters    B       1            7    4


# 6.5 Exercises

## Exercise 1

Consider the following three DataFrames:

In [22]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

In [23]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

In [24]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

Use concatenation to glue these DataFrames together. First, concatenate along rows, then columns.

## Exercise 2

Consider the following two data frames:

In [25]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})  

Merge these two DataFrames on the keys 'key1' and 'key2'. First merge with an inner join, then outer. Next, merge using only keys from right frame, then from the left. Disucss how these differ.

## Exercise 3

Consider the following two DataFrames:

In [26]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

Explore and use the `.join()` method to combine these DataFrames into a single DataFrame. Again, explore how 'outer' changes things.

## Exercise 4

Load the SF Salaries Dataset from Kaggle (you'll find it on Canvas). Then do the following:
1. Check the head of the DataFrame and use the `.info()` method to find out how many entries there are.
2. What is the average base pay?
3. What is the highest amount of overtime pay in the dataset?
4. What is JOE DRISCOLL's job title? What happens if we do not use all caps?
5. How much does JOE DRISCOLL make (including benefits)?
6. What is the name of the highest paid person (including benefits)?
7. What is the name of the lowest paid person (including benefits)? What is unusual here?
8. What is the average base pay of all employees per year?
9. How many unique job titles are there?
10. What are the top 5 most common jobs?
11. How many job titles were represented by only one person in the year 2014?
12. How many people have the word 'Chief' in their job title?
13. Is there a correlation between length of the Job Title string and Salary?

## 6.6 Summary

We should now be happy with the idea of hierarchical indexing. We we should also be able to merge, reshape and concatenate data. This will conclude the preprocessing part of the course. Given a data set, we should now have a rough idea of how to tidy up the data and prepare it for future exploration. Next week, we will start that exploration by looking at linear regression.