# "SpreadSheet Munging Strategies in Python - Small Multiples"
> "Extract data from multiple tables in a spreadsheet"

- toc: true
- branch: master
- badges: true
- hide_binder_badge: True
- hide_colab_badge: True
- comments: true
- author: Samuel Oranyeli
- categories: [Spreadsheet, Python, Pandas]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true
- metadata_key1: "spreadsheet"
- metadata_key2: "python"

## **Small Multiples**

This is part of a series of blog posts about extracting data from spreadsheets using Python.  It is based on the [book](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) written by [Duncan Garmonsway](https://twitter.com/nacnudus?lang=en), which was written primarily for R users. LInks to the other posts are on the [homepage](https://samukweku.github.io/data-wrangling-blog/).

Small multiples refer to mini tables embedded in a spreadsheet, or multiple spreadsheets. Ideally, this tables should be lumped into one dataframe for meaningful analysis. The examples below show different scenarios and how we can reshape the data

### __Case 1 : Small Multiples with all Headers Present for Each Multiple__

![small-multiples.png](Images/small-multiples.png)

In this spreadsheet, each table is a separate subject. It would be better to aggregate all the subjects and underlying data into one table.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#we'll use this filename for all the examples
filename = "Data_files/worked-examples.xlsx"

In [3]:
sheet = "small-multiples"
df = pd.read_excel(filename, sheet_name=sheet, header=None)
df

Unnamed: 0,0,1,2,3,4,5,6
0,Classics,,,,History,,
1,Name,Score,Grade,,Name,Score,Grade
2,Matilda,1,F,,Matilda,3,D
3,Olivia,2,D,,Olivia,4,C
4,,,,,,,
5,Music,,,,Drama,,
6,Name,Score,Grade,,Name,Score,Grade
7,Matilda,5,B,,Matilda,7,A
8,Olivia,6,B,,Olivia,8,A


**Observations :** 
1. There is a completely empty column that splits the tables, and a completely empty row as well. We'll use the coordinates in our data reshaping
2. For each table, the subject is directly above. We'll use the empty cells adjacent to it as a criteria to create a subject column

In [4]:
res = (pd.concat((df.iloc[:,:3], #get the first three columns before the completely null column
                  df.iloc[:,4:] #get the columns after the completely null column
                  .set_axis([0,1,2],axis=1))
                )
        .set_axis(['Name','Score','Grade'],axis=1)
        .query('Name != "Name"')
        .dropna(subset=['Name'])
        .assign(subject = lambda x: np.where(x.Score.isna(), 
                                             x.Name, 
                                             np.nan)
                 )
        .assign(subject = lambda x: x.subject.ffill())
        .dropna()
        .sort_values(['subject','Name'], ignore_index=True)
        )

res

Unnamed: 0,Name,Score,Grade,subject
0,Matilda,1,F,Classics
1,Olivia,2,D,Classics
2,Matilda,7,A,Drama
3,Olivia,8,A,Drama
4,Matilda,3,D,History
5,Olivia,4,C,History
6,Matilda,5,B,Music
7,Olivia,6,B,Music


The image below illustrates the main concepts of the above code.

!["solution visual for case1"](Images/case1.jpg)

### __Case 2 : Same table in several worksheets/files (using the sheet/file name)__

![humanities.png](Images/humanities.png)

![performance.png](Images/performance.png)

For this case, our data is in different worksheets. We'll create a function, apply it to each worksheet and combine the tables into one.

In [5]:
def extract_data(filename,sheet):
    #the student names are the header row
    #the subjects are the index
    #the numbers are the scores
    df = (pd.read_excel(filename, 
                        sheet_name=sheet,
                        index_col=0)
          #we are assigning the final column names here
           .rename_axis(columns='student',index='subject')
           .stack()
           .reset_index(name = 'scores')
          )
    return df

The image below illustrates the core concepts of the above function for one of the sheets:

!["function description for case 2"](Images/case2.jpg)

Let's apply our function to each sheet and lump into one : 

In [6]:
sheets = ("humanities", "performance")
extract = (extract_data(filename, sheet) for sheet in sheets)
#combine into one
res = pd.concat(extract, ignore_index=True)
res

Unnamed: 0,subject,student,scores
0,Classics,Matilda,1
1,Classics,Nicholas,3
2,History,Matilda,3
3,History,Nicholas,5
4,Music,Matilda,5
5,Music,Nicholas,9
6,Drama,Matilda,7
7,Drama,Nicholas,12


### __Case 3 : Same table in several worksheets/files but in different positions__

![female.png](Images/female.png)  

![male.png](Images/male.png)

This is similar to Case 2, with the core data been the same. Our function must be robust enough to exclude the irrelevant data

In [7]:
def extract_data(filename,sheet):
    #the student names are the header row
    #the subjects are the index
    #the numbers are the scores
    df = (pd.read_excel(filename, 
                        sheet_name=sheet,
                        header = None,#our header is not the first row and varies per sheet
                        index_col = 0)#set the first column as the index of the dataframe
           .loc['Subject':] #picks data from the Subject index downwards, excluding the irrelevant data
          )
    
    #set columns equal to the 'Subject' index 
    df.columns = df.loc['Subject']
    df = (df
           .drop('Subject')
           .rename_axis(index='subject',columns='student')
           .stack()
           .reset_index(name='scores')
           .assign(sex = sheet)
          )
    
    return df

The image below explains the main concepts of the function : 

!["visual explanation of function for case3"](Images/case3.jpg)

Same as in case 2, we'll apply the function to each sheet:

In [8]:
sheets = ("female", "male")
extract = (extract_data(filename, sheet) for sheet in sheets)
#combine into one
res = pd.concat(extract, ignore_index=True)
res

Unnamed: 0,subject,student,scores,sex
0,Classics,Matilda,1,female
1,Classics,Olivia,2,female
2,History,Matilda,3,female
3,History,Olivia,4,female
4,Classics,Nicholas,3,male
5,Classics,Paul,0,male
6,History,Nicholas,5,male
7,History,Paul,1,male


### __Case 4 : Implied multiples__

![implied-multiples.png](Images/implied-multiples.png)

For this case, we have the fields at the top, followed by the subjects and grade for each subject. the student names is the very first column.<br>
The goal is to get the subjects,grades and scores per field, per student and combine into one.

In [9]:
sheet = "implied-multiples"
df = (pd.read_excel(filename,
                    sheet_name=sheet,
                    header = None,
                    )
       .ffill(axis=1)
       .replace({np.nan : 'field'})
       .set_index(0)
       .T
       .melt(id_vars = ['field','Name'],
             var_name = 'student',
             value_name = 'scores'
            )
       .assign(grade = lambda x: x.loc[x.Name == "Grade", 'scores'])
       #scores are above grades per student
       #hence the bfill
       .bfill()
       .query('Name != "Grade"')
       .rename(columns={'Name':'subject'})
       .sort_values(['field','subject'], ignore_index=True)
      )
df

Unnamed: 0,field,subject,student,scores,grade
0,Humanities,Classics,Matilda,1,F
1,Humanities,Classics,Olivia,2,D
2,Humanities,History,Matilda,3,D
3,Humanities,History,Olivia,4,C
4,Performance,Drama,Matilda,7,A
5,Performance,Drama,Olivia,8,A
6,Performance,Music,Matilda,5,B
7,Performance,Music,Olivia,6,B


And a visual illustration of the steps is shown below: 

!["visual explanation of code for case4"](Images/case4.jpg)