# Cleaning up age demographic by Wohnviertel data using pandas

The idea here is to use pandas to import the spreadsheet, investigate the structure and then clean it up and re-export as a csv file in a similar format to that used by the rest of the data files

In [7]:
# import necessary libraries
import pandas as pd
import os

In [31]:
# set some global variables
# probably a more elegant way of doing this....
file_path = os.path.join('..', 'data', 't01-1-22_Alter und Wohnviertel.xlsx')

In [32]:
# import the excel file 
data = pd.ExcelFile(file_path)

In [33]:
# take a look at the tabs in the spreadsheet
data.sheet_names

['Steckbrief',
 'Altstadt Grossbasel',
 'Vorstädte',
 'Am Ring',
 'Breite',
 'St. Alban',
 'Gundeldingen',
 'Bruderholz',
 'Bachletten',
 'Gotthelf',
 'Iselin',
 'St. Johann',
 'Altstadt Kleinbasel',
 'Clara',
 'Wettstein',
 'Hirzbrunnen',
 'Rosental',
 'Matthäus',
 'Klybeck',
 'Kleinhüningen',
 'Riehen',
 'Bettingen']

In [20]:
# I've already browsed the data in Excel and know the first sheet just contains metadata that isn't of interest
# while the rest contain data for each Wohnviertel. 
# Save a set of just the relevant tabs for use later using list slicing

wohnviertel_sheets = data.sheet_names[1:]

In [35]:
# take a look at the data for my favourite Wohnviertel
# the below returns the first 15 rows of a pandas dataframe based on the Breite sheet
data.parse('Breite').head(15)

Unnamed: 0,Präsidialdepartement des Kantons Basel-Stadt,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
,Statistisches Amt,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
,t01.1.22,,Wohnbevölkerung nach Staatsangehörigkeit und A...,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
,,,Schweiz,,,,Ausland,,,,Gesamtbevölkerung,,,,,,,
,Alter1,,Männlich,Weiblich,Total,,Männlich,Weiblich,Total,,Männlich,Weiblich,Total,,,,,
,,,,,,,,,,,,,,,,,,
,0-4,,111,127,238,,79,85,164,,190,212,402,,,,,


In [37]:
# let's also take a look at the last 15 rows of the sheet
data.parse('Breite').tail(15)

Unnamed: 0,Präsidialdepartement des Kantons Basel-Stadt,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
,55-59,,188.0,233.0,421.0,,78.0,62.0,140.0,,266.0,295.0,561.0,,,,,
,60-64,,160.0,198.0,358.0,,75.0,54.0,129.0,,235.0,252.0,487.0,,,,,
,65-69,,135.0,205.0,340.0,,39.0,37.0,76.0,,174.0,242.0,416.0,,,,,
,70-74,,145.0,215.0,360.0,,29.0,29.0,58.0,,174.0,244.0,418.0,,,,,
,75-79,,129.0,192.0,321.0,,25.0,26.0,51.0,,154.0,218.0,372.0,,,,,
,80-84,,103.0,214.0,317.0,,18.0,15.0,33.0,,121.0,229.0,350.0,,,,,
,85-89,,63.0,162.0,225.0,,5.0,6.0,11.0,,68.0,168.0,236.0,,,,,
,90-94,,29.0,85.0,114.0,,0.0,5.0,5.0,,29.0,90.0,119.0,,,,,
,95-99,,6.0,14.0,20.0,,1.0,1.0,2.0,,7.0,15.0,22.0,,,,,
,100-104,,1.0,5.0,6.0,,0.0,0.0,0.0,,1.0,5.0,6.0,,,,,


In [38]:
# The above returns a pandas dataframe
# Note a couple of things:
    # The first six rows don't include anything particularly useful nor do the last three
    # The index doesn't contain anything useful (ie NaN)
    # The column names aren't useful
    # The first column seems to contain the age bands
    # columns three and four contain the numbers of Swiss men and women respectively while the seventh and eight contain the numbers of non Swiss men and women
    # all other data columns simply contain totals that can be derived from the other data
    # the assumption is that all other tabs contain data in the same format

# So that's what the data looks like, now to clean it up

Let's try to drop the irrelevant rows and columns for a single sheet first and then create some code than can be used to iterate over all tabs and create a single dataframe of data that can then be re-exported as csv


In [40]:
# first pull the entire Breite sheet into a dataframe variable
df = data.parse('Breite')

In [55]:
# we next want to drop the columns we don't need and rename the ones we want to retain


In [54]:
# next drop the rows we don't need and create an index using the age bands column


In [56]:
# now we need to think about adding a new column or columns to tell us which wohnviertel we're talking about (follow standard of other data files)

In [57]:
# now we might want to think about transposing the different measures (ie swiss men, swiss women etc) into columns giving us just a single data column (again follow standard in other files)

In [58]:
# once this code works, create a simple loop over the relevant sheets to create a single dataframe for all wohnviertels


In [59]:
# then save the resulting dataframe as a csv using the same structure, delimiters etc as the other files

In [60]:
# next steps would be to amend the existing code in mapping to pull this new csv file in much the same way the current files are loaded
