In [None]:
import pandas as pd

To begin, we read the Excel file to see all the sheets that are in our file and identify which would be relevant to answer the business questions.

In [None]:
# Print all sheet names
all_sheets = pd.ExcelFile('data/business-demographics.xlsx')
print(all_sheets.sheet_names)

To answer our first business question we have to analyse past survival rates of businesses to find the borough in London with the highest 5-year survival rate. Thus, we will first analyse the sheets with the survival rates of businesses.

In [None]:
# Read the sheets on survival rates in a dictionary
survivalrates = {}
    
for i in range(2002, 2019):
    survivalrates[str(i)] = pd.read_excel('data/business-demographics.xlsx', sheet_name= str(i) + ' Survival Rates')

# Print 2002 Survival Rates
print(survivalrates['2002'].head(5))

# Print the number of rows and columns in the dataframe
print(survivalrates['2002'].shape)


At first glance, we identify several problems with the dataset and we can assume that these problems exist in all other sheets on survival rates.  

Firstly, the column names are mostly unnamed, so we know the first row of the dataset contains empty values. Thus we should re-read the Excel file and skip the first row.


In [None]:
for i in range(2002, 2019):
    survivalrates[str(i)] = pd.read_excel('data/business-demographics.xlsx', sheet_name= str(i) + ' Survival Rates', skiprows=1)

# Print 2002 Survival Rates
print(survivalrates['2002'].head(5))

Secondly, we can drop the columns that represent the survival rates in numbers as the percentage of businesses contains sufficient information about the survival rates for each borough. We can also drop the 'Births' column. 

In [None]:
# Removing 'Births' and the columns with survival rates in numbers 
for i in range(2002, 2019):
    survivalrates[str(i)].drop(survivalrates[str(i)].columns[[2, 3, 5, 7, 9, 11]], axis=1, inplace=True)

# Print 2002 Survival Rates
print(survivalrates['2002'].head(5))



Secondly, we can also rename the column names as we know from viewing the dataframe previously that the 'Per cent' columns are the survival rates in percentage for 1, 2, 3, 4 and 5 years in that order.

In [None]:
# Rename columns
for i in range(2002, 2019):
    survivalrates[str(i)].rename(columns={'Per cent': '1 Year Survival in %', 'Per cent.1': '2 Year Survival in %', 'Per cent.2': '3 Year Survival in %','Per cent.3': '4 Year Survival in %','Per cent.4': '5 Year Survival in %',}, inplace=True)
    

print(survivalrates['2002'].head(5))


Thirdly, we observe that there is an empty row at the start of the data, so we want to test if there are any other empty rows or cells in the dataset.

In [None]:
# Check for missing values
print(survivalrates['2002'].isnull().sum())
print(survivalrates['2002'].isna().sum())

missing_rows_na = survivalrates['2002'][survivalrates['2002'].isna().any(axis=1)]
print(missing_rows_na)

We observe that rows 0, 34, 37, 47 and 52 are empty, so we can drop these rows. To better understand the data and why there are empty rows, we can view all the data.

In [None]:
# Print the whole sheet
print(survivalrates['2002'])

After row 33, the data is on different regions in London. As this information is already contained in the above rows which details the survival rates in different boroughs of London, we can remove the rows below 33. 

In [None]:
# Remove rows with irrelevant information
for i in range(2002, 2019):
    survivalrates[str(i)] = survivalrates[str(i)].iloc[1:34]

# Print 2002 Survival Rates
print(survivalrates['2002'])


Logically, as we know that the data in this dataset only goes up to 2019, we know that later years would not have all the data for survival rates for surviving more than a year. We need to check the other sheets as well.

In [None]:
# Print the 2018 Survival Rates sheet
print(survivalrates['2018'].head(5))

This means we need to remove the columns with ':'.

In [None]:
# Remove columns that contain ':'
for i in range(2002, 2019):
    survivalrates[str(i)].drop(columns = survivalrates[str(i)].columns[(survivalrates[str(i)] == ':').any()], inplace = True)

print(survivalrates['2014'].head(5))

As we are dealing with a lot of numbers, it is also important to check if the data type of each column are stored as numbers and not as strings.

In [None]:
# Checking the data type of each 
print(survivalrates['2002'].info(verbose=True))

Another sheet in the Excel file that would be relevant to answering the business question is 'Active Enterprises by year'. 

In [None]:
activeenterprises = pd.read_excel('data/business-demographics.xlsx', sheet_name= 'Active Enterprises by year')

print(activeenterprises)

As the structure is similiar to that of the previous sheets with survival rates except without incorrect column names, we will perform the same data cleaning process to remove empty rows and create a new dataframe with only the relevant data. To confirm, we will also check for empty values.

In [None]:
activeenterprises = activeenterprises.iloc[1:34]

print(activeenterprises)

In [None]:
print(activeenterprises.isnull().sum())
print(activeenterprises.isna().sum())

In [None]:
# Convert the headers from integers to string
activeenterprises.columns = activeenterprises.columns.map(str)

print(activeenterprises.info(verbose=True))


As the births are in float, we can convert them to integers as it makes more sense based on the context of the data.

In [None]:
for i in range (2002, 2020):
    activeenterprises[str(i)] = activeenterprises[str(i)].astype(int)

print(activeenterprises.head(5))

Now, as the data is prepared, we will save the edited file as a new Excel sheet.

In [None]:
# Save only the relevant sheets (maybe try diff file for each set of sheets)
with pd.ExcelWriter('data/prepared_data.xlsx') as writer:  
    activeenterprises.to_excel(writer, sheet_name='Active Enterprises by Year')
    for i in range(2002, 2019):
        survivalrates[str(i)].to_excel(writer, sheet_name= str(i) + ' Survival Rates')

