# Business Funding Data
 Cleaning & Preprocessing

Author: Taofeekat Balogun
Purpose: Clean, explore and prepare the Business Funding Data for analysis.




##Imports and read file with encoding fallback because CSVs often have weird encodings


In [11]:
import pandas as pd
import numpy as np
df = pd.read_csv("Business Funding Data.csv", encoding="latin1")


#Quick data inspection
## To Understand rows, columns and types

In [14]:
df.head()


Unnamed: 0,Website Domain,Effective date,Found At,Financing Type,Financing Type Normalized,Categories,Investors,Investors Count,Amount,Amount Normalized,Source Urls
0,trafigura.com,,2024-03-14T01:00:00+01:00,,,[],,,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,,2024-05-31T02:00:00+02:00,,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...
2,zenobe.com,,2024-07-24T02:00:00+02:00,,,"[""private_equity""]",,,£41.7m,53671000,https://www.innovationnewsnetwork.com/zenobe-a...
3,canva.com,,2024-05-01T02:00:00+02:00,,,[],stackcapitalgroup.com,1.0,US$8 million,8000000,https://www.globenewswire.com/news-release/202...
4,fidelity.com,,2024-04-11T02:00:00+02:00,,,[],chevychasetrust.com,1.0,$1.96 million,1960000,https://www.defenseworld.net/2024/04/11/chevy-...


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Effective date             6 non-null      object 
 2   Found At                   26 non-null     object 
 3   Financing Type             8 non-null      object 
 4   Financing Type Normalized  8 non-null      object 
 5   Categories                 26 non-null     object 
 6   Investors                  13 non-null     object 
 7   Investors Count            13 non-null     float64
 8   Amount                     26 non-null     object 
 9   Amount Normalized          26 non-null     int64  
 10  Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 2.4+ KB


In [16]:
df.shape

(26, 11)

#Standardize column names

##Make column name easier to type and reduces risk of key errors.

In [18]:
df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
df.head(2)

Unnamed: 0,website_domain,effective_date,found_at,financing_type,financing_type_normalized,categories,investors,investors_count,amount,amount_normalized,source_urls
0,trafigura.com,,2024-03-14T01:00:00+01:00,,,[],,,$1.9b,1900000000,https://www.tradefinanceglobal.com/posts/trafi...
1,zenobe.com,,2024-05-31T02:00:00+02:00,,,[],"avivainvestors.com, lloydsbankinggroup.com, sa...",9.0,$522.7 million,522700000,https://realassets.ipe.com/news/aviva-among-le...


#Remove exact duplicates

In [19]:
#shows the duplicates values present
print("duplicates:", df.duplicated().sum())

#remove duplicates and reset insdex
df = df.drop_duplicates().reset_index(drop=True)


duplicates: 0


#Check where the missing values are

In [20]:
# Count missing values in each column
print(df.isnull().sum())


website_domain                0
effective_date               20
found_at                      0
financing_type               18
financing_type_normalized    18
categories                    0
investors                    13
investors_count              13
amount                        0
amount_normalized             0
source_urls                   0
dtype: int64


#See percentage of missing values

In [21]:
# Percentage of missing values
print((df.isnull().mean() * 100).round(2))


website_domain                0.00
effective_date               76.92
found_at                      0.00
financing_type               69.23
financing_type_normalized    69.23
categories                    0.00
investors                    50.00
investors_count              50.00
amount                        0.00
amount_normalized             0.00
source_urls                   0.00
dtype: float64


#Drop columns with too many missing values (>70%)

In [22]:
df = df.drop(columns=['effective_date', 'financing_type', 'financing_type_normalized'])


In [24]:
#Fill missing values in investors and investors_count
df['investors'] = df['investors'].fillna("Unknown")
df['investors_count'] = df['investors_count'].fillna(0)

print(df.isnull().sum())



website_domain       0
found_at             0
categories           0
investors            0
investors_count      0
amount               0
amount_normalized    0
source_urls          0
dtype: int64


#Observations from exploring the data

1. The dataset contains information on business funding in Nigeria (e.g., company website, funding amount, financing type, investors, etc.).

2. Some columns had a very high percentage of missing values (e.g., effective_date, financing_type, investors).

3. There were duplicates that needed to be removed.

4. The amount_normalized column was already in a standardized format, useful for analysis.

5. Some categorical columns (like categories and financing_type) will be useful for grouping and comparing funding patterns.

#Steps I took to clean, preprocess, and transform the data

1. Loaded the CSV file and inspected its structure with .shape and .head().

2. Standardized column names (lowercase, underscores) for easier handling.

3. Removed duplicate rows using .drop_duplicates().

4. Handled missing values:

5. Dropped columns with more than 60% missing data (effective_date, financing_type, financing_type_normalized).

6. Dropped rows with missing values in critical columns (investors, investors_count).

7. Confirmed the dataset was clean with .isnull().sum().

#4. Reflections on the importance of preprocessing in real-world data analysis

1. Real-world data is rarely clean; preprocessing is essential to remove errors, inconsistencies, and noise.

2. Good preprocessing ensures accurate insights, avoids misleading results, and improves model performance.

3. Decisions on handling missing values, duplicates, and formatting directly affect the reliability of analysis.

4. Without preprocessing, downstream analysis or predictive modeling could lead to wrong business conclusions.