In [1]:
# import relevant modules and libraries
import pandas as pd
import numpy as np
import re

Data to be cleaned is real estate data of property for sale in Lagos. The data and its features described appropriately in the 'data_dictionary.pdf' file found in the files folder in the master branch of the project repository. The data is in csv format. Get started by loading the data using the pandas' 'read_csv' method.

In [2]:
# load data
lag = pd.read_csv('lag_houses.csv')

Once dataset has been loaded it is best to take a high level view of the data, this can be done by taking a look at its dimensions using pandas' 'shape' attribute. After this a look at the first few rows using the 'head' method to have a look at some contents of the dataset.

In [3]:
# peek at data
print(lag.shape)
lag.head()

(24773, 10)


Unnamed: 0,Description,Title,Location,Beds,Baths,Toilets,Is_new,Is_furnished,Is_serviced,Price
0,4 Bedroom Semi Detached Duplex,4 BEDROOM HOUSE FOR SALE,Chevron Lekki Lagos,4.0,5.0,5.0,1,1,1,85000000
1,5 Bedroom Fully Detached Duplex,5 BEDROOM HOUSE FOR SALE,2nd Toll Gate Oral Estate Lekki Lagos,5.0,6.0,6.0,1,1,0,160000000
2,4 Bedroom Semi Detached Duplex+ Jacuzzi,4 BEDROOM HOUSE FOR SALE,2nd Toll Gate Lekki Lagos,4.0,5.0,5.0,1,1,0,68000000
3,5 Bedroom Duplex,5 BEDROOM HOUSE FOR SALE,Ikate Lekki Lagos,5.0,6.0,6.0,1,0,0,290000000
4,3 Bedroom Apartment,3 BEDROOM HOUSE FOR SALE,Lekki Lagos,3.0,4.0,4.0,1,1,0,150000000


As seen above the data set has 24,773 rows and 10 columns. From this view we can infer what types of data are present in the dataset. To be certain about the data types we take a look at them we use pandas' 'info' method

In [4]:
# try to undertand data types
lag.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24773 entries, 0 to 24772
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Description   24773 non-null  object 
 1   Title         24773 non-null  object 
 2   Location      24773 non-null  object 
 3   Beds          24270 non-null  float64
 4   Baths         24170 non-null  float64
 5   Toilets       24154 non-null  float64
 6   Is_new        24773 non-null  int64  
 7   Is_furnished  24773 non-null  int64  
 8   Is_serviced   24773 non-null  int64  
 9   Price         24773 non-null  object 
dtypes: float64(3), int64(3), object(4)
memory usage: 1.9+ MB


Most columns have data types assigned correctly but the Price column notably has its type as object, this means that the contents of that column are not all floating point numbers as they should be as price is typically a continuous variable. To correct this we take a closer look at the price column to assess what impurities are present.

In [5]:
# take closer look at price column
lag['Price'][lag['Price'].str.contains(r'[^\d.]', regex=True, na=False)]

105             $440000
134      130000000/year
963            $1100000
1024           $1500000
1112     300000000/year
              ...      
23869          $1400000
23897          $5000000
23899       1500000/sqm
24338          $2000000
24484          $1500000
Name: Price, Length: 408, dtype: object

As the output above shows there are over 400 dirty entries in the  price column. Some prices are designated in dollars and some prices appear to be for rental properties, this should not be the case as this data is to supposed to contain only properties for sale. Now to clean the price column and change its data type to float. We make use of a list comprehension and the regular expressions 're' module just as we used above to weed out the dirty data.

In [6]:
lag["Price"] = [float(re.findall("\d+", val.split("$")[1])[0]) * 775 if "$" in val else float(re.findall("\d+", val)[0]) for val in lag['Price']]

If the code previusly used to check if the price column had any dirty dta is run an Attribute Error is returned as the column no longer has any string datatypes. The conversion to float was also successful because all symbols were successfuly removed.

Next up, check for missing values

In [7]:
# check for missing values
print(lag.isna().any()) # to check if any columns have missing values
lag.isna().sum() # to get a sum of missing values per column

Description     False
Title           False
Location        False
Beds             True
Baths            True
Toilets          True
Is_new          False
Is_furnished    False
Is_serviced     False
Price           False
dtype: bool


Description       0
Title             0
Location          0
Beds            503
Baths           603
Toilets         619
Is_new            0
Is_furnished      0
Is_serviced       0
Price             0
dtype: int64

Important to note that while the 'any' method lets you know if any column has null values the 'sum' method applied lets you add the number of missing values to give a total per column.

The above output shows that only 3 columns have missing or unknown values. Now to know what percentage those missing values constitute per column.

In [8]:
# to check percentage of missing values per column
print((lag.isna().sum()/ len(lag) *100))

Description     0.000000
Title           0.000000
Location        0.000000
Beds            2.030436
Baths           2.434102
Toilets         2.498688
Is_new          0.000000
Is_furnished    0.000000
Is_serviced     0.000000
Price           0.000000
dtype: float64


The percentage of missing values in those 3 columns doesnt exceed 3% of each column so it is safe to conclude that dropping these columns or imputaton will not have a minimal impact on the outcome of any predictive analysis to be carried out. So to drop the missing values use the 'dropna' method and set appropriate parameters.

In [9]:
# drop rows with missing values
lag.dropna(axis=0, inplace=True)

Next, check for duplicate data.Use the 'duplicated' method with the 'sum' method chained to it to get the number of missing values. Like below:

In [21]:
# check for duplicates
lag.duplicated().sum()

0

A seen above there are 2150 duplicate rows in in the data.That data needs to be dropped.

In [20]:
# remove duplicates
lag.drop_duplicates(inplace=True)

Whitespace can cause all sorts of errors in analysis, so for each column that has string values, the whitespace must be removed.

In [16]:
# remove whitspace from string columns
for col in lag.columns:
        if lag[col].dtype == "object":
            lag[col] = lag[col].str.strip()

Now one last look at the data before its is written to a new csv file titled 'cleaned_houses.csv' found in the data file of the master branch.

In [22]:
lag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21937 entries, 0 to 24772
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Description   21937 non-null  object 
 1   Title         21937 non-null  object 
 2   Location      21937 non-null  object 
 3   Beds          21937 non-null  float64
 4   Baths         21937 non-null  float64
 5   Toilets       21937 non-null  float64
 6   Is_new        21937 non-null  int64  
 7   Is_furnished  21937 non-null  int64  
 8   Is_serviced   21937 non-null  int64  
 9   Price         21937 non-null  float64
dtypes: float64(4), int64(3), object(3)
memory usage: 1.8+ MB


In [23]:
# write clean data to csv file
lag.to_csv("cleaned_houses.csv", index=False)
