<a href="https://colab.research.google.com/github/qsquentinsmith/minnesota_real_estate_analysis/blob/main/real_estate_description_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

import matplotlib.pyplot as plt
%matplotlib inline

from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input

warnings.filterwarnings("ignore", category=FutureWarning)

## Get Seller Description Data from Real Estate Website for Sentiment Analysis

In [None]:
# Call data back to read csv
# Resources: https://github.com/JacobSampson/mls_scraper. Forked repo to https://github.com/qsquentinsmith/mls_scraper to add more features
df_properties = pd.read_csv('https://raw.githubusercontent.com/qsquentinsmith/mls_scraper/main/output/properties.csv')

In [None]:
# copy data frame for description analysis
df_properties_description = df_properties.copy()
df_properties_description = df_properties_description[['pid', 'description']]

In [None]:
# no truncation
pd.set_option('display.max_colwidth', None)

In [None]:
df_properties_description.head(5)

Unnamed: 0,pid,description
0,pid,
1,5708235,Description: Simplygrandresidence-unforgettablesetting-spectacularviewsallaround!Spaciousandopenmain-floorlivingshowcasesrichfinishes:finearchitecturaldetailsandbeautifulupdates.LuxuriousOwner'squarterswithspabath:hugewalk-inclosetandadjoiningoffice(ornursery!)Formallivinganddiningroomsopento2-storygreatroom:oversizedgourmetkitchenandfabuloussunroomwithtreetopviews!Walkouttoextensivemultileveldeckswithcustompergolasurroundedintotalprivacy!Accessupperlevelfromgrandcurvedstaircaseorsecondstairwellto3bedrooms:2baths:loft-opentothemainlevel.Lowerlevelfeaturesexecutive-styleoffice:hugefamily/amusementandgamerooms:exerciseroom:5thbedroom:bathandstorageroom!Walkouttodeck:paveredpatioandyardoverlookingthepondandwoodswithbeautifullandscapingandnatureallaround!Trulyasanctuarysetting-.83acreatquietendofculdesac:CharlesCuddbuilt.Pleaseseesupplement.
2,5618755,Description: Youwillfallinlovewiththisstunninglygorgeoushomeandallofitsmanyfeatures!Beautifulsparklingpool:multiplefireplaces:stainlesssteelappliances:hugemaintenancefreedeck:patio:gorgeouswoodfloors:basementwithheatedfloors:andameticulouslykeptgaragewithabeautifulapartmentaboveit.Apartmenthasaseparateentrance:heatedfloors:steamshower:separatefurnaceandAC:maintenancefreedeck:fireplace:lotsofnaturallight:andstainlesssteelappliances.Thishomehasitall:youwillnotbedisappointed!
3,5501312,Description: Builtforfamily:friendsandgraciouseaseofliving.Highlightsofthe6000finsqfthomeincludesachef'skitchen:5bedrooms:hobbyroom:billiardsroom:saunaandprivateoffice.
4,Themainfloormasterismoreofasanctuarythanabedroomwithagasfireplace.Thehugewalk-inshowerandsoakingtubareahavenofcomfortandrelaxation.Theoutdoorsincludesabelowgroundswimmingpool:ahottubandanoutdoorkitchenareawithpatio.Awelldesignedstudioin-lawsuitewithakitchen:privatebath:laundryroomandseparateentrance.Outbuildingsincludeafitnesscenterwithadancestudioandrockclimbingwall.Bringyourhorses..horsereadyfacilities.,


In [None]:
# Delete first row
df_properties_description = df_properties_description.iloc[1:]

In [None]:
# Check data types
df_properties_description.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138867 entries, 1 to 138867
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   pid          138867 non-null  object
 1   description  100509 non-null  object
dtypes: object(2)
memory usage: 2.1+ MB


In [None]:
# Change description type to string for cleaning
df_properties_description = df_properties_description.astype('str')

In [None]:
# Create additional column for length of pid
df_properties_description['pid_length'] = df_properties_description['pid'].apply(len) 

In [None]:
df_properties_description.head(5)

Unnamed: 0,pid,description,pid_length
1,5708235,Description: Simplygrandresidence-unforgettablesetting-spectacularviewsallaround!Spaciousandopenmain-floorlivingshowcasesrichfinishes:finearchitecturaldetailsandbeautifulupdates.LuxuriousOwner'squarterswithspabath:hugewalk-inclosetandadjoiningoffice(ornursery!)Formallivinganddiningroomsopento2-storygreatroom:oversizedgourmetkitchenandfabuloussunroomwithtreetopviews!Walkouttoextensivemultileveldeckswithcustompergolasurroundedintotalprivacy!Accessupperlevelfromgrandcurvedstaircaseorsecondstairwellto3bedrooms:2baths:loft-opentothemainlevel.Lowerlevelfeaturesexecutive-styleoffice:hugefamily/amusementandgamerooms:exerciseroom:5thbedroom:bathandstorageroom!Walkouttodeck:paveredpatioandyardoverlookingthepondandwoodswithbeautifullandscapingandnatureallaround!Trulyasanctuarysetting-.83acreatquietendofculdesac:CharlesCuddbuilt.Pleaseseesupplement.,7
2,5618755,Description: Youwillfallinlovewiththisstunninglygorgeoushomeandallofitsmanyfeatures!Beautifulsparklingpool:multiplefireplaces:stainlesssteelappliances:hugemaintenancefreedeck:patio:gorgeouswoodfloors:basementwithheatedfloors:andameticulouslykeptgaragewithabeautifulapartmentaboveit.Apartmenthasaseparateentrance:heatedfloors:steamshower:separatefurnaceandAC:maintenancefreedeck:fireplace:lotsofnaturallight:andstainlesssteelappliances.Thishomehasitall:youwillnotbedisappointed!,7
3,5501312,Description: Builtforfamily:friendsandgraciouseaseofliving.Highlightsofthe6000finsqfthomeincludesachef'skitchen:5bedrooms:hobbyroom:billiardsroom:saunaandprivateoffice.,7
4,Themainfloormasterismoreofasanctuarythanabedroomwithagasfireplace.Thehugewalk-inshowerandsoakingtubareahavenofcomfortandrelaxation.Theoutdoorsincludesabelowgroundswimmingpool:ahottubandanoutdoorkitchenareawithpatio.Awelldesignedstudioin-lawsuitewithakitchen:privatebath:laundryroomandseparateentrance.Outbuildingsincludeafitnesscenterwithadancestudioandrockclimbingwall.Bringyourhorses..horsereadyfacilities.,,408
5,5628081,Description: AmazingVictorianinspiredhomelocatedinBlaineon11.11acres;homehasplentyofroomforentertaining.Theindoorpool(3ftto8ftdeep)withhottubandsaunaandoutdoorlivingareaswillbringyearsofenjoymenttoafamilyandtheirfriends.Allfivebedroomsarelocatedupstairsonthe,7


In [None]:
# Main cleaning for loop. Loop through backwards. This moves the descriptions that got cut off into the pid column back into the description data in the previous row
for i in range(len(df_properties_description)-1, 0, -1):
  if df_properties_description.iloc[i, 2] != 7:
    if df_properties_description.iloc[i-1, 2] != 7:
      df_properties_description.iloc[i-1, 0] = df_properties_description.iloc[i-1, 0] + df_properties_description.iloc[i, 0]
    else: 
      df_properties_description.iloc[i-1, 1] = df_properties_description.iloc[i-1, 1] + df_properties_description.iloc[i, 0] #where split once

In [None]:
# Since we copied the pid data into the description column. We can now delete any data in pid that is larger than 7 which is the default pid size
df_properties_description = df_properties_description[df_properties_description['pid_length'] == 7]

In [None]:
# Also we can get rid of any nan
df_properties_description = df_properties_description[df_properties_description['description'] != 'nan']

In [None]:
# This library puts in spaces
!pip3 install wordninja



In [None]:
import wordninja

description = wordninja.split(df_properties_description.iloc[0, 1])

In [None]:
print(description)

['Description', 'Simply', 'grand', 'residence', 'unforgettable', 'setting', 'spectacular', 'views', 'all', 'around', 'Spacious', 'and', 'open', 'main', 'floor', 'living', 'showcases', 'rich', 'finishes', 'fine', 'architectural', 'details', 'and', 'beautiful', 'updates', 'Luxurious', "Owner's", 'quarters', 'with', 'spa', 'bath', 'huge', 'walk', 'in', 'closet', 'and', 'adjoining', 'office', 'or', 'nursery', 'Formal', 'living', 'and', 'dining', 'rooms', 'open', 'to', '2', 'story', 'great', 'room', 'oversized', 'gourmet', 'kitchen', 'and', 'fabulous', 'sunroom', 'with', 'treetop', 'views', 'Walkout', 'to', 'extensive', 'multi', 'level', 'decks', 'with', 'custom', 'pergola', 'surrounded', 'in', 'total', 'privacy', 'Access', 'upper', 'level', 'from', 'grand', 'curved', 'staircase', 'or', 'second', 'stairwell', 'to', '3', 'bedrooms', '2', 'baths', 'loft', 'open', 'to', 'the', 'main', 'level', 'Lower', 'level', 'features', 'executive', 'style', 'office', 'huge', 'family', 'amusement', 'and', '

In [None]:
# Function to convert the list of words back into a sentence with spaces
def listToString(s): 

    str1 = "" 

    for ele in s: 
        str1 += ele + " "  
 
    return str1 

In [None]:
list_description_clean = []

In [None]:
# Puts the cleaned data into a list
for i in range(0, len(df_properties_description)):
  list_description_clean.append(listToString(wordninja.split(df_properties_description.iloc[i, 1])))
  

In [None]:
print(df_properties_description.shape)
print(len(list_description_clean))

(100530, 3)
100530


In [None]:
# Creates new column and copies the list into it
df_properties_description['description_clean'] = list_description_clean

In [None]:
# Since we have a cleaned description column we no longer need description
del df_properties_description['description']

In [None]:
df_properties_description.head(5)

Unnamed: 0,pid,pid_length,description_clean
1,5708235,7,Description Simply grand residence unforgettable setting spectacular views all around Spacious and open main floor living showcases rich finishes fine architectural details and beautiful updates Luxurious Owner's quarters with spa bath huge walk in closet and adjoining office or nursery Formal living and dining rooms open to 2 story great room oversized gourmet kitchen and fabulous sunroom with treetop views Walkout to extensive multi level decks with custom pergola surrounded in total privacy Access upper level from grand curved staircase or second stairwell to 3 bedrooms 2 baths loft open to the main level Lower level features executive style office huge family amusement and game rooms exercise room 5 th bedroom bath and storage room Walkout to deck pave red patio and yard overlooking the pond and woods with beautiful landscaping and nature all around Truly a sanctuary setting 83 acre at quiet end of cul de sac Charles Cud d built Please see supplement
2,5618755,7,Description You will fallin love with this stunningly gorgeous home and all of its many features Beautiful sparkling pool multiple fireplaces stainless steel appliances huge maintenance free deck patio gorgeous wood floors basement with heated floors and a meticulously kept garage with a beautiful apartment above it Apartment has a separate entrance heated floors steam shower separate furnace and AC maintenance free deck fireplace lots of natural light and stainless steel appliances This home has it all you will not be disappointed
3,5501312,7,Description Built for family friends and gracious ease of living Highlights of the 6000 fins q ft home includes a chef's kitchen 5 bedrooms hobby room billiards room sauna and private office The main floor master is more of a sanctuary than a bedroom with a gas fireplace The huge walk in shower and soaking tub area haven of comfort and relaxation The outdoors includes a below ground swimming pool a hot tub and an outdoor kitchen area with patio A well designed studio in law suite with a kitchen private bath laundry room and separate entrance Outbuildings include a fitness center with a dance studio and rock climbing wall Bring your horses horse ready facilities
5,5628081,7,Description Amazing Victorian inspired home located in Blaine on 11 11 acres home has plenty of room for entertaining The indoor pool 3 ft to 8 ft deep with hot tub and sauna and outdoor living areas will bring years of enjoyment to a family and their friends All five bedrooms are located upstairs on the same level Upper floor has 9 ft ceilings Main level includes gourmet kitchen with custom antique white cabinets Wolf range and sub zero refrigerator This is a must see property Nature seekers outdoor seekers or seeking privacy this maybe your perfect property that awaits you
9,5502025,7,Description Rustic with a modern flare Welcome home to the Reggie Award winning Brooke model by One Ten Ten Homes We strive for functionality and this home has it all Gourmet kitchen with a 9 foot island a wrapped kitchen with 52 inch cabinets a farmhouse sink Beams barn doors and so much more This home has a junior suite jack and jill bath and a master bath you could only dream of Come today you won't regret it


In [None]:
df_properties_description.dtypes

pid                  object
pid_length            int64
description_clean    object
dtype: object

In [None]:
# copy data frame to change type
df = df_properties_description.copy()
df = df[['pid', 'description_clean']]

In [None]:
df.dtypes

pid                  object
description_clean    object
dtype: object

In [None]:
# Go through pid list and only keep numeric string values
for i in df.index.values:
  if not df['pid'][i].isnumeric():
    df = df.drop(i)

In [None]:
# Since they are all numeric we can change the types for merging in analysis
df['pid']= df['pid'].astype(int)
df['description_clean']= df['description_clean'].astype(str)

In [None]:
# Link google drive
from google.colab import drive
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [None]:
# Save to google drive
df.to_csv('cleaned_housing_description_data_v1.csv')
!cp cleaned_housing_description_data_v1.csv "drive/My Drive/"

The description data is now cleaned! We will merge it together with the image data and the metadata cleaned df in the analysis notebook.