<a href="https://colab.research.google.com/github/skannah/realestatedata/blob/main/part_1_nlp_real_estate_description_training_dataset_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1 | NLP Real Estate Desciption Training Dataset
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/16fgew9rSX1pZYtrqZMu83F0_uUCrDEYG?usp=sharing)

## Overview
| Detail Tag            | Information                                                                                        |
|-----------------------|----------------------------------------------------------------------------------------------------|
| Originally Created By | Ariel Herrera arielherrera@analyticsariel.com |
| External References   | Open AI API |
| Input Datasets        | Source name |
| Output Datasets       | Source name |
| Input Data Source     | Pandas DataFrame |
| Output Data Source    | String |

## History
| Date         | Developed By  | Reason                                                |
|--------------|---------------|-------------------------------------------------------|
| 13th May 2023 | Ariel Herrera | Create notebook. |

## Getting Started
1. Copy this notebook -> File -> Save a Copy in Drive
2. Directions

## Useful Resources
- [Google Colab Cheat Sheet](https://towardsdatascience.com/cheat-sheet-for-google-colab-63853778c093)
- [Stemming vs Lemmatization](https://www.analyticsvidhya.com/blog/2022/06/stemming-vs-lemmatization-in-nlp-must-know-differences/#:~:text=Stemming%20is%20a%20process%20that,'%20would%20return%20'Car'.)

## <font color="blue">Install Packages</font>

## <font color="blue">Imports</font>

In [None]:
from google.colab import files
from getpass import getpass
import io
import pandas as pd
import plotly.express as px
import string
import nltk # NLTK is a leading platform for building Python programs to work with human language data
from nltk.tokenize import word_tokenize # used for parsing a large amount of textual data into parts to perform an analysis of the character of the text
from nltk.stem import WordNetLemmatizer # Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item
from collections import Counter

wordnet_lemmatizer = WordNetLemmatizer()
pd.set_option("display.max_columns", None)

In [None]:
nltk.download('punkt') # divides a test into a lsit of sentences by using an unsupervised algo
nltk.download('stopwords') # common words, not useful to descript the topic of content
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## <font color="blue">Functions</font>

In [None]:
def tokens_to_sentence(text):
    return ' '.join(text)

def remove_punctuation(text):
    no_punct=[w for w in text if w not in string.punctuation]
    words_wo_punct=''.join(no_punct)
    return words_wo_punct

def tokenize_text(text):
    return word_tokenize(text)

def lower_case(text):
    return [w.lower() for w in text]

def remove_stopwords(text):
    stopword = nltk.corpus.stopwords.words('english')
    return [w for w in text if w not in stopword]

def lemmatize_text(text):
    return [wordnet_lemmatizer.lemmatize(w, pos="v") for w in text]

In [None]:
def filter_nlp_prop_detail(df, remove_nulls=True, remove_duplicates=True, min_words = 1):

    if remove_nulls == True:
        # remove properties without addresses (likely new build)
        df = df.loc[~df['Street'].isnull()]
        # remove properties without descriptions (likely new construction)
        df = df.loc[~df['Description'].isnull()]

    if remove_duplicates == True:
        df = df.drop_duplicates()

    # filter based on min words
    df = df.loc[df['word_count'] >= min_words]

    return df

In [None]:
def hard_code_labels(df, distressed_keywords, remodeled_keywords):
    distressed_label_list = []
    for d in df['norm_desc'].tolist():
        distressed_opt = "none"
        for w in distressed_keywords:
            if w in d:
                distressed_opt = "distressed"
                break

        if distressed_opt == "none":
            for w in remodeled_keywords:
                if w in d:
                    distressed_opt = "not-distressed"
                    break
        distressed_label_list.append(distressed_opt)

    df['label'] = distressed_label_list
    return df

## <font color="blue">Data</font>


In [None]:
# upload file
uploaded = files.upload()

Saving Birmingham-Properties-Data.csv to Birmingham-Properties-Data (2).csv


In [None]:
# transform file into pandas dataframe
_df = pd.read_csv(io.BytesIO(uploaded[list(uploaded.keys())[0]]))
print('Number of rows:', len(_df))
print('Number of columns:', len(_df.columns))
_df.head()

Number of rows: 329
Number of columns: 33


Unnamed: 0,Street,City,State,ZIP Code,Image,Price,House Type,Year Built,Sq. Footage,Lot Size,Bedrooms,Bathrooms,Description,Down Payment %,Interest Rate %,Loan Years,Loan Amount,Payment Months,Closing Costs %,Total Renovation Cost,Rent,Other Revenue,Mortgage Payment,HOA Fees,Insurance,Property Taxes,Vacancy Rate Allocation,Management Fee,Maintenance,Other Costs,Initial Costs,Monthly Profit,Cash on Cash
0,220 6th St,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/64d6e6e7193...,8500,Single Family,1935,1049,2613,3,1,"charming 3-bedroom, 1-bathroom home boasts 1,0...",0.2,0.07,30,6800.0,360,0.03,0,950,0,45.24,0,2,4.321,79,87,17,0,1955.0,715.44,439.14
1,2136 47th St W,Birmingham,AL,35208,https://photos.zillowstatic.com/fp/176be8318d5...,10000,Single Family,1971,892,7405,2,1,NEW PRICE! Incredible Investment Opportunity! ...,0.2,0.07,30,8000.0,360,0.03,0,844,0,53.22,0,3,5.083,70,77,15,0,2300.0,620.44,323.71
2,811 4th St,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/8553ae74cc5...,12000,Single Family,1930,1271,10454,3,1,"3 bedroom, 1 bathroom tax deed property for sale.",0.2,0.07,30,9600.0,360,0.03,0,944,0,63.87,0,4,6.1,79,87,17,0,2760.0,687.03,298.71
3,812 Avenue H,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/72bde8e1ceb...,12000,Single Family,1925,1288,6534,3,1,** 14k **. INVESTORS DON'T LET THIS ONE GET ...,0.2,0.07,30,9600.0,360,0.03,0,938,0,63.87,0,4,6.5,78,86,17,0,2760.0,682.63,296.8
4,516 5th Way,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/e8053444165...,14000,Single Family,1935,1332,6969,3,1,Calling all Builders!! This home caught fire o...,0.2,0.07,30,11200.0,360,0.03,0,925,0,74.51,0,4,7.583,77,85,17,0,3220.0,659.9,245.93


In [None]:
# make a copy of the original dataset
df = _df.copy()

# apply normalization functions on text dataset
df['norm_desc'] = df['Description'].apply(lambda x: remove_punctuation(str(x)))
df['norm_desc'] = df['norm_desc'].apply(lambda x: tokenize_text(x))
df['norm_desc'] = df['norm_desc'].apply(lambda x: lower_case(x))
df['norm_desc'] = df['norm_desc'].apply(lambda x: remove_stopwords(x))
df['norm_desc'] = df['norm_desc'].apply(lambda x: lemmatize_text(x))
df['norm_desc'] = df['norm_desc'].apply(lambda x: tokens_to_sentence(x))
df.head()

Unnamed: 0,Street,City,State,ZIP Code,Image,Price,House Type,Year Built,Sq. Footage,Lot Size,Bedrooms,Bathrooms,Description,Down Payment %,Interest Rate %,Loan Years,Loan Amount,Payment Months,Closing Costs %,Total Renovation Cost,Rent,Other Revenue,Mortgage Payment,HOA Fees,Insurance,Property Taxes,Vacancy Rate Allocation,Management Fee,Maintenance,Other Costs,Initial Costs,Monthly Profit,Cash on Cash,norm_desc
0,220 6th St,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/64d6e6e7193...,8500,Single Family,1935,1049,2613,3,1,"charming 3-bedroom, 1-bathroom home boasts 1,0...",0.2,0.07,30,6800.0,360,0.03,0,950,0,45.24,0,2,4.321,79,87,17,0,1955.0,715.44,439.14,charm 3bedroom 1bathroom home boast 1043 squar...
1,2136 47th St W,Birmingham,AL,35208,https://photos.zillowstatic.com/fp/176be8318d5...,10000,Single Family,1971,892,7405,2,1,NEW PRICE! Incredible Investment Opportunity! ...,0.2,0.07,30,8000.0,360,0.03,0,844,0,53.22,0,3,5.083,70,77,15,0,2300.0,620.44,323.71,new price incredible investment opportunity wa...
2,811 4th St,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/8553ae74cc5...,12000,Single Family,1930,1271,10454,3,1,"3 bedroom, 1 bathroom tax deed property for sale.",0.2,0.07,30,9600.0,360,0.03,0,944,0,63.87,0,4,6.1,79,87,17,0,2760.0,687.03,298.71,3 bedroom 1 bathroom tax deed property sale
3,812 Avenue H,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/72bde8e1ceb...,12000,Single Family,1925,1288,6534,3,1,** 14k **. INVESTORS DON'T LET THIS ONE GET ...,0.2,0.07,30,9600.0,360,0.03,0,938,0,63.87,0,4,6.5,78,86,17,0,2760.0,682.63,296.8,14k investors dont let one get fix flipbuy hol...
4,516 5th Way,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/e8053444165...,14000,Single Family,1935,1332,6969,3,1,Calling all Builders!! This home caught fire o...,0.2,0.07,30,11200.0,360,0.03,0,925,0,74.51,0,4,7.583,77,85,17,0,3220.0,659.9,245.93,call builders home catch fire 7th february com...


In [None]:
# add a word count per prop desc
df['word_count'] = df.apply(lambda x: len(x['norm_desc'].split(' ')), axis=1)

In [None]:
px.histogram(df, x='word_count')

In [None]:
# view top words
full_text = ' '.join(s for s in df['norm_desc'].to_list())
full_text_split = full_text.split(' ')
word_counts = Counter(full_text_split)
df_word_counts = pd.DataFrame.from_dict(word_counts, orient='index').reset_index()
df_word_counts.columns = ['word', 'count']
df_word_counts.sort_values(by=['count'], ascending=False).head(50)

Unnamed: 0,word,count
3,home,418
28,new,187
15,property,172
12,room,152
108,bath,146
100,great,139
31,opportunity,136
220,floor,119
292,portfolio,113
454,tenant,113


In [None]:
# bottom 10
df_word_counts.sort_values(by=['count'], ascending=False).tail(10)

Unnamed: 0,word,count
1171,centre,1
1172,upcoming,1
1175,fun,1
1176,skate,1
1177,rink,1
1178,jump,1
1179,esports,1
1180,bowl,1
175,wealth,1
1953,frameless,1


### <font color="blue">Filter on Dataset</font>

In [None]:
# criteria
remove_nulls = True
remove_duplicates = True
min_words = 10

In [None]:
df_filter = filter_nlp_prop_detail(df=df,
                       remove_nulls=remove_nulls,
                       remove_duplicates=remove_duplicates,
                       min_words=min_words)

In [None]:
print('Count of records post filter: {0}\n Prct of records retained: {1} %'\
      .format(len(df_filter), round((len(df_filter) / len(df)) * 100, 2)))
df_filter.head(1)

Count of records post filter: 297
 Prct of records retained: 90.27 %


Unnamed: 0,Street,City,State,ZIP Code,Image,Price,House Type,Year Built,Sq. Footage,Lot Size,Bedrooms,Bathrooms,Description,Down Payment %,Interest Rate %,Loan Years,Loan Amount,Payment Months,Closing Costs %,Total Renovation Cost,Rent,Other Revenue,Mortgage Payment,HOA Fees,Insurance,Property Taxes,Vacancy Rate Allocation,Management Fee,Maintenance,Other Costs,Initial Costs,Monthly Profit,Cash on Cash,norm_desc,word_count
0,220 6th St,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/64d6e6e7193...,8500,Single Family,1935,1049,2613,3,1,"charming 3-bedroom, 1-bathroom home boasts 1,0...",0.2,0.07,30,6800.0,360,0.03,0,950,0,45.24,0,2,4.321,79,87,17,0,1955.0,715.44,439.14,charm 3bedroom 1bathroom home boast 1043 squar...,31


## Preliminary Labels
Setting hard labels to speed up manual human review.

In [None]:
distressed_keywords = ['tlc', 'asis', 'investor', 'investment', 'repairs', 'opportunity',
                       'potential', 'cash', 'fixer upper', 'handyman', 'unfinished', 'vrbo',
                       'airbnb', 'cosmetic', 'fix flip', 'tenant', 'rehab', 'income produce',
                       'rental', 'rent', 'tenant', 'repair', 'must sell', 'quick sell'
                       ]
remodeled_keywords = ['update', 'new', 'upgrade', 'beautiful', 'love', 'reno', 'movein ready',
                      'gorgeous', 'nice', 'maintained', 'clean', 'adorable', 'remarkable',
                      'granite', 'quartz', 'well maintain', 'moveinready', 'redone', 'remodel',
                      'home ready', 'well keep', 'stainless', 'island', 'hardwood', 'move ready',
                      'entertain', 'charm']

In [None]:
df_label = hard_code_labels(df_filter, distressed_keywords, remodeled_keywords)
df_label.head(5)

Unnamed: 0,Street,City,State,ZIP Code,Image,Price,House Type,Year Built,Sq. Footage,Lot Size,Bedrooms,Bathrooms,Description,Down Payment %,Interest Rate %,Loan Years,Loan Amount,Payment Months,Closing Costs %,Total Renovation Cost,Rent,Other Revenue,Mortgage Payment,HOA Fees,Insurance,Property Taxes,Vacancy Rate Allocation,Management Fee,Maintenance,Other Costs,Initial Costs,Monthly Profit,Cash on Cash,norm_desc,word_count,label
0,220 6th St,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/64d6e6e7193...,8500,Single Family,1935,1049,2613,3,1,"charming 3-bedroom, 1-bathroom home boasts 1,0...",0.2,0.07,30,6800.0,360,0.03,0,950,0,45.24,0,2,4.321,79,87,17,0,1955.0,715.44,439.14,charm 3bedroom 1bathroom home boast 1043 squar...,31,distressed
1,2136 47th St W,Birmingham,AL,35208,https://photos.zillowstatic.com/fp/176be8318d5...,10000,Single Family,1971,892,7405,2,1,NEW PRICE! Incredible Investment Opportunity! ...,0.2,0.07,30,8000.0,360,0.03,0,844,0,53.22,0,3,5.083,70,77,15,0,2300.0,620.44,323.71,new price incredible investment opportunity wa...,50,distressed
3,812 Avenue H,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/72bde8e1ceb...,12000,Single Family,1925,1288,6534,3,1,** 14k **. INVESTORS DON'T LET THIS ONE GET ...,0.2,0.07,30,9600.0,360,0.03,0,938,0,63.87,0,4,6.5,78,86,17,0,2760.0,682.63,296.8,14k investors dont let one get fix flipbuy hol...,10,distressed
4,516 5th Way,Birmingham,AL,35214,https://photos.zillowstatic.com/fp/e8053444165...,14000,Single Family,1935,1332,6969,3,1,Calling all Builders!! This home caught fire o...,0.2,0.07,30,11200.0,360,0.03,0,925,0,74.51,0,4,7.583,77,85,17,0,3220.0,659.9,245.93,call builders home catch fire 7th february com...,15,none
5,3669 42nd Ave N,Birmingham,AL,35207,https://photos.zillowstatic.com/fp/901efc70ddb...,15000,Single Family,1948,876,5227,3,1,This one has potential with fenced large yard....,0.2,0.07,30,12000.0,360,0.03,0,874,0,79.84,0,4,7.625,73,80,16,0,3450.0,613.54,213.4,one potential fence large yard would make grea...,11,distressed


#### <font color="purple">Distressed</font>

In [None]:
# iterate through those not destressed
i = 1
for d in df_label.loc[df_label['label'] == "distressed"]['norm_desc'].tolist()[:5]:
    print('{0}) {1}\n'.format(i, d))
    i += 1

1) charm 3bedroom 1bathroom home boast 1043 square feet live space provide plenty room make mark property need renovations offer tremendous upside potential elbow grease home transform perfect dream home investment property

2) new price incredible investment opportunity walk distance new birmingham crossplex potential airbnb rental fix flip use caution stairs inside go past caution tape seller would like sell 3 properties together properties vacant sell asis 420 grant street mls 1344881 2136 47th street mls 1344880 9321 9th ave n mls 1344883

3) 14k investors dont let one get fix flipbuy hold cash

4) one potential fence large yard would make great rental starter home

5) possibilities endless 3 bed 1 bath home offer 992 square feet customizable live space sell asis make perfect opportunity investors homebuyers alike



#### <font color="purple">Non-Distressed</font>

In [None]:
# iterate through those not destressed
i = 1
for d in df_label.loc[df_label['label'] == "not-distressed"]['norm_desc'].tolist()[:5]:
    print('{0}) {1}\n'.format(i, d))
    i += 1

1) charm home locate desirable south highlands neighborhood birmingham alabama boast 2 bedrooms 1 bathroom driveway offstreet park outside backyard fence provide private outdoor space relaxation recreation

2) lovely home level yard home orginal hardwood floor plenty space whole family home last long hurry go

3) appear good bone price sell come see attractive 3 bedroom 1 bathroom home west end highland neighborhood home feature fence yard new windows five year old hvac system hardwood floor look level could easily shine den back house nice bonus add live space room spread full bathroom tile pedestal sink look functional right redemption may apply buyer agent confirm list data tax estimate prior owner tax exempt

4) lovely 1930s charmer need new owner home full character original millwork staircase primary bedroom main level three bedrooms bright daylight den upstairs home need proud owner restore original grandeur know owners good community deserve new owner proud service community su

#### <font color="purple">None</font>
No keywords to tag a label

In [None]:
# iterate through those not destressed
i = 1
for d in df_label.loc[df_label['label'] == "none"]['norm_desc'].tolist()[:10]:
    print('{0}) {1}\n'.format(i, d))
    i += 1

1) call builders home catch fire 7th february come view property see bring back pristine form

2) 2 bedroom 1 bathroom tax deed property great location neighborhood

3) tax deed property access inside property view property outside right buyer take place seller quiet titleejectment action perform

4) nvestors delight fire damage brick house great lot great location 3 br 1 5 bath den screen porch great cover 2 car detach carport sell use precaution enter risk

5) small house perfect look comfortable affordable home locate quiet neighborhood offer peaceful retreat hustle bustle city life 2 bed 1 bath

6) house 3 bedrooms 1 bathroom cozy functional home perfect small family group individuals

7) fixerupper south titusville home blank slate ready make 2 bedrooms 1 full bath woodburning fireplace cover front porch minutes uab downtown birmingham schedule show today

8) quiet serene neighborhood 3br1ba diamond rough spruce make great home conveniently locate near shop malls interstate

9) bu

# End Notebook