## Airbnb DC Hosting Helper ##

## 3_text_processing ##

### Summary ###

In this notebook, I will be turning raw text data into informative data for a listing. I will be working with what the host provided for name of the listing, its description, the neighborhood overview, about the host, and amenities.

Import libraries and read in data

In [60]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk import FreqDist
import nltk

In [61]:
np.random.seed(100)

In [62]:
pd.set_option('display.max_columns', 300)

In [63]:
pd.set_option('display.max_rows', 300)

In [64]:
df = pd.read_csv('../data/cleaned_numerical_df.csv').drop(columns=['Unnamed: 0'])

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3652 entries, 0 to 3651
Data columns (total 47 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              3652 non-null   int64  
 1   name                            3652 non-null   object 
 2   description                     3609 non-null   object 
 3   neighborhood_overview           2796 non-null   object 
 4   host_id                         3652 non-null   int64  
 5   host_about                      2538 non-null   object 
 6   host_response_time              3652 non-null   object 
 7   host_response_rate              3652 non-null   float64
 8   host_acceptance_rate            3652 non-null   float64
 9   host_is_superhost               3652 non-null   int64  
 10  host_has_profile_pic            3652 non-null   int64  
 11  host_identity_verified          3652 non-null   int64  
 12  neighbourhood_cleansed          36

I will be analyzing the length and sentiment in the text provided for listing name, description, neighborhood overview, and about the host below.

Instantiate and run tokenizer on columns to find individual words in string and chop out punctuation. Instantiate sentiment analyzer.

A note about teh setiment scoring - negative, neutral and positive scores are straightforward. The compound score is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

In [66]:
tokenizer = RegexpTokenizer(r'\w+')

In [67]:
#nltk.download('vader_lexicon')

sentiment = SentimentIntensityAnalyzer()

In [68]:
df['name']= df['name'].fillna('')

df['name']= df['name'].str.lower()

df['name'] = df.apply(lambda row: ' '.join(tokenizer.tokenize(row['name'])), axis=1)

df['name_word_count'] = [len(i.split() ) for i in df['name'].tolist() ]

df['name_neutral_sentiment'] = [sentiment.polarity_scores(i)['neu'] for i in df['name']]

df['name_negative_sentiment'] = [sentiment.polarity_scores(i)['neg'] for i in df['name']]

df['name_positive_sentiment'] = [sentiment.polarity_scores(i)['pos'] for i in df['name']]

df['name_compound_sentiment'] = [sentiment.polarity_scores(i)['compound'] for i in df['name']]

In [69]:
df['description']= df['description'].fillna('')

df['description']= df['description'].str.lower()

df['description'] = df.apply(lambda row: ' '.join(tokenizer.tokenize(str(row['description']))), axis=1)

df['description_word_count'] = [len(i.split() ) for i in df['description'].tolist() ]

df['description_neutral_sentiment'] = [sentiment.polarity_scores(i)['neu'] for i in df['description']]

df['description_negative_sentiment'] = [sentiment.polarity_scores(i)['neg'] for i in df['description']]

df['description_positive_sentiment'] = [sentiment.polarity_scores(i)['pos'] for i in df['description']]

df['description_compound_sentiment'] = [sentiment.polarity_scores(i)['compound'] for i in df['description']]

In [70]:
df['neighborhood_overview']= df['neighborhood_overview'].fillna('')

df['neighborhood_overview']= df['neighborhood_overview'].str.lower()

df['neighborhood_overview'] = df.apply(lambda row: ' '.join(tokenizer.tokenize((str(row['neighborhood_overview'])))), axis=1)

df['neighborhood_overview_word_count'] = [len(i.split() ) for i in df['neighborhood_overview'].tolist() ]

df['neighborhood_overview_neutral_sentiment'] = [sentiment.polarity_scores(i)['neu'] for i in df['neighborhood_overview']]

df['neighborhood_overview_negative_sentiment'] = [sentiment.polarity_scores(i)['neg'] for i in df['neighborhood_overview']]

df['neighborhood_overview_positive_sentiment'] = [sentiment.polarity_scores(i)['pos'] for i in df['neighborhood_overview']]

df['neighborhood_overview_compound_sentiment'] = [sentiment.polarity_scores(i)['compound'] for i in df['neighborhood_overview']]

In [71]:
df['host_about']= df['host_about'].fillna('')

df['host_about']= df['host_about'].str.lower()

df['host_about'] = df.apply(lambda row: ' '.join(tokenizer.tokenize(str(row['host_about']))), axis=1)

df['host_about_word_count'] = [len(i.split() ) for i in df['host_about'].tolist() ]

df['host_about_neutral_sentiment'] = [sentiment.polarity_scores(i)['neu'] for i in df['host_about']]

df['host_about_negative_sentiment'] = [sentiment.polarity_scores(i)['neg'] for i in df['host_about']]

df['host_about_positive_sentiment'] = [sentiment.polarity_scores(i)['pos'] for i in df['host_about']]

df['host_about_compound_sentiment'] = [sentiment.polarity_scores(i)['compound'] for i in df['host_about']]

Print out of new shape with all added text columns below.

In [72]:
df.head()

Unnamed: 0,id,name,description,neighborhood_overview,host_id,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude_x,longitude_x,room_type,accommodates,bathrooms,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,availability_30,availability_60,availability_90,availability_365,instant_bookable,calculated_host_listings_count,historic site,museum,metro,music venue,perfomring arts venue,college and university,food,nightlife spot,outdoors and recreation,government building,clothing store,popular,days_being_host,days_since_first_review,days_since_last_review,name_word_count,name_neutral_sentiment,name_negative_sentiment,name_positive_sentiment,name_compound_sentiment,description_word_count,description_neutral_sentiment,description_negative_sentiment,description_positive_sentiment,description_compound_sentiment,neighborhood_overview_word_count,neighborhood_overview_neutral_sentiment,neighborhood_overview_negative_sentiment,neighborhood_overview_positive_sentiment,neighborhood_overview_compound_sentiment,host_about_word_count,host_about_neutral_sentiment,host_about_negative_sentiment,host_about_positive_sentiment,host_about_compound_sentiment
0,3686,vita s hideaway,important notes br carefully read and be sure ...,we love that our neighborhood is up and coming...,4645,i am a literary scholar teacher poet vegan che...,within a day,0.8,0.75,0,1,1,Historic Anacostia,38.86339,-76.98889,Private room,1,1.0,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",55.0,2,365,2.0,365.0,1,31,61,336,0,2,1,2,1,0,3,10,25,5,28,21,5,0,4610,2576,180,3,1.0,0.0,0.0,0.0,164,0.867,0.016,0.117,0.923,25,0.846,0.0,0.154,0.6369,180,0.818,0.0,0.182,0.9828
1,7103,best of washington great neighborhood parking,private guest suite with cathedral ceiling sur...,ideal and idyllic location br quiet safe stree...,17633,my business is luxbnb we offer short stay furn...,within an hour,1.0,0.93,0,1,1,"Spring Valley, Palisades, Wesley Heights, Foxh...",38.91999,-77.09774,Entire home/apt,2,1.0,1.0,2.0,"[""Cooking basics"", ""First aid kit"", ""Lockbox"",...",97.0,7,200,7.0,1125.0,9,20,50,140,0,43,2,1,0,0,1,40,18,4,29,6,0,1,4437,1912,16,6,0.325,0.0,0.675,0.8519,169,0.828,0.0,0.172,0.9806,157,0.823,0.029,0.148,0.9657,210,0.801,0.006,0.192,0.9911
2,9641,sophisticated logan circle loft,stay in or go out either way you ll enjoy this...,logan circle is a historic residential neighbo...,32067,a former technology executive and entrepreneur...,within a day,1.0,0.35,0,1,1,"Dupont Circle, Connecticut Avenue/K Street",38.90927,-77.03471,Entire home/apt,4,1.0,1.0,2.0,"[""Cooking basics"", ""Lockbox"", ""Dedicated works...",185.0,2,180,2.0,180.0,17,47,76,76,0,2,36,28,2,15,46,44,50,50,45,48,47,0,4346,2223,19,4,0.455,0.0,0.545,0.5574,174,0.884,0.0,0.116,0.9686,117,0.914,0.0,0.086,0.775,53,0.748,0.0,0.252,0.926
3,11785,sanctuary near cathedral,b the space b br an english basement like no o...,our neighborhood is informally known as cathed...,32015,i am a somewhat gregarious middle aged phd jd ...,within an hour,1.0,1.0,0,1,1,"Cathedral Heights, McLean Gardens, Glover Park",38.92622,-77.07591,Entire home/apt,4,1.0,1.0,3.0,"[""Cooking basics"", ""Dedicated workspace"", ""Lon...",125.0,1,365,1.0,1125.0,12,42,72,347,1,4,5,0,0,1,4,17,48,27,44,24,2,0,4347,3929,23,3,1.0,0.0,0.0,0.0,193,0.88,0.067,0.053,-0.5788,176,0.837,0.008,0.155,0.9826,141,0.836,0.0,0.164,0.9689
4,12442,peaches cream near cathedral,b the space b br life as it was in days gone b...,there is so much to love in cathedral heights ...,32015,i am a somewhat gregarious middle aged phd jd ...,within an hour,1.0,1.0,0,1,1,"Cathedral Heights, McLean Gardens, Glover Park",38.92756,-77.07667,Private room,2,1.5,1.0,1.0,"[""Cooking basics"", ""Dedicated workspace"", ""Lon...",61.0,1,365,1.0,1125.0,19,49,79,354,1,4,5,0,0,1,4,17,48,27,44,24,2,0,4347,2121,28,4,1.0,0.0,0.0,0.0,185,0.775,0.049,0.177,0.98,81,0.74,0.04,0.22,0.9435,141,0.836,0.0,0.164,0.9689


In [73]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3652 entries, 0 to 3651
Data columns (total 67 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   id                                        3652 non-null   int64  
 1   name                                      3652 non-null   object 
 2   description                               3652 non-null   object 
 3   neighborhood_overview                     3652 non-null   object 
 4   host_id                                   3652 non-null   int64  
 5   host_about                                3652 non-null   object 
 6   host_response_time                        3652 non-null   object 
 7   host_response_rate                        3652 non-null   float64
 8   host_acceptance_rate                      3652 non-null   float64
 9   host_is_superhost                         3652 non-null   int64  
 10  host_has_profile_pic                

Drop text original text columns from df. 

In [74]:
df.drop(columns=['name', 'description', 'neighborhood_overview', 'host_about'], inplace=True)

In [75]:
df.shape

(3652, 63)

Clean amenities columns in df.

In [76]:
df['amenities'] = df.apply(lambda row: row['amenities'].strip('[]').replace('"','').split(', '), axis=1)

In [77]:
df['amenities'][0]

['First aid kit',
 'Long term stays allowed',
 'Patio or balcony',
 'Extra pillows and blankets',
 'Smoke alarm',
 'Shampoo',
 'Refrigerator',
 'Heating',
 'Indoor fireplace',
 'Hot water',
 'Stove',
 'Hangers',
 'Washer',
 'Bed linens',
 'Oven',
 'Kitchen',
 'Carbon monoxide alarm',
 'Essentials',
 'Coffee maker',
 'Dishwasher',
 'Dishes and silverware',
 'Free street parking',
 'Microwave',
 'Wifi',
 'Cooking basics',
 'Backyard',
 'Dryer']

Create list of unique amenity options to loop through.

In [78]:
amenities_list = []

for i in df['amenities']:
    for j in i:
        amenities_list.append(j)

In [79]:
import collections

frequency = collections.Counter(amenities_list)

frequency.most_common()

[('Wifi', 3568),
 ('Smoke alarm', 3557),
 ('Essentials', 3537),
 ('Heating', 3478),
 ('Air conditioning', 3467),
 ('Hangers', 3356),
 ('Iron', 3343),
 ('Kitchen', 3235),
 ('Long term stays allowed', 3230),
 ('Hair dryer', 3210),
 ('Carbon monoxide alarm', 3136),
 ('Hot water', 3115),
 ('Shampoo', 3088),
 ('Dedicated workspace', 2914),
 ('Dishes and silverware', 2905),
 ('Microwave', 2870),
 ('Washer', 2837),
 ('Dryer', 2832),
 ('Fire extinguisher', 2830),
 ('Refrigerator', 2811),
 ('Coffee maker', 2791),
 ('Cooking basics', 2640),
 ('Private entrance', 2447),
 ('Bed linens', 2362),
 ('Stove', 2279),
 ('Oven', 2211),
 ('Free street parking', 2092),
 ('Dishwasher', 1977),
 ('First aid kit', 1939),
 ('Extra pillows and blankets', 1859),
 ('TV', 1802),
 ('Patio or balcony', 1304),
 ('Cable TV', 1301),
 ('Free parking on premises', 1261),
 ('Luggage dropoff allowed', 1257),
 ('TV with standard cable', 1248),
 ('Security cameras on property', 1079),
 ('Lockbox', 1065),
 ('Keypad', 1061),
 ('

In [80]:
most_common_amens = frequency.most_common(32)

most_common_amenities = [i for i,j in most_common_amens]

most_common_amenities

['Wifi',
 'Smoke alarm',
 'Essentials',
 'Heating',
 'Air conditioning',
 'Hangers',
 'Iron',
 'Kitchen',
 'Long term stays allowed',
 'Hair dryer',
 'Carbon monoxide alarm',
 'Hot water',
 'Shampoo',
 'Dedicated workspace',
 'Dishes and silverware',
 'Microwave',
 'Washer',
 'Dryer',
 'Fire extinguisher',
 'Refrigerator',
 'Coffee maker',
 'Cooking basics',
 'Private entrance',
 'Bed linens',
 'Stove',
 'Oven',
 'Free street parking',
 'Dishwasher',
 'First aid kit',
 'Extra pillows and blankets',
 'TV',
 'Patio or balcony']

The top 32 amenities in all listings are show above. 

In [81]:
if 'Wifi' in df['amenities'][0]:
    print('True')
else:
    print('False')

True


In [82]:
for i in most_common_amenities:
    df[i.lower()] = ['']*df.shape[0]

In [83]:
df['amenities'][0]

['First aid kit',
 'Long term stays allowed',
 'Patio or balcony',
 'Extra pillows and blankets',
 'Smoke alarm',
 'Shampoo',
 'Refrigerator',
 'Heating',
 'Indoor fireplace',
 'Hot water',
 'Stove',
 'Hangers',
 'Washer',
 'Bed linens',
 'Oven',
 'Kitchen',
 'Carbon monoxide alarm',
 'Essentials',
 'Coffee maker',
 'Dishwasher',
 'Dishes and silverware',
 'Free street parking',
 'Microwave',
 'Wifi',
 'Cooking basics',
 'Backyard',
 'Dryer']

Loop through the amenities in each listing and mark if they have the top 32 amenities or not in each column.

In [84]:
for i in range(0, len(df)):

    for j in most_common_amenities:
    
        if j in df['amenities'][i]:
            df[j.lower()][i] =1
        else:
            df[j.lower()][i] =0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[j.lower()][i] =1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[j.lower()][i] =0


In [85]:
df.head()

Unnamed: 0,id,host_id,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude_x,longitude_x,room_type,accommodates,bathrooms,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,availability_30,availability_60,availability_90,availability_365,instant_bookable,calculated_host_listings_count,historic site,museum,metro,music venue,perfomring arts venue,college and university,food,nightlife spot,outdoors and recreation,government building,clothing store,popular,days_being_host,days_since_first_review,days_since_last_review,name_word_count,name_neutral_sentiment,name_negative_sentiment,name_positive_sentiment,name_compound_sentiment,description_word_count,description_neutral_sentiment,description_negative_sentiment,description_positive_sentiment,description_compound_sentiment,neighborhood_overview_word_count,neighborhood_overview_neutral_sentiment,neighborhood_overview_negative_sentiment,neighborhood_overview_positive_sentiment,neighborhood_overview_compound_sentiment,host_about_word_count,host_about_neutral_sentiment,host_about_negative_sentiment,host_about_positive_sentiment,host_about_compound_sentiment,wifi,smoke alarm,essentials,heating,air conditioning,hangers,iron,kitchen,long term stays allowed,hair dryer,carbon monoxide alarm,hot water,shampoo,dedicated workspace,dishes and silverware,microwave,washer,dryer,fire extinguisher,refrigerator,coffee maker,cooking basics,private entrance,bed linens,stove,oven,free street parking,dishwasher,first aid kit,extra pillows and blankets,tv,patio or balcony
0,3686,4645,within a day,0.8,0.75,0,1,1,Historic Anacostia,38.86339,-76.98889,Private room,1,1.0,1.0,1.0,"[First aid kit, Long term stays allowed, Patio...",55.0,2,365,2.0,365.0,1,31,61,336,0,2,1,2,1,0,3,10,25,5,28,21,5,0,4610,2576,180,3,1.0,0.0,0.0,0.0,164,0.867,0.016,0.117,0.923,25,0.846,0.0,0.154,0.6369,180,0.818,0.0,0.182,0.9828,1,1,1,1,0,1,0,1,1,0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,0,1
1,7103,17633,within an hour,1.0,0.93,0,1,1,"Spring Valley, Palisades, Wesley Heights, Foxh...",38.91999,-77.09774,Entire home/apt,2,1.0,1.0,2.0,"[Cooking basics, First aid kit, Lockbox, Dedic...",97.0,7,200,7.0,1125.0,9,20,50,140,0,43,2,1,0,0,1,40,18,4,29,6,0,1,4437,1912,16,6,0.325,0.0,0.675,0.8519,169,0.828,0.0,0.172,0.9806,157,0.823,0.029,0.148,0.9657,210,0.801,0.006,0.192,0.9911,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,1,1,0,0
2,9641,32067,within a day,1.0,0.35,0,1,1,"Dupont Circle, Connecticut Avenue/K Street",38.90927,-77.03471,Entire home/apt,4,1.0,1.0,2.0,"[Cooking basics, Lockbox, Dedicated workspace,...",185.0,2,180,2.0,180.0,17,47,76,76,0,2,36,28,2,15,46,44,50,50,45,48,47,0,4346,2223,19,4,0.455,0.0,0.545,0.5574,174,0.884,0.0,0.116,0.9686,117,0.914,0.0,0.086,0.775,53,0.748,0.0,0.252,0.926,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,0,1,1,1,1,0,0,0,1
3,11785,32015,within an hour,1.0,1.0,0,1,1,"Cathedral Heights, McLean Gardens, Glover Park",38.92622,-77.07591,Entire home/apt,4,1.0,1.0,3.0,"[Cooking basics, Dedicated workspace, Long ter...",125.0,1,365,1.0,1125.0,12,42,72,347,1,4,5,0,0,1,4,17,48,27,44,24,2,0,4347,3929,23,3,1.0,0.0,0.0,0.0,193,0.88,0.067,0.053,-0.5788,176,0.837,0.008,0.155,0.9826,141,0.836,0.0,0.164,0.9689,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,0,1
4,12442,32015,within an hour,1.0,1.0,0,1,1,"Cathedral Heights, McLean Gardens, Glover Park",38.92756,-77.07667,Private room,2,1.5,1.0,1.0,"[Cooking basics, Dedicated workspace, Long ter...",61.0,1,365,1.0,1125.0,19,49,79,354,1,4,5,0,0,1,4,17,48,27,44,24,2,0,4347,2121,28,4,1.0,0.0,0.0,0.0,185,0.775,0.049,0.177,0.98,81,0.74,0.04,0.22,0.9435,141,0.836,0.0,0.164,0.9689,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,0,1,0,1,0,1


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3652 entries, 0 to 3651
Data columns (total 95 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   id                                        3652 non-null   int64  
 1   host_id                                   3652 non-null   int64  
 2   host_response_time                        3652 non-null   object 
 3   host_response_rate                        3652 non-null   float64
 4   host_acceptance_rate                      3652 non-null   float64
 5   host_is_superhost                         3652 non-null   int64  
 6   host_has_profile_pic                      3652 non-null   int64  
 7   host_identity_verified                    3652 non-null   int64  
 8   neighbourhood_cleansed                    3652 non-null   object 
 9   latitude_x                                3652 non-null   float64
 10  longitude_x                         

This confirms there are no more nulls and the final data is ready to be exported. 

In [90]:
df.to_csv('../data/final_df.csv')