#### 2.2.1.f Treatment for columns containing unstructured data

Now we need to revisit what to do with the columns containing unstructured data. 

After reading around, I decided that a simple text processing model can be used to transform the unstructured text into feature vectors that could be processed by a machine learning model. In a way, I had to do basic feature extraction on those columns to extract the textual features into vectors as an input to my model.
I referred to [this article](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/) for extensive guidance on how to do this.

First, I extract just the relevant columns and determine which columns would be useful for treatment. Referring to the data dictionary above, we had earlier determined that name, summary, space, description, neighborhood_overview, and transit contain unstructured text data. 

In [2]:
listings_sample.loc[['name','summary', 'space', 'description', 'neighborhood_overview','transit']]

NameError: name 'listings_sample' is not defined

Looking at the sample values above, name may not contain much useful information for prediction, so to keep things simple I decided to drop it. 

The space, summary and description columns also appear to overlap. I have decided to keep description instead of summary as there appears to be more variation (i.e. unique values) in the description column compared to summary, so drop summary and space from the analysis.

Neighborhood_overview may also contain similar information to the neighborhood column, it contains descriptions of the neighbourhood. Regardless, I decided to keep it as it may contain information that affects guests' ratings.

In [None]:
#extract subset of columns for simple text processing from dataset
listings_txt=listings[['description','neighborhood_overview','transit']]
listings_txt.head(5)

In [None]:
2.2.1.f (i) Preprocessing of text

In [None]:
#1. Lower case: transform all columns to lowercase
listings_txt=listings_txt.apply(lambda x: x.str.lower())

In [None]:
#2. Remove punctuation
listings_txt=listings_txt.apply(lambda x: x.str.replace('[^\w\s]',''))

#3. Removing stopwords and numbers from the columns

We should remove all stopwords (or commonly occurring words) and numbers from the text as they do not contain useful information. I used predefined libraries of stopwords for removal

In [None]:
#3. Removal of Stop Words
#add stopwords to corpus to exclude
stop = stopwords.words('english')

#create function for re-use
def no_stop(col_no_stop,col):
    listings_txt[col_no_stop] = listings_txt[col].apply(lambda x: ' '.join([word for word in str(x).split() if word not in (stop)]))

#apply function to dataframe - note: is there are more efficient way to do this?
for x,y in zip(list(listings_txt.columns.values+'_no_stop'),list(listings_txt.columns.values)):
    no_stop(x,y) 

listings_txt.head(5)

#4.Common word removal

Check most frequent common words in the data and make a call to retain or to drop.

In [None]:
#create function for re-use
def freq(col):
    freq = pd.Series(' '.join(listings_txt[col]).split()).value_counts()[:30]
    return col,freq
#check most frequent words
[freq(col) for col in [col for col in listings_txt if col.endswith('stop')]]

We can see that for all three columns, words like guest, home, house, seattle and downtown are not useful and generic, so can be dropped. 

For the description column, words like neighborhood, room, kitchen, bedroom, home, bed, space, apartment are all very generic and so should be dropped.

For the neighborhood_overview columns, words like neighborhood is obviously generic and can be dropped. The others look okay to be kept, however.

For the transit column, words like away, minutes and street are very generic, so should be dropped along with minutes. 

In [None]:
#drop selected words

desc_freq = ['neighborhood','guest','home','house','seattle','downtown','room', 'kitchen', 'bathroom','bedroom', 'home', 'bed', 'space', 'apartment' ] 
neigh_freq=['guest','home','house','seattle','neighborhood','downtown','nan']
transit_freq=['guest','home','house','seattle','away','minutes','minute','street','nan','downtown']

for x,y in zip([col for col in listings_txt if col.endswith('stop')],[desc_freq,neigh_freq,transit_freq]):
    listings_txt[x] = listings_txt[x].apply(lambda x: " ".join(x for x in x.split() if x not in y))

In [None]:
#5. Rare words removal

#create function for re-use
def rare(col):
    rare = pd.Series(' '.join(listings_txt[col]).split()).value_counts()[-10:]
    return rare

#check rare words
[rare(col) for col in [col for col in listings_txt if col.endswith('stop')]]

#add rare words to lists
desc_rare=[]
neigh_rare=[]
transit_rare=[]

for x,y in zip([col for col in listings_txt if col.endswith('stop')]
               ,[desc_rare,neigh_rare,transit_rare]):
    y=list(rare(x).index)
    
    
#remove rare words
for x,y in zip([col for col in listings_txt if col.endswith('stop')]
               ,[desc_rare,neigh_rare,transit_rare]):
    listings_txt[x] = listings_txt[x].apply(lambda x: " ".join(x for x in x.split() if x not in y))

In [None]:
#Lemmatization
for col in [col for col in listings_txt if col.endswith('stop')]:
    listings_txt[col] = listings_txt[col].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

listings_txt=listings_txt[[col for col in listings_txt if col.endswith('stop')]]

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(listings_txt)
train_bow

In [3]:
# #convert all 'nan' into np.nan
# listings_txt=listings_txt.replace('nan', np.nan).head(5)

In [None]:
lst1=collections.Counter(" ".join(listings_test["description"].dropna()).split()).most_common(1000)