# [3920] Homework # 3 - Text Mining
Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/Seattle_hotels.csv



## Homework Submission Rules (for all homework assignments)
* Homework is due by 2:30 PM on the due date
  * No late submission will be accepted
* Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework #3 Requirements
* Load data and examine data
* Clean data: 1) remove punctuation, 2) lowercase, 3) stem or lemmatize
* Vectorize cleaned data
* Generate similarities matrix
* Generate hotel recommendations for the 3 listed hotels
  * Motel 6 Seattle Sea-Tac Airport South
  * The Bacon Mansion Bed and Breakfast
  * Holiday Inn Seattle Downtown

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 11/09/23 19:39:34


### Import libraries

In [2]:
import pandas as pd
import re
import string
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /Users/timsmac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/timsmac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Load data

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/Seattle_hotels.csv')
df.head()

Unnamed: 0,name,address,desc
0,Hilton Garden Seattle Downtown,"1821 Boren Avenue, Seattle Washington 98101 USA","Located on the southern tip of Lake Union, the..."
1,Sheraton Grand Seattle,"1400 6th Avenue, Seattle, Washington 98101 USA","Located in the city's vibrant core, the Sherat..."
2,Crowne Plaza Seattle Downtown,"1113 6th Ave, Seattle, WA 98101","Located in the heart of downtown Seattle, the ..."
3,Kimpton Hotel Monaco Seattle,"1101 4th Ave, Seattle, WA98101",What?s near our hotel downtown Seattle locatio...
4,The Westin Seattle,"1900 5th Avenue,�Seattle,�Washington�98101�USA",Situated amid incredible shopping and iconic a...


### Examine data

In [5]:
df.shape

(152, 3)

In [6]:
df.head()

Unnamed: 0,name,address,desc
0,Hilton Garden Seattle Downtown,"1821 Boren Avenue, Seattle Washington 98101 USA","Located on the southern tip of Lake Union, the..."
1,Sheraton Grand Seattle,"1400 6th Avenue, Seattle, Washington 98101 USA","Located in the city's vibrant core, the Sherat..."
2,Crowne Plaza Seattle Downtown,"1113 6th Ave, Seattle, WA 98101","Located in the heart of downtown Seattle, the ..."
3,Kimpton Hotel Monaco Seattle,"1101 4th Ave, Seattle, WA98101",What?s near our hotel downtown Seattle locatio...
4,The Westin Seattle,"1900 5th Avenue,�Seattle,�Washington�98101�USA",Situated amid incredible shopping and iconic a...


### Prepare data

In [7]:
# Drop column address
df = df.drop(columns=["address"])
df.head()

Unnamed: 0,name,desc
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the..."
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat..."
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ..."
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...
4,The Westin Seattle,Situated amid incredible shopping and iconic a...


#### Clean column hotel descriptions
1) remove punctuation
2) lowercase text
3) either stem or lemmatize text

In [8]:
punct = string.punctuation

# Create function to remove punctuation
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in punct])
    return text_nopunct

# Remove punctuation from data
df['desc_clean'] = df['desc'].apply(lambda x: remove_punct(x))
df.head()

Unnamed: 0,name,desc,desc_clean
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the...",Located on the southern tip of Lake Union the ...
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat...",Located in the citys vibrant core the Sheraton...
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ...",Located in the heart of downtown Seattle the a...
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...,Whats near our hotel downtown Seattle location...
4,The Westin Seattle,Situated amid incredible shopping and iconic a...,Situated amid incredible shopping and iconic a...


In [9]:
# Lowercase data 
df['desc_clean_lower'] = df['desc_clean'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,name,desc,desc_clean,desc_clean_lower
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the...",Located on the southern tip of Lake Union the ...,located on the southern tip of lake union the ...
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat...",Located in the citys vibrant core the Sheraton...,located in the citys vibrant core the sheraton...
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ...",Located in the heart of downtown Seattle the a...,located in the heart of downtown seattle the a...
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...,Whats near our hotel downtown Seattle location...,whats near our hotel downtown seattle location...
4,The Westin Seattle,Situated amid incredible shopping and iconic a...,Situated amid incredible shopping and iconic a...,situated amid incredible shopping and iconic a...


In [10]:
# Create function to tokenize and lowercase data
def tokenize(text):
    # W+ means that either a word character (A-Za-z0-9_) or a dash (-) can go there.
    tokens = re.split('\W+', text)
    return tokens

# Tokenize and lowercase data 
df['desc_tokenized'] = df['desc_clean_lower'].apply(lambda x: tokenize(x.lower())) 
df.head()

Unnamed: 0,name,desc,desc_clean,desc_clean_lower,desc_tokenized
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the...",Located on the southern tip of Lake Union the ...,located on the southern tip of lake union the ...,"[located, on, the, southern, tip, of, lake, un..."
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat...",Located in the citys vibrant core the Sheraton...,located in the citys vibrant core the sheraton...,"[located, in, the, citys, vibrant, core, the, ..."
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ...",Located in the heart of downtown Seattle the a...,located in the heart of downtown seattle the a...,"[located, in, the, heart, of, downtown, seattl..."
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...,Whats near our hotel downtown Seattle location...,whats near our hotel downtown seattle location...,"[whats, near, our, hotel, downtown, seattle, l..."
4,The Westin Seattle,Situated amid incredible shopping and iconic a...,Situated amid incredible shopping and iconic a...,situated amid incredible shopping and iconic a...,"[situated, amid, incredible, shopping, and, ic..."


In [11]:
stopwords = nltk.corpus.stopwords.words('english')

# Create function to remove stopwords
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopwords]
    return text

# Remove stop words from data
df['desc_nostop'] = df['desc_tokenized'].apply(lambda x: remove_stopwords(x))
df.head()

Unnamed: 0,name,desc,desc_clean,desc_clean_lower,desc_tokenized,desc_nostop
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the...",Located on the southern tip of Lake Union the ...,located on the southern tip of lake union the ...,"[located, on, the, southern, tip, of, lake, un...","[located, southern, tip, lake, union, hilton, ..."
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat...",Located in the citys vibrant core the Sheraton...,located in the citys vibrant core the sheraton...,"[located, in, the, citys, vibrant, core, the, ...","[located, citys, vibrant, core, sheraton, gran..."
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ...",Located in the heart of downtown Seattle the a...,located in the heart of downtown seattle the a...,"[located, in, the, heart, of, downtown, seattl...","[located, heart, downtown, seattle, awardwinni..."
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...,Whats near our hotel downtown Seattle location...,whats near our hotel downtown seattle location...,"[whats, near, our, hotel, downtown, seattle, l...","[whats, near, hotel, downtown, seattle, locati..."
4,The Westin Seattle,Situated amid incredible shopping and iconic a...,Situated amid incredible shopping and iconic a...,situated amid incredible shopping and iconic a...,"[situated, amid, incredible, shopping, and, ic...","[situated, amid, incredible, shopping, iconic,..."


In [12]:
ps = nltk.PorterStemmer()

# Create function to apply stemmer
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

# Apply Porter Stemmer
df['desc_clean_stemmed'] = df['desc_nostop'].apply(lambda x: stemming(x))
df.head()

Unnamed: 0,name,desc,desc_clean,desc_clean_lower,desc_tokenized,desc_nostop,desc_clean_stemmed
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the...",Located on the southern tip of Lake Union the ...,located on the southern tip of lake union the ...,"[located, on, the, southern, tip, of, lake, un...","[located, southern, tip, lake, union, hilton, ...","[locat, southern, tip, lake, union, hilton, ga..."
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat...",Located in the citys vibrant core the Sheraton...,located in the citys vibrant core the sheraton...,"[located, in, the, citys, vibrant, core, the, ...","[located, citys, vibrant, core, sheraton, gran...","[locat, citi, vibrant, core, sheraton, grand, ..."
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ...",Located in the heart of downtown Seattle the a...,located in the heart of downtown seattle the a...,"[located, in, the, heart, of, downtown, seattl...","[located, heart, downtown, seattle, awardwinni...","[locat, heart, downtown, seattl, awardwin, cro..."
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...,Whats near our hotel downtown Seattle location...,whats near our hotel downtown seattle location...,"[whats, near, our, hotel, downtown, seattle, l...","[whats, near, hotel, downtown, seattle, locati...","[what, near, hotel, downtown, seattl, locat, b..."
4,The Westin Seattle,Situated amid incredible shopping and iconic a...,Situated amid incredible shopping and iconic a...,situated amid incredible shopping and iconic a...,"[situated, amid, incredible, shopping, and, ic...","[situated, amid, incredible, shopping, iconic,...","[situat, amid, incred, shop, icon, attract, we..."


#### Display updated dataframe

In [13]:
df.head()

Unnamed: 0,name,desc,desc_clean,desc_clean_lower,desc_tokenized,desc_nostop,desc_clean_stemmed
0,Hilton Garden Seattle Downtown,"Located on the southern tip of Lake Union, the...",Located on the southern tip of Lake Union the ...,located on the southern tip of lake union the ...,"[located, on, the, southern, tip, of, lake, un...","[located, southern, tip, lake, union, hilton, ...","[locat, southern, tip, lake, union, hilton, ga..."
1,Sheraton Grand Seattle,"Located in the city's vibrant core, the Sherat...",Located in the citys vibrant core the Sheraton...,located in the citys vibrant core the sheraton...,"[located, in, the, citys, vibrant, core, the, ...","[located, citys, vibrant, core, sheraton, gran...","[locat, citi, vibrant, core, sheraton, grand, ..."
2,Crowne Plaza Seattle Downtown,"Located in the heart of downtown Seattle, the ...",Located in the heart of downtown Seattle the a...,located in the heart of downtown seattle the a...,"[located, in, the, heart, of, downtown, seattl...","[located, heart, downtown, seattle, awardwinni...","[locat, heart, downtown, seattl, awardwin, cro..."
3,Kimpton Hotel Monaco Seattle,What?s near our hotel downtown Seattle locatio...,Whats near our hotel downtown Seattle location...,whats near our hotel downtown seattle location...,"[whats, near, our, hotel, downtown, seattle, l...","[whats, near, hotel, downtown, seattle, locati...","[what, near, hotel, downtown, seattl, locat, b..."
4,The Westin Seattle,Situated amid incredible shopping and iconic a...,Situated amid incredible shopping and iconic a...,situated amid incredible shopping and iconic a...,"[situated, amid, incredible, shopping, and, ic...","[situated, amid, incredible, shopping, iconic,...","[situat, amid, incred, shop, icon, attract, we..."


### Vectorize cleaned hotel descriptions

In [14]:
# Apply TfidfVectorizer
tfidf_vect = TfidfVectorizer(analyzer=stemming)
tfidf_counts = tfidf_vect.fit_transform(df['desc_clean_stemmed'])
print(tfidf_counts.shape)
print()
print(tfidf_vect.get_feature_names_out())

(152, 2608)

['' '1' '10' ... 'zipcar' 'zone' 'zoo']


In [15]:
tfidf_counts_df = pd.DataFrame(tfidf_counts.toarray(), columns=tfidf_vect.get_feature_names_out())
tfidf_counts_df.head()

Unnamed: 0,Unnamed: 1,1,10,100,1000,10000,103000,109,109room,10best,...,youd,youll,your,youth,yummi,zagat,zephyr,zipcar,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.082273,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Generate similarities matrix on cleaned hotel descriptions

In [16]:
similarity_matrix = cosine_similarity(tfidf_counts)

In [17]:
similarity_matrix

array([[1.        , 0.07345895, 0.12272613, ..., 0.06892185, 0.01821178,
        0.04149018],
       [0.07345895, 1.        , 0.08040042, ..., 0.07400006, 0.02550196,
        0.03744059],
       [0.12272613, 0.08040042, 1.        , ..., 0.12874819, 0.03469514,
        0.03530516],
       ...,
       [0.06892185, 0.07400006, 0.12874819, ..., 1.        , 0.05369064,
        0.03366706],
       [0.01821178, 0.02550196, 0.03469514, ..., 0.05369064, 1.        ,
        0.01266814],
       [0.04149018, 0.03744059, 0.03530516, ..., 0.03366706, 0.01266814,
        1.        ]])

### Create hotel recommender

In [18]:
def hotel_recommender(hotel_name, df, similarity_matrix, num_recommendations=5):
    # Find the index of the hotel
    hotel_index = df[df['name'] == hotel_name].index[0]
    
    # Get the similarity scores for the given hotel
    hotel_similarities = similarity_matrix[hotel_index]
    
    # Sort the hotels by their similarity scores
    similar_hotels = list(enumerate(hotel_similarities))
    similar_hotels = sorted(similar_hotels, key=lambda x: x[1], reverse=True)
    
    # Exclude the hotel itself
    similar_hotels = similar_hotels[1:]
    
    # Get the top recommended hotels
    top_recommendations = similar_hotels[:num_recommendations]
    
    recommended_hotels = []
    for (index, similarity) in top_recommendations:
        recommended_hotel = df['name'][index]
        recommended_hotels.append((recommended_hotel, similarity))
    
    return recommended_hotels

### Make hotel recommendations for the following hotel names:
* Motel 6 Seattle Sea-Tac Airport South
* The Bacon Mansion Bed and Breakfast
* Holiday Inn Seattle Downtown

In [19]:
hotel_name = "Motel 6 Seattle Sea-Tac Airport South"
recommendations = hotel_recommender(hotel_name, df, similarity_matrix, num_recommendations=5)

print(f"Recommendations for {hotel_name}:")
for i, (recommended_hotel, similarity) in enumerate(recommendations):
    print(f"{i+1}. {recommended_hotel} (Similarity: {similarity:.2f})")

Recommendations for Motel 6 Seattle Sea-Tac Airport South:
1. Ramada by Wyndham SeaTac Airport (Similarity: 0.29)
2. Four Points by Sheraton Seattle Airport South (Similarity: 0.25)
3. Crown Inn Motel (Similarity: 0.24)
4. Red Roof Inn Seattle Airport - SEATAC (Similarity: 0.24)
5. Emerald Motel (Similarity: 0.22)


In [20]:
hotel_name = "The Bacon Mansion Bed and Breakfast"
recommendations = hotel_recommender(hotel_name, df, similarity_matrix, num_recommendations=5)

print(f"Recommendations for {hotel_name}:")
for i, (recommended_hotel, similarity) in enumerate(recommendations):
    print(f"{i+1}. {recommended_hotel} (Similarity: {similarity:.2f})")

Recommendations for The Bacon Mansion Bed and Breakfast:
1. 11th Avenue Inn Bed and Breakfast (Similarity: 0.26)
2. Shafer Baillie Mansion Bed & Breakfast (Similarity: 0.24)
3. Silver Cloud Hotel - Seattle Broadway (Similarity: 0.18)
4. Quality Inn & Suites Seattle Center (Similarity: 0.16)
5. Gaslight Inn (Similarity: 0.16)


In [21]:
hotel_name = "Holiday Inn Seattle Downtown"
recommendations = hotel_recommender(hotel_name, df, similarity_matrix, num_recommendations=5)

print(f"Recommendations for {hotel_name}:")
for i, (recommended_hotel, similarity) in enumerate(recommendations):
    print(f"{i+1}. {recommended_hotel} (Similarity: {similarity:.2f})")

Recommendations for Holiday Inn Seattle Downtown:
1. Holiday Inn Express & Suites Seattle-City Center (Similarity: 0.39)
2. Silver Cloud Hotel - Seattle Stadium (Similarity: 0.33)
3. Holiday Inn Express & Suites North Seattle - Shoreline (Similarity: 0.32)
4. Best Western Plus Pioneer Square Hotel Downtown (Similarity: 0.30)
5. Inn at Queen Anne (Similarity: 0.28)
