<a href="https://colab.research.google.com/github/sajaldebnath/topic_sentiment_analysis/blob/main/TopicModeling_SentimentAnalysis_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **IISc CCE AI/ML Capstone Project - Topic modeling and sentiment analysis based on user review data.**

### **Purpose:**
This project aims to take user reviews as inputs and figure out the following:

*   Figure out topics discussed
*   Model the topics
*   Topic clustering
*   Unsupervised sentiment analysis on each topic cluster providing Top N positive, negative, and neutral sentiment topics
*   Figuring out the Top N impactful topic for overall rating improvement

### **Approach:**

We are taking a phased approach to achieve the goal. The following phases will be carried out:

#### **Milestone-1**

*  Cleaning and preparing data
*  Topic modeling
*  Unsupervised topic clustering
*  End of Milestone-1 - Top N trending topic

#### **Milestone-2**

* Unsupervised sentiment analysis on each topic cluster
* End of Milestone-2 - Top N positive, negative, and neutral sentiment topics

#### **Milestone-3**

* End of Milestone-3 - Top N impactful topic for overall rating improvement

### **Milestone-1**



1.   Import all the necessary libraries

In this dataset from Kaggle https://www.kaggle.com/datasets/rhonarosecortez/new-york-airbnb-open-data/data we have 3 files:


*   calendar.csv
*   listings.csv
*   reviewes.csv

We will be working only with reviews.csv as we want to do a topic modelling on the reviews.



In [2]:
# Importing all necessary libraries

# utilities
import re
import pickle
import numpy as np
import pandas as pd

# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt


# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report


2. Import and view the dataset

In [3]:
# Importing the dataset
#COLUMNS  = ["listing_id", "id", "date", "reviewer_id", "reviewer_name", "comments"]
# ENCODING = "ISO-8859-1"
url = 'https://raw.githubusercontent.com/sajaldebnath/topic_sentiment_analysis/refs/heads/main/data/NY-Airbnb-Open-Data-Reviews.csv'
df = pd.read_csv(url, engine='python', on_bad_lines='skip' )

df.head(10) # view the first 10 rows of the dataframe

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2992450,15066586,2014-07-01,16827297,Kristen,Large apartment; nice kitchen and bathroom. Ke...
1,2992450,21810844,2014-10-24,22648856,Christopher,"This may be a little late, but just to say Ken..."
2,2992450,27434334,2015-03-04,45406,Altay,The apartment was very clean and convenient to...
3,2992450,28524578,2015-03-25,5485362,John,Kenneth was ready when I got there and arrange...
4,2992450,35913434,2015-06-23,15772025,Jennifer,We were pleased to see how 2nd Street and the ...
5,2992450,38893053,2015-07-19,11614467,Stephanie,"The flat is not in a good area, while we were ..."
6,2992450,57989144,2015-12-31,28580637,Betty,The apartment was centrally located near all t...
7,2992450,457366954464901293,2021-09-22,413779309,Carolina,The place is clean and the host is very nice
8,2992450,695544085190177036,2022-08-17,19928494,Kyle,It was much dirtier in person and half the fur...
9,3820211,17665203,2014-08-15,11024290,Abigail,We had a marvelous time staying at Terra's bea...


In [4]:
# Removing the unnecessary columns since we will only be working with reviews
# later on we want to co-relate reviews with listings and perhaps with the reviewers
data = df[['listing_id','reviewer_id', 'comments']]


data.head(10)

Unnamed: 0,listing_id,reviewer_id,comments
0,2992450,16827297,Large apartment; nice kitchen and bathroom. Ke...
1,2992450,22648856,"This may be a little late, but just to say Ken..."
2,2992450,45406,The apartment was very clean and convenient to...
3,2992450,5485362,Kenneth was ready when I got there and arrange...
4,2992450,15772025,We were pleased to see how 2nd Street and the ...
5,2992450,11614467,"The flat is not in a good area, while we were ..."
6,2992450,28580637,The apartment was centrally located near all t...
7,2992450,413779309,The place is clean and the host is very nice
8,2992450,19928494,It was much dirtier in person and half the fur...
9,3820211,11024290,We had a marvelous time staying at Terra's bea...


In [5]:
# Next let's find out the details of the dataset
data.info() #provides a summary of the data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24752 entries, 0 to 24751
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   24752 non-null  int64 
 1   reviewer_id  24752 non-null  int64 
 2   comments     24745 non-null  object
dtypes: int64(2), object(1)
memory usage: 580.2+ KB


From the above information we can see that there are 24752 rows of data and without any null values. So, we don't have to worry about null data in this dataset.
Still let's handle that situation for any future datasets.

In [6]:
# Explicitly check for missing values
if (data.isnull().sum() > 0).any():
    print("Missing values found in the dataset.")
    # The best approach is to just drop them.
    data = data.dropna(subset=['comments'])
else:
    print("No missing values found in the dataset.")

data.info()

Missing values found in the dataset.
<class 'pandas.core.frame.DataFrame'>
Index: 24745 entries, 0 to 24751
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   24745 non-null  int64 
 1   reviewer_id  24745 non-null  int64 
 2   comments     24745 non-null  object
dtypes: int64(2), object(1)
memory usage: 773.3+ KB


3. In this section we will handle the pre-processing of the date which includes the following:

* **Lower Casing**: Each text is converted to lowercase.

* **Replacing URLs**: Links starting with "http" or "https" or "www" are replaced by "URL".

* **Replacing Emojis**: Replace emojis by using a pre-defined dictionary containing emojis along with their meaning. (eg: ":)" to "EMOJIsmile")

* **Replacing Usernames**: Replace @Usernames with word "USER". (eg: "@Kaggle" to "USER")

* **Removing Non-Alphabets**: Replacing characters except Digits and Alphabets with a space.

* **Removing Consecutive letters**: 3 or more consecutive letters are replaced by 2 letters. (eg: "Heyyyy" to "Heyy")

* **Removing Short Words**: Words with length less than 2 are removed.

* **Removing Stopwords**: Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. (eg: "the", "he", "have")

* **Lemmatizing**: Lemmatization is the process of converting a word to its base form. (e.g: “Great” to “Good”)


In [7]:
# Defining dictionary containing all emojis with their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad',
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed',
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink',
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

In [8]:
def preprocess_comments(comments):

    # Defining regex patterns.
    urlPattern        = r'http[s]?://\S+' # Removes any website link starting with http or https
    userPattern       = '@[^\s]+'         # Removes any users names starting with @ character
    alphaPattern      = "[^a-zA-Z0-9]"    # Removes text except alpha numeric patterns
    sequencePattern   = r"(.)\1\1+"       # Finds out 3 or more repeating letters likes "aaa"
    seqReplacePattern = r"\1\1"           # Replace the repeating letter sequence with two letters
    symbolPattern = r'Ã[^\x80-\xBF]+'     # Finds out the special characters like Ã
    spacePattern = r'\s+'                 # Finds out extra spaces betweem words


    # Replace all URls with ' '
    comments = re.sub(urlPattern, '', comments, flags=re.MULTILINE)
    # Replace @USERNAME to ' '.
    comments = re.sub(userPattern,' ', comments)
    # Replace all non alphabets.
    comments = re.sub(alphaPattern, " ", comments)
    # Replace 3 or more consecutive letters by 2 letters.
    comments = re.sub(sequencePattern, seqReplacePattern, comments)
    # Replace special encoded symbols that appear as characters like Ã, Å, Ë
    comments = re.sub(symbolPattern, ' ', comments)

    # Replace all emojis.
    for emoji in emojis.keys():
      comments = comments.replace(emoji, " " + emojis[emoji])

    # Replace multiple spaces with a single space
    comments = re.sub(spacePattern, ' ', comments)
    # Remove all leading or trailing spaces
    comments = comments.strip()

    return comments.lower()  # Convert all to lowercase

In [9]:
import time
t = time.time()
data['processed_comments'] = data['comments'].apply(preprocess_comments) #applying the preprocess_comments function to comments column

print(f'Text Preprocessing complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')

Text Preprocessing complete.
Time Taken: 1 seconds


In [10]:
#VIEW THE CLEANED Comments IN "comments" column again
pd.set_option('display.max_colwidth', None) #to view text fully and so that
#pandas doesn't truncate the text

data.head(10) # Printing the data to see the updated result

Unnamed: 0,listing_id,reviewer_id,comments,processed_comments
0,2992450,16827297,"Large apartment; nice kitchen and bathroom. Kenneth left drinks for us which was so nice. His cousin lives upstairs, and she was very nice and helpful, too. The internet only worked about half the time, and everything on the street can be heard from the main bedroom. Wonderful stay- it was exactly what we needed. It is not in the best area of town, but I never felt unsafe.",large apartment nice kitchen and bathroom kenneth left drinks for us which was so nice his cousin lives upstairs and she was very nice and helpful too the internet only worked about half the time and everything on the street can be heard from the main bedroom wonderful stay it was exactly what we needed it is not in the best area of town but i never felt unsafe
1,2992450,22648856,"This may be a little late, but just to say Kenneth was quick to respond to our request. He left some supplies in the refrigerator and kindly showed us round downtown Albany. We were without a car and he arranged for his cousin Kristina to take us to supermarket.\r<br/>The apartment is large, clean with many original features. It is situated in an historical area that is fairly central, although like the rest of Albany you have to climb a steep hill to get to it if you are on foot. \r<br/>",this may be a little late but just to say kenneth was quick to respond to our request he left some supplies in the refrigerator and kindly showed us round downtown albany we were without a car and he arranged for his cousin kristina to take us to supermarket br the apartment is large clean with many original features it is situated in an historical area that is fairly central although like the rest of albany you have to climb a steep hill to get to it if you are on foot br
2,2992450,45406,"The apartment was very clean and convenient to downtown. One thing to keep in mind: one bedroom is huge and the other is tiny, so if you're traveling with a friend you might have to flip a coin. :)",the apartment was very clean and convenient to downtown one thing to keep in mind one bedroom is huge and the other is tiny so if you re traveling with a friend you might have to flip a coin
3,2992450,5485362,"Kenneth was ready when I got there and arranged for the upstairs neighbor to meet me at the door with the keys. Shortly after that I was left on my own in the privacy of this large 2 BR apartment - just like I like it. I really like my privacy and there was no time at which anyone came to visit me or bother me unannounced. At one point Kenneth had to pic up a refrigerator stored in the apartment, but the neighbor lady gave me plenty of notice and I was on my way out before he got there. When I returned the extra refrigerator was gone and I had the nice quiet apartment to myself. At one point I went to take a shower and found some body wash in a drawer. That was nice since all I brought with me was shampoo. The apartment is a city street apartment meaning that like many two family homes the BR window is feet away from the public sidewalk and you can hear people all hours of the night walking by and conversing, but its not terribly disruptive. Overall the apartment was just what I needed and I had an enjoyable time while there. The heat worked perfectly and the apartment was picked up. The bed was a king size and very comfortable with sensible pillows and coverings. The water pressure on the shower was awesome.",kenneth was ready when i got there and arranged for the upstairs neighbor to meet me at the door with the keys shortly after that i was left on my own in the privacy of this large 2 br apartment just like i like it i really like my privacy and there was no time at which anyone came to visit me or bother me unannounced at one point kenneth had to pic up a refrigerator stored in the apartment but the neighbor lady gave me plenty of notice and i was on my way out before he got there when i returned the extra refrigerator was gone and i had the nice quiet apartment to myself at one point i went to take a shower and found some body wash in a drawer that was nice since all i brought with me was shampoo the apartment is a city street apartment meaning that like many two family homes the br window is feet away from the public sidewalk and you can hear people all hours of the night walking by and conversing but its not terribly disruptive overall the apartment was just what i needed and i had an enjoyable time while there the heat worked perfectly and the apartment was picked up the bed was a king size and very comfortable with sensible pillows and coverings the water pressure on the shower was awesome
4,2992450,15772025,"We were pleased to see how 2nd Street and the Ten Broeck neighborhood in Albany have come along. Its great to see all of the gorgeous brownstones, including the one we stayed in, being renovated and cared for.<br/>In terms of quality, I was pleased by the courtesy of the host's friend who lived nearby. He was there to let us in and make sure I knew to call with any problems. With a full kitchen and two bedrooms, my daughter and I felt at home with plenty of space. As a native of Albany I'm used to thin windows and occasionally noisy neighbors. If you don't sleep well, don't take the master bedroom and be aware that loud music may be playing during the evening and weekend. But there is a fan/ac unit to drown out the din from the street.<br/>This space has been renovated and a vast improvement I'm sure from where it was, and I'm hoping progress will continue here as it seemed a few projects were incomplete. I would have loved to enjoy more sunlight in the house but there was a ton of tree debris in the back yard so there was no way to enjoy opening the windows.<br/>On my own, this place was fine, but if traveling with my child again I will probably pick a quieter location where I'd be able to open windows or put her down for a nap without worrying about excessive noise.",we were pleased to see how 2nd street and the ten broeck neighborhood in albany have come along its great to see all of the gorgeous brownstones including the one we stayed in being renovated and cared for br in terms of quality i was pleased by the courtesy of the host s friend who lived nearby he was there to let us in and make sure i knew to call with any problems with a full kitchen and two bedrooms my daughter and i felt at home with plenty of space as a native of albany i m used to thin windows and occasionally noisy neighbors if you don t sleep well don t take the master bedroom and be aware that loud music may be playing during the evening and weekend but there is a fan ac unit to drown out the din from the street br this space has been renovated and a vast improvement i m sure from where it was and i m hoping progress will continue here as it seemed a few projects were incomplete i would have loved to enjoy more sunlight in the house but there was a ton of tree debris in the back yard so there was no way to enjoy opening the windows br on my own this place was fine but if traveling with my child again i will probably pick a quieter location where i d be able to open windows or put her down for a nap without worrying about excessive noise
5,2992450,11614467,"The flat is not in a good area, while we were fine you do have to be careful (even our cab driver told us that) I wouldn't stay again. The flat also does not have air con- the large bedroom had an obnoxiously loud fan which kept the one room cool, the rest of the house is very hot. So if you are going to stay here bear these things in mind.",the flat is not in a good area while we were fine you do have to be careful even our cab driver told us that i wouldn t stay again the flat also does not have air con the large bedroom had an obnoxiously loud fan which kept the one room cool the rest of the house is very hot so if you are going to stay here bear these things in mind
6,2992450,28580637,"The apartment was centrally located near all the night life in Albany, ideally suited to the age profile of my 2 sons who stayed their. ... We had initial difficulty in contacting the host...until we used my son who lives in Albany to be the contact, as we were relying on internet connection where ever we found free wifi..as roaming charges for international tourists is prohibitive. He went to the apartment a few times on the day we were yo arrive into JFK bit could not gain access...he was trying to get key so that he could leave basic provisions for us there. He eventually got an answer to the door bell..late evening. The apartment was adequate for them as they spent most of their time with their brother who lives in Albany. They did say that the apartment was way too warm, and they could not see where to set the heating. One of the windows was broken but had plastic/perspex over it. The furniture was of a good enough standard but the couch was a little shabby and their was a musty smell from it. Other than those facts they said the place was ideal for them and the contact person was very obliging considering they locked themselves out once and needed access .",the apartment was centrally located near all the night life in albany ideally suited to the age profile of my 2 sons who stayed their we had initial difficulty in contacting the host until we used my son who lives in albany to be the contact as we were relying on internet connection where ever we found free wifi as roaming charges for international tourists is prohibitive he went to the apartment a few times on the day we were yo arrive into jfk bit could not gain access he was trying to get key so that he could leave basic provisions for us there he eventually got an answer to the door bell late evening the apartment was adequate for them as they spent most of their time with their brother who lives in albany they did say that the apartment was way too warm and they could not see where to set the heating one of the windows was broken but had plastic perspex over it the furniture was of a good enough standard but the couch was a little shabby and their was a musty smell from it other than those facts they said the place was ideal for them and the contact person was very obliging considering they locked themselves out once and needed access
7,2992450,413779309,The place is clean and the host is very nice,the place is clean and the host is very nice
8,2992450,19928494,It was much dirtier in person and half the furniture was broken. Hopefully the host can fix these issues because the apartment could be nice if it is cleaned up correctly and furnished with items that weren’t all broken such as the Tv being propped up on a table with two screws.,it was much dirtier in person and half the furniture was broken hopefully the host can fix these issues because the apartment could be nice if it is cleaned up correctly and furnished with items that weren t all broken such as the tv being propped up on a table with two screws
9,3820211,11024290,We had a marvelous time staying at Terra's beautiful apartment in downtown Albany. The location was perfect and the apartment was just splendid. Terra was very responsive to our inquiries and flexible. Could not have been happier.,we had a marvelous time staying at terra s beautiful apartment in downtown albany the location was perfect and the apartment was just splendid terra was very responsive to our inquiries and flexible could not have been happier


In [11]:
# Tokenize using nltk library
# pip install nltk # If not already present then install the module
#import nltk
#from nltk.tokenize import word_tokenize
#nltk.download('punkt')
#nltk.download('punkt_tab')

#def tokenize_comment(comment):
#    tokens = word_tokenize(comment) # Break up and tokenize the comment
#    return tokens

# Calling the tokenize_comment function on the comment to tokenize it
# Also adding another column called "Tokens" to store the tokenized words
#data['Tokens'] = data['comments'].apply(tokenize_comment)

#print(f'Tokenization complete.')
#print(f'Time Taken: {round(time.time()-t)} seconds')

In [12]:
# Let's check the tokenized data
#data.head(10)

**Removing the Stopwords**:

In this section we will remove all the defined stopwords

In [13]:
import string
import nltk
# Download the stopwords list
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

#Let's define a list of custom_stop_words which does not add value to the topic
custom_stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves', 'us']


# Define the function to remove the stopword from the tokens
stops = stopwords.words("english")
updated_stops = stops+custom_stopwordlist

def remove_stops(comment):

    #removes all stop words, including custom words
    words = comment.split()
    final = []
    for word in words:
        if word not in updated_stops:
            # Lemmatizing the words
            #word = lemmatizer.lemmatize(word)
            word = lemmatizer.lemmatize(word, pos='v')
            final.append(word)

    #reassembles the text without stop words
    final = " ".join(final)

    #removes all punctuation
    final = final.translate(str.maketrans("", "", string.punctuation))

    #removes all numbers
    final = "".join([i for i in final if not i.isdigit()])

    #eliminates double white spaces
    while "  " in final:
        final = final.replace("  ", " ")
    return (final)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
# Remove stopwords from the original text and lemmatize them
data['Filtered_comments'] = data['processed_comments'].apply(remove_stops)

print(f'Removing stopwords from processed_comments and lemmatizing them complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')


Removing stopwords from processed_comments and lemmatizing them complete.
Time Taken: 24 seconds


In [15]:
# Let's check the updated data again
data.head(10)

Unnamed: 0,listing_id,reviewer_id,comments,processed_comments,Filtered_comments
0,2992450,16827297,"Large apartment; nice kitchen and bathroom. Kenneth left drinks for us which was so nice. His cousin lives upstairs, and she was very nice and helpful, too. The internet only worked about half the time, and everything on the street can be heard from the main bedroom. Wonderful stay- it was exactly what we needed. It is not in the best area of town, but I never felt unsafe.",large apartment nice kitchen and bathroom kenneth left drinks for us which was so nice his cousin lives upstairs and she was very nice and helpful too the internet only worked about half the time and everything on the street can be heard from the main bedroom wonderful stay it was exactly what we needed it is not in the best area of town but i never felt unsafe,large apartment nice kitchen bathroom kenneth leave drink nice cousin live upstairs nice helpful internet work half time everything street hear main bedroom wonderful stay exactly need best area town never felt unsafe
1,2992450,22648856,"This may be a little late, but just to say Kenneth was quick to respond to our request. He left some supplies in the refrigerator and kindly showed us round downtown Albany. We were without a car and he arranged for his cousin Kristina to take us to supermarket.\r<br/>The apartment is large, clean with many original features. It is situated in an historical area that is fairly central, although like the rest of Albany you have to climb a steep hill to get to it if you are on foot. \r<br/>",this may be a little late but just to say kenneth was quick to respond to our request he left some supplies in the refrigerator and kindly showed us round downtown albany we were without a car and he arranged for his cousin kristina to take us to supermarket br the apartment is large clean with many original features it is situated in an historical area that is fairly central although like the rest of albany you have to climb a steep hill to get to it if you are on foot br,may little late say kenneth quick respond request leave supply refrigerator kindly show round downtown albany without car arrange cousin kristina take supermarket br apartment large clean many original feature situate historical area fairly central although like rest albany climb steep hill get foot br
2,2992450,45406,"The apartment was very clean and convenient to downtown. One thing to keep in mind: one bedroom is huge and the other is tiny, so if you're traveling with a friend you might have to flip a coin. :)",the apartment was very clean and convenient to downtown one thing to keep in mind one bedroom is huge and the other is tiny so if you re traveling with a friend you might have to flip a coin,apartment clean convenient downtown one thing keep mind one bedroom huge tiny travel friend might flip coin
3,2992450,5485362,"Kenneth was ready when I got there and arranged for the upstairs neighbor to meet me at the door with the keys. Shortly after that I was left on my own in the privacy of this large 2 BR apartment - just like I like it. I really like my privacy and there was no time at which anyone came to visit me or bother me unannounced. At one point Kenneth had to pic up a refrigerator stored in the apartment, but the neighbor lady gave me plenty of notice and I was on my way out before he got there. When I returned the extra refrigerator was gone and I had the nice quiet apartment to myself. At one point I went to take a shower and found some body wash in a drawer. That was nice since all I brought with me was shampoo. The apartment is a city street apartment meaning that like many two family homes the BR window is feet away from the public sidewalk and you can hear people all hours of the night walking by and conversing, but its not terribly disruptive. Overall the apartment was just what I needed and I had an enjoyable time while there. The heat worked perfectly and the apartment was picked up. The bed was a king size and very comfortable with sensible pillows and coverings. The water pressure on the shower was awesome.",kenneth was ready when i got there and arranged for the upstairs neighbor to meet me at the door with the keys shortly after that i was left on my own in the privacy of this large 2 br apartment just like i like it i really like my privacy and there was no time at which anyone came to visit me or bother me unannounced at one point kenneth had to pic up a refrigerator stored in the apartment but the neighbor lady gave me plenty of notice and i was on my way out before he got there when i returned the extra refrigerator was gone and i had the nice quiet apartment to myself at one point i went to take a shower and found some body wash in a drawer that was nice since all i brought with me was shampoo the apartment is a city street apartment meaning that like many two family homes the br window is feet away from the public sidewalk and you can hear people all hours of the night walking by and conversing but its not terribly disruptive overall the apartment was just what i needed and i had an enjoyable time while there the heat worked perfectly and the apartment was picked up the bed was a king size and very comfortable with sensible pillows and coverings the water pressure on the shower was awesome,kenneth ready get arrange upstairs neighbor meet door key shortly leave privacy large br apartment like like really like privacy time anyone come visit bother unannounced one point kenneth pic refrigerator store apartment neighbor lady give plenty notice way get return extra refrigerator go nice quiet apartment one point go take shower find body wash drawer nice since bring shampoo apartment city street apartment mean like many two family home br window feet away public sidewalk hear people hours night walk converse terribly disruptive overall apartment need enjoyable time heat work perfectly apartment pick bed king size comfortable sensible pillow coverings water pressure shower awesome
4,2992450,15772025,"We were pleased to see how 2nd Street and the Ten Broeck neighborhood in Albany have come along. Its great to see all of the gorgeous brownstones, including the one we stayed in, being renovated and cared for.<br/>In terms of quality, I was pleased by the courtesy of the host's friend who lived nearby. He was there to let us in and make sure I knew to call with any problems. With a full kitchen and two bedrooms, my daughter and I felt at home with plenty of space. As a native of Albany I'm used to thin windows and occasionally noisy neighbors. If you don't sleep well, don't take the master bedroom and be aware that loud music may be playing during the evening and weekend. But there is a fan/ac unit to drown out the din from the street.<br/>This space has been renovated and a vast improvement I'm sure from where it was, and I'm hoping progress will continue here as it seemed a few projects were incomplete. I would have loved to enjoy more sunlight in the house but there was a ton of tree debris in the back yard so there was no way to enjoy opening the windows.<br/>On my own, this place was fine, but if traveling with my child again I will probably pick a quieter location where I'd be able to open windows or put her down for a nap without worrying about excessive noise.",we were pleased to see how 2nd street and the ten broeck neighborhood in albany have come along its great to see all of the gorgeous brownstones including the one we stayed in being renovated and cared for br in terms of quality i was pleased by the courtesy of the host s friend who lived nearby he was there to let us in and make sure i knew to call with any problems with a full kitchen and two bedrooms my daughter and i felt at home with plenty of space as a native of albany i m used to thin windows and occasionally noisy neighbors if you don t sleep well don t take the master bedroom and be aware that loud music may be playing during the evening and weekend but there is a fan ac unit to drown out the din from the street br this space has been renovated and a vast improvement i m sure from where it was and i m hoping progress will continue here as it seemed a few projects were incomplete i would have loved to enjoy more sunlight in the house but there was a ton of tree debris in the back yard so there was no way to enjoy opening the windows br on my own this place was fine but if traveling with my child again i will probably pick a quieter location where i d be able to open windows or put her down for a nap without worrying about excessive noise,please see nd street ten broeck neighborhood albany come along great see gorgeous brownstones include one stay renovate care br term quality please courtesy host friend live nearby let make sure know call problems full kitchen two bedrooms daughter felt home plenty space native albany use thin windows occasionally noisy neighbor sleep well take master bedroom aware loud music may play even weekend fan ac unit drown din street br space renovate vast improvement sure hop progress continue seem project incomplete would love enjoy sunlight house ton tree debris back yard way enjoy open windows br place fine travel child probably pick quieter location able open windows put nap without worry excessive noise
5,2992450,11614467,"The flat is not in a good area, while we were fine you do have to be careful (even our cab driver told us that) I wouldn't stay again. The flat also does not have air con- the large bedroom had an obnoxiously loud fan which kept the one room cool, the rest of the house is very hot. So if you are going to stay here bear these things in mind.",the flat is not in a good area while we were fine you do have to be careful even our cab driver told us that i wouldn t stay again the flat also does not have air con the large bedroom had an obnoxiously loud fan which kept the one room cool the rest of the house is very hot so if you are going to stay here bear these things in mind,flat good area fine careful even cab driver tell stay flat also air con large bedroom obnoxiously loud fan keep one room cool rest house hot go stay bear things mind
6,2992450,28580637,"The apartment was centrally located near all the night life in Albany, ideally suited to the age profile of my 2 sons who stayed their. ... We had initial difficulty in contacting the host...until we used my son who lives in Albany to be the contact, as we were relying on internet connection where ever we found free wifi..as roaming charges for international tourists is prohibitive. He went to the apartment a few times on the day we were yo arrive into JFK bit could not gain access...he was trying to get key so that he could leave basic provisions for us there. He eventually got an answer to the door bell..late evening. The apartment was adequate for them as they spent most of their time with their brother who lives in Albany. They did say that the apartment was way too warm, and they could not see where to set the heating. One of the windows was broken but had plastic/perspex over it. The furniture was of a good enough standard but the couch was a little shabby and their was a musty smell from it. Other than those facts they said the place was ideal for them and the contact person was very obliging considering they locked themselves out once and needed access .",the apartment was centrally located near all the night life in albany ideally suited to the age profile of my 2 sons who stayed their we had initial difficulty in contacting the host until we used my son who lives in albany to be the contact as we were relying on internet connection where ever we found free wifi as roaming charges for international tourists is prohibitive he went to the apartment a few times on the day we were yo arrive into jfk bit could not gain access he was trying to get key so that he could leave basic provisions for us there he eventually got an answer to the door bell late evening the apartment was adequate for them as they spent most of their time with their brother who lives in albany they did say that the apartment was way too warm and they could not see where to set the heating one of the windows was broken but had plastic perspex over it the furniture was of a good enough standard but the couch was a little shabby and their was a musty smell from it other than those facts they said the place was ideal for them and the contact person was very obliging considering they locked themselves out once and needed access,apartment centrally locate near night life albany ideally suit age profile sons stay initial difficulty contact host use son live albany contact rely internet connection ever find free wifi roam charge international tourists prohibitive go apartment time day yo arrive jfk bite could gain access try get key could leave basic provision eventually get answer door bell late even apartment adequate spend time brother live albany say apartment way warm could see set heat one windows break plastic perspex furniture good enough standard couch little shabby musty smell facts say place ideal contact person oblige consider lock need access
7,2992450,413779309,The place is clean and the host is very nice,the place is clean and the host is very nice,place clean host nice
8,2992450,19928494,It was much dirtier in person and half the furniture was broken. Hopefully the host can fix these issues because the apartment could be nice if it is cleaned up correctly and furnished with items that weren’t all broken such as the Tv being propped up on a table with two screws.,it was much dirtier in person and half the furniture was broken hopefully the host can fix these issues because the apartment could be nice if it is cleaned up correctly and furnished with items that weren t all broken such as the tv being propped up on a table with two screws,much dirtier person half furniture break hopefully host fix issue apartment could nice clean correctly furnish items break tv prop table two screw
9,3820211,11024290,We had a marvelous time staying at Terra's beautiful apartment in downtown Albany. The location was perfect and the apartment was just splendid. Terra was very responsive to our inquiries and flexible. Could not have been happier.,we had a marvelous time staying at terra s beautiful apartment in downtown albany the location was perfect and the apartment was just splendid terra was very responsive to our inquiries and flexible could not have been happier,marvelous time stay terra beautiful apartment downtown albany location perfect apartment splendid terra responsive inquiries flexible could happier


#### **4. STEMMING**

Stemming reduces words to their root form (or stem). Note: may not always be a valid word in the language.

**For example**:
* "playing" might be stemmed to "play"

* "studying" might be stemmed to "study"

* "running" might be stemmed to "run"

* "ring" might be stemmed to "r" (ineffiency)

#### **5. LEMMATIZATION**

Lemmatization reduces words to their base or dictionary form (lemma) by considering their dictionary meaning hence providing more accuracy than stemming.

For example:

* "running" becomes "run,"

* "better" becomes "good."

* "ring" remains "ring."

Hence I won't be performing Stemming, I'll be going forward with Lemmatization.

In [None]:
# Let's see the updated result
# data.head(10)

### FEATURE EXTRACTION using TF-IDF

TF-IDF is a natural language processing (NLP) technique that's used to evaluate the importance of different words in a sentence.

It's simple, intuitive technique which utilizes a fixed size input. It captures the importance of words (hence the semantics) while reducing the influence of stopwords alongside.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

#performing TF-IDF vectorization

tfidf_vectorizer = TfidfVectorizer(
                                lowercase=True,        # lowercase => allows us to lowercase our data.
                                max_features=10000,    # max_features => this is how many words out of the entire corpus we want to take into account. Here, we will use 100.
                                max_df=0.8,            # max_df => this is a float that tells the algorithm to ignore any word that occurs in 80% of the documents.
                                min_df=5,              # min_df => this is an integer that tells the algorithm to ignore words that do not occur more than 5 times in the entire corpus
                                ngram_range = (1,3),   # ngram_range => this is a tuple with index 0 being the smallest number of words to consider with the second index being the largest,
                                                       # i.e. 2 for bigram and 3 for trigrams. (1, 3) means the algorithm should consider anything from a unigram (one word) to trigram
                                                       # (three words) as a concept.
                                stop_words = "english" # stop_words => this is the language you are working with and you are telling the algorithm to ignore the predefined set of stopwords.
                                                       #   This is a bit excessive since we already did this with NLTK.
                            )


#def apply_tfidf(data):
    #data['TFIDF_comments'] = data['Filtered_comments'].apply(lambda x: ' '.join(x)) # Creating a column TFIDF_comments from the Lammatized_Tokens column
    # tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
    #tfidf_matrix = tfidf_vectorizer.fit_transform(data['TFIDF_comments'])
    #feature_names = tfidf_vectorizer.get_feature_names_out()
    #return tfidf_matrix, feature_names

# Applying TF-IDF on our Dataframe
#tfidf_matrix, feature_names = apply_tfidf(data)

vectors = tfidf_vectorizer.fit_transform(data['Filtered_comments'])

print(f'Feature extraction using TF-IDF complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')

Feature extraction using TF-IDF complete.
Time Taken: 37 seconds


In [17]:
# Printing the updated data
# data.head(10)

In [18]:
print (vectors[0])

  (0, 4907)	0.13670743727843854
  (0, 458)	0.0807778161994712
  (0, 6054)	0.23075425680850536
  (0, 4844)	0.10656740514569052
  (0, 851)	0.11615606055594532
  (0, 4952)	0.13186291920007917
  (0, 2805)	0.15903764896424855
  (0, 5076)	0.12582604017629814
  (0, 9548)	0.15417345595683257
  (0, 4167)	0.10376554747886883
  (0, 4690)	0.18381417915325826
  (0, 9932)	0.11158225479092704
  (0, 4101)	0.1848806927849427
  (0, 9280)	0.09158360514586154
  (0, 8925)	0.09974481942098329
  (0, 4136)	0.15280210932577157
  (0, 5539)	0.15636956818554246
  (0, 1011)	0.1304552238898346
  (0, 9896)	0.1039682720579427
  (0, 8552)	0.047142178228897325
  (0, 3139)	0.11795160453361851
  (0, 5851)	0.08031658217120272
  (0, 1043)	0.12205071461215197
  (0, 633)	0.09307861098007787
  (0, 9392)	0.12705171111781485
  (0, 3368)	0.11114522126394046
  (0, 9537)	0.1909485487764692
  (0, 551)	0.18147718001292296
  (0, 6105)	0.20197420410506395
  (0, 4850)	0.18637107188952576
  (0, 5089)	0.22689054237537246
  (0, 6099)	0.21

In [19]:
feature_names = tfidf_vectorizer.get_feature_names_out()
print (feature_names[5])


able accommodate


In [20]:
dense = vectors.todense()
print (dense[0])


[[0. 0. 0. ... 0. 0. 0.]]


In [22]:
denselist = dense.tolist()
print (denselist[0])


[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

In [4]:
true_k = 5

model = KMeans(n_clusters=true_k, init="k-means++", max_iter=100, n_init=1)

model.fit(vectors)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names_out()



NameError: name 'KMeans' is not defined

In [23]:
print (order_centroids)

[[1177 8552 6684 ... 4559 4556    0]
 [3721 5164 3847 ... 6242 6243 9999]
 [3900 6684 3721 ... 6143 6144 4999]
 [8552 6684 1617 ... 4709 1294 1229]
 [3966 3721 8552 ... 6652 6651    0]]


In [24]:
print (terms[92])


accommodate recommend


In [25]:
i = 0
for cluster in order_centroids:
    print (f"Cluster {i}")
    for keyword in cluster[0:10]:
        print (terms[keyword])
    print ("")
    i=i+1

Cluster 0
br
stay
place
great
host
clean
apartment
park
need
room

Cluster 1
great
location
great location
host
great host
stay
great stay
place
clean
place great

Cluster 2
great place
place
great
great place stay
place stay
stay
nice place
nice
great place great
host

Cluster 3
stay
place
clean
nice
comfortable
host
albany
apartment
recommend
need

Cluster 4
great stay
great
stay
stay stay
usual
cozy great
great location great
house
location great
defiantly



In [3]:
#A lot of this section was obtained from https://stackoverflow.com/questions/27494202/how-do-i-visualize-data-points-of-tf-idf-vectors-for-kmeans-clustering

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [2]:
kmean_indices = model.fit_predict(vectors)

pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(vectors.toarray())

NameError: name 'model' is not defined

In [1]:
colors = ["r", "b", "m", "y", "c"]

x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]

NameError: name 'scatter_plot_points' is not defined

In [None]:
fig, ax = plt.subplots(figsize=(50, 50))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])

In [None]:
fig, ax = plt.subplots(figsize=(50, 50))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])
for i, txt in enumerate(names):
    ax.annotate(txt[0:5], (x_axis[i], y_axis[i]))