Data Analysis

Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Loading the data set into pandas dataframe

In [2]:
data = pd.read_csv("/content/Data Analyst - Test Data - US.csv")

In [3]:
# first five rows
data.head()

Unnamed: 0,Review,date,Location
0,I was very impressed with the resort.\n Great ...,2019/08/20,Sebastian
1,The rooms were nice the outside needs work als...,2019/08/20,Los Angeles
2,Great location! I have stayed at this hotel on...,2019/08/20,Georgia
3,The hotel was adequate for my stay. The strips...,2019/08/20,
4,"Great location, room was large and spacious. P...",2019/08/19,Palm Harbor


In [4]:
# Rows and columns of the data set
data.shape

(6448, 3)

In [5]:
# checkiing the data types 
data.dtypes

Review      object
date        object
Location    object
dtype: object

All the columns are of object data type

Checking the missing values in the data set

In [6]:
data.isnull().sum()

Review        55
date           0
Location    4737
dtype: int64

Review column has 55 null values and Location has 4737 null values

Dropping all the null values from the columns

In [7]:
data.dropna(inplace=True)

In [8]:
data.isnull().sum()

Review      0
date        0
Location    0
dtype: int64

All the null values are dropped

Dropping Date and Location columns

In [11]:
data.drop(columns=["date","Location"],axis=1,inplace=True)

In [12]:
data.head()

Unnamed: 0,Review
0,I was very impressed with the resort.\n Great ...
1,The rooms were nice the outside needs work als...
2,Great location! I have stayed at this hotel on...
4,"Great location, room was large and spacious. P..."
10,Very clean and friendly and I love the locatio...


Converting from object to string data type

In [15]:
data["Review"]=data['Review'].astype(str)

word_count : how many words are there in each Review

In [16]:
data["word_count"]=data["Review"].apply(lambda x: len(x.split()))

In [17]:
data.head()

Unnamed: 0,Review,word_count
0,I was very impressed with the resort.\n Great ...,33
1,The rooms were nice the outside needs work als...,25
2,Great location! I have stayed at this hotel on...,20
4,"Great location, room was large and spacious. P...",20
10,Very clean and friendly and I love the locatio...,40


character count :how many characters are there in each review

In [18]:
data["char_count"]=data["Review"].apply(lambda x:len(x))

In [19]:
data.head()

Unnamed: 0,Review,word_count,char_count
0,I was very impressed with the resort.\n Great ...,33,196
1,The rooms were nice the outside needs work als...,25,134
2,Great location! I have stayed at this hotel on...,20,106
4,"Great location, room was large and spacious. P...",20,126
10,Very clean and friendly and I love the locatio...,40,238


Finding the average words in each of the Review

In [20]:
def average_words(x):      # function to find the avarage words
  words=x.split()
  return sum(len(word) for word in words)/len(words)

In [21]:
data["average_wordcount"]=data["Review"].apply(lambda x:average_words(x))

In [22]:
data.head()

Unnamed: 0,Review,word_count,char_count,average_wordcount
0,I was very impressed with the resort.\n Great ...,33,196,4.69697
1,The rooms were nice the outside needs work als...,25,134,4.12
2,Great location! I have stayed at this hotel on...,20,106,4.0
4,"Great location, room was large and spacious. P...",20,126,5.0
10,Very clean and friendly and I love the locatio...,40,238,4.8


Finding out the stop words 

In [23]:
stop_words=stopwords.words("english")
len(stop_words)

179

In [24]:
data["stopword_count"] = data['Review'].apply(lambda x:len([word for word in x.split() if word.lower() in stop_words]))

In [25]:
data.head()

Unnamed: 0,Review,word_count,char_count,average_wordcount,stopword_count
0,I was very impressed with the resort.\n Great ...,33,196,4.69697,14
1,The rooms were nice the outside needs work als...,25,134,4.12,10
2,Great location! I have stayed at this hotel on...,20,106,4.0,8
4,"Great location, room was large and spacious. P...",20,126,5.0,7
10,Very clean and friendly and I love the locatio...,40,238,4.8,19


stopword_rate:percentage of stop word to non stop word

In [27]:
data['stopword_rate'] = data["stopword_count"]/data["word_count"]

In [28]:
data.head()

Unnamed: 0,Review,word_count,char_count,average_wordcount,stopword_count,stopword_rate
0,I was very impressed with the resort.\n Great ...,33,196,4.69697,14,0.424242
1,The rooms were nice the outside needs work als...,25,134,4.12,10,0.4
2,Great location! I have stayed at this hotel on...,20,106,4.0,8,0.4
4,"Great location, room was large and spacious. P...",20,126,5.0,7,0.35
10,Very clean and friendly and I love the locatio...,40,238,4.8,19,0.475


Reviews with high stop words

In [30]:
data.sort_values(by="stopword_rate",ascending=False)

Unnamed: 0,Review,word_count,char_count,average_wordcount,stopword_count,stopword_rate
6267,·it was so close to what we were there for!\n ...,11,61,3.818182,7,0.636364
1084,I like the location. It is close to the air po...,32,159,3.750000,19,0.593750
4553,property was partially under construction duri...,34,185,4.264706,20,0.588235
2404,The only thing we liked about the property was...,102,534,4.176471,60,0.588235
3517,"There was no elevator, I have a mother who is ...",63,325,4.063492,37,0.587302
...,...,...,...,...,...,...
6345,·housekeeping staff\n \n \n \n \n ·restaurant com,4,44,8.000000,0,0.000000
6365,·the bed,2,8,3.500000,0,0.000000
6379,"·staff, cleanliness, price",3,26,8.000000,0,0.000000
6385,·no microwave,2,13,6.000000,0,0.000000


summary of the data set

In [31]:
data.describe()

Unnamed: 0,word_count,char_count,average_wordcount,stopword_count,stopword_rate
count,1705.0,1705.0,1705.0,1705.0,1705.0
mean,44.531378,250.558944,4.772229,18.478592,0.369202
std,44.401263,234.884032,2.547305,22.331574,0.131761
min,1.0,8.0,2.947368,0.0,0.0
25%,18.0,106.0,4.228571,5.0,0.3
50%,28.0,165.0,4.534247,11.0,0.4
75%,55.0,303.0,4.916667,24.0,0.466667
max,311.0,1531.0,65.2,174.0,0.636364
