# Exploratory Data Analysis

## Introduction

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each resort:

1. **Most common feeback** - find these and create word clouds


## Most Common Words

### Analysis

In [None]:
# Read in the document-term matrix
import pandas as pd

data = pd.read_pickle('../pickle/dtm.pkl')
data = data.transpose()
data.head()

In [None]:
# Find the top 30 words said for each resort
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))
top_dict

In [None]:
# Print the top 15 words said for each resort
for resort, top_words in top_dict.items():
    print(resort)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

**NOTE:** At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.



In [None]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each comedian
words = []
for resort in data.columns:
    top = [word for (word, count) in top_dict[resort]]
    for t in top:
        words.append(t)
        
words

In [None]:
# Let's aggregate this list and identify the most common words along with how many routines they occur in
Counter(words).most_common()

In [None]:
# If more than half of the resorts have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]
add_stop_words_additional = ['hotel','resort','munnar','place','room','rooms','swiss','county','tea','good','great','stay','nice']
add_stop_words.extend(add_stop_words_additional)

In [None]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('../pickle/data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.review)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = data_clean.index

# Pickle it for later use
import pickle
pickle.dump(cv, open("../pickle/cv_stop.pkl", "wb"))
data_stop.to_pickle("../pickle/dtm_stop.pkl")
#data_stop.to_pickle("../pickle/dtm_stop_tm.pkl") # going to use for topic modiling

In [None]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)


In [None]:
# Reset the output dimensions
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [16, 6]
resort_names = ['KTDC Tea County', 'Misty Mountain', 'Munnar Tea Country', 'Rivulet Resort', 'Swiss County', 'Tea Valley']

# Create subplots for each comedian
for index, resort in enumerate(data.columns):
    wc.generate(data_clean.review[resort])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(resort_names[index])
    
plt.show()

### Findings

* Most of the resorts are in good location. Let's dig into that later.