# Thinkful Data Science Event:

## Planning a Vacation with NLP

The purpose of this event is to become familiar with natural language processing(NLP). 

NLP entails taking unstructured data such as text and transforming it into a useful format from which we can extract insights. 

In today's event, we will use this methodology in order to help us in planning and ideal vacation. The Kaggle website contains a file of hotel reviews which can be downloaded [here](https://www.kaggle.com/datafiniti/hotel-reviews/data). Our task is to use these reviews to create a random forest that will help plan our ideal location for a holiday.  

## Getting Started-

The data for tonights event contains reviews for 1000 hotels collected by Datafinity. We are going to use these reviews to help us plan a vacation. Before we can do that, however, we will have to pre-process our data. This will include transforming the shape and structure of the data to make it readable by our algorithm, and also the removal of unnecessary information that cannot be interpreted by the computer.

### NLTK

In order to complete our project, we will rely on the `nltk` package which provides an excellent platform in python for carrying out natural language processing. To install this package and set it up:

1. Open up a terminal window or command line:
    * Run `pip install --upgrade pip`
    * Then type `pip install nltk`
    
2. Create a new instance of python
    * Run the following line: `import nltk`
    * Then you must type in `nltk.download()`

When you run the last line, it will bring out a pop-up window that looks like this:

![Image of NLTK](https://likegeeks.com/wp-content/uploads/2017/09/01-NLP-Tutorial.png)

The packages are all small, so we can go ahead and install them all. Then simply close the pop-up and we can continue with our project in python.

In [6]:
# Import the required libraries
import os
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import preprocessing, model_selection, metrics, naive_bayes
from sklearn.ensemble import RandomForestClassifier

In [7]:
# Check the working directory
# print(os.getcwd())
# path = 'path/to/files'
# os.chdir(path)

In [8]:
# Load the data
data = pd.read_csv('https://github.com/Thinkful-Ed/data-201-resources/raw/master/hotel-reviews.csv')

## Pre-processing-

This is often the most time consuming component of any data science project. Indeed, it can comprise up to 90% of the effort in building a model. 

In order to analyze our data, we must first process it through several tasks:
* Remove casing
* Separate words
* Remove unnecessary information
* Word "stemming"
* TF-IDF
* Encode the target variable
* Split the data

In [9]:
# Convert to lower-case
data['reviews.text'] = data['reviews.text'].str.lower()

In [10]:
# Remove punctuation
data["words"] = data['reviews.text'].str.replace('[^\w\s]','')

The next step is referred to as "tokenization". This separates each word in the document or "corpus". 

Tokenization can take place at two different levels. 

First of all, we can separate sentences. More importantly, we can separate individual words. For our purposes, we will need to separate the words. 

For this reason, we can inherently ignore the sentence tokenization. Plus, we've already removed all the punctuation from our dataframe anyways.

In [11]:
# Tokenize words
data['words'] = data['reviews.text'].apply(str)

data['words'] = data['words'].apply(lambda row: word_tokenize(row))

Now that we have separated all of our words, the data looks like this- for each row, the `reviews.text` column has been transformed from sentences into a simple list of words. 

We still need to further process these results. 

First, we must remove the "stop words". These are articles like "the, is, are, etc." that don't really add any important information to the text. 

`NLTK` has lists of common stop words in many different languages. We can use these lists to automatically filter the stop words out of our data.  

In [12]:
# Filter out stop words
stop_words = set(stopwords.words('english'))

data['words'] = data['words'].apply(lambda x: [item for item in x if item not in stop_words])

The next step in processing our data is called "stemming". 

Basically, all this does is strip individual words down to their roots. For example, running would become run. 

The `NLTK` library also provides a convenient function for doing this as well. There are even several options that allow you to stem the words into different formats. 

In [13]:
# 'Stemming' the words
ps = PorterStemmer()

data['stemmed'] = data['words'].apply(lambda x: [ps.stem(y) for y in x])

It was mentioned earlier that our pre-processing left us with a column in which the data for each row was a list. 

In python(pandas), those lists are seen as objects. These objects will cause problems as we procede. 

Before continuing any further, we will go ahead and rejoin the lists. This will leave us with a single `string` of words separated by spaces for each row of the data frame. 

In [14]:
# Rejoin the words
data['stemmed'] = data['stemmed'].apply(lambda x: ' '.join(x))

In [15]:
# Separate features and target
df = pd.DataFrame()

df['x'] = data['stemmed']

df['y'] = data['name']

Sentences, whether they are in the form of lists or paragraphs, cannot really be understood by machine learning algorithms. We still need to convert our reviews into a binary format in which words are represented by 0's and 1's. 

Basically, our columns will become words and our rows will be sentences. Then, for each sentence- the words contained will be represented as a 1 in the corresponding column.

In [16]:
# Create count vectors for each review
cv = CountVectorizer(analyzer = 'word', token_pattern = r'\w{1,}')

In [17]:
# Apply counter to data
x = cv.fit_transform(df['x'])

x = pd.DataFrame(x.toarray(), columns = cv.get_feature_names())

The next step in the process is called tf-idf. This stands for "term frequency- inverse document frequency". According to Wikipedia:

> In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Tf-idf is one of the most popular term-weighting schemes today; 83% of text-based recommend-er systems in digital libraries use tf-idf.[2]

The point of this calculation is to describe the relative importance of each word in a collection of documents based on its frequency of occurence. 

In [18]:
# TF-IDF
tf_idf = TfidfVectorizer(analyzer = 'word', token_pattern = r'\w{1,}', max_features = 5000)

x_tf_idf = tf_idf.fit_transform(df['x'])

x_tf_idf = pd.DataFrame(x_tf_idf.toarray(), columns = tf_idf.get_feature_names())

Thats it, our data has been processed and is ready to be fed into a model. The only remaining task is to re-label the hotels in terms of numbers rather than string values.

In [19]:
# Convert the hotel names to numbers for model
encoder = preprocessing.LabelEncoder()

df['y1'] = encoder.fit_transform(df['y'])

## Machine Learning-

Now that the hard part is over, its time to get to the fun stuff- building models. The truth is that since we have pre-processed our data: we can now feed it into virtually any machine learning algorithm you can think of. 

In [20]:
# Split the data into training and testing sets
train_x, test_x, train_y, test_y = model_selection.train_test_split(x, df['y1'])

tf_train_x, tf_test_x, tf_train_y, tf_test_y = model_selection.train_test_split(x_tf_idf, df['y1'])

For our purposes, we will limit the possibilities to a Random Forest and a Naive Bayes Classifier. 

However, we will first create a function that will take our data and model as inputs and return the accuracy. Thus, we could theoretically run any/all potential models. We can also reuse this code in other projects for the same purpose.  

In [21]:
# Function to build the model
def build_model(model, x_training, y_training, x_testing, y_testing):
    
    # Fit the model on the training data
    model.fit(x_training, y_training)
    
    # Predictions
    preds = model.predict(x_testing)
    
    # Output 
    return(metrics.accuracy_score(preds, y_testing))

In [22]:
# Run the function to create a Random Forest
rf_count_acc = build_model(RandomForestClassifier(), train_x, train_y, test_x, test_y)

rf_tf_acc = build_model(RandomForestClassifier(), tf_train_x, tf_train_y, tf_test_x, tf_test_y)

In [23]:
# Run the function to create a Naive Bayes 
nb_count_acc = build_model(naive_bayes.MultinomialNB(), train_x, train_y, test_x, test_y)

nb_tf_acc = build_model(naive_bayes.MultinomialNB(), tf_train_x, tf_train_y, tf_test_x, tf_test_y)

## Conclusion- 

Congratulations! 

We have successfully completed our first forray into the complicated, confusing world of natural language processing. 

The skills we have learned today taught us many components of general programming in the python language, more importantly- we also learned the many steps required in order to translate and interpret textual information using machine learning and data science.

In [24]:
# Results
print("The random forest model accuracy for the vectorized reviews is:", rf_count_acc)

print("The naive bayes model accuracy for the vectorized reviews is:", rf_tf_acc)

print("The random forest model accuracy for the tf-idf reviews is:", nb_count_acc)

print("The naive bayes model accuracy for the tf-idf reviews is:", nb_count_acc)

The random forest model accuracy for the vectorized reviews is: 0.1932501670750724
The naive bayes model accuracy for the vectorized reviews is: 0.21753174426375585
The random forest model accuracy for the tf-idf reviews is: 0.0797505012252172
The naive bayes model accuracy for the tf-idf reviews is: 0.0797505012252172
