<a href="https://colab.research.google.com/github/tommparekh/NLP_Session/blob/master/Attendee_Intro_to_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Natural Language Processing
- Throughout this notebook we will be exploring foundational concepts regarding NLP and applying them in a miniature project where we analyze sentiment from hotel reviews.
- Anytime you see a line surrounded by triple asterisks, `***LIKE THIS***`, that is a line of code that you will need to replace or edit.
- Have fun and good luck coding!

> To execute a line or block of code, simply click the "Play" button on the left side or use the keyboard shortcut "Shift + Enter"
> When that code block has actually been executed, the blank brackets will change to have a number inside of them.

In [0]:
x = ***EDIT THIS CODE***
print(x)

### Importing the Packages That We'll Need
One of the things that makes Python **great** for data science is all of the different libraries that exist so we don't have to code them from scratch. Tonight we'll be taking advantage of:
- [Pandas](https://pandas.pydata.org/) for data wrangling and analysis
- [Scikit-learn](https://scikit-learn.org/stable/) for machine learning
- [Regex](https://docs.python.org/3/library/re.html) for regular expression and text parsing
- [Matplotlib](https://matplotlib.org/) for visualizing and plotting our data

In [0]:
import pandas as pd
import sklearn
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Import the Data Set
Pandas can work with information from all kinds of data sources. Below, we'll import the data we need from a GitHub URL and read it into a Pandas Dataframe.

In [0]:
data = pd.read_csv('https://github.com/Thinkful-Ed/data-201-resources/raw/master/hotel-reviews.csv')
data.head(3)

In [0]:
# Checking the size of our data (rows, columns)
data.shape

We don't need to worry about using all of our different columns for this project. Instead, we'll focus just on the test review and ratings columns.

In [0]:
data[['name', 'reviews.rating', 'reviews.text']]
***OVERWRITE OUR ORIGINAL DATA VARIABLE WITH THIS***

Pandas also has some cool funcitonality that let's us quickly edit aspects of our dataframe object like the column names. **Don't forget to make sure those changes persist by using the `inplace` parameter or by overwriting the variable like we did above!**

In [0]:
data.rename(columns={'name':'hotel_name', 'reviews.rating': 'review_rating', 'reviews.text':'review_text'}, inplace=***MAKE THIS TRUE***)

In [0]:
# Check to make sure our changes took effect
data.head(3)

## Data Wrangling / Processing
A major part of every data science project is the data wrangling and processing phase, and this is especially true in NLP. During this section of the notebook, we'll cover:
- How to clean up the clutter from our initial text.
- How to work with variables assignment and Pandas syntax to make sure those changes persist.
- How to use Pandas' `map` functionality to assign more relevant labels to our reviews.

In [0]:
# Make everything lower case — but this doesn't actually change our dataframe!
data['review_text'].str.lower()

In [0]:
# Make sure that change actually "sticks"
data['review_text'] = data['review_text'].str.lower()

In [0]:
# Remove non-text characters in a similar fashion.
data['review_text'] = data['review_text'].str.replace(r'\.|\!|\?|\'|,|-|\(|\)', "")

In [0]:
# And again fill in blank reviews with '' rather than Null (which would give us errors).
data['review_text'] = data['review_text'].fillna('')

In [0]:
data.head(3)

Since we're working on a sentiment analysis project, let's have some fun and change the reviews from numbers to a more emotional human sentiment!

In [0]:
# Taking our hotel ratings and "translating" them to sentiment labels
data['review_sentiment'] = data['review_rating'].map({1.0:***CHANGE THIS CODE TO A STRING***,
                                                      2.0:***CHANGE THIS CODE TO A STRING***,
                                                      3.0:***CHANGE THIS CODE TO A STRING***,
                                                      4.0:***CHANGE THIS CODE TO A STRING***,
                                                      5.0:***CHANGE THIS CODE TO A STRING***}, )

After we've done that we can get rid of the old rating column and get rid of any of our rows that are missing a review.

In [0]:
# Dropping that original rating column
data.drop(columns='review_rating', inplace=True)

# Dropping all rows where there is a null value in the sentiment column
data.dropna(subset=['review_sentiment'], inplace=True)

In [0]:
data.head(3)

In [0]:
# Quick look at the distribution of our label
data['review_sentiment'].value_counts().plot(kind = 'bar')
plt.title('Label Counts in the Data Set')
plt.xlabel('***EDIT THIS STRING***')
plt.ylabel('***EDIT THIS STRING***');

## Creating a Bag of Words
- In this step, we'll take all of that text that we cleaned up and encode it so our model can understand it.
- Scikit-learn has a number of[ different ways](https://scikit-learn.org/stable/modules/feature_extraction.html) to do this, today we'll stick with the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [0]:
# Import and initiate a vectorizer.
from sklearn.feature_extraction.text import CountVectorizer

# The max features is how many words we want to allow us to create columns for.
vectorizer = CountVectorizer(max_features=5000)

In [0]:
# Vectorize our reviews to transform sentences into columns.
X = vectorizer.fit_transform(data['review_text'])

# And then put all of that in a table.
bag_of_words = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

In [0]:
bag_of_words.head(3)

## Modeling
Now that we've got everything cleaned up and restructured, it's time to model!
- We'll use our `bag_of_words` as our features to predict our label of `review_sentiment`.
- `X` is a common convention for designating our feature matrix  — the same is true for using `y` for the target series.
- Once we've defined those, we can have our [Multinomial Naive Bayes model](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) learn from them just like we would in many other machine learning problems.

In [0]:
# X is our features or attributes.
X = bag_of_words

# y is our review sentiment column (the outcome we care about).
y = data['review_sentiment']

In [0]:
# Import and instantiate our model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

# Train our model on the review text and associated labels
trained_model = model.fit(X, y)

## Making Predictions From Our Model
After our model has learned from our historical data, we can introduce new or unseen reviews to it and ask it to predict the sentiment of those.
- Go ahead and play around with the text in your review and see how the output changes!

In [0]:
# Write your own hotel review here...
test_review = ['''
    ***EDIT THIS STRING***.
    ''']

In [0]:
# Convert the test review just like we did earlier.
X_test = vectorizer.transform(test_review).toarray()

In [0]:
# Use our model to predict a label for it
prediction = trained_model.predict(X_test)
print(prediction)

In [0]:
# Alternatively, we can predict a probability for each sentiment
probas = trained_model.predict_proba(X_test)[0]

# And convert it to a more readable output.
probabilities = [str(int(x*100))+'%' for x in probas]
labels = list(trained_model.classes_)
dict(zip(probabilities, labels))

# Take Home Challenge
Now that you've been introduced to some of the foundational concepts within the NLP space, we want you to apply what you've learned outside of the workshop environment. We went over the concepts of using "stop words" and "n-grams" to (hopefully) improve your analysis and the accuracy of your model, but didn't actually apply them yet in this workbook.

To do so:
- Try removing stop words from your bag of words structure before feeding it into the Multinomial NB classifier. This can be as simple as including an argument for the `stop_words` parameter in the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) class.
- Additionally, try using bigrams to create some additional context for the text that you're analyzing. Again, scikit-learn makes implementing that change relatively easy by including an `ngram_range` parameter within the same [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) class.
- As you make these changes (and maybe others!), be sure to note how your predictions change accordingly. Do they behave more or less in the ways that you expect?

**Bonus Advanced Challenge:**
Looking to take your modeling powers to the next level? You're going to want a way to see if the predictions that your model is making are accurate or not. In a practical setting, one of the most common ways to do this is by using [train-test-split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
 - You will take a large portion of your data (75% by default using `sklearn`) and train your model or have it learn from all of that information.
 - The other smaller portion (25% in this case) will be held out — your model will **not** learn from these data points. Instead, they can be used to simulate real-world or unseen data points.
 - You will ask your model to make predictions on this smaller set and then compare the predictions made against the actual ground truths to find out how well your model performs.
 - Some common metrics for classification performance are [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) and inspecting the [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

# Keep Learning with Thinkful
If you enjoyed today's session and want to take a deeper dive into many of the topics that we covered today like Pandas, SQL, predictive modeling, visualizing your data, and so much more, we'd love to have you join us again!
- Check out more of our webinars at [Thinkful Webinars](https://www.thinkful.com/webinars/)
- Get a taste of what the actual program would be like with our [Two-Week Free Trial](https://www.thinkful.com/join/sign-up/?signup_key=virtual-data)
  - Flexible Online Course
  - Start with Python and Statistics
  - Access to a Personal Program Manager
  - Attend Unlimited Q&A Sessions 
  - Participate in the Student Slack Community