# NLP

In this tutorial, you will embark on a journey to develop a sentiment analysis model for restaurant reviews using Natural Language Processing (NLP). Starting with raw text data, you will preprocess the data by cleaning and transforming it into a usable format. You will then learn to extract meaningful features using techniques like Bag of Words and TF-IDF. With these features, you will train a machine learning model to classify reviews as positive or negative. Finally, you will evaluate your model's performance using metrics such as accuracy and the confusion matrix. By the end of this tutorial, you will have hands-on experience in building a complete NLP pipeline, equipping you with the skills to tackle similar challenges in real-world applications.

In [None]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Importing the dataset
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

dataset

Mounted at /content/drive


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


The main task appears to involve training a Naive Bayes classifier.

In [None]:
# Cleaning the texts
# re is the regular expressions library in Python, used for pattern matching and string manipulation.
# nltk is the Natural Language Toolkit, a comprehensive library for working with human language data.

import re
import nltk

# This line downloads the list of stopwords from the NLTK library. Stopwords are common words like "and," "the," "in," etc.,
# which are typically removed from text data because they don’t carry much meaningful information.

nltk.download('stopwords')

# stopwords is imported from nltk.corpus, which will be used to filter out the stopwords from the reviews.
# PorterStemmer is a stemming algorithm that reduces words to their root form (e.g., "running" to "run").

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# corpus is an empty list that will store the cleaned and processed reviews.

corpus = []

# This loop iterates over the first 1000 reviews in the dataset.
# re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) uses regular expressions to remove any character that is not a letter (i.e., anything that is not a-z or A-Z) from the review.
# It replaces these characters with a space ' '. This removes numbers, punctuation, and special characters, leaving only letters.

for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])

# Converts the entire review to lowercase to ensure uniformity (e.g., "This" and "this" are treated the same).
# Splits the review into a list of individual words. For example, "Loved this place" becomes ['loved', 'this', 'place'].

    review = review.lower()
    review = review.split()

# ps = PorterStemmer() initializes the Porter Stemmer.
# The list comprehension [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] processes each word in the review:
# It checks if the word is not a stopword (e.g., "and," "the").
# If the word is not a stopword, it applies stemming to reduce the word to its root form.
# The processed words are stored in the list review.
# Joins the list of stemmed words back into a single string with spaces between words, forming the cleaned review.
# Appends the cleaned and processed review to the corpus list.

    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# After running this loop, corpus will contain 1000 cleaned and processed reviews, each review represented as a string of stemmed words with stopwords removed.
# This corpus can then be used as input for further text processing or modeling tasks.
# Example:
# Original Review: "Wow... Loved this place!"
# After Cleaning: "love place"

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1])

In [None]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n",cm)

Confusion Matrix : 
 [[55 42]
 [12 91]]
