# Natural Language Processing (NLP)

- Natural Language Processing (or NLP) is applying Machine Learning models to text and language.

- NLP can be used for 
    1. Review if the text is good or bad.
    1. Predict the Categories and genere of article & Books
    1. Speach recognition and translation.

- Most of NLP algorithems are Classification algorithem, they include Logistic Regression, Naive Bayes, CART (Decision tree, Markov models).

- A very well-known model in NLP is **Bag of Words**.
    - a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.

### Classical vs Deep Learning Models
<img src='./nlp_photos/types_nlp.png' height='200px'>

**Examaples**
1. Clasiscal NLP (green)
    - If-Else Rules (Chatbot)
    - Audio frequency components analysis (Speech Recognition)
    - Bag-of-words model (Classification)

1. DNLP (purple)
    - CNN for text Recognition (Classification)

## Bag of Words
- We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text

- It is achived by converting text into a bag of words, which keeps a count of the total occurrences of most frequently used words. and stores it in to an array at corresponding indexes.

<img src='https://aiml.com/wp-content/uploads/2023/02/disadvantage-bow-1024x650.png' height='400px'>

- We feed the Arrays to the algoritem, the pattern derived is associated to specific results. 

## Implememting NLP
### 1. Sentimental analysis

### pre-processing

In [33]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
# quoting = 3 to ignore double quotes
# tsv is a tab separated value, delimiter = '\t',   default is ','
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

### Cleaning the texts

In [34]:
import re
import nltk

nltk.download('stopwords') # stopwords {a, an, the} have no significance in results. (non relevent words)
from nltk.corpus import stopwords

# Simplyfies words for root meaning, eg."Loved" -> "love",  "wonderful" -> "wonder"
from nltk.stem.porter import PorterStemmer 
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

# Cleaning the text
corpus = []
for i in range(0, dataset.shape[0]): # length of dataset
    
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # replace all non-alphabets with spaces in Review column of dataset
    review = review.lower() # convert to lower case
    review = review.split() # split into individual words
    
    # remove stopwords and Applying ps.stem(word) --> stemming to every word in a sentence.
    review = [ps.stem(word) for word in review if not word in all_stopwords]
    review = ' '.join(review) # join words with Spaces in between to form a sentence
    corpus.append(review)

print('\n \t few sentences in corpus \n')
for i in range(0, 5):
    print(corpus[i])    


 	 few sentences in corpus 

wow love place
crust not good
not tasti textur nasti
stop late may bank holiday rick steve recommend love
select menu great price


[nltk_data] Downloading package stopwords to /home/dk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Creating the Bag-of-words model

- the CountVectorizer() object accept only one parameter. **max_features = 1000**, it is 1000 most frequent words, to minimize to dimensionality

- focus is only on words that are frequently used, or have significace, names and numbers are ignored.

**Note :**<br>
- For first time, Use the object without and parameters(max_features). it will help you find total number of unique words in the corpus.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

# Finding total number of words
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
print(len(X[0])) # length of First row

1566


In [36]:
# As we can see that the dataset has 1567 unique words, 
# so we can use 1500 features in our model

cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray() # Independent variable
y = dataset.iloc[:, -1].values # Depentent variable


# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Training the Naive Bayes model

- Can use any Classifier, to catogorize the corpus into 'Yes' & 'No' groups.

In [37]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

print('\t Accuracy is', accuracy_score(y_test, y_pred))


[[55 42]
 [12 91]]
	 Accuracy is 0.73


### Predicting Single (Custom) reviews

In [38]:
new_review= ['the temperature inside is Hot, AC not working',
            'The food is not good.',
            'I liked the pasta.',
            "the staff was very friendly, it was good",
            "Nice place to visit",
            "I hate the food"]

new_corpus = []
for review in new_review:
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    new_corpus.append(review)

In [39]:
results = classifier.predict(cv.transform(new_corpus).toarray())
for i in range(0, len(new_corpus)):
    if results[i] == 1:
        print(f"{new_corpus[i]}, --> Positive")
    else:
        print(f"{new_corpus[i]}, --> Negative")

temperatur insid hot ac not work, --> Positive
food not good, --> Positive
like pasta, --> Positive
staff friendli good, --> Positive
nice place visit, --> Positive
hate food, --> Negative


### Training Kernal SVM (RBF) 

In [40]:
from sklearn.svm import SVC
# Using SVC methord from Scikit learn library
classifier2 = SVC(kernel="rbf", random_state=0)
classifier2.fit(X_train,y_train)

# Predicting results
y_pred2 = classifier2.predict(X_test)

# Creating the Confusion matrix
cm2 = confusion_matrix(y_test,y_pred2)
print(cm2)
print(f' Accuracy is {accuracy_score(y_pred2,y_test)}\n')

[[89  8]
 [36 67]]
 Accuracy is 0.78



In [41]:
results = classifier2.predict(cv.transform(new_corpus).toarray())
for i in range(0, len(new_corpus)):
    if results[i] == 1:
        print(f"{new_corpus[i]}, --> Positive")
    else:
        print(f"{new_corpus[i]}, --> Negative")

temperatur insid hot ac not work, --> Negative
food not good, --> Negative
like pasta, --> Negative
staff friendli good, --> Positive
nice place visit, --> Positive
hate food, --> Negative


- It seem like the SVM (non-linear kernal) performed Better, 
    - The Accuraccy may be higher for SVM but the Type 2 error is also very high, which may results in more incorrect negative, as it is the case in the example.

- As for the Naive bayes Model it Performed Bad, the Type 1 error is more with lower Accuracy, resulting in More incorrect positive responces.

- Fine tunning the Regular Expression, improving the Cleannig methords, and Choosing the Optimal Classification Algoritem can help in Improve the Accuracy and Prediction Quality. 