<a href="https://colab.research.google.com/github/yoseforaz0990/ML-templates/blob/main/NLP/natural_language_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

| Step                                          | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Importing the libraries                      | Import the required libraries for text preprocessing and machine learning. This includes the `re` library for regular expressions, `nltk` for natural language processing tasks, `stopwords` from NLTK to remove common words, `PorterStemmer` for word stemming, `CountVectorizer` from scikit-learn for Bag of Words representation, and `GaussianNB` from scikit-learn for the Naive Bayes classifier.                                                                                                                                     |
| Importing the dataset                        | Load the dataset containing the text reviews and their corresponding sentiment labels. This dataset will be used to train and test the Naive Bayes classifier.                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Cleaning the texts                           | Preprocess the text reviews to remove any characters other than alphabets, convert text to lowercase, tokenize the text into individual words, perform word stemming to reduce words to their root form, and remove common English stopwords to focus on important words for sentiment analysis. The cleaned text reviews are stored in the `corpus` list.                                                                                                                                                                                                                                                                                             |
| Creating the Bag of Words model              | Create a Bag of Words representation of the preprocessed text reviews using CountVectorizer. The Bag of Words model converts the text data into a numerical format, representing the frequency of each word in each review. The `max_features` parameter specifies the maximum number of words (features) to be included in the model.                                                                                                                                                                                                                                                                              |
| Splitting the dataset into Training and Test | Split the dataset into the training set and the test set. The training set will be used to train the Naive Bayes classifier, and the test set will be used to evaluate its performance.                                                                                                                                                                                                                                                                                                                                                                                                         |
| Training the Naive Bayes model               | Train the Naive Bayes classifier on the training set, using the Bag of Words representation of the text reviews and their corresponding sentiment labels. The Naive Bayes algorithm learns the probability distribution of words given a sentiment label and uses it to classify new reviews based on their word frequencies.                                                                                                                                                                                                                                                                                   |
| Predicting the Test set results              | Use the trained Naive Bayes classifier to predict the sentiment labels for the test set reviews. The predicted results are stored in the `y_pred` variable.                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Making the Confusion Matrix                  | The confusion matrix can be computed to evaluate the performance of the classifier. However, the code snippet for this part is not provided in the original code. A confusion matrix allows us to analyze the accuracy of the classifier's predictions and understand the number of true positives, true negatives, false positives, and false negatives. It helps in assessing the overall model performance and identifying any biases or issues in the classification process. |


In [None]:
# Importing the libraries
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# Importing the dataset (dataset is not provided in the code, assuming it's already loaded)

# Cleaning the texts
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set (assuming they are already prepared)

# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# Making the Confusion Matrix (assuming y_pred and y_test are available)
