In [2]:
import pandas as pd
import numpy as np
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Flatten
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Load the CSV file
data = pd.read_csv(r"C:\Users\Deepti Wandhekar\Downloads\imdb_dataset.csv")

# Split the data into input (reviews) and output (sentiment) columns
reviews = data["review"]
sentiments = data["sentiment"]

# Tokenize the text
#tokenizer = Tokenizer(): This line initializes an instance of the Tokenizer class.
#tokenizer.fit_on_texts(reviews): This line fits the tokenizer on the provided reviews data. This step builds the vocabulary of the tokenizer based on the words present in the reviews. It assigns a unique index to each word in the vocabulary.
#sequences = tokenizer.texts_to_sequences(reviews): This line converts the text data into sequences of integers. It takes the reviews data and replaces each word with its corresponding index from the tokenizer's vocabulary. The resulting sequences variable contains lists of integers, where each integer represents a word in the original text.

tokenizer = Tokenizer()
tokenizer.fit_on_texts(reviews)
sequences = tokenizer.texts_to_sequences(reviews)

# Pad sequences to a fixed length. pad_sequences function is used to pad the sequences of integers (sequences) to ensure they all have the same length. This is often necessary when working with sequential data in machine learning models
#X = pad_sequences(sequences, maxlen=max_length): This line applies padding to the sequences using the pad_sequences function. It takes the sequences as input and pads or truncates them to have a length of max_length. The resulting X variable contains the padded sequences.

max_length = 250
X = pad_sequences(sequences, maxlen=max_length)

# Convert sentiments to binary labels (0 for negative, 1 for positive). NumPy array y is created based on a list of sentiments. The sentiment values are converted to binary labels, where a sentiment of "positive" is assigned the label 1, and any other sentiment is assigned the label 0.
y = np.array([1 if sentiment == 'positive' else 0 for sentiment in sentiments])

# Split the data into training and testing sets The random_state parameter ensures reproducibility by setting a seed value for the random shuffling of the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [3]:
# Build the model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128, input_length=max_length))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, batch_size=64, epochs=7, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Test loss:", loss)
print("Test accuracy:", accuracy)

# Make predictions
predictions = model.predict(X_test[:5])


Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Test loss: 0.4675050377845764
Test accuracy: 0.8866999745368958


In [4]:
# Display sample predictions and actual sentiments
print("Sample Predictions:\n")
for i in range(5):
    predicted_sentiment = "Positive" if predictions[i]  fgvvvvvvvvvvvvvvvvvvvbg ff >= 0.5 else "Negative"
    actual_sentiment = "Positive" if y_test[i] == 1 else "Negative"
    print("Predicted Sentiment:", predicted_sentiment)
    print("Actual Sentiment:", actual_sentiment)
    review_text = reviews.iloc[i]
    print("Review Text:\n", review_text)
    print()

SyntaxError: expected 'else' after 'if' expression (652079100.py, line 4)

In [None]:
##Dataset:
IMDB Dataset-The IMDB dataset is a large collection of movie reviews collected from the IMDB
website, which is a popular source of user-generated movie ratings and reviews. The dataset consists of
50,000 movie reviews, split into 25,000 reviews for training and 25,000 reviews for testing.

Each review is represented as a sequence of words, where each word is represented by an integer index
based on its frequency in the dataset. The labels for each review are binary, with 0 indicating a negative
review and 1 indicating a positive review.

##Classification:
The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. 
Classes can be called as targets/labels or categories.

##Binary Classification:
Binary classification is a fundamental task in machine learning where the goal is to classify data instances into one of two classes or categories. In binary classification, the target variable or label can take only two possible values, often referred to as positive and negative, 1 and 0, or true and false

##Applications:
Spam Detection: 
Fraud Detection: 
Disease Diagnosis
Sentiment Analysis
Customer Churn Prediction
Credit Scoring
Intrusion Detection
Fault Diagnosis
Image Classification
Face Recognition

## build model:


The Sequential() function is used to initialize a sequential model.
model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128, input_length=max_length)) adds an embedding layer to the model. This layer is responsible for converting the input sequences into dense vectors of fixed size.
model.add(Flatten()) flattens the embedded output to a 1-dimensional tensor.
model.add(Dense(64, activation='relu')) adds a fully connected layer with 64 units and ReLU activation.
model.add(Dense(1, activation='sigmoid')) adds the output layer with a single unit and sigmoid activation.

##compile:
In the code snippet you provided, the model.compile() function is used to compile the deep neural network (DNN) model in Keras. This function specifies the loss function, optimizer, and optional metrics to be used during the training process.

Here's an explanation of the parameters used in model.compile():

loss='binary_crossentropy': This parameter specifies the loss function to be used for binary classification problems. Binary cross-entropy is a common loss function used when dealing with binary classification tasks. It measures the difference between the predicted and actual class probabilities.

optimizer='adam': This parameter specifies the optimizer algorithm to be used during the training process. In this case, the 'adam' optimizer is used. Adam is a popular optimization algorithm that combines the advantages of both Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp). It adapts the learning rate during training and efficiently handles sparse gradients.

metrics=['accuracy']: This parameter specifies the metric(s) to be used to evaluate the model's performance. In this case, the 'accuracy' metric is used. Accuracy measures the percentage of correctly predicted instances out of the total number of instances. It is a commonly used metric for classification problems.

    
#Train:
model.fit() function is used to train the deep neural network (DNN) model using the training data. This function fits the model to the training data, updating the model's parameters based on the specified configuration.
batch_size=64: This parameter specifies the number of samples to be used in each training batch. The training data is divided into batches, and the model's parameters are updated after processing each batch. A batch size of 64 means that 64 samples will be processed at a time.
epochs=7: This parameter specifies the number of times the entire training dataset will be iterated over during training. Each iteration over the entire dataset is called an epoch. In this case, the model will be trained for 7 epochs.
validation_data=(X_test, y_test): This parameter specifies the validation data to be used during training. The model's performance on the validation data will be evaluated after each epoch. The validation data is used to monitor the model's generalization and prevent overfitting.
    
Make prediction:     
In the code snippet you provided, the model.predict() function is used to make predictions using the trained deep neural network (DNN) model. This function takes input data and returns the predicted outputs.
Here's an explanation of the code:
X_test[:5]: This is a slicing operation that selects the first five examples from the testing data (X_test).
    
    

    ##
    pandas (as pd): Used for data manipulation and analysis.
numpy (as np): Used for numerical operations and array manipulation.
tensorflow (as keras): A popular deep learning framework.
Sequential: A class from keras.models module that allows you to create a sequential model by stacking layers.
Embedding: A layer from keras.layers module that performs word embedding, commonly used for text data processing.
Dense: A layer from keras.layers module that represents a fully connected layer in a neural network.
Flatten: A layer from keras.layers module that flattens the input tensor into a 1-dimensional array.
Tokenizer: A class from keras.preprocessing.text module that is used to tokenize text data into sequences of integers.
pad_sequences: A function from keras.preprocessing.sequence module that pads sequences to a fixed length.
Additionally, the train_test_split function from sklearn.model_selection module is imported. This function is used to split the dataset into training and testing sets
    