### Text Classification is the process of assigning a label or category to a given piece of text. For example, we can classify emails as spam or not spam, tweets as positive or negative, and articles as relevant or not relevant to a given topic.

In [1]:
import pandas as pd

# Load dataset
data = pd.read_csv("https://raw.githubusercontent.com/suyashi29/Generative-AI-for-NLP/main/movie_reviews.csv")

# Display the first few rows of the dataset
print(data.head())


             Movie                                Review  Sentiment
0     Pulp Fiction  The special effects were incredible.          0
1  The Dark Knight          The film lacked originality.          1
2     Forrest Gump    The screenplay was poorly written.          1
3       Fight Club                     Predictable plot.          0
4  The Dark Knight                 Disappointing ending.          1


### Data Preprocessing
Preprocessing the text data to clean and prepare it for modeling.

- Lowercasing: Convert all text to lowercase.
- Removing Punctuation: Remove punctuation marks.
- Tokenization: Split the text into words.
- Removing Stopwords: Remove common words that do not carry significant meaning.
- Stemming/Lemmatization: Reduce words to their base or root form.

was: be
sings:sing
worst: bad
sings:sing
danced:dance

In [2]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data files
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()  
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stopwords.words('english')]
    # Lemmatize
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply preprocessing to the Review column
data['Review'] = data['Review'].apply(preprocess_text)

# Display the preprocessed data
print(data.head())

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suyashi144893\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suyashi144893\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\suyashi144893\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


             Movie                     Review  Sentiment
0     Pulp Fiction  special effect incredible          0
1  The Dark Knight    film lacked originality          1
2     Forrest Gump  screenplay poorly written          1
3       Fight Club           predictable plot          0
4  The Dark Knight       disappointing ending          1


## Splitting the Dataset
Split the dataset into training and testing sets.

In [3]:
from sklearn.model_selection import train_test_split

# Features and labels
X = data['Review']
y = data['Sentiment']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the splits
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(3200,) (800,) (3200,) (800,)


## Vectorization
Convert the text data into numerical data using TF-IDF Vectorizer.

## The difference between vectorizer.fit_transform(X_train) and vectorizer.transform(X_test) lies in the fitting and transforming processes performed on training and testing datasets.

#### vectorizer.fit_transform(X_train)
- fit: The vectorizer.fit(X_train) step analyzes the training data X_train to learn the vocabulary and the idf (inverse document frequency) statistics (if applicable).
- transform: The vectorizer.transform(X_train) step converts the training data into a matrix of token counts or tf-idf features based on the learned vocabulary.
- Combining these into vectorizer.fit_transform(X_train) allows you to both fit the model and transform the training data in a single step. This step produces a sparse matrix or another representation suitable for machine learning algorithms.

#### vectorizer.transform(X_test)
- transform: The vectorizer.transform(X_test) step converts the test data X_test into the same feature space as the training data. It uses the vocabulary and idf statistics learned from X_train during the fit step.
- This ensures that the test data is represented in the same way as the training data, allowing for accurate evaluation of the model trained on X_train.

### Summary
vectorizer.fit_transform(X_train): Fits the vectorizer to the training data and transforms the training data into the desired feature representation.
vectorizer.transform(X_test): Transforms the test data into the same feature representation using the vocabulary and statistics learned from the training data.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

# Display the shape of the transformed data
print(X_train_tfidf.shape, X_test_tfidf.shape)


(3200, 54) (800, 54)


## Hyperparameter Tuning and Model Training

### Hyperparameter Tuning
- Hyperparameter tuning is the process of finding the optimal hyperparameters for a machine learning model. Hyperparameters are the parameters that are set before the learning process begins and cannot be learned directly from the training data. Examples include the learning rate in neural networks, the number of trees in a random forest, and the regularization parameter in logistic regression.

### Grid Search
- is one of the most commonly used methods for hyperparameter tuning. It involves an exhaustive search through a manually specified subset of the hyperparameter space. The steps involved in the Grid Search method are as follows:

- Define the Hyperparameter Space: Specify the hyperparameters to be tuned and their possible values. This is often done by creating a grid of parameter values.

- Perform Cross-Validation: For each combination of hyperparameters, the model is trained and evaluated using cross-validation. Cross-validation helps in assessing the model's performance on different subsets of the training data.

- Evaluate Performance: The performance metric (e.g., accuracy, F1 score, etc.) is calculated for each combination of hyperparameters.

- Select the Best Hyperparameters: The combination of hyperparameters that yields the best performance on the validation set is selected as the optimal set.



### here we are using GridSearchCV to find the best hyperparameters for the Logistic Regression model.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}

# Initialize the Logistic Regression model
lr = LogisticRegression()

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=lr, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)

# Fit GridSearchCV
grid_search.fit(X_train_tfidf, y_train)

# Get the best parameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print(f"Best Parameters: {best_params}")


Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best Parameters: {'C': 0.1, 'solver': 'liblinear'}


## Model Training
Train a classification model, for example, a Logistic Regression model.

## Model Evaluation
Evaluate the model on the test data.

In [8]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on the test data
y_pred = best_model.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy}")

# Display classification report
print(classification_report(y_test, y_pred, target_names=['Class 1', 'Class 2']))


Test Accuracy: 0.49875
              precision    recall  f1-score   support

     Class 1       0.51      0.43      0.47       408
     Class 2       0.49      0.57      0.53       392

    accuracy                           0.50       800
   macro avg       0.50      0.50      0.50       800
weighted avg       0.50      0.50      0.50       800



## Save the Model and Vectorizer
Save the trained model and the vectorizer for future use.

In [9]:
import joblib

# Save the model
joblib.dump(best_model, 'logistic_regression_best_model.pkl')

# Save the vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')


['tfidf_vectorizer.pkl']

## Load and Use the Model for Predictions

In [14]:
# Load the model and vectorizer
best_model = joblib.load('logistic_regression_best_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')

# Sample new reviews
new_reviews = ["The movie was fantastic and excellent!", "I did not like the film, it was boring.","i am happy with movie"]

# Preprocess and vectorize the new reviews
new_reviews_preprocessed = [preprocess_text(review) for review in new_reviews]
new_reviews_tfidf = vectorizer.transform(new_reviews_preprocessed)

# Predict sentiments
predictions = best_model.predict(new_reviews_tfidf)

# Map predictions to labels
label_mapping = {0: 'positive', 1: 'negative'}
labeled_predictions = [label_mapping[pred] for pred in predictions]

print(labeled_predictions)  # Output: ['positive', 'negative']


['positive', 'negative', 'positive']


In [13]:
predictions

array([0, 1, 0], dtype=int64)