In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# Load the preprocessed data
preprocessed_data = pd.read_csv('/content/preprocessed_data.csv')

In [3]:
# Split the data into features and target variable
X = preprocessed_data['review']
y = preprocessed_data['sentiment']

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [5]:
# Convert text data into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [6]:
# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
# Predict the sentiment labels for the testing set
y_pred = model.predict(X_test_vectorized)

In [8]:
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.87976


# Improving the model
In this updated code, we have added the following enhancements:

**1. Grid Search for Hyperparameter Tuning**:

* We have included the GridSearchCV function from scikit-learn to perform a grid search for hyperparameter tuning.
* The param_grid variable specifies the different parameter combinations to be evaluated, such as different n-gram ranges and regularization values.
* The grid search is performed on the pipeline, considering both the feature extraction step (TfidfVectorizer) and the classification step (LinearSVC).
* The best model is obtained using grid_search.best_estimator_, which represents the model with the best combination of hyperparameters.

**2. Improved Model Selection**:

* The grid search enables the selection of the best hyperparameters, allowing the model to be fine-tuned and potentially improving its performance.
* By exploring different combinations of n-gram ranges and regularization values, we can identify the optimal settings for the given dataset.

By incorporating hyperparameter tuning through grid search, this code allows for further optimization of the model's performance. The best model obtained from the grid search is then used to predict sentiment labels and evaluate its performance using the classification report.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [16]:
# Define the pipeline for feature extraction and model training
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC())
])

In [17]:
# Define the parameter grid for hyperparameter tuning
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],  # Unigrams or bigrams
    'clf__C': [0.1, 1, 10]  # Regularization parameter
}

In [18]:
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

In [19]:
# Get the best model
best_model = grid_search.best_estimator_

In [20]:
# Predict the sentiment labels for the testing set
y_pred = best_model.predict(X_test)

In [21]:
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.90984


In [22]:
# Evaluate the model's performance
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

    negative       0.92      0.90      0.91      6157
    positive       0.90      0.92      0.91      6343

    accuracy                           0.91     12500
   macro avg       0.91      0.91      0.91     12500
weighted avg       0.91      0.91      0.91     12500



## Comparison Analysis:

**1. Accuracy**:

* The last model achieved an accuracy of approximately 91% on the testing set.
* This indicates that the model correctly predicted the sentiment label for 91 out of 100 instances.

**2. Precision and Recall**:

* For both the negative and positive sentiment classes, the precision values are around 0.92.
* Precision represents the proportion of correctly predicted instances for a given class out of all instances predicted as that class.
* The recall values are also around 0.92 for both classes, indicating that the model captured a high percentage of instances belonging to each class.
* Recall represents the proportion of correctly predicted instances for a given class out of all instances that actually belong to that class.

**3. F1-Score**:

* The F1-score, which considers both precision and recall, is approximately 0.91 for both sentiment classes.
* The F1-score provides a balanced measure of model performance, taking into account both the precision and recall values.
* The balanced F1-scores for both classes indicate that the model performs well in terms of both precision and recall.

**4. Support**:

* The support values indicate the number of instances in each class in the testing set.
* The dataset contains 6,157 instances labeled as negative sentiment and 6,343 instances labeled as positive sentiment.

Overall, the last model demonstrates a high level of performance with an accuracy of 91%. It effectively predicts both negative and positive sentiment labels, as indicated by high precision, recall, and F1-scores for both classes. The model shows balanced performance across the classes, suggesting that it can generalize well to new instances.