In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# Load the preprocessed data
preprocessed_data = pd.read_csv('/content/preprocessed_data.csv')

In [3]:
# Split the data into features and target variable
X = preprocessed_data['review']
y = preprocessed_data['sentiment']

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [5]:
# Convert text data into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [6]:
# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
# Predict the sentiment labels for the testing set
y_pred = model.predict(X_test_vectorized)

In [8]:
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.87976


# Improving the model
In this code, we have made the following improvements:

**1. Used TfidfVectorizer**: Instead of using CountVectorizer, we have switched to TfidfVectorizer for feature extraction. TF-IDF considers not only the frequency of words in the documents but also the importance of words based on their frequency across the entire corpus.

**2. Implemented Pipeline**: We have used scikit-learn's Pipeline to create a unified workflow that includes both feature extraction and model training. This simplifies the code and makes it easier to maintain and modify.

**3. Updated the model**: We have switched to using LinearSVC (Linear Support Vector Classifier) as the machine learning algorithm. LinearSVC is known for its efficiency and good performance on text classification tasks.

**4. Included classification report**: We have calculated the classification report, which provides more detailed performance metrics such as precision, recall, and F1-score for each class. This helps evaluate the model's performance on both positive and negative sentiments.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

In [10]:
# Define the pipeline for feature extraction and model training
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC())
])

In [11]:
# Train the model
pipeline.fit(X_train, y_train)

In [12]:
# Predict the sentiment labels for the testing set
y_pred = pipeline.predict(X_test)

In [13]:
# Evaluate the model's performance
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

    negative       0.90      0.88      0.89      6157
    positive       0.89      0.90      0.90      6343

    accuracy                           0.89     12500
   macro avg       0.89      0.89      0.89     12500
weighted avg       0.89      0.89      0.89     12500



In [14]:
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.8936



**1. Basic Model:**
   - Model Accuracy: 0.87976
   - This model uses logistic regression and the CountVectorizer for feature extraction.

   Analysis:
   - The basic model achieved a relatively high accuracy of approximately 87.98% on the testing set.
   - The CountVectorizer technique was used to convert the preprocessed text data into numerical features.
   - Logistic regression was employed as the machine learning algorithm for sentiment classification.
   - The model's performance indicates that it can effectively predict sentiment based on the given features.
   - However, there may still be room for improvement as the accuracy could potentially be further enhanced.

**2. Improved Model:**
   - Model Accuracy: 0.8936
   - This model utilizes LinearSVC and the TfidfVectorizer for feature extraction.

   Analysis:
   - The improved model achieved a higher accuracy of approximately 89.36% on the testing set.
   - The TfidfVectorizer technique was employed to convert the preprocessed text data into numerical features.
   - LinearSVC, a linear support vector classifier, was chosen as the machine learning algorithm.
   - The higher accuracy suggests that the improved model captures more meaningful information from the textual data.
   - The inclusion of TF-IDF weighting in the feature extraction process likely contributes to the enhanced performance.
   - The improved accuracy indicates that the model can better discern the sentiment based on the given features.

Overall, the improved model outperforms the basic model in terms of accuracy. The utilization of LinearSVC and the TfidfVectorizer, which considers the importance of words in the corpus, helps improve the model's ability to classify sentiments accurately. However, further analysis of additional metrics, such as precision, recall, and F1-score, would provide a more comprehensive evaluation of the models.