# Sentiment Analysis with Multiple Models

This Jupyter notebook demonstrates sentiment analysis using various machine learning models, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost. The notebook also includes steps for hyperparameter tuning, model training, and evaluation.

In [1]:
!pip install nltk scikit-learn pandas
!pip install spacy
!python -m spacy download en_core_web_sm

In [2]:
# Step 1: Import necessary libraries
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import joblib

In [3]:
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')

In [4]:
# Step 2: Define Dataset #Import dataste file
data = {
    'text': [
        'I love this product!', 'This is the worst thing I have ever bought.', 'I am so happy with my purchase!',
        'Not worth the money.', 'Great value for the price.', 'Amazing quality!', 'Will definitely buy again.',
        'Very disappointed.', 'Highly recommend this!', 'I will never buy this again.', 'Fantastic experience!',
        'The worst customer service ever.', 'Absolutely love it!', 'It broke after a week.', 'Superb quality!',
        'Not happy with the purchase.', 'Perfect for my needs!', 'This was a complete waste of money.',
        'Very satisfied with the result.', 'I would not recommend this to anyone.', 'So easy to use!', 'Terrible quality.',
        'I am really impressed.', 'Could be better.', 'I’m so pleased with this product!', 'Really bad product.',
        'Very useful and practical.', 'Completely unsatisfied.', 'Exceeded my expectations!', 'The quality is awful.',
        'Good product but overpriced.', 'It’s amazing!', 'Waste of time and money.', 'I am in love with this!',
        'Don’t waste your money.', 'Will purchase again.', 'One of the worst things I’ve ever bought.',
        'So convenient and easy to use.', 'Really great value.', 'This is my favorite product!'
    ],
    'sentiment': [
        'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative',
        'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive',
        'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative',
        'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive', 'negative',
        'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive',
        'positive', 'negative', 'positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive',
        'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive',
        'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive',
        'positive', 'negative', 'positive', 'negative'
    ]
}

# Preprocess the dataset
data['text'] = data['text'][:100]
data['sentiment'] = data['sentiment'][:100]

In [5]:
# Convert the 'sentiment' column to numeric labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['sentiment'])

# Create DataFrame
df = pd.DataFrame(data)

In [6]:
# Step 3: Split the dataset into train and test sets (80% train, 20% test)
X = df['text']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Step 4: Preprocess the text using SpaCy
def preprocess_text_spacy(text):
    text = text.lower()
    doc = nlp(text)
    words = [token.lemma_ for token in doc if token.text.isalpha()]
    return ' '.join(words)

# Apply preprocessing
X_train_cleaned = X_train.apply(preprocess_text_spacy)
X_test_cleaned = X_test.apply(preprocess_text_spacy)

In [8]:
# Step 5: Feature Extraction using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train_cleaned).toarray()
X_test_tfidf = tfidf.transform(X_test_cleaned).toarray()

In [9]:
# ------------------------------------------
# 1. Support Vector Machine (SVM) Hyperparameter Tuning
# ------------------------------------------
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'class_weight': ['balanced', None]
}

svm_grid = GridSearchCV(SVC(), param_grid_svm, cv=3, n_jobs=-1)
svm_grid.fit(X_train_tfidf, y_train)
best_svm_model = svm_grid.best_estimator_

# Evaluate SVM model
y_pred_svm = best_svm_model.predict(X_test_tfidf)
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))

In [10]:
# ------------------------------------------
# 2. Logistic Regression
# ------------------------------------------
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_tfidf, y_train)
y_pred_lr = log_reg.predict(X_test_tfidf)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

In [11]:
# ------------------------------------------
# 3. Random Forest Classifier
# ------------------------------------------
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_tfidf, y_train)
y_pred_rf = rf.predict(X_test_tfidf)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

In [12]:
# ------------------------------------------
# 4. XGBoost Classifier
# ------------------------------------------
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train_tfidf, y_train)
y_pred_xgb = xgb_clf.predict(X_test_tfidf)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))

In [13]:
# Save models using joblib
joblib.dump(best_svm_model, 'svm_model.pkl')
joblib.dump(log_reg, 'logistic_regression_model.pkl')
joblib.dump(rf, 'random_forest_model.pkl')
joblib.dump(xgb_clf, 'xgb_model.pkl')

# Conclusion

This notebook demonstrates multiple machine learning models for sentiment analysis, including hyperparameter tuning and model evaluation. The results from each classifier were evaluated based on accuracy, confusion matrix, and classification report. The trained models are also saved for future use.