
# Sentiment Analysis for Pakistani Traffic
## Task: Find the Best Text Classification Model

In this assignment, we will explore different machine learning models to perform sentiment analysis on traffic-related texts from Pakistan. The goal is to determine the best model for classifying sentiments as either positive (1) or negative (0).


In [None]:

# Importing necessary libraries
import pandas as pd

# Load the dataset
data = pd.read_csv('Pakistani Traffic sentiment Analysis.csv')

# Check for missing values and handle them (if any)
data.dropna(inplace=True)

# Display the first few rows of the dataset to understand its structure
data.head()


In [None]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

# Separate the features (text) and labels (sentiment)
X = data['Text']
y = data['Sentiment']

# Initialize the vectorizers
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()
hashing_vectorizer = HashingVectorizer(n_features=5000)

# Fit and transform the text data with each vectorizer
X_count = count_vectorizer.fit_transform(X)
X_tfidf = tfidf_vectorizer.fit_transform(X)
X_hash = hashing_vectorizer.transform(X)

# For this exercise, we'll use CountVectorizer (X_count) for demonstration.


In [None]:

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
import xgboost as xgb

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_count, y, test_size=0.2, random_state=42)

# Initialize models
log_reg = LogisticRegression()
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
xgboost = xgb.XGBClassifier()
svm = SVC()
naive_bayes = MultinomialNB()

# Setting up GridSearchCV for Logistic Regression (can be done for other models too)
param_grid_lr = {'C': [0.1, 1, 10], 'penalty': ['l2']}
grid_lr = GridSearchCV(log_reg, param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)

# Get the best model and accuracy
best_params_lr = grid_lr.best_params_
best_score_lr = grid_lr.best_score_
print("Logistic Regression Best Parameters:", best_params_lr)
print("Logistic Regression Best Accuracy:", best_score_lr)

# Repeat similar GridSearchCV for other models and compare the results


In [None]:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict on test data
y_pred = grid_lr.predict(X_test)

# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))
print("Classification Report:
", classification_report(y_test, y_pred))



## Conclusion
In this notebook, we have trained and evaluated several machine learning models for text classification. We used Logistic Regression, K-Nearest Neighbors, Random Forest, SVM, Naive Bayes, and XGBoost models to classify sentiments related to Pakistani traffic conditions.

Logistic Regression with GridSearchCV for hyperparameter tuning showed [insert best results here], but you can experiment with other models similarly.

Feel free to extend this notebook by trying out the other models and comparing their performance. Good luck with your sentiment analysis!
