Classifying text using a custom text classifier 

# üß† Classifying Text Using a Custom Text Classifier

In this notebook, we build a **custom sentiment classifier** using classical machine-learning techniques.

The goal is to train a model that can **learn from labeled text data** and then **classify new sentences** as **positive** or **negative**.

---

## üìå Problem Overview

- Input: text sentences  
- Output: sentiment label (`positive` or `negative`)  
- Task type: **Supervised text classification**

---

## üìä Algorithms Used

We compare three common classification algorithms:

1. **Logistic Regression**
2. **Naive Bayes**
3. **Linear Support Vector Machine (SVM)**

---

## üßÆ Regression vs Classification (Important Distinction)

- **Linear Regression** ‚Üí predicts continuous values  
- **Logistic Regression** ‚Üí predicts discrete classes (classification)

Despite the name, **Logistic Regression is a classification algorithm**.

---

## üì¶ Imports


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## üóÇÔ∏è Dataset

We use a small labeled dataset containing **positive** and **negative** sentences.

Each row consists of:
- `text` ‚Üí the sentence
- `sentiment` ‚Üí the label


In [None]:
data = pd.DataFrame([("i love spending time with my friends and family", "positive"),
("that was the best meal i've ever had in my life", "positive"),
("i feel so grateful for everything i have in my life", "positive"),
("i received a promotion at work and i couldn't be happier", "positive"),
("watching a beautiful sunset always fills me with joy", "positive"),
("my partner surprised me with a thoughtful gift and it made my day", "positive"),
("i am so proud of my daughter for graduating with honors", "positive"),
("listening to my favorite music always puts me in a good mood", "positive"),
("i love the feeling of accomplishment after completing a challenging task", "positive"),
("i am excited to go on vacation next week", "positive"),
("i feel so overwhelmed with work and responsibilities", "negative"),
("the traffic during my commute is always so frustrating", "negative"),
("i received a parking ticket and it ruined my day", "negative"),
("i got into an argument with my partner and we're not speaking", "negative"),
("i have a headache and i feel terrible", "negative"),
("i received a rejection letter for the job i really wanted", "negative"),
("my car broke down and it's going to be expensive to fix", "negative"),
("i'm feeling sad because i miss my friends who live far away", "negative"),
("i'm frustrated because i can't seem to make progress on my project", "negative"),
("i'm disappointed because my team lost the game", "negative")],columns=["text", "sentiment"])

## üîÄ Shuffling the Dataset

We shuffle the data to ensure positive and negative samples are mixed evenly.


In [None]:
data = data.sample(frac=1).reset_index(drop=True)    #we are taking 100% of the data and shuffling it in random order  #drop=true means we don't want old index 


In [None]:
X = data["text"]      
y = data["sentiment"]

## üî¢ Converting Text into Numbers (Bag of Words)

Machine-learning models require numerical input.

We use **CountVectorizer** to:
- convert text into tokens
- build a vocabulary
- represent each sentence as word counts


In [None]:
#using CountVectorizer here so each word becomes a column and each sentence becomes a row and entries are the counts of words in each sentence
countVec = CountVectorizer()
countVec_fit = countVec.fit_transform(X)


In [None]:
bag_of_words = pd.DataFrame(countVec_fit.toarray(), columns=countVec.get_feature_names_out())

In [None]:
print(bag_of_words)

## ‚úÇÔ∏è Train / Test Split

- **Training set** ‚Üí used to train the model  
- **Test set** ‚Üí used to evaluate performance


In [None]:
X_train, X_test, y_train, y_test = train_test_split(bag_of_words, y, test_size=0.3, random_state=7)
#random state is used to ensure that the split is reproducible and we have the same train and test split each time we run the code

## ü§ñ Model 1: Logistic Regression

Logistic Regression predicts the **probability of a class** and assigns the most likely label.

In [None]:
lr = LogisticRegression(random_state=1).fit(X_train, y_train)
   #fit method is used to train the model

In [None]:
y_pred_lr = lr.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred_lr)

In [None]:
print(classification_report(y_test, y_pred_lr, zero_division=0))

## ü§ñ Model 2: Naive Bayes

Naive Bayes:
- is probabilistic
- assumes word independence
- performs very well on text data


In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
#creating algorithm instance
nb = MultinomialNB().fit(X_train, y_train)

In [None]:
y_pred_nb = nb.predict(X_test)  # making predictions on the test set

In [None]:
accuracy_score(y_test, y_pred_nb)

In [None]:
print(classification_report(y_test, y_pred_nb))

## ü§ñ Model 3: Linear Support Vector Machine (SVM)

SVM:
- finds the best possible decision boundary
- works well in high-dimensional spaces (like text)

In [None]:
# Futher Improving the accuracy using Linear Support Vector Machine
from sklearn.linear_model import SGDClassifier

In [None]:
svm = SGDClassifier().fit(X_train, y_train)

In [None]:
y_pred_svm = svm.predict(X_test)

In [None]:
accuracy_score(y_pred_svm, y_test)

## ‚úÖ Final Takeaways

- Text must be converted into numerical form before modeling
- Bag-of-Words is a strong baseline for text classification
- Logistic Regression, Naive Bayes, and Linear SVM are excellent starting models
- Comparing multiple models helps identify strengths and weaknesses
- This pipeline forms the foundation for more advanced NLP systems