# Logistic Regression

Sentiment analysis, also known as opinion mining, is the process of computationally identifying and categorizing opinions expressed in a piece of text. Essentially, it helps us understand if the writer's attitude towards a particular topic, product, or entity is positive, negative, or neutral.

This example demonstrates how to implement a simple sentiment classifier using logistic regression. It is surprising how well it performs for this class of tasks as a relatively simple model.

Articles used:
- https://www.getfocal.co/post/top-7-metrics-to-evaluate-sentiment-analysis-models
- https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c/

## Data Preparation

Let's start from imporing some sentiment training dataset from Kaggle first. It is kind of weird, but it's enough to start *learning* something - we could clean it later.

In [3]:
%env PYTHONUNBUFFERED=1
import kagglehub
df = kagglehub.dataset_load(
    kagglehub.KaggleDatasetAdapter.PANDAS,
    'jp797498e/twitter-entity-sentiment-analysis',
    'twitter_training.csv',
    pandas_kwargs={'encoding': 'ISO-8859-1'},
)
display(df)

env: PYTHONUNBUFFERED=1


Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
...,...,...,...,...
74676,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74677,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74678,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74679,9200,Nvidia,Positive,Just realized between the windows partition of...


Now, we can prepare a simple cleaning function. It's pretty simple - it just converts text into lowercase and removes common words (stopwords) that usually have no semantic value in terms of sentiment analysis.

In [4]:
from gensim.parsing.preprocessing import remove_stopwords
def clean(text):
    return remove_stopwords(text.lower())

Let's clean our dataset and take a look at it again.

In [5]:
# Pin column names
df = df[df.columns[[2, 3]]]
df.columns = ['sentiment', 'text']

# Pin column data types
df['text'] = df['text'].astype(str)
df['sentiment'] = df['sentiment'].astype(str)
df = df.dropna()

# Clean data
df = df.loc[df['sentiment'] != 'Irrelevant']
df['text'] = df['text'].map(clean)

# Check data
display(df)

Unnamed: 0,sentiment,text
0,Positive,"coming borders kill all,"
1,Positive,"im getting borderlands kill all,"
2,Positive,"im coming borderlands murder all,"
3,Positive,"im getting borderlands 2 murder all,"
4,Positive,"im getting borderlands murder all,"
...,...,...
74676,Positive,realized windows partition mac like 6 years nv...
74677,Positive,realized mac window partition 6 years nvidia d...
74678,Positive,realized windows partition mac 6 years nvidia ...
74679,Positive,realized windows partition mac like 6 years nv...


In [6]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
x_train = train['text']
y_train = train['sentiment']
x_test = test['text']
y_test = test['sentiment']

## Building and Training the Model

Alright, we are done with our data, now let's build the classification pipeline.

We need to vectorize our data first - essentially, to turn strings into a bunch of numbers to manipulate them mathematically. The easiest way to do so is to use an instrument called **count vectorizer**.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

First, it scans all the text to build a dictionary (vocabulary) of all unique words it finds. Then, for each sentence, it creates a numerical list (vector) where each number represents how many times a specific word from that dictionary appears in that sentence.

In [8]:
example_vector = vectorizer.fit_transform(['Hello World! Hello!', 'Bye World']).toarray()
display(vectorizer.get_feature_names_out(), example_vector)

array(['bye', 'hello', 'world'], dtype=object)

array([[0, 2, 1],
       [1, 0, 1]])

Sounds easy, right? Next, we will define a thing called **scaler**.

It helps to rescale numerical features (like word counts) so they have a similar range, typically with a mean of 0 and a standard deviation of 1. This helps Logistic Regression learn better because it prevents features with naturally larger values from unfairly dominating the learning process, ensuring all features contribute more equally.

In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)

Finally, let's define a classifier (no fancy configuration here *yet*) and stuck everything into an elegant pipeline. That would be our final **model architecture**.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs', max_iter=1500)
pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('scaler', scaler),
    ('classifier', classifier),
])

We could start training our model right away, but...

It would be not ideal in terms of its *hyperparameters* - those values that define *how* our pipeline works. These are settings we choose before training, like regularization strength (`C`) or how our text vectorizer processes words (`min_df`, `max_df`). They significantly control how the model learns and how well it ultimately performs.

Manually trying every possible combination of these hyperparameters would be incredibly tedious. Instead, we can use automated hyperparameter tuning techniques. For example, `GridSearchCV` (or `RandomizedSearchCV`) systematically tries out different combinations of hyperparameters from a grid we define.

Crucially, to evaluate how good each combination is without peeking at our final test set, it uses cross-validation. This means for each hyperparameter set, it splits the training data into several "folds", trains the model on some folds, and tests it on a remaining fold, repeating this process so every fold gets to be a test set. By averaging the performance across these folds, we get a more reliable score for that hyperparameter combination, helping us pick the best ones for our final model.

In [11]:
param_grid = {
    'classifier__C': [0.1, 1, 10],
    'vectorizer__ngram_range': [(1, 1), (1, 2)], 
    'vectorizer__max_df': [0.85, 0.90, 0.95, 1.0],
    'vectorizer__min_df': [1, 2, 3, 5],
}

Let's dissect those hyperparameters:

- Classifier `C`: Regularization strength of the LogisticRegression classifier. Smaller values make it stronger (less prone to overfitting), bigger - weaker (able to capture more nuances in data).
- Vectorizer `ngram_range`: This is crucial for capturing context! Instead of just looking at individual words (unigrams), n-grams allow us to consider sequences of words as single features. Using n-grams beyond unigrams often significantly improves performance in text tasks by providing more contextual information to the model, but also increases the vocabulary size.
- Vectorizer `max_df`: Maximum document frequency - ignore terms that appear in more than 'max_df' % of documents. Smaller values exclude more common terms (good for noise reduction), but too small may result in losing important common signals (underfitting).
- Vectorizer `min_df`: Minimum document frequency - ignore terms that appear in fewer than 'min_df' documents. Smaller values may lead to huge noisy vocabularies, bigger may result in losing specific signals.

Here comes the moment of truth - we could finally start actually *training* our model, with cross-validation running over the top.

In [12]:
%env PYTHONUNBUFFERED=1
from joblib import parallel_backend
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipeline, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=2)
with parallel_backend('multiprocessing'):
    grid.fit(x_train, y_train)

env: PYTHONUNBUFFERED=1
Fitting 3 folds for each of 96 candidates, totalling 288 fits
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1); total time=   1.7s
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1); total time=   1.8s
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1); total time=   1.9s
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1); total time=   3.7s
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1); total time=   3.8s
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1); total time=   4.2s
[CV] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1); total time=   3.3s
[CV] END classifier__C=0.1, vect

## Result

In [13]:
from sklearn.metrics import classification_report
prediction = grid.best_estimator_.predict(x_test)
print(classification_report(y_test, prediction))
print(f'Best parameters: {grid.best_params_}')

              precision    recall  f1-score   support

    Negative       0.93      0.91      0.92      4453
     Neutral       0.92      0.89      0.91      3701
    Positive       0.88      0.93      0.90      4185

    accuracy                           0.91     12339
   macro avg       0.91      0.91      0.91     12339
weighted avg       0.91      0.91      0.91     12339

Best parameters: {'classifier__C': 0.1, 'vectorizer__max_df': 0.85, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}


## Conclusion

By systematically preprocessing the data, constructing a robust Scikit pipeline (`CountVectorizer` -> `StandardScaler` -> `LogisticRegression`), and performing hyperparameter optimization via `GridSearchCV`, this sentiment analysis model achieved a final accuracy of 91%.

Key factors contributing to this performance included the use of n-grams (1, 2) and careful tuning of regularization and vectorizer frequency cutoffs. This demonstrates the effectiveness of classical machine learning techniques for type of text classification task.