# Baseline with Logistic Regression

We'll run our baseline with a simple logistic regression with a TF/IDF (Term-Frequency/Inverse-Document-Frequency) vectorizer. This improves performance over 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

df = pd.read_csv('sentiment140_cleaned.csv')

In [2]:
texts = df['clean_text'].astype(str)
labels = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size = 0.2, random_state = 734)
X_train.head()

9824        allergeys the dogwood trees are out to get me
6587    wtf it sounds like there is a bug in my ipod i...
8718               lets get tea station i need a pickmeup
1643    mom and grandmom just left apartments empty again
1902    using internet explorer is so as i was on my w...
Name: clean_text, dtype: object

Let's use a BOW (bag-of-words) vectorizer to generate our embeddings. BOW ignores word (token) context and order and lumps everything together making it a simple, yet much less robust, NLP method.

In [3]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Now we can train the classifier and evaluate!

In [4]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_vec, y_train)

y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.69      0.73      1015
           1       0.71      0.79      0.75       985

    accuracy                           0.74      2000
   macro avg       0.74      0.74      0.74      2000
weighted avg       0.74      0.74      0.74      2000



The simple baseline does a modest job classifying tweets with an overall accuracy of 74%.

For class 0, 77% of our negative predictions were actually negative although we were only able to recall 69% of all negatives.

For class 1, 71% of our positive predictions were actually positive and we were also able to capture 79% of all positives.

This indicates that our model leans towards overpredicting positives and missing true negative cases. Can we do better?