# Logistic Regression: From Application to Theory

## 1 Introduction

Imagine you're faced with a decision-making problem where you need to classify something into one of two categories. It might be determining if a bank loan should be approved or denied, or predicting if a student will pass or fail a test based on their study hours. You collect the necessary data, and your task is to find a way to make sense of it to arrive at a clear decision.

Logistic regression is the tool that can help you with this. Unlike linear regression, which is used to predict continuous values (values that can fall within a range, such as temperatures or prices), logistic regression is designed for binary decisions: yes or no, true or false, 1 or 0.

At its core, logistic regression works by taking your input data and transforming it into a probability, a chance that the answer is 'yes' or 'no'. It doesn't just draw a straight line through the data points but shapes a curve called the sigmoid function that best represents the relationship between the variables.

Once you have found this mathematical expression, you can use it to make predictions with confidence. You'll know not just whether the answer is 'yes' or 'no', but also how confident you can be in that classification.

Logistic regression is a powerful yet accessible tool in statistics and machine learning. It's not just about crunching numbers; it's about understanding probabilities and making decisions that are informed, precise, and reflective of the complexity of real-world situations.

## 2 Application

### 2.1 Logistic Regression Application Introduction - Sentiment Analysis

Sentiment Analysis is a powerful technique that analyzes the emotions and opinions within text. In the context of movies, it can be applied to understand how audiences feel about a particular film based on their reviews. By applying Logistic Regression to this problem, we can categorize reviews as positive or negative, providing key insights for filmmakers, critics, and audiences.

### 2.2 Data Collection

For our application, we can obtain data from datasets specifically created for sentiment analysis. The dataset we'll be working with is the Large Movie Review Dataset, also known as the IMDb dataset. This is a comprehensive collection of 50,000 movie reviews, evenly split between positive and negative sentiments. It's publicly available and can be downloaded from the [Stanford AI Lab](http://ai.stanford.edu/~amaas/data/sentiment/). This has already for this application and is available in the [aclImdb](./aclImdb/README) directory.

Now let's load this data into our application:

In [15]:
import os
TRAIN_PATH = './aclImdb/train'
TEST_PATH = './aclImdb/test'
# Load the movie review data
def load_movie_review_data(path):
    reviews = []
    labels = []
    for sentiment in ['neg', 'pos']:
        folder = os.path.join(path, sentiment)
        for filename in os.listdir(folder):
            with open(os.path.join(folder, filename), 'r') as file:
                reviews.append(file.read())
                labels.append(sentiment == 'pos')
    return reviews, labels

reviews, labels = load_movie_review_data(TRAIN_PATH)

In the code above we created a function that navigates into the folder with the movie reviews we downloaded and extract the contents of the positive and negative reviews.

### 2.3 Preprocessing the Data

Now that we've loaded the data, we need to preprocess it before training our model.

#### 2.3.1 Tokenization

Tokenization is the process of breaking down text into individual units, commonly known as tokens. In most cases, these tokens are words. Tokenization helps in analyzing the frequency and importance of individual words in the text.

For example:

In [16]:
from nltk import download as nltk_download
from nltk.tokenize import word_tokenize
# Download the necessary NLTK data for tokenization
nltk_download('punkt')
text = "I love movies."
tokens = word_tokenize(text)
print(tokens)

['I', 'love', 'movies', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/santiagogomez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Let's break down this example to understand what just happened.  

##### 2.3.1.1 NLTK

NLTK is a package for building python programs to work with human language data. `nltk.download` is a function that makes it easy to download various resources, models, etc., for different NLTK functionalities. These resources are stored online, and the download function fetches them to be used locally.

##### 2.3.1.2 punkt

`punkt` is a tokenizer model used for unsupervised machine learning tokenization. It's a pre-trained model that knows how to tokenize sentences in different European languages. We downloaded this model when we called:
```python
nltk_download('punkt')
```

##### 2.3.1.3 word_tokenize
`word_tokenize` is a function that breaks input text into words, which is a common task in Natural Language Processing (NLP). It uses the Punkt tokenizer to perform this task. It uses the pre-trained Punkt tokenizer model to accurately split text into sentences or words. Without it, the `word_tokenize` function doesn't have the necessary knowledge to perform the tokenization.

##### 2.3.1.4 Example Summary
In our code example above, we downloaded the necessary pre-trained tokenizer model called Punkt which is used by NLTK, the package often used when working with human language data. We then called the `word_tokenize` function with the text `'I love movies.'` so we can break down that text into tokens. Lastly we printed out the resulting tokens `['I', 'love', 'movies', '.']`.

#### 2.3.2 Removing Stop Words
After tokenization, the next step is to remove stop words. Stop words are common words such as 'the', 'is', 'in', which are generally ignored in text data analysis. They don't contribute to the sentiment or meaning of the text, so they are removed to reduce the dimensionality of the data and focus more on informative words.

For example:

In [None]:
from nltk.corpus import stopwords
nltk_download('stopwords')

# create a set of English stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(stop_words)
print(filtered_tokens)

In this example, we've downloaded a list of precompiled stop words for various languages from the NLTK servers when we ran the line:
```python
nltk_download('stopwords')
``` 

With this precompiled list of words downloaded, we filered out all of the tokens that were in this list of stop words so that we would only be left with informative words. In this case we filtered out the stop word `'I'` from the list of tokens `['I', 'love', 'movies', '.']`. The result after filtering out the stop words is: `['love', 'movies', '.']`.

#### 2.3.3 Vectorization

In machine learning, text data can't be fed directly to models since the models require numerical values. Vectorization is the translation process turning words and sentences into numerical forms, making them digestible for the models. Think of it as converting text into numbers, where each word or phrase becomes a unique numerical value. This transformation not only captures individual words but also the underlying meanings and relationships within the text.

##### 2.3.3.1 CountVectorizer
One common method for vectorizing text data is to use a technique that counts the occurrence of words. We can use `CountVectorizer` from the sklearn library for this purpose. `CountVectorizer` is a feature extraction technique used in Natural Language Processing (NLP) and text mining specifically designed to convert text data into a numerical format by counting word occurrences.

For example, let's imagine we have three simple movie reviews:

1. "I love this movie."
2. "I hate this movie."
3. "This movie is great. I love it."

Using `CountVectorizer`, we can represent these reviews in the following matrix:

|            | "I" | "love" | "this" | "movie" | "hate" | "is" | "great" | "it" |
|------------|:---:|:------:|:------:|:-------:|:------:|:----:|:-------:|:----:|
| Review 1:  |  1  |   1    |   1    |    1    |   0    |  0   |    0    |  0   |
| Review 2:  |  1  |   0    |   1    |    1    |   1    |  0   |    0    |  0   |
| Review 3:  |  1  |   1    |   1    |    1    |   0    |  1   |    1    |  1   |

This matrix is now a numerical representation of the text, and it can be used as input for various machine learning models, allowing them to analyze patterns and relationships within the text data.  If we choose to remove the stop words "I", "this", "is", "it", we'll get a simpler matrix of reduced dimensionality (less columns):

|            | "love" | "movie" | "hate" | "great" |
|------------|:------:|:-------:|:------:|:-------:|
| Review 1:  |   1    |    1    |   0    |    0    |
| Review 2:  |   0    |    1    |   1    |    0    |
| Review 3:  |   1    |    1    |   0    |    1    |

Here's how `CountVectorizer` works:

1. **Tokenization**: `CountVectorizer` starts by tokenizing the text, breaking it down into individual words, phrases, symbols, or other meaningful elements known as tokens.
   
2. **Building a Vocabulary**: `CountVectorizer` then builds a vocabulary by identifying unique tokens across all the documents. Each unique token becomes a feature in the resulting matrix.

3. **Counting Occurrences**: For each document, `CountVectorizer` counts the occurrences of each token in the vocabulary. These counts form the values of the matrix.

4. **Removing Stop Words (Optional)**: If specified, common words (stop words) that may not contribute to the analysis can be removed. This is done through the `stop_words` parameter.

5. **Creating the Matrix**: The final result is a sparse matrix, a matrix mostly filled with zeros, where each row corresponds to a document, and each column corresponds to a unique token from the vocabulary. The value in each cell represents the count of the occurrence of the token in the corresponding document (review). The result is a matrix filled mostly with zeros (sparse) because most documents only contain a small fraction of the entire vocabulary, so many counts will be zero.

By utilizing `CountVectorizer` we can combine the previous 2 steps, tokenization and removing stop words, and obtain a numerical representation of our text data that can be directly used in our model.  Let's continue building our sentiment analysis of movie reviews using logistic regression by leveraging `CountVectorizer`:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(reviews)

In the code above we've created an instance of the `CountVectorizer` class and specified that the english stop words should be filtered out. We then used this instance, called `vectorizer`, and called the `fit_transform` method which does two things to the `reviews`:
 - `fit`: It learns the vocabulary of all the tokens in the `reviews`.
 - `transform`: It transforms the `reviews` into a matrix that contains the counts of each token's occurrence in the `reviews`.

The result is a matrix of word counts we called `X`. The `X` matrix are our numeric features. We'll now take this to split the data into testing and training sets so we can train and later validate our model:

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)

We took our feature data `X`, which is a matrix of how often each word appears in each of the movie reviews we downloaded, and our target data `labels`, which contains the result of each review, whether it's a positive review or a negative review, and split each into testing and training sets of data.

### 2.4 Building the Logistic Regression Model

Now that our data has been preprocessed and we've obtained our training and testing sets, it's time to build our Logistic Regression model:

In [20]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

By calling `model.fit`, we're training our model on the provided training data. It will iteratively learn the relationship between the words in the reviews (`X_train`) and their corresponding sentiments (`y_train`). `max_iter=10000` simply means we're giving the model enough time to learn from our data. 

We can then use the trained model to make predictions on our testing set:

In [21]:
y_pred = model.predict(X_test)

Our model is now ready, and we can proceed to analyze and evaluate its performance.

### 2.5 Analysis of the Logistic Regression Model

#### 2.5.1 Predicting Sentiment for New Reviews

Having trained the model using movie reviews and their corresponding sentiments, we can now apply this model to predict the sentiment of new reviews.

Suppose you have a new review and you want to know whether it's positive or negative. Here's how you can predict the sentiment using our trained Logistic Regression model:

In [22]:
def predictSentiment(text, sentiment_model = model, vectorize = vectorizer):
    # Transforming the text so that it matches the training data
    text_vectorized = vectorize.transform([text])
    # Predicting the sentiment
    predicted_sentiment = sentiment_model.predict(text_vectorized)
    # Decoding the predicted sentiment
    if predicted_sentiment[0] == 1:
        print(f"Text: {text}\nPredicted sentiment: Positive\n")
    else:
        print(f"Text: {text}\nPredicted sentiment: Negative\n")

predictSentiment("This movie was fantastic!")
predictSentiment("This movie was awful!")
predictSentiment("I didn't think this movie was amazing as many reviews have stated. In fact, I thought it was mediocre at best!")

Text: This movie was fantastic!
Predicted sentiment: Positive

Text: This movie was awful!
Predicted sentiment: Negative

Text: I didn't think this movie was amazing as many reviews have stated. In fact, I thought it was mediocre at best!
Predicted sentiment: Positive



As we have shown with the code above, the trained model can be applied to any new reviews, transforming them into the same vectorized form using the `CountVectorizer` and then classifying them with the `LogisticRegression` model.

This process ensures that the new reviews are treated consistently with the training data, enabling reliable predictions.

However, some of the predictions may not always be quite what you expect. Let's analyze our logistic regression model to further understand how it's working with our movie reviews data.

#### 2.5.2 Understanding the Coefficients

In Logistic Regression, the relationship between the features (words in the reviews) and the target (sentiment) is quantified using coefficients. The coefficients of our model represent the weight or importance assigned to each word or feature in determining the sentiment of a review.

In [23]:
coefficients = model.coef_[0]
print(coefficients)

[-2.38859414e-01  1.88130377e-01  0.00000000e+00 ...  1.32824756e-05
  7.53057485e-03 -3.32844842e-02]


By examining the coefficients, we can understand how each word in the reviews contributes to the predicted sentiment. Here's how to print the top coefficients:

1. We first need to get the actual words. To achieve this we use the `vectorizer.get_feature_names_out()` function. This function retrieves the feature names, which are the words in the vocabulary that were vectorized during the preprocessing phase.

In [24]:
feature_names = vectorizer.get_feature_names_out()

2. We then need to pair up the coefficients with their respective feature names. We use `zip` for this purpose as it's a function that generates tuples. In the code below, `zip` combines the coefficients with the corresponding feature names. Since both `coefficients` and `feature_names` are lists that correspond to each other (meaning the $n^\text{th}$ element in `coefficients` corresponds to the $n^\text{th}$ word in `feature_names`), using zip creates pairs that maintain this relationship. For example, if you had `coefficients = [0.5, -0.3, 1.2]` and `feature_names = ['happy', 'sad', 'great']`, using `zip(coefficients, feature_names)` would return the pairs:
`(0.5, 'happy')`, `(-0.3, 'sad')`, and `(1.2, 'great')`.

In [25]:
feature_name_coefficient_pairs = zip(coefficients, feature_names)

3. Now that we have the feature names with their respective coefficients, we can sort them based on the coefficient to see which values have the greatest influence

In [26]:
# function to sort the coefficients
def sort_key(coefficient_word_tuple):
    coefficient, _ = coefficient_word_tuple
    return coefficient

# sort the coefficients from the smallest to the largest
sorted_coefficients = sorted(feature_name_coefficient_pairs, key=sort_key, reverse=True)

Let's now see the top positive coefficients and the top negative coefficients to see which top words the model has learned to associate with a positive sentiment (in the case of a positive coefficient) and a negative sentiment (in the case of a negative coefficient).

In [27]:
print("Top Positive Coefficients:")
for coef, word in sorted_coefficients[:10]:
    print(f"{word}: {coef}")

print("\nTop Negative Coefficients:")
for coef, word in sorted_coefficients[-10:]:
    print(f"{word}: {coef}")

Top Positive Coefficients:
wonderfully: 1.328571350909191
perfect: 1.3007026200487972
excellent: 1.282701882392441
funniest: 1.2737573405623213
rare: 1.2669243006536708
surprisingly: 1.230629278284308
perfectly: 1.2222360392311848
erotic: 1.1934035648995072
finest: 1.1907394698705622
jolie: 1.1778307043962923

Top Negative Coefficients:
badly: -1.3633704619772233
horrible: -1.3918862413007114
worse: -1.396960038007195
disappointing: -1.4357959426869529
awful: -1.6746433115917152
lacks: -1.6939809662107006
poorly: -1.8569035814037598
waste: -2.2241164925242907
worst: -2.23799666913952
disappointment: -2.242382523387833


The model has learned the words with the positive coefficients are the ones that usually appear in positive reviews, which means they contribute to a prediction of positive sentiment. Similarly, the model has learned that the words with negative coefficients are the ones that usually appear in a negative review. Since these are the top coefficients, it means that these words have a strong association with their respective sentiment. These are the words that the model considers most influential in determining the sentiment of a review.

#### 2.5.3 Checking for Overfitting or Underfitting

We want our model to perform well not just on the training data but also on unseen data. This requires avoiding overfitting (too complex) and underfitting (too simple).

As discussed in [Linear Regression - Deeper Dive sections 2.3 and 2.4](../linear-regression/Linear%20Regression%20-%20Deeper%20Dive.ipynb), Overfitting is when a model performs exceptionally well on training data but poorly on unseen data and underfitting is when the model performs poorly on both.

We can check for overfitting and underfitting by comparing the accuracy on the training and testing datasets. Below, we compare how well the model model is predicting the sentiments of the reviews it was trained on, `train_accuracy`, and how well the model is predicting the sentiments of reviews it hasn't seen. The accuracy on the test data, `test_accuracy`, provides a measure of how well the model generalizes to new, unseen data.

In [28]:
train_accuracy = model.score(X_train, y_train)
test_accuracy = model.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy}")
print(f"Testing Accuracy: {test_accuracy}")

Training Accuracy: 0.9987
Testing Accuracy: 0.8778


A significant difference between the training and testing accuracies might indicate overfitting. Similarly, poor performance on both may signal underfitting. Cross-validation, as mentioned in [Linear Regression - Deeper Dive section 2.6.1](../linear-regression/Linear%20Regression%20-%20Deeper%20Dive.ipynb), can further assist in detecting these issues.

### 2.6 Evaluating the Model

Evaluating the model involves using various metrics to understand how well the model is performing.

- Confusion Matrix
- Accurary
- Precision
- Recall
- F1 Score
- ROC Curve and AUC

In [29]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve

# Metrics
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

# ROC and AUC
fpr, tpr, _ = roc_curve(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
print('fpr', fpr)
print('tpr', tpr)
print('auc', auc)

Confusion Matrix: [[2216  322]
 [ 289 2173]]
Accuracy: 0.8778
Precision: 0.8709418837675351
Recall: 0.8826157595450853
F1 Score: 0.8767399636877143
fpr [0.         0.12687155 1.        ]
tpr [0.         0.88261576 1.        ]
auc 0.877872103570809


### 2.7 Improving the Model

### 2.8 Application Conclusion

### 3 Theory

#### 3.1 Introduction to the Theory

#### 3.2 Understanding the Sigmoid Function

#### 3.3 Coefficients and Odds Ratios

#### 3.4 Finding Parameters
- Maximum Likelihood Estimation (MLE)

#### 3.5 Cost Function
- Log Loss

#### 3.6 Gradient Descent

#### 3.6.1 Equations 
- Equations for updating parameters

#### 3.6.2 Example
- Step by Step walk-through

### 3.7 Decision Boundary
- Understanding how decisions are made

### 4 Advanced Concepts

#### 4.1 Mathematical Foundations

#### 4.2 Regularization Techniques

#### 4.3 Multiclass Classification

### 5 Conclusion

### 5.1 Next Steps