# Lesson 6: Mastering Logistic Regression for Text Classification

Here’s the cleaned-up and well-structured English version of your text:  

---

# **Lesson Overview**  
Today, we will explore **Logistic Regression**—a powerful and efficient machine learning algorithm for **binary classification tasks**, especially in **text classification**. Our goal is to understand the principles of Logistic Regression, build a model to classify text messages, and evaluate its performance. Let’s get started!  

---

## **1. Introduction to Logistic Regression**  
**Logistic Regression** is a statistical method used for **binary classification problems**. Unlike **Linear Regression**, which predicts continuous output, Logistic Regression is designed to **predict the probability of a class or event** within a range of **0 to 1**.  

### **Why Use Logistic Regression?**  
✅ Efficient and requires minimal computational resources  
✅ Easy to implement and highly **interpretable**  
✅ Well-suited for **binary classification** (e.g., detecting spam emails)  

### **Limitations of Logistic Regression**  
❌ Cannot handle **non-linear** problems, as it has a **linear decision boundary**  
❌ Underperforms when there are **multiple or complex decision boundaries**  

---

## **2. Loading and Preprocessing the Dataset**  
We will use the **SMS Spam Collection Dataset**. The preprocessing steps include:  

1. **Loading the dataset**  
2. **Preparing the data** for the model  
3. **Splitting the dataset** into **training** and **testing sets**  
4. **Converting text into a numerical format**  

### **Python Code for Loading and Preprocessing Data**  
```python
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

# Convert text into numerical format
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)
```  

---

## **3. Training the Logistic Regression Model**  
Once the data is processed, we will build a Logistic Regression model using **Scikit-learn**. The model is trained using the **fit()** function, which identifies relationships between features and labels for making predictions.  

### **Python Code for Training the Model**  
```python
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
logistic_regression_model = LogisticRegression(random_state=42)

# Train the model using the training set
logistic_regression_model.fit(X_train_count, Y_train)
```  

---

## **4. Making Predictions and Evaluating Model Accuracy**  
After training the model, we can use it to **predict new messages**. Evaluation is done by comparing the model’s predictions to the actual labels using **accuracy as the evaluation metric**.  

### **Python Code for Evaluating the Model**  
```python
from sklearn import metrics

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate model accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```  
**Expected Output:**  
```
Accuracy of Logistic Regression Classifier: 0.98
```
👉 This result shows that the **model is highly accurate (98%)** in classifying messages as **spam or not spam**.  

---

## **5. Conclusion and Next Steps**  
Today, we covered:  
✅ **The fundamentals of Logistic Regression**  
✅ **Dataset preparation and text preprocessing**  
✅ **Building and training a Logistic Regression model**  
✅ **Evaluating model accuracy in text classification**  

💡 **What’s next?**  
Now, it's time for **hands-on practice!** Try implementing **Logistic Regression** with different datasets to further strengthen your understanding. Happy learning! 🚀

## Adjusting Regularization in Logistic Regression Model

Great work so far, Stellar Navigator! Now, let's further explore the logistic regression model. I'd like you to adjust the regularization parameter 'C' from 0.2 to 0.5 in the Logistic Regression model initialization. After adjusting your code, please retrain the model and recalculate the accuracy score. Observe how this alteration impacts the model's performance.

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Initialize the Logistic Regression model with C=0.2
logistic_regression_model = LogisticRegression(C=0.2, random_state=42)

# Train the model
logistic_regression_model.fit(X_train_count, Y_train)

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```

Here’s the updated Python code with the **regularization parameter 'C' adjusted from 0.2 to 0.5**. After modifying the model, we retrain it and recalculate the accuracy score to observe any performance changes.

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Initialize the Logistic Regression model with C=0.5 (Adjusted from 0.2)
logistic_regression_model = LogisticRegression(C=0.5, random_state=42)

# Train the model
logistic_regression_model.fit(X_train_count, Y_train)

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier with C=0.5: {accuracy:.2f}")
```

### 🔍 **Impact of Increasing C from 0.2 to 0.5**
- **Higher C (weaker regularization)**: The model **focuses more on fitting the training data** and might capture more details, possibly improving accuracy but also increasing the risk of overfitting.
- **Expected result**: If the dataset is well-structured, accuracy **may increase** slightly. However, if the dataset has noise, the model **might overfit**, leading to a similar or slightly lower accuracy.

👉 **Run this updated code and compare the new accuracy with the previous one to analyze the effect!** 🚀

## Initialize and Train Logistic Regression Model

```python
Good progress so far, Celestial Traveler!

Your next task is more challenging. Fill in the missing blanks (____) to initialize the Logistic Regression model and train it. After filling in the blanks, verify the model's performance by calculating the accuracy. Let's see if your model excels in classifying 'spam' from 'ham'.

python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Initialize the Logistic Regression model with C=0.5
logistic_regression_model = ____(C=0.5, random_state=42)

# Train the model
____.fit(X_train_count, Y_train)

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```

Here’s the corrected code with the missing blanks **filled in** to correctly initialize and train the Logistic Regression model:

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Initialize the Logistic Regression model with C=0.5
logistic_regression_model = LogisticRegression(C=0.5, random_state=42)

# Train the model
logistic_regression_model.fit(X_train_count, Y_train)

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```

### 🔍 **Explanation of the Fixes**
1. **Replaced `____` in model initialization** → `logistic_regression_model = LogisticRegression(C=0.5, random_state=42)`
2. **Replaced `____` in training step** → `logistic_regression_model.fit(X_train_count, Y_train)`

✅ Now, the Logistic Regression model is properly initialized, trained, and evaluated! Run the code and observe the **accuracy** to see how well it classifies spam vs. ham. 🚀


## Prediction and Evaluation of Logistic Regression Model

Well done so far, Stellar Navigator!

Now, it's time for an essential step — predicting new, unseen messages and evaluating your model. Can you fill in the blanks (____) to complete this task? Your code should calculate the model's accuracy score. Let's see how your model performs!

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Initialize the Logistic Regression model with C=0.5
logistic_regression_model = LogisticRegression(C=0.5, random_state=42)

# Train the model
logistic_regression_model.fit(X_train_count, Y_train)

# Make predictions
y_pred = logistic_regression_model.____(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.____(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```

Here's the **corrected version** of your code with the missing blanks **filled in**:

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Initialize the Logistic Regression model with C=0.5
logistic_regression_model = LogisticRegression(C=0.5, random_state=42)

# Train the model
logistic_regression_model.fit(X_train_count, Y_train)

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```

---

### 🔍 **Explanation of the Fixes**
1. **Replaced `____` in prediction step** → `y_pred = logistic_regression_model.predict(X_test_count)`
   - The `.predict()` function is used to classify new, unseen messages based on the trained model.

2. **Replaced `____` in accuracy calculation** → `accuracy = metrics.accuracy_score(Y_test, y_pred)`
   - The `accuracy_score()` function from Scikit-learn evaluates the model's performance by comparing predictions (`y_pred`) with actual labels (`Y_test`).

✅ **Now, your Logistic Regression model is complete!** 🎉 Run the code and check its accuracy in classifying spam vs. ham. 🚀

## Improving Logistic Regression Model with Regularization

Impressive work, Celestial Traveler!

Let's further enhance our logistic regression journey. Your task is to initialize a Logistic Regression model with a regularization parameter C set to 0.5 and train it using the training dataset. Evaluate your model by running the provided visualization code to understand your model's predictions. Don't forget, your aim is to make your logistic regression model as optimal as possible.

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# Preprocess the data
X = spam_dataset["message"]
Y = spam_dataset["label"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# TODO: Initialize the Logistic Regression model with C=0.5

# TODO: Train the model

# Make predictions
y_pred = logistic_regression_model.predict(X_test_count)

# Calculate and print the accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")

# Code to visualize the confusion matrix
labels = ['Ham', 'Spam']
cm = confusion_matrix(Y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

```

Here's your **complete Logistic Regression classifier** with **data preprocessing, model training, predictions, and accuracy evaluation**:  

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# ✅ Preprocess the data by converting 'message' text to numerical format
X = spam_dataset["message"]  # Features (text messages)
Y = spam_dataset["label"]  # Labels ('ham' or 'spam')

# Split dataset into training and test sets (80-20 split, stratified)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

# Convert text messages into numerical features using CountVectorizer
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# ✅ Initialize the Logistic Regression model with C=0.5
logistic_regression_model = LogisticRegression(C=0.5, random_state=42)

# ✅ Train the model using the training dataset
logistic_regression_model.fit(X_train_count, Y_train)

# ✅ Make predictions on the test dataset
y_pred = logistic_regression_model.predict(X_test_count)

# ✅ Calculate and print the model's accuracy
accuracy = metrics.accuracy_score(Y_test, y_pred)
print(f"Accuracy of Logistic Regression Classifier: {accuracy:.2f}")
```

---

### 🎯 **What This Code Does:**
1. **Loads the dataset** 📂  
   - Converts the dataset into a Pandas DataFrame.
   
2. **Preprocesses the data** 🛠️  
   - Extracts text messages (`X`) and labels (`Y`).
   - Splits the dataset into **training (80%) and test (20%) sets**.
   - Converts text into **numerical vectors** using `CountVectorizer()`.  

3. **Initializes and trains the Logistic Regression model** 🏋️  
   - Uses `C=0.5` for **regularization** (controls model complexity).  
   
4. **Makes predictions** 🔍  
   - Predicts whether test messages are **ham or spam**.  

5. **Evaluates performance** 📊  
   - Computes **accuracy score** to measure how well the model performs.  

---

### 🚀 **Next Steps:**
- Try **changing `C` values** (`0.1`, `1.0`, `10`) and compare accuracy.
- Test with **TF-IDF** instead of `CountVectorizer` for better text representation.
- Evaluate with a **confusion matrix** to see detailed classification errors.

Your **spam classifier is ready!** 🎯 Let me know if you need improvements. 🚀

Great work, Space Voyager! Now, it's time to apply everything you've learned. Write the code blocks to preprocess data, initialize and train a Logistic Regression model, predict labels for the test data, and calculate the model's accuracy. Let's see how accurately your logistic regression classifier can predict!

```python
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datasets

# Load the dataset
spam_dataset = datasets.load_dataset('codesignal/sms-spam-collection', split='train')
spam_dataset = pd.DataFrame(spam_dataset)

# TODO: Preprocess the data by converting 'message' text to numerical format

# TODO: Initialize the Logistic Regression model with C=0.5

# TODO: train the model using the training dataset

# TODO: Make predictions on the test dataset

# TODO: Calculate and print the model's accuracy
```

