<img src="https://rhyme.com/assets/img/logo-dark.png" align="center"> <h2 align="center">Logistic Regression: A Sentiment Analysis Case Study</h2>

### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

### Step 1: Loading the dataset
---

In this section, the code loads a dataset using the pd.read_csv function from the Pandas library. The dataset is stored in a Pandas DataFrame object named df. Additionally, the first five rows of the DataFrame are displayed using the head() function.

In [1]:
import pandas as pd

# Load data
df = pd.read_csv(r'F:\Folder\Coursera\Perform Sentiment Analysis with scikit-learn\movie_data.csv')

# Show 5 rows
print(df.head())

                                              review  sentiment
0  In 1974, the teenager Martha Moxley (Maggie Gr...          1
1  OK... so... I really like Kris Kristofferson a...          0
2  ***SPOILER*** Do not read this, if you think a...          0
3  hi for all the people who have seen this wonde...          1
4  I recently bought the DVD, forgetting just how...          0


### Step 2: Preprocessing Text
---

This section focuses on text preprocessing tasks. It includes several subtasks such as removing HTML tags, converting text to lowercase, removing non-alphanumeric characters, extracting emoticons, and tokenizing words. The text preprocessing functions defined are preprocessor(), tokenizer(), and tokenizer_porter(). The cleaned text is stored in a new column named 'clean_review' in the DataFrame.

In [2]:
import re
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

def preprocessor(text):
    text = re.sub(r'<[^>]*>', '', text)
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub(r'[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

# Clean review text
df['clean_review'] = df['review'].apply(preprocessor)


### Step 3: Train-Test Split
---

The dataset is split into training and testing sets using the train_test_split function from scikit-learn. The feature data (X) consists of the 'clean_review' column, while the target data (y) consists of the 'sentiment' column. The split ratio is 80% for training and 20% for testing (test_size=0.2), with a random state set to ensure reproducibility. 

In [3]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['clean_review'], df['sentiment'], test_size=0.2, random_state=42)


In [4]:
# Print the length of training and testing data
print("Training data length:", len(X_train))
print("Testing data length:", len(X_test))

# Print first few examples of training data
print("\nTraining data examples:")
for i in range(5):
    print("Example", i+1, ":", X_train.iloc[i])

# Print first few examples of testing data
print("\nTesting data examples:")
for i in range(5):
    print("Example", i+1, ":", X_test.iloc[i])


Training data length: 40000
Testing data length: 10000

Training data examples:
Example 1 : when anti bush jokes get really easy to do a show like this had better make sure it has something extra when that something extra is kid versions of political figures making jokes about the future they don t have yet it s just plain nonsense dick cheney and george bush are done well but dick cheney mutters mostly there s also condoleeza rice who has a crush on bush for some reason and donald rumsfeld who isn t really that similar to donald rumsfeld at all the democratic characters rarely give their names so it s a mystery as to who could be who aside from barack obama and hilary clinton the episodes have coherent stories but that s not nearly enough to keep this from sinking 
Example 2 : minor spoilerunderrated little stephen king shocker it s not perfect by any stretch of the imagination even if the limp performances of dale midkiff and denise crosby were better there d still be the mismanaged 

### Step 4: Feature Extraction, Word relevancy using term frequency-inverse document frequency (TF-IDF Vectorization)

In this part, the text data is transformed into TF-IDF (Term Frequency-Inverse Document Frequency) vectors using the TfidfVectorizer from scikit-learn. TF-IDF vectorization converts text documents into numerical representations, where each word's importance is weighted based on its frequency in the document and across the entire corpus.

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and df(d, t) is the number of documents d that contain the term t.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define TF-IDF vectorizer
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        tokenizer=tokenizer_porter,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True)

# Transform text data into TF-IDF vectors
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)




The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

### Step 5: Training Logistic Regression Model
---

A logistic regression model is trained using the training data (X_train_tfidf and y_train) with the LogisticRegression class from scikit-learn. Logistic regression is a popular classification algorithm suitable for binary classification tasks like sentiment analysis.

In [9]:
from sklearn.linear_model import LogisticRegression

# Train Logistic Regression model
lr = LogisticRegression()
lr.fit(X_train_tfidf, y_train)


### Step 6: Evaluation
---


In this section, the code focuses on evaluating the performance of the sentiment analysis model using various metrics. Here's a breakdown of what each part of the code does:

1. **Importing Metrics Functions**:
   The code imports necessary functions from scikit-learn's metrics module, including `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, and `classification_report`. These functions will be used to calculate different evaluation metrics for the model.

2. **Calculate Predictions**:
   The model's predictions are calculated on the test data (`X_test_tfidf`) using the `predict` method of the logistic regression model (`lr`). This step generates predicted sentiment labels based on the trained model.

3. **Evaluate Using Metrics**:
   The predictions generated in the previous step are then evaluated using several metrics:
   - **Accuracy**: It measures the proportion of correctly predicted instances among all instances in the test set.
   - **Precision**: It indicates the proportion of true positive predictions out of all positive predictions made by the model.
   - **Recall**: It measures the proportion of true positive predictions out of all actual positive instances in the test set.
   - **F1-score**: It is the harmonic mean of precision and recall, providing a balanced measure between the two.

4. **Display Evaluation Metrics**:
   The calculated metrics—accuracy, precision, recall, and F1-score—are printed out to the console, providing insights into the model's performance across different aspects of classification.

5. **Display Classification Report**:
   Finally, a detailed classification report is printed using the `classification_report` function. This report provides a comprehensive summary of precision, recall, F1-score, and support for each class (in this case, sentiment classes) in the test set. It's particularly useful for understanding the model's performance on individual classes, especially in the case of imbalanced datasets.

By evaluating the model using these metrics and providing a classification report, we gain a comprehensive understanding of its strengths and weaknesses in classifying sentiment in text data.


In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Calculate predictions on the test data
y_pred = lr.predict(X_test_tfidf)

# Evaluate using metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Display evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.8956
Precision: 0.8821122369446609
Recall: 0.9115988723318567
F1-score: 0.8966131907308378

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.88      0.89      5034
           1       0.88      0.91      0.90      4966

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



### Analysis of Model Evaluation Results:

- **Accuracy:** The accuracy of the sentiment analysis model is 89.56%, indicating that approximately 89.56% of the predictions made by the model on the test dataset were correct.

- **Precision:** The precision score measures the proportion of true positive predictions out of all positive predictions made by the model. For class 0 (negative sentiment), the precision is approximately 88.21%, and for class 1 (positive sentiment) is approximately 88.21%.

- **Recall:** The recall score measures the proportion of true positive predictions out of all actual positive instances in the test set. For class 0 (negative sentiment), the recall is approximately 91.16%, and for class 1 (positive sentiment) is approximately 91.16%.

- **F1-score:** The F1-score is the harmonic mean of precision and recall, providing a balanced measure between the two. The F1-score for class 0 (negative sentiment) is approximately 89.66%, and for class 1 (positive sentiment) is approximately 89.66%.

- **Classification Report:** The classification report provides a comprehensive summary of precision, recall, F1-score, and support for each class (negative and positive sentiment) in the test set. It shows that the model performs well for both classes, with high precision, recall, and F1-score values. Additionally, the macro avg and weighted avg values indicate the overall performance of the model across both classes, considering their respective support (number of instances). The accuracy, precision, recall, and F1-score values indicate that the model is effective in classifying sentiment in the given text data.


### Step 7: Sentiment Analysis Example (Test Sentiment)
---

An example of sentiment analysis is demonstrated in this section. A new text ("This movie is great!") is preprocessed using the same preprocessing function (preprocessor()), transformed into a TF-IDF vector, and then fed into the trained logistic regression model for sentiment prediction. The predicted sentiment (positive=1 or negative=0) is printed out as the output.

In [11]:
# Example sentiment analysis
new_text = "This movie is great!"
new_text_cleaned = preprocessor(new_text)
new_text_tfidf = tfidf.transform([new_text_cleaned])
prediction = lr.predict(new_text_tfidf)
print("Sentiment prediction:", prediction[0])


Sentiment prediction: 1


### Step 8: Perform Prediction Sentiment for all data
---

In this section, the code predicts the sentiment (positive or negative) for all the data in the DataFrame using the trained logistic regression model. Here's a summary of what each part of the code does:

- **Predict Sentiment for All Data**: The TF-IDF vectorizer (`tfidf`) transforms all the cleaned review text data (`df['clean_review']`) into TF-IDF vectors (`all_data_tfidf`).

- **Make Predictions**: The logistic regression model (`lr`) predicts the sentiment labels for all the TF-IDF vectors, storing the predicted sentiment labels in the `all_data_predictions` array.

- **Add Predictions to DataFrame**: The predicted sentiment labels are added as a new column named 'predicted_sentiment' to the original DataFrame (`df`).

- **Display Results**: Finally, a subset of the DataFrame containing the original review text and the predicted sentiment labels is printed to the console using the `print()` function.


In [12]:
# Predict sentiment for all data
all_data_tfidf = tfidf.transform(df['clean_review'])
all_data_predictions = lr.predict(all_data_tfidf)

# Add predictions to DataFrame
df['predicted_sentiment'] = all_data_predictions

# Display results
print(df[['review', 'predicted_sentiment']])


                                                  review  predicted_sentiment
0      In 1974, the teenager Martha Moxley (Maggie Gr...                    1
1      OK... so... I really like Kris Kristofferson a...                    0
2      ***SPOILER*** Do not read this, if you think a...                    0
3      hi for all the people who have seen this wonde...                    1
4      I recently bought the DVD, forgetting just how...                    1
...                                                  ...                  ...
49995  OK, lets start with the best. the building. al...                    0
49996  The British 'heritage film' industry is out of...                    0
49997  I don't even know where to begin on this one. ...                    0
49998  Richard Tyler is a little boy who is scared of...                    0
49999  I waited long to watch this movie. Also becaus...                    1

[50000 rows x 2 columns]


### Step 9: Save into csv
---

In [14]:
# Define the file path
file_path = r'F:\Folder\Coursera\Perform Sentiment Analysis with scikit-learn\movie_reviews.csv'

# Save DataFrame to CSV
df.to_csv(file_path, index=False)

print("DataFrame has been successfully saved to:", file_path)


DataFrame has been successfully saved to: F:\Folder\Coursera\Perform Sentiment Analysis with scikit-learn\movie_reviews.csv
