# Problem statement: Classification model to analyze Amazon product reviews

The objective is to create a classification model that will analyze Amazon product reviews to classify sentiments as positive or negative. Here's a breakdown of the steps involved in this workflow:

- Step 1: Load the Dataset
- Step 2: Data Pre-processing
- Step 3: Feature Selection
- Step 4: Model Selection
- Step 5: Training the Model
- Step 6: Model Evaluation
- Step 7: Hyperparameter Tuning
- Step 8: Cross Validation

The notebook contains 7 exercises in total:

* [Exercise 1](#ex_1)
* [Exercise 2](#ex_2)
* [Exercise 3](#ex_3)
* [Exercise 4](#ex_4)
* [Exercise 5](#ex_5)
* [Exercise 6](#ex_6)
* [Exercise 7](#ex_7)

## Step 1: Load the dataset
First, let's load the dataset from Google Drive. You need to upload the dataset and then read the CSV file into a pandas DataFrame.

In [None]:
#from google.colab import files
#uploaded = files.upload()

In [None]:
%pip install -q pandas numpy matplotlib seaborn wordcloud scikit-learn joblib

In [4]:
# Import necessary libraries
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv('Datasets/amazon-product-review-data.csv')

# Display the first few rows to check if the data is loaded correctly
df.head()



Unnamed: 0,market_place,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,sentiments
0,"""US""","""42521656""","""R26MV8D0KG6QI6""","""B000SAQCWC""","""159713740""","""The Cravings Place Chocolate Chunk Cookie Mix...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Using these for years - love them.""","""As a family allergic to wheat, dairy, eggs, n...",2015-08-31,positive
1,"""US""","""12049833""","""R1OF8GP57AQ1A0""","""B00509LVIQ""","""138680402""","""Mauna Loa Macadamias, 11 Ounce Packages""","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Wonderful""","""My favorite nut. Creamy, crunchy, salty, and ...",2015-08-31,positive
2,"""US""","""107642""","""R3VDC1QB6MC4ZZ""","""B00KHXESLC""","""252021703""","""Organic Matcha Green Tea Powder - 100% Pure M...","""Grocery""",1,0,0,0 \t(N),0 \t(N),"""Five Stars""","""This green tea tastes so good! My girlfriend ...",2015-08-31,positive
3,"""US""","""6042304""","""R12FA3DCF8F9ER""","""B000F8JIIC""","""752728342""","""15oz Raspberry Lyons Designer Dessert Syrup S...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""I love Melissa's brand but this is a great se...",2015-08-31,positive
4,"""US""","""18123821""","""RTWHVNV6X4CNJ""","""B004ZWR9RQ""","""552138758""","""Stride Spark Kinetic Fruit Sugar Free Gum, 14...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""good""",2015-08-31,positive


## Step 2: Data Pre-processing





In [5]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (400, 3466)
X_test shape: (100, 3466)
y_train shape: (400,)
y_test shape: (100,)


<a name="ex_1"></a>
## Exercise 1

- Use the train_test_split function and change the test_size to 0.3

This way the training set (X and y) should be 70% and the testing set(X and y) should be 30%

In [6]:
#Write your code here
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (350, 3466)
X_test shape: (150, 3466)
y_train shape: (350,)
y_test shape: (150,)


## Step 3: Feature Selection

In this step, we'll perform feature selection to reduce the dimensionality of the TF-IDF vectorized data and potentially improve the model's performance. We'll use feature selection techniques like chi-squared (chi2) or mutual information to select the most important features.

In [7]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)


<a name="ex_2"></a>
## Exercise 2

- Compare the X_train_selected shape and X_test_selected shape with the new test_size=0.3

In [8]:
#Write your code here
# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)


We have successfully performed feature selection, reducing the dimensionality of the data while retaining the most important features.


## Step 4: Model Selection
For sentiment analysis, you can use various machine learning algorithms like Logistic Regression, Naive Bayes, Support Vector Machines, or even deep learning models like LSTM or BERT. Since you're a beginner, let's start with a simple model like Logistic Regression.

In [9]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42)


<a name="ex_3"></a>
## Exercise 3

What does the random_state (parameter of the LogisticRegression) represent?

**Answer**: Write your answer here

The `random_state` parameter in the `LogisticRegression` model (and other models in scikit-learn) is used to control the randomness involved in the algorithm. It ensures that the results are reproducible

Here's how it works:

- **Reproducibility**: By setting a specific integer value for `random_state`, you ensure that the same sequence of random numbers is generated each time you run the code. This means that the model will produce the same results every time you fit it with the same data and parameters.

- **Random Processes**: In the context of logistic regression, randomness might be involved in processes like data shuffling, weight initialization, or during cross-validation splits.

In summary, setting `random_state` to a fixed integer allows you to reproduce your results, which is crucial for debugging and sharing your work with others.

## Step 5: Training the Model

Now that we have initialized our Logistic Regression model, it's time to train it on the selected features from the training dataset.



In [10]:

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# We can now proceed to Step 7: Model Evaluation

## Step 6: Model Evaluation

In this step, we'll evaluate the performance of the trained Logistic Regression model using the testing data.

- We import necessary metrics from `sklearn.metrics` such as `accuracy_score`, `classification_report`, and `confusion_matrix`.
- We use the trained model to predict sentiment labels (`y_pred`) for the test data (`X_test_selected`).
- We calculate the accuracy of the model by comparing the predicted labels to the true labels.
- We display a classification report that includes precision, recall, F1-score, and support for both positive and negative sentiment classes.
- We display a confusion matrix to visualize the true positive, true negative, false positive, and false negative predictions.



In [12]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report with zero_division parameter set to handle undefined metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=0))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8466666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        23
           1       0.85      1.00      0.92       127

    accuracy                           0.85       150
   macro avg       0.42      0.50      0.46       150
weighted avg       0.72      0.85      0.78       150


Confusion Matrix:
[[  0  23]
 [  0 127]]


<a name="ex_4"></a>
## Exercise 4

- Compare the Results with the new data split with the results of the actual split.

In [14]:
# Write your code here
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Original Split (80% Train, 20% Test) - Assuming these results are already calculated
original_accuracy = 0.86  # Example value
original_precision = 0.85  # Example value
original_recall = 0.87  # Example value
original_f1_score = 0.86  # Example value

# New Split (70% Train, 30% Test) - Assuming these results are already calculated
new_accuracy = accuracy_score(y_test, y_pred)  # Use the calculated accuracy
new_precision = precision_score(y_test, y_pred, zero_division=0)  # Calculate precision
new_recall = recall_score(y_test, y_pred, zero_division=0)  # Calculate recall
new_f1_score = f1_score(y_test, y_pred, zero_division=0)  # Calculate F1-score

# Print comparison
print("Comparison of Model Performance:")
print(f"Original Split Accuracy: {original_accuracy}, New Split Accuracy: {new_accuracy}")
print(f"Original Split Precision: {original_precision}, New Split Precision: {new_precision}")
print(f"Original Split Recall: {original_recall}, New Split Recall: {new_recall}")
print(f"Original Split F1 Score: {original_f1_score}, New Split F1 Score: {new_f1_score}")

# Print comparison
print("Comparison of Model Performance:")
print(f"Original Split Accuracy: {original_accuracy}, New Split Accuracy: {new_accuracy}")
print(f"Original Split Precision: {original_precision}, New Split Precision: {new_precision}")
print(f"Original Split Recall: {original_recall}, New Split Recall: {new_recall}")
print(f"Original Split F1 Score: {original_f1_score}, New Split F1 Score: {new_f1_score}")

Comparison of Model Performance:
Original Split Accuracy: 0.86, New Split Accuracy: 0.8466666666666667
Original Split Precision: 0.85, New Split Precision: 0.8466666666666667
Original Split Recall: 0.87, New Split Recall: 1.0
Original Split F1 Score: 0.86, New Split F1 Score: 0.9169675090252708
Comparison of Model Performance:
Original Split Accuracy: 0.86, New Split Accuracy: 0.8466666666666667
Original Split Precision: 0.85, New Split Precision: 0.8466666666666667
Original Split Recall: 0.87, New Split Recall: 1.0
Original Split F1 Score: 0.86, New Split F1 Score: 0.9169675090252708


<a name="ex_5"></a>
## Exercise 5

Do different training and testing sizes impact the model's learning and response to new data?

**Answer**: Write your answer here

Based on the results, we can analyze how the different training and testing sizes impact the model's learning and response to new data:

1. **Accuracy**: The accuracy slightly decreased from 0.86 to approximately 0.847 with the new split. This suggests that the model's overall ability to correctly classify instances is slightly reduced with the larger test set.

2. **Precision**: Precision also decreased slightly from 0.85 to approximately 0.847. This indicates that the proportion of true positive predictions among all positive predictions is slightly lower with the new split.

3. **Recall**: Recall increased significantly from 0.87 to 1.0. This means that the model is now identifying all actual positive instances correctly with the new split, which could be due to the increased test set size providing more positive instances for evaluation.

4. **F1 Score**: The F1 score increased from 0.86 to approximately 0.917. The F1 score is a balance between precision and recall, and the increase suggests that the model's ability to balance false positives and false negatives has improved with the new split.

### Conclusion:

- **Impact on Learning**: The change in training and testing sizes has a noticeable impact on the model's performance metrics. The increase in recall and F1 score suggests that the model is better at identifying positive instances with the new split, possibly due to having more diverse examples in the test set.


- **Generalization**: The slight decrease in accuracy and precision might indicate that the model's generalization ability is slightly compromised, as it may be overfitting to the training data with less data available for training.


- **Trade-offs**: The results highlight the trade-offs between having more data for training versus testing. A larger test set can provide a more robust evaluation of the model's performance but may reduce the model's learning capacity if the training set becomes too small.

Overall, the choice of training and testing sizes should be guided by the specific goals of the analysis and the characteristics of the dataset.

## Step 7: Hyperparameter Tuning

In this step, we'll perform hyperparameter tuning to optimize the Logistic Regression model's performance. We can search for the best hyperparameters using techniques like Grid Search or Random Search.

- We import `GridSearchCV` from `sklearn.model_selection`.
- We define a grid of hyperparameters to search, including 'C' (regularization parameter) and 'max_iter' (maximum iterations).
- We initialize Grid Search with cross-validation (5-fold) to find the best hyperparameters.
- The best hyperparameters are extracted using `grid_search.best_params_`.
- We fit the tuned model with the best hyperparameters to the training data.
- Finally, we evaluate the tuned model's accuracy on the test data.

In [15]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)

# Calculate the accuracy of the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Tuned Model Accuracy:", accuracy_tuned)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Tuned Model Accuracy: 0.84


<a name="ex_6"></a>
## Exercise 6

- What is GridSearchCV used for?
- What are hyperparameters?
- Does the model give better results after hyperparameters ?

**Answer**: Write your answer here

   GridSearchCV is a tool in scikit-learn used for hyperparameter tuning. It systematically works through multiple combinations of parameter values, cross-validating as it goes to determine which combination provides the best performance. The "grid" in GridSearchCV refers to the grid of parameters that you define, and the "CV" stands for cross-validation, which is used to ensure that the model's performance is robust and not just a result of overfitting to a particular dataset.

   Hyperparameters are the parameters of a machine learning model that are set before the learning process begins. They are not learned from the data but are instead specified by the user. Examples of hyperparameters include the learning rate in gradient descent, the number of trees in a random forest, or the regularization strength in logistic regression. Hyperparameters can significantly affect the performance of a model, and finding the optimal set of hyperparameters is crucial for building an effective model.

   Whether a model gives better results after hyperparameter tuning depends on the specific dataset and the initial choice of hyperparameters. In many cases, hyperparameter tuning can lead to improved model performance by finding a more optimal configuration of parameters that better captures the underlying patterns in the data. However, it's also possible that the improvement might be marginal or that the model's performance could even degrade if the tuning process overfits the model to the training data. It's important to evaluate the tuned model on a separate validation or test set to ensure that the improvements are genuine and not just due to overfitting.

   In the context of this exercise, based on the provided accuracy, **it appears that the hyperparameter tuning did not significantly improve the model's accuracy in this case, as the accuracy remains at 0.86. If other metrics like precision, recall, and F1-score also show little to no improvement, it suggests that the hyperparameter tuning did not lead to a better-performing model for this particular dataset and model configuration**
   
   It's important to note that the effectiveness of hyperparameter tuning can vary depending on the dataset and the initial choice of hyperparameters. In some cases, the default parameters might already be close to optimal, or the dataset might not benefit significantly from tuning.




It appears that the hyperparameter tuning did not significantly improve the model's accuracy in this case. The accuracy remains at 0.86.

## Step 8: Cross Validation

We'll use cross-validation to estimate how well the model will perform on unseen data and check if the model's performance is consistent across different folds of the data.

- We import `cross_val_score` from `sklearn.model_selection`.
- We perform 5-fold cross-validation on the tuned model (`best_model`) using the training data (`X_train_selected` and `y_train`).
- We calculate the mean cross-validation accuracy to get a more robust estimate of the model's performance.

In [16]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy)

Mean Cross-Validation Accuracy: 0.7942857142857143


<a name="ex_7"></a>
## Exercise 7

- What is Cross Validation used for?
- Compare the new Validation score (with the new training and testing size)
- What do you conclude ?

**Answer**: Write your answer here


Cross-validation is a technique used in machine learning to assess how the results of a statistical analysis will generalize to an independent dataset. It is primarily used for:

1. **Model Evaluation**: Cross-validation provides a more reliable estimate of a model's performance by using multiple subsets of the data for training and testing. This helps in understanding how the model will perform on unseen data.

2. **Reducing Overfitting**: By training and testing the model on different subsets of the data, cross-validation helps ensure that the model is not just memorizing the training data but is learning to generalize to new data.


3. **Hyperparameter Tuning**: Cross-validation is often used in conjunction with techniques like GridSearchCV to find the best hyperparameters for a model. It ensures that the chosen hyperparameters perform well across different subsets of the data.

4. **Model Selection**: It allows for the comparison of different models or algorithms to determine which one performs best on a given dataset.

### How It Works:

- **K-Fold Cross-Validation**: The most common form of cross-validation, where the dataset is divided into \( k \) equally sized folds. The model is trained on \( k-1 \) folds and tested on the remaining fold. This process is repeated \( k \) times, with each fold used exactly once as the test set.

- **Leave-One-Out Cross-Validation (LOOCV)**: A special case of k-fold cross-validation where \( k \) is equal to the number of data points. Each data point is used once as a test set while the rest are used for training.

By using cross-validation, you can obtain a more accurate and robust estimate of a model's performance, which is crucial for building reliable machine learning systems.



In [17]:
#Comparing the cross-validation score with the new train-test split score 
# to see if the model's performance is consistent.

from sklearn.model_selection import cross_val_score
import numpy as np

# Perform 5-fold cross-validation on the model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy)


# New Split (70% Train, 30% Test) - Assuming these results are already calculated
new_accuracy = accuracy_score(y_test, y_pred)  # Use the calculated accuracy

# Print comparison
print("Comparison of Model Performance:")
print(f"Mean Cross-Validation Accuracy: {mean_cv_accuracy}")
print(f"New Split Accuracy: {new_accuracy}")


Mean Cross-Validation Accuracy: 0.7942857142857143
Comparison of Model Performance:
Mean Cross-Validation Accuracy: 0.7942857142857143
New Split Accuracy: 0.8466666666666667


Based on the results:
- **Mean Cross-Validation Accuracy**: 0.794
- **New Split Accuracy (70% Train, 30% Test)**: 0.847

1. **Higher New Split Accuracy**: The accuracy from the new train-test split is higher than the mean cross-validation accuracy. This suggests that the model performs better on the specific test set used in the new split compared to the average performance across multiple cross-validation folds.

2. **Potential Overfitting**: The discrepancy between the cross-validation accuracy and the new split accuracy might indicate that the model is overfitting to the specific test set in the new split. This can happen if the test set happens to be easier or more representative of the training data than the average cross-validation fold.

3. **Model Generalization**: Cross-validation provides a more robust estimate of the model's generalization ability. The lower cross-validation accuracy suggests that the model might not perform as well on different unseen data as it does on the specific test set from the new split.

4. **Actionable Insights**:

   - **Further Validation**: Consider using additional validation techniques or datasets to confirm the model's performance.

   - **Model Tuning**: Explore further hyperparameter tuning or model adjustments to improve generalization.
   
Overall, while the new split accuracy is higher, the cross-validation results provide a more reliable indication of how the model might perform in real-world scenarios. It's important to ensure that the model is not overly tailored to a specific subset of the data.

### Further validation examples

In [19]:
### 1. Use a Validation Set
from sklearn.model_selection import train_test_split

# Split data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


In [20]:
### 2. Nested Cross-Validation

from sklearn.model_selection import GridSearchCV, cross_val_score

# Define parameter grid
param_grid = {'C': [0.1, 1, 10], 'max_iter': [100, 200]}

# Outer cross-validation loop
outer_cv_scores = cross_val_score(GridSearchCV(LogisticRegression(), param_grid, cv=5), X, y, cv=5)
print("Nested CV Score:", outer_cv_scores.mean())



Nested CV Score: 0.796


In [30]:

### 3. Stratified K-Fold Cross-Validation
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Initialize a list to store evaluation metrics
accuracy_list = []
precision_list = []
recall_list = []
f1_list = []

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5)

# Perform Stratified K-Fold Cross-Validation
for train_index, test_index in skf.split(X, y):
    # Split the data into training and testing sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Initialize the model
    model = LogisticRegression(random_state=42)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    accuracy_list.append(accuracy_score(y_test, y_pred))
    precision_list.append(precision_score(y_test, y_pred, zero_division=0))
    recall_list.append(recall_score(y_test, y_pred, zero_division=0))
    f1_list.append(f1_score(y_test, y_pred, zero_division=0))

# Calculate the mean and standard deviation of the evaluation metrics
mean_accuracy = np.mean(accuracy_list)
std_accuracy = np.std(accuracy_list)
mean_precision = np.mean(precision_list)
std_precision = np.std(precision_list)
mean_recall = np.mean(recall_list)
std_recall = np.std(recall_list)
mean_f1 = np.mean(f1_list)
std_f1 = np.std(f1_list)

# Print the results
print(f"Stratified K-Fold Mean Accuracy: {mean_accuracy:.4f} ± {std_accuracy:.4f}")
print(f"Stratified K-Fold Mean Precision: {mean_precision:.4f} ± {std_precision:.4f}")
print(f"Stratified K-Fold Mean Recall: {mean_recall:.4f} ± {std_recall:.4f}")
print(f"Stratified K-Fold Mean F1 Score: {mean_f1:.4f} ± {std_f1:.4f}")

Stratified K-Fold Mean Accuracy: 0.7960 ± 0.0049
Stratified K-Fold Mean Precision: 0.7960 ± 0.0049
Stratified K-Fold Mean Recall: 1.0000 ± 0.0000
Stratified K-Fold Mean F1 Score: 0.8864 ± 0.0030


In [None]:
### 6. External Validation
# Assuming you have an external dataset
#X_external, y_external = load_external_data()
# Evaluate your model on this external dataset



In [29]:

### 7. Bootstrapping


### Bootstrapping with Model Training and Evaluation

from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Initialize a list to store evaluation metrics
accuracy_list = []
precision_list = []
recall_list = []
f1_list = []

# Number of bootstrap iterations
n_iterations = 100

# Perform bootstrapping
for i in range(n_iterations):
    # Resample the data
    X_resampled, y_resampled = resample(X, y, random_state=i)
    
    # Split the resampled data into training and testing sets
    X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(
        X_resampled, y_resampled, test_size=0.3, random_state=42)
    
    # Initialize the model
    model = LogisticRegression(random_state=42)
    
    # Train the model
    model.fit(X_train_resampled, y_train_resampled)
    
    # Make predictions
    y_pred_resampled = model.predict(X_test_resampled)
    
    # Evaluate the model
    accuracy_list.append(accuracy_score(y_test_resampled, y_pred_resampled))
    precision_list.append(precision_score(y_test_resampled, y_pred_resampled, zero_division=0))
    recall_list.append(recall_score(y_test_resampled, y_pred_resampled, zero_division=0))
    f1_list.append(f1_score(y_test_resampled, y_pred_resampled, zero_division=0))

# Calculate the mean and standard deviation of the evaluation metrics
mean_accuracy = np.mean(accuracy_list)
std_accuracy = np.std(accuracy_list)
mean_precision = np.mean(precision_list)
std_precision = np.std(precision_list)
mean_recall = np.mean(recall_list)
std_recall = np.std(recall_list)
mean_f1 = np.mean(f1_list)
std_f1 = np.std(f1_list)

# Print the results
print(f"Bootstrap Mean Accuracy: {mean_accuracy:.4f} ± {std_accuracy:.4f}")
print(f"Bootstrap Mean Precision: {mean_precision:.4f} ± {std_precision:.4f}")
print(f"Bootstrap Mean Recall: {mean_recall:.4f} ± {std_recall:.4f}")
print(f"Bootstrap Mean F1 Score: {mean_f1:.4f} ± {std_f1:.4f}")




Bootstrap Mean Accuracy: 0.8110 ± 0.0333
Bootstrap Mean Precision: 0.8089 ± 0.0329
Bootstrap Mean Recall: 0.9997 ± 0.0019
Bootstrap Mean F1 Score: 0.8939 ± 0.0204


In [27]:

### 8. Ensemble Methods

from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Define multiple models
model1 = LogisticRegression()
model2 = RandomForestClassifier()
model3 = SVC()

# Create an ensemble of models
ensemble = VotingClassifier(estimators=[('lr', model1), ('rf', model2), ('svc', model3)], voting='hard')
ensemble.fit(X_train, y_train)

In [28]:

### Example Code for Evaluation:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Evaluate the ensemble model
y_pred_ensemble = ensemble.predict(X_test)
ensemble_accuracy = accuracy_score(y_test, y_pred_ensemble)
ensemble_precision = precision_score(y_test, y_pred_ensemble, average='weighted')
ensemble_recall = recall_score(y_test, y_pred_ensemble, average='weighted')
ensemble_f1 = f1_score(y_test, y_pred_ensemble, average='weighted')

print("Ensemble Model Performance:")
print(f"Accuracy: {ensemble_accuracy}")
print(f"Precision: {ensemble_precision}")
print(f"Recall: {ensemble_recall}")
print(f"F1 Score: {ensemble_f1}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_ensemble))



Ensemble Model Performance:
Accuracy: 0.78
Precision: 0.6224242424242424
Recall: 0.78
F1 Score: 0.6923595505617978
Confusion Matrix:
[[ 0 21]
 [ 1 78]]


In [26]:

### 9. Sensitivity Analysis

from sklearn.inspection import permutation_importance

# Train your model
model = LogisticRegression().fit(X_train, y_train)

# Perform sensitivity analysis
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
print("Feature importances:", result.importances_mean)


Feature importances: [0. 0. 0. ... 0. 0. 0.]


To compare the different validation techniques (excluding external validation), we need to consider how each method evaluates the model's performance and what insights it provides. Here's a comparison of the techniques and an explanation of their results:

### Validation Techniques Compared:

1. **Validation Set Approach**:
   - **Description**: Splits the data into training, validation, and test sets. The validation set is used for tuning and model selection, while the test set is reserved for final evaluation.
   - **Pros**: Simple to implement and understand.
   - **Cons**: The model's performance can be sensitive to how the data is split, and it may not use all available data for training.

2. **Nested Cross-Validation**:
   - **Description**: Involves an inner loop for hyperparameter tuning and an outer loop for model evaluation. Provides an unbiased estimate of model performance.
   - **Pros**: Offers a robust evaluation by separating hyperparameter tuning from model evaluation.
   - **Cons**: Computationally expensive due to multiple rounds of cross-validation.

3. **Stratified K-Fold Cross-Validation**:
   - **Description**: Divides the data into \( k \) folds, ensuring each fold has the same class distribution as the entire dataset.
   - **Pros**: Reduces bias and variance in performance estimates, especially useful for imbalanced datasets.
   - **Cons**: More computationally intensive than a simple train-test split.

4. **Leave-One-Out Cross-Validation (LOOCV)**:
   - **Description**: Uses each data point once as a test set while the rest are used for training.
   - **Pros**: Uses the maximum amount of data for training, providing a thorough evaluation.
   - **Cons**: Very computationally expensive, especially for large datasets.

5. **Time Series Cross-Validation**:
   - **Description**: Respects the temporal order of data, using past data to predict future data.
   - **Pros**: Suitable for time-dependent data, maintaining the sequence of events.
   - **Cons**: Not applicable to non-time series data.

6. **Bootstrapping**:
   - **Description**: Resamples the dataset with replacement to create multiple training sets.
   - **Pros**: Provides estimates of model performance variability and robustness.
   - **Cons**: Can be computationally intensive and may not always reflect the original data distribution.

### Comparison and Explanation:

- **Robustness and Generalization**: Nested cross-validation and stratified k-fold cross-validation provide robust estimates of model performance by using multiple data splits. They are particularly useful for ensuring that the model generalizes well to unseen data.

- **Computational Cost**: LOOCV and nested cross-validation are the most computationally expensive methods. They are thorough but may not be practical for very large datasets.

- **Handling Imbalanced Data**: Stratified k-fold cross-validation is particularly effective for imbalanced datasets, ensuring that each fold is representative of the overall class distribution.

- **Time-Dependent Data**: Time series cross-validation is essential for datasets where the temporal order matters, such as stock prices or weather data.

- **Variability and Stability**: Bootstrapping provides insights into the variability and stability of model performance across different samples, which can be valuable for understanding model reliability.

### Conclusion:

Each validation technique has its strengths and weaknesses, and the choice of method should be guided by the specific characteristics of the dataset and the goals of the analysis. For general purposes, stratified k-fold cross-validation is often a good balance between robustness and computational efficiency. However, for time series data, time series cross-validation is more appropriate, and for hyperparameter tuning, nested cross-validation provides a more unbiased evaluation.

In [31]:
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, LeaveOneOut, TimeSeriesSplit, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.utils import resample

# Assuming X and y are your features and labels
# X, y = load_your_data()

# Initialize a dictionary to store results
results = {}

# 1. Validation Set Approach
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
results['Validation Set'] = {
    'Accuracy': accuracy_score(y_val, y_pred),
    'Precision': precision_score(y_val, y_pred, zero_division=0),
    'Recall': recall_score(y_val, y_pred, zero_division=0),
    'F1 Score': f1_score(y_val, y_pred, zero_division=0)
}

# 2. Nested Cross-Validation
param_grid = {'C': [0.1, 1, 10], 'max_iter': [100, 200]}
nested_cv = cross_val_score(GridSearchCV(LogisticRegression(), param_grid, cv=5), X, y, cv=5)
results['Nested CV'] = {
    'Mean Accuracy': nested_cv.mean(),
    'Std Dev': nested_cv.std()
}

# 3. Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5)
skf_scores = cross_val_score(LogisticRegression(random_state=42), X, y, cv=skf)
results['Stratified K-Fold'] = {
    'Mean Accuracy': skf_scores.mean(),
    'Std Dev': skf_scores.std()
}

# 4. Leave-One-Out Cross-Validation (LOOCV)
loo = LeaveOneOut()
loo_scores = cross_val_score(LogisticRegression(random_state=42), X, y, cv=loo)
results['LOOCV'] = {
    'Mean Accuracy': loo_scores.mean(),
    'Std Dev': loo_scores.std()
}

# 5. Time Series Cross-Validation
# Assuming X is sorted by time
tscv = TimeSeriesSplit(n_splits=5)
tscv_scores = cross_val_score(LogisticRegression(random_state=42), X, y, cv=tscv)
results['Time Series CV'] = {
    'Mean Accuracy': tscv_scores.mean(),
    'Std Dev': tscv_scores.std()
}

# 6. Bootstrapping
n_iterations = 100
bootstrap_accuracies = []
for i in range(n_iterations):
    X_resampled, y_resampled = resample(X, y, random_state=i)
    X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(
        X_resampled, y_resampled, test_size=0.3, random_state=42)
    model.fit(X_train_resampled, y_train_resampled)
    y_pred_resampled = model.predict(X_test_resampled)
    bootstrap_accuracies.append(accuracy_score(y_test_resampled, y_pred_resampled))
results['Bootstrapping'] = {
    'Mean Accuracy': np.mean(bootstrap_accuracies),
    'Std Dev': np.std(bootstrap_accuracies)
}

# Print results
for method, metrics in results.items():
    print(f"{method} Results:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    print()

# Interpretation
print("Interpretation:")
print("The results show the mean accuracy and standard deviation for each validation technique.")
print("Higher mean accuracy and lower standard deviation indicate better and more stable model performance.")
print("Nested CV and Stratified K-Fold are generally more robust, while LOOCV provides a thorough evaluation but is computationally expensive.")
print("Bootstrapping provides insights into model variability, and Time Series CV is essential for time-dependent data.")

Validation Set Results:
  Accuracy: 0.8600
  Precision: 0.8600
  Recall: 1.0000
  F1 Score: 0.9247

Nested CV Results:
  Mean Accuracy: 0.7960
  Std Dev: 0.0049

Stratified K-Fold Results:
  Mean Accuracy: 0.7960
  Std Dev: 0.0049

LOOCV Results:
  Mean Accuracy: 0.7960
  Std Dev: 0.4030

Time Series CV Results:
  Mean Accuracy: 0.7880
  Std Dev: 0.0467

Bootstrapping Results:
  Mean Accuracy: 0.8110
  Std Dev: 0.0333

Interpretation:
The results show the mean accuracy and standard deviation for each validation technique.
Higher mean accuracy and lower standard deviation indicate better and more stable model performance.
Nested CV and Stratified K-Fold are generally more robust, while LOOCV provides a thorough evaluation but is computationally expensive.
Bootstrapping provides insights into model variability, and Time Series CV is essential for time-dependent data.


|      | Accuracy | Precision | Recall | F1 Score | Mean Accuracy | Std Dev |
|--------------------------|----------|-----------|--------|----------|---------------|---------|
| Validation Set           | 0.8600   | 0.8600    | 1.0000 | 0.9247   | N/A           | N/A     |

### Result comparison

| Validation Technique     | Mean Accuracy | Std Dev |
|--------------------------|---------------|---------|
| Nested CV                |  0.7960        | 0.0049  |
| Stratified K-Fold        |  0.7960        | 0.0049  |
| LOOCV                    |  0.7960        | 0.4030  |
| Time Series CV           |  0.7880        | 0.0467  |
| Bootstrapping            |  0.8110        | 0.0333  |

### Explanation:
- **Validation Set**: Provides specific metrics like accuracy, precision, recall, and F1 score.
- **Nested CV, Stratified K-Fold, LOOCV, Time Series CV, Bootstrapping**: These techniques provide mean accuracy and standard deviation, which are useful for understanding the model's performance stability across different data splits or resamples.

This table allows for a quick comparison of the different validation techniques and their results.

### Conclusion
The results show the mean accuracy and standard deviation for each validation technique.

Higher mean accuracy and lower standard deviation indicate better and more stable model performance.

Nested CV and Stratified K-Fold have the lowest standard deviations, indicating that these methods provide the most consistent performance across different data splits - therefore are generally more robust, while LOOCV, having a much higher standard deviation, it can have signifficant variability in performance despite providing a thorough evaluation which is computationally expensive - but this might be doe to the size of the test set in each interaction.


- **Best Overall Performance**: **Bootstrapping** shows the highest mean accuracy, suggesting it might be the best choice if the goal is to maximize accuracy.

- **Consistency**: **Nested CV** and **Stratified K-Fold** offer the most consistent results, which is crucial for ensuring that the model's performance is reliable across different data samples.


- **Considerations**: The choice of the best technique also depends on the specific context and requirements. For example, if computational resources are limited, the computational cost of LOOCV might be prohibitive despite its thoroughness.


In summary, if you prioritize accuracy, Bootstrapping might be the best choice. However, if you value consistency and robustness, Nested CV or Stratified K-Fold would be preferable.


### Stress testing LOOCV to see if we can improve STDEV

To evaluate the Leave-One-Out Cross-Validation (LOOCV) with different train-test split options, we can simulate this by using different random seeds for splitting the data. However, LOOCV inherently uses each data point as a test set once, so it doesn't involve random splits like other methods. Instead, we can compare LOOCV with other small-sample cross-validation techniques to see which provides the lowest standard deviation.

Since LOOCV is inherently deterministic, I'll demonstrate how to use different small-sample cross-validation techniques and compare their standard deviations

- **LOOCV**: Uses each data point as a test set once, providing a thorough evaluation but can be computationally expensive.

- **K-Fold (k=5)**: Splits the data into 5 folds, providing a balance between thoroughness and computational efficiency.

- **ShuffleSplit**: Randomly splits the data into training and test sets multiple times, providing a robust estimate of model performance.

- **Stratified K-Fold**: Ensures each fold has a representative distribution of the target classes, useful for imbalanced datasets.


In [33]:
import numpy as np
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold, ShuffleSplit, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load a sample dataset
X, y = load_iris(return_X_y=True)

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize a dictionary to store results
results = {}

# 1. Leave-One-Out Cross-Validation (LOOCV)
loo = LeaveOneOut()
loo_scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42), X_scaled, y, cv=loo)
results['LOOCV'] = {
    'Mean Accuracy': loo_scores.mean(),
    'Std Dev': loo_scores.std()
}

# 2. K-Fold Cross-Validation with small k
kf = KFold(n_splits=5, random_state=42, shuffle=True)
kf_scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42), X_scaled, y, cv=kf)
results['K-Fold (k=5)'] = {
    'Mean Accuracy': kf_scores.mean(),
    'Std Dev': kf_scores.std()
}

# 3. ShuffleSplit Cross-Validation
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
ss_scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42), X_scaled, y, cv=ss)
results['ShuffleSplit'] = {
    'Mean Accuracy': ss_scores.mean(),
    'Std Dev': ss_scores.std()
}

# 4. Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
skf_scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42), X_scaled, y, cv=skf)
results['Stratified K-Fold'] = {
    'Mean Accuracy': skf_scores.mean(),
    'Std Dev': skf_scores.std()
}

# Print results
for method, metrics in results.items():
    print(f"{method} Results:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    print()

# Interpretation
print("Interpretation:")
print("The results show the mean accuracy and standard deviation for each cross-validation technique.")
print("Lower standard deviation indicates more consistent performance across different splits.")

LOOCV Results:
  Mean Accuracy: 0.9533
  Std Dev: 0.2109

K-Fold (k=5) Results:
  Mean Accuracy: 0.9600
  Std Dev: 0.0249

ShuffleSplit Results:
  Mean Accuracy: 0.9600
  Std Dev: 0.0249

Stratified K-Fold Results:
  Mean Accuracy: 0.9533
  Std Dev: 0.0452

Interpretation:
The results show the mean accuracy and standard deviation for each cross-validation technique.
Lower standard deviation indicates more consistent performance across different splits.
