### 1. Reading Data Without Preprocessing

This is the simplest case where we just load the data without any modification to the text.

In [1]:
import pandas as pd

def read_data_without_preprocessing(file_path):
    # Load the dataset
    data = pd.read_csv(file_path)
    return data

# Usage
file_path = 'dataset.csv'
data = read_data_without_preprocessing(file_path)
print(data.head())

                                             comment sentiment
0  Oh my god, it just doesn't get any worse than ...  negative
1  If you're a layman interested in quantum theor...  negative
2  It's amazing that this no talent actor Chapa g...  negative
3  This must be one of the most overrated Spanish...  negative
4  Some critics have compared Chop Shop with the ...  positive


### 2. Basic Text Preprocessing

This method includes converting all letters to lower case, removing numbers, extra characters, and separating words.

In [2]:
import re

def basic_text_preprocessing(data):
    # Convert to lower case
    data['comment'] = data['comment'].str.lower()
    # Remove numbers and extra characters
    data['comment'] = data['comment'].apply(lambda x: re.sub(r'\d+', '', x))  # Remove digits
    data['comment'] = data['comment'].apply(lambda x: re.sub(r'\W+', ' ', x))  # Remove non-word characters
    return data

# Apply basic preprocessing
data_basic_preprocessed = basic_text_preprocessing(data.copy())
print(data_basic_preprocessed.head())

                                             comment sentiment
0  oh my god it just doesn t get any worse than t...  negative
1  if you re a layman interested in quantum theor...  negative
2  it s amazing that this no talent actor chapa g...  negative
3  this must be one of the most overrated spanish...  negative
4  some critics have compared chop shop with the ...  positive


### 3. High-level Text Preprocessing

This method includes all steps from the basic preprocessing plus removal of stopwords, stemming, and lemmatization. Additional methods from online references for text data preprocessing can be explored, such as using `nltk` for stopwords, stemming, and lemmatization.

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

def high_level_text_preprocessing(data):
    # Initialize NLTK tools
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    
    # Remove stopwords, then stem and lemmatize the words
    def preprocess_text(text):
        words = nltk.word_tokenize(text)
        filtered_words = [word for word in words if word not in stop_words]
        stemmed_words = [stemmer.stem(word) for word in filtered_words]
        lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
        return ' '.join(lemmatized_words)
    
    data['comment'] = data['comment'].apply(preprocess_text)
    return data

# Apply high-level preprocessing
data_high_level_preprocessed = high_level_text_preprocessing(data_basic_preprocessed.copy())
print(data_high_level_preprocessed.head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sina\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sina\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sina\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


                                             comment sentiment
0  oh god get wors alway love silli littl sci fi ...  negative
1  layman interest quantum theori string theori r...  negative
2  amaz talent actor chapa got well known star ap...  negative
3  must one overr spanish film histori lack subtl...  negative
4  critic compar chop shop theatric releas citi g...  positive



### Reason for Using These Methods

Each preprocessing step is chosen based on common practices in NLP to improve the quality of input data for machine learning models:

- **Lowercasing**: Helps in treating words like "Hello", "HELLO", and "hello" equally.
- **Removing numbers and special characters**: Often, numbers and special characters do not carry useful information for sentiment analysis and can introduce noise into the data.
- **Stopword removal**: Stopwords are common words that usually do not carry significant meaning and are removed to reduce the dataset size and improve processing time.
- **Stemming and Lemmatization**: These processes reduce words to their root form, helping the model to understand that different forms of a word carry the same meaning.

When implementing these preprocessing steps, it's important to remember that the effectiveness of each method can vary depending on the dataset and the specific task at hand. It's always a good idea to experiment with different preprocessing techniques and evaluate their impact on model performance.

To compare the modeling quality using different text vectorization methods along with Logistic Regression and K-Nearest Neighbors (KNN) algorithms, we'll proceed through several steps:

1. **High-level text preprocessing** of the dataset using the third method described previously.
2. **Vectorization**: We'll use two vectorization methods: Bag of Words (BoW) and Word2Vec (although "vec2Word" was mentioned, I'll assume it's a reference to Word2Vec, a common word embedding technique).
3. **Model Training and Hyperparameter Tuning**: For Logistic Regression and K-Nearest Neighbors (KNN), we'll adjust hyperparameters to find the best performing model.
4. **Evaluation and Reporting**: We'll evaluate the models using appropriate metrics (like accuracy, precision, recall, and F1-score) and report the best settings.

### Preparing the Environment

First, ensure you have all necessary libraries installed and imported. We'll need `sklearn`, `gensim` for Word2Vec, `nltk` for text preprocessing, and `pandas` for data manipulation.

### High-level Text Preprocessing

We'll preprocess the dataset using the high-level text preprocessing method as described previously. This step is assumed to be completed, and the preprocessed dataset is ready for vectorization.

### Vectorization Methods Implementation

#### Bag of Words (BoW)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

def vectorize_bow(texts):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(texts)
    return X, vectorizer

#### Word2Vec

For Word2Vec, we first need to train a model on our corpus or use a pre-trained model. Given the constraint of not using external data, we'll train a simple Word2Vec model on the preprocessed comments.

In [7]:
from gensim.models import Word2Vec
import numpy as np

def train_word2vec(sentences):
    # Train a Word2Vec model
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    return model

def vectorize_word2vec(model, sentences):
    # Vectorize sentences using the average of word vectors
    vectorized = []
    for sentence in sentences:
        vectors = [model.wv[word] for word in sentence if word in model.wv]
        if vectors:
            vectorized.append(np.mean(vectors, axis=0))
        else:
            vectorized.append(np.zeros(model.vector_size))
    return np.array(vectorized)

### Model Training and Hyperparameter Tuning

We'll split the dataset into training and test sets, then train and tune both Logistic Regression and KNN models for each vectorization method.

#### Splitting the Dataset

In [8]:
from sklearn.model_selection import train_test_split

# Assuming `data_high_level_preprocessed` is the dataset after high-level preprocessing
X = data_high_level_preprocessed['comment']
y = data_high_level_preprocessed['sentiment']

# For BoW
X_bow, vectorizer_bow = vectorize_bow(X)
# Splitting for BoW
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

# For Word2Vec
# Tokenize the sentences for Word2Vec
sentences = [nltk.word_tokenize(comment) for comment in X]
w2v_model = train_word2vec(sentences)
X_w2v = vectorize_word2vec(w2v_model, sentences)
# Splitting for Word2Vec
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(X_w2v, y, test_size=0.2, random_state=42)

#### Training and Hyperparameter Tuning

Due to the complexity and the computational cost of hyperparameter tuning, I'll outline the approach for one model and vectorization method combination. You can replicate this approach for the other combinations.

##### Logistic Regression with BoW

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}

# Initialize the classifier
log_reg = LogisticRegression(max_iter=1000)

# Initialize GridSearchCV
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')

# Fit on the training data
grid_search.fit(X_train_bow, y_train)

# Best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters: {best_params}")
print(f"Best score: {best_score}")

Best parameters: {'C': 0.1}
Best score: 0.8846944444444444


##### K-Nearest Neighbors with Bo

You would follow a similar approach for KNN, adjusting parameters like `n_neighbors` and potentially the metric (e.g., Euclidean, Manhattan).

### Evaluation and Reporting

After training and tuning both models with both vectorization methods, evaluate each model on the test set to determine the best performing combination. Use metrics such as accuracy, precision, recall, and F1-score for a comprehensive evaluation.

### Final Steps

Repeat the process for each combination of models and vectorization techniques, adjusting hyperparameters as needed. Report the best models for Logistic Regression and KNN along with their hyperparameters and the vectorization method that led to the highest performance.

Due to the execution and complexity involved in hyperparameter tuning and model training, you might want to run these experiments on your local machine or a cloud computing platform to manage computational resources effectively.

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import pickle

# Assuming data_high_level_preprocessed is loaded and contains 'comment' and 'sentiment' columns
comments = data_high_level_preprocessed['comment']
sentiments = data_high_level_preprocessed['sentiment']

# Vectorize with BoW
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(comments)
y = sentiments

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

# Logistic Regression with GridSearchCV
param_grid_lr = {'C': [0.01, 0.1, 1, 10, 100]}
grid_lr = GridSearchCV(LogisticRegression(max_iter=1000), param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)

# KNN with GridSearchCV
param_grid_knn = {'n_neighbors': [3, 5, 7, 9]}
grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='accuracy')
grid_knn.fit(X_train, y_train)

Compare two models:

In [18]:
import os

# Define the directory
dir_path = 'mnt/data/'

# Check if the directory exists, if not, create it
if not os.path.exists(dir_path):
    os.makedirs(dir_path)

# Continue with your code
if grid_lr.best_score_ > grid_knn.best_score_:
    best_model = grid_lr.best_estimator_
    best_score = grid_lr.best_score_
    best_params = grid_lr.best_params_
    model_type = "Logistic Regression"
    save_path = os.path.join(dir_path, 'pkl_LR.pkl')
else:
    best_model = grid_knn.best_estimator_
    best_score = grid_knn.best_score_
    best_params = grid_knn.best_params_
    model_type = "KNN"
    save_path = os.path.join(dir_path, 'pkl_kNN.pkl')

# Save the best model
with open(save_path, 'wb') as file:
    pickle.dump(best_model, file)

print(f"The best model is {model_type} with a score of {best_score}.")
print(f"Best parameters: {best_params}")
print(f"Model saved to: {save_path}")

The best model is Logistic Regression with a score of 0.8846944444444444.
Best parameters: {'C': 0.1}
Model saved to: mnt/data/pkl_LR.pkl


### Implementation of the points section
For modeling with a Multilayer Perceptron (MLP) algorithm using scikit-learn's `MLPClassifier`, let's proceed through the necessary steps: preprocessing, vectorization, model training, and saving the best model. We'll explore different vectorization techniques and preprocessing methods to optimize performance.

### Preprocessing

Given that there are no restrictions on data preprocessing, we will use high-level text preprocessing, including:
- **Lowercasing**: Convert all text to lower case to ensure uniformity.
- **Removing special characters and numbers**: Simplifies the text data.
- **Tokenization**: Splits text into individual words or tokens.
- **Removing stopwords**: Eliminates common words that may not contribute to the sentiment.
- **Lemmatization**: Reduces words to their base or dictionary form.

### Vectorization: TF-IDF

In addition to Bag of Words (BoW) and Word2Vec, we'll use TF-IDF (Term Frequency-Inverse Document Frequency) for vectorization. TF-IDF highlights words that are more interesting, i.e., frequent in a document but not across documents.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limit to 1000 features to manage model complexity
X_tfidf = tfidf_vectorizer.fit_transform(comments).toarray()

### MLP Model Training

For the MLPClassifier, we need to define several parameters:
- **Number of layers and neurons**: A common starting point might be one hidden layer with a number of neurons between the size of the input layer and the output layer. However, for text data, more complex models might be necessary. We'll start with two hidden layers.
- **Activation function**: `relu` (Rectified Linear Unit) is a common choice for hidden layers due to its efficiency and simplicity.
- **Solver for weight optimization**: `adam` works well on large datasets with thousands of training samples.
- **Regularization (`alpha`)**: Helps prevent overfitting by penalizing large weights.

In [20]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Assuming X_tfidf and y are already defined
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

mlp = MLPClassifier(hidden_layer_sizes=(512, 256), activation='relu', solver='adam', alpha=0.0001, max_iter=500, random_state=42)
mlp.fit(X_train, y_train)

# Predict and evaluate
predictions = mlp.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

# Assuming this model performs best, save it
import pickle
with open('mnt/data/pkl_best.pkl', 'wb') as file:
    pickle.dump(mlp, file)

print("Model saved as pkl_best.pkl")

Accuracy: 0.865
Model saved as pkl_best.pkl



### Explanation and References

- **TF-IDF Vectorization**: TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Reference: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- **MLPClassifier**: A Multilayer Perceptron (MLP) is a class of feedforward artificial neural network (ANN). An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Reference: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.