To solve an NLP problem using the AmazonReview.csv dataset, the steps below outline a methodical approach. Assume the dataset contains reviews and corresponding sentiment labels (positive, negative, etc.).

In [1]:
# Load the dataset and inspect its structure:
import pandas as pd

# Load dataset
df = pd.read_csv('AmazonReview.csv')

# Check data
print(df.head())
print(df.info())


                                              Review  Sentiment
0  Fast shipping but this product is very cheaply...          1
1  This case takes so long to ship and it's not e...          1
2  Good for not droids. Not good for iPhones. You...          1
3  The cable was not compatible between my macboo...          1
4  The case is nice but did not have a glow light...          1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Review     24999 non-null  object
 1   Sentiment  25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB
None


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Handle cases where text is NaN (float)
    if isinstance(text, float):
        return ""
        
    # Tokenization
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords and punctuation
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return " ".join(tokens)

# Fill missing values in the Review column with empty strings
df['Review'] = df['Review'].fillna("")

# Apply preprocessing to the review text
df['cleaned_review'] = df['Review'].apply(preprocess_text)

# Check the result
print(df.head())

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\venka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\venka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\venka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                              Review  Sentiment  \
0  Fast shipping but this product is very cheaply...          1   
1  This case takes so long to ship and it's not e...          1   
2  Good for not droids. Not good for iPhones. You...          1   
3  The cable was not compatible between my macboo...          1   
4  The case is nice but did not have a glow light...          1   

                                      cleaned_review  
0  fast shipping product cheaply made brought gra...  
1         case take long ship 's even worth dont buy  
2  good droids good iphones use feature watch iph...  
3  cable compatible macbook iphone also connector...  
4  case nice glow light 'm disappointed product n...  


In [5]:
# Transform the text data into numerical vectors.
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000)

# Fit and transform the data
X = tfidf.fit_transform(df['cleaned_review']).toarray()

# Target variable
y = df['Sentiment']  # Assuming Sentiment column has labels


In [6]:
# Example using Logistic Regression:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression model
model = LogisticRegression()

# Fit model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.4708
              precision    recall  f1-score   support

           1       0.56      0.62      0.59      1021
           2       0.39      0.37      0.38      1000
           3       0.38      0.36      0.37       985
           4       0.42      0.39      0.40       989
           5       0.57      0.62      0.59      1005

    accuracy                           0.47      5000
   macro avg       0.46      0.47      0.47      5000
weighted avg       0.46      0.47      0.47      5000

