## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below.

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps:
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features.
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable.

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [6]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [7]:

# Load the dataset from a CSV file
data = pd.read_csv('portfolio_3.csv')


Step 1
Data Exploration and Cleaning
Check for missing values and remove abnormal instances

In [8]:
# Step 1:
data = data.dropna()  # Remove rows with missing values

Step 2
encoding catagorical features

In [9]:
# Step 2:
encoder = LabelEncoder()
categorical_columns = ['gender', 'category', 'item']
for column in categorical_columns:
    data[column] = encoder.fit_transform(data[column])


Step 3: Feature selection
only choosing those that are relevent

In [10]:
# Step 3: Feature Selection
# Choose relevant features
X = data[['helpfulness', 'gender', 'category', 'item_price', 'user_city', 'item_id']]


In [11]:
# Target variable
y = data['rating']


Step 4: Logistic regression model


In [12]:

# Step 4:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [13]:
# Train a logistic regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)


In [14]:

logistic_predictions = logistic_model.predict(X_test)
logistic_accuracy = accuracy_score(y_test, logistic_predictions)


Step 5:K-Nearest Neighbors Model

In [15]:
# Split the dataset again for KNN
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [16]:
# Defining a range of K values to search for the best value of K
param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}

In [17]:
# Use GridSearchCV to find the best K
knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
knn_grid.fit(X_train, y_train)

In [18]:

best_k = knn_grid.best_params_['n_neighbors']
best_knn_model = KNeighborsClassifier(n_neighbors=best_k)
best_knn_model.fit(X_train, y_train)

In [21]:
# Evaluate the best KNN model
best_knn_predictions = best_knn_model.predict(X_test)
best_knn_accuracy = accuracy_score(y_test, best_knn_predictions)


Step 6

In [20]:
# Step 6: Discussion
# Discuss the performance and strengths/weaknesses of both models
print("Best K:", best_k)
print("Logistic Regression Accuracy:", logistic_accuracy)
print("KNN Accuracy (Best K):", best_knn_accuracy)
# Additional discussions on model performance
#over here we can see that knn is better performing then the logistic regression
#

Best K: 11
Logistic Regression Accuracy: 0.6480446927374302
KNN Accuracy (Best K): 0.707635009310987
