## Motivation: 

The objective for the provided method is to solve two distinct tasks: sentiment analysis of reviews (task 1) and categorisation based on integrated data (task 2). Both tasks aim to create effective machine learning algorithms that can reliably assume the target variables. 

Task 1 focuses on assessing the sentiment of textual reviews, whilst task 2 seeks to categorize categories using features taken from both textual and numerical data. 


### Data Preprocessing:
The data for both tasks comprises of textual and numerical elements. Task 1 just uses textual review, whereas task 2 uses the combined feature. The textual data is preprocessed, which includes tokenization, stopword removal, lemmatization, and vectorization with CountVectorizer. Prior to preprocessing, numerical features are merged with textual data.

In addition, I have test both 'CountVectorizer' and 'TfidfVectorizer' to vectorizating input data, After conducting tests with both 'CountVectorizer' and 'TfidfVectorizer' for vectorizing input data, it was observed that 'CountVectorizer' generally outperformed 'TfidfVectorizer' when used with the Bayes classifier, based on the formulated data; therefore we will using 'CountVectorizer' for our data preprocessing
| |TifidVectorizer | CountVectorizer|
|--|---------------|----------------|
|CNB|0.8450704225352113|0.9084507042253521|
|MNB|0.7288732394366197|0.8926056338028169|

### Task 1: Sentiment Analysis
For sentiment analysis (task 1), the data preprocessing stages are applied to the reviews, and the preprocessed data is divided into training and test sets using an 80-20 split (the most common comment splitting approach). The training data is used to train two Navie Bayes classifiers: Complement Navie Bayes and Multinomial Navie Bayes. Each classifier's accuracy is measured using the 'accuracy_score' metric on the test data. Overall, the Complement Navie Bayes classifier outperforms the Mulinomial Navie Bayes classifier in sentiment analysis, and I want to use it for the next challenge (combine category classification).
|Complement Navie Bayes| Multionmial Navie Bayes|
|----------------------|------------------------|
|0.9084507042253521|0.8926056338028169|

### Task 2: Category Classification
For category categorization (task 2), the combined characteristics of'review' and 'name' were chosen, purposefully eliminating only numerical properties such as'mean_checkin_time', 'ID', 'latitude', and 'longitude'. This strategic move aims to improve the performance of Bayes classifiers. By emphasizing the textual character of'review' and 'name', we improve the categorization process. These attributes give useful context and extra information about the categorization challenge. Using'review' and 'name' allows the classifier to identify complex patterns and semantics in the data, resulting in more accurate predictions. As a result, for predicting category classification in test.csv, using the combined characteristics of 'name' and'review' is regarded ideal for getting the best performance. This technique stresses the relevance of feature selection as well as using textual properties to increase classification algorithm efficacy.





|name & review|name & numerical data | review & numerical data | all feature|
|--------------|-----------------|---------------|------------|
|0.9225352112676056| 0.6795774647887324|0.9084507042253521|0.9225352112676056|


In [12]:
import numpy as np;
import re;
import pandas as pd;
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer;
from sklearn.model_selection import train_test_split;
from sklearn.metrics import accuracy_score;
from sklearn.naive_bayes import ComplementNB, MultinomialNB;
from sklearn.metrics import accuracy_score;
import nltk;
nltk.download('stopwords');
nltk.download('wordnet');
from nltk.corpus import wordnet as wn;
from nltk.stem import WordNetLemmatizer;
from nltk.corpus import stopwords;
from nltk.tokenize import word_tokenize;
train_df = pd.read_csv(('train.csv'));
test_df = pd.read_csv(('test.csv'));

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vicmon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/vicmon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
def data_preprocessing(sentences, vectorizer=None):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    processed_words = []
    
    for sentence in sentences:
        sentence = re.sub(r'[^a-zA-Z\s]', '', sentence)
        sentence = sentence.lower()
        tokens = word_tokenize(sentence)
        filtered_tokens = [word for word in tokens if word not in stop_words]
        lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
        processed_word = ' '.join(lemmatized_tokens)
        processed_words.append(processed_word)
    
    if vectorizer is None:
        # vectorizer = TfidfVectorizer()
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform(processed_words)
    else:
        X = vectorizer.transform(processed_words)
    
    return X, vectorizer

In [4]:
reviews = train_df.review
y = train_df.category
X,vectorizer = data_preprocessing(reviews)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = ComplementNB()
model1 =MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Complement Navie Bayes classifcation Accuracy for task1:", accuracy)


model1 .fit(X_train, y_train)
y_pred = model1.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Multinomia Navie Bayes classifcation Accuracy for task1:", accuracy)


Complement Navie Bayes classifcation Accuracy for task1: 0.9084507042253521
Multinomia Navie Bayes classifcation Accuracy for task1: 0.8926056338028169


In [5]:
# output for kaggle test

# Extracting reviews from the 'test.csv' file
review = test_df.review

# preprocess all reviews from test.csv file
X_test, _ = data_preprocessing(review, vectorizer)
X = X_test

# Making predictions for the reviews
prediction = model.predict(X)

#assigning ID with classified labels
results_df = pd.DataFrame({'ID': test_df['ID'],'Category': prediction})

# Write the DataFrame to a new CSV file
results_df.to_csv('predictions_task1.csv', index=False)

In [6]:

labels = train_df.category
#combine only review and name
combine =  train_df['review'] +train_df['name'] #+ train_df['mean_checkin_time'].round().astype('str')
X,vectorizer = data_preprocessing(combine)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
model = ComplementNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for task2:", accuracy)

Accuracy for task2: 0.9225352112676056


In [7]:

#All features
combine =  train_df['review'] +train_df['name'] + train_df['mean_checkin_time'].round().astype('str') +train_df["latitude"].astype('str') + train_df["longitude"].astype('str') + train_df["ID"].astype('str')
X,vectorizer = data_preprocessing(combine)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
model = ComplementNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for task2:", accuracy)

Accuracy for task2: 0.9225352112676056


In [9]:

# review & numerical data
combine =  train_df['review'] + train_df['mean_checkin_time'].round().astype('str') +train_df["latitude"].astype('str') + train_df["longitude"].astype('str') + train_df["ID"].astype('str')
X,vectorizer = data_preprocessing(combine)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
model = ComplementNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for task2:", accuracy)

Accuracy for task2: 0.9084507042253521


In [10]:

# name & numerical data
combine =  train_df['name'] + train_df['mean_checkin_time'].round().astype('str') +train_df["latitude"].astype('str') + train_df["longitude"].astype('str') + train_df["ID"].astype('str')
X,vectorizer = data_preprocessing(combine)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
model = ComplementNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for task2:", accuracy)

Accuracy for task2: 0.6795774647887324


In [11]:
# output for kaggle test
combine =   test_df['name']+ test_df['review'] #+test_df['mean_checkin_time'].round().astype('str')
X, _ = data_preprocessing(combine, vectorizer)

prediction = model.predict(X)
results_df = pd.DataFrame({'ID': test_df['ID'],'Category': prediction})

# Write the DataFrame to a new CSV file
results_df.to_csv('predictions_task2.csv', index=False)