In [None]:
**Assignment 2 **

**The report section - Report on Predictive Modelling for Point of Interest Categorization**

**1. Introduction**
The problem involves assigning a relevant category label to a given point of interest, which can range from restaurants and shopping centres to nightlife venues. In this report, we explore the task of predicting the category of points of interest, focusing on three main categories: "Restaurants," "Shopping," and "Nightlife." We investigate the application of the Naive Bayes algorithm and an enhanced version of it to address this task.

**2. Data Representation and Pre-processing**
The dataset used is a Yelp data from New Orleans city. The data is divided into two excel sheets each for training and test set of data for modelling. It consists of seven columns – ID, mean_checkin_time, longitude, latitude, review and category. The category column is the class label that is to be predicted while remaining columns can be used as features for the model.
The data set consists of textual values in the column review and float values in the remaining columns. The column category has three categories - Restaurants, Shopping, and Nightlife which are to be predicted. Note that we do not have this column for test set as it is to be predicted using Naïve Bayes Classifier.
Various methods of data pre-processing are used on the data prior to the modelling. Since the data set has text and numeric data, each type of data is to be pre-processed before applying Naïve Bayes Classifier. 
The textual columns contain of stop words, punctuation, capital alphabets and short forms of the words. To remove these from the values of data we have a simple process called as Bag of Words. In this process, we convert the sentences to a vector of numbers assigned to each word that are relevant for modelling. This process can be done using Count Vectorizer from sklearn package. The method not only removes stop words but also the punctuations from the sentences, tokenizes the sentences and convert capital letters to smaller ones.
Next for the numeric features in the dataset, we have used the binning method. It is a discretization method, where we turn continuous numerical features into discrete categories. These categories are called as bins. In our case we have created 5 bins for features mean_checkin_time, longitude, and latitude. Here, we have created the binned features and stored them in sparse matrices.
After that we have combined the pre-processed features of text and numeric values together.
Lastly, we have applied SMOTE technique as there is an imbalance in the classes of Yelp dataset. It is a Synthetic Minority Over-sampling Technique that creates synthetic samples of the minority class so that the number of samples for each class is similar. In our code, we have used imblearn package to apply SMOTE on the data.

**3. Task 1: Naïve Bayes Algorithm**
There are three types of models that can be implemented for Naïve Bayes Algorithm - Gaussian Naive Bayes, Multinomial Naive Bayes and Bernoulli Naive Bayes. Since our dataset requires text classification, we have used Multinomial Naïve Bayes model. This model handles discrete features and more efficient for textual data. We have added the parameter alpha to the model in this task. Also, the feature considered here is review only. Based on this feature, we can achieve the accuracy of 0.8899 as per the Kaggle evaluation of the model.

**4. Task 2: Enhanced Naive Bayes Algorithm for Point of Interest Categorization**
For the second task, we have considered all the features available in the dataset to improve the performance of the previous model. We have also considered the parameter alpha that is used for smoothening of data for the model. 
To use an optimal value for the alpha parameter, we have used cross validation process. In this process, we have considered three values for the alpha – 0.1, 0.5 and 1.0. With the cross-validation scores for these values, we came to the found that having value 0.1 for alpha will provide us with the best model.

**5. Evaluation Procedure**
For the evaluation of the models, we have considered cross-validation scores of the models that can be applied to the data. Cross-validation score is a performance metric used for cross validation method in data modelling. 
In cross-validation process, we divide the training data into different subsets called as folds. The model is trained on combination of these folds and evaluated using cross validation scores. We can also create a validation set that is a subset of training set and use to test the model in the cross-validation process.
Other than cross-validation scores, we have also considered the precision, recall and F1-score metrics for evaluating the performances of the model. Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. Recall measures the proportion of correctly predicted positive instances out of all true positive instances and F1-score is the harmonic mean of precision and recall.
When precision is higher it can be said that there is a lower false positive rate. When there is high recall value, we can say that there is low false negative rate. Since there is a trade off between precision and recall, it is important to strike a balance between them. If F1-score is high, it indicates overall good performance of the model.

**6. Training/Validation Results**
After pre-processing the data and using cross-validation on the data, we get a model with cross-validation scores in between 0.94 to 0.96. We also get the precision, recall and F1-score values as 0.946, 0.945 and 0.9453 respectively. Considering these values there is a possibility that the model will have good performance on the test data set.

**7. Conclusion**
While the above-mentioned metrics are good indicators, it is also important to consider other factors like model complexity and computation costs of the model. Even if the performance metrics values are good, the model may show different results on test dataset. 


In [185]:
#Import statements
import pandas as p
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import KBinsDiscretizer
import scipy.sparse as sp
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import precision_score, recall_score, f1_score

In [137]:
# Task 1 Code

# Loading training data and test data
train_excel_data = 'C:/Users/tejas/Downloads/cs762-point-of-interest-categorization/train.csv'
train_data=p.read_csv(train_excel_data)

test_excel_data = 'C:/Users/tejas/Downloads/cs762-point-of-interest-categorization/test.csv'
test_data=p.read_csv(test_excel_data)

X_train = train_data['review']
y_train = train_data['category']
X_test = test_data['review']

#Preprocessing with count vectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
X_test_vectorized = vectorizer.transform(X_test).toarray()

#build model
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

y_pred = classifier.predict(X_test_vectorized)

# Create a DataFrame to store the predicted values
predictions_df = p.DataFrame({'category': y_pred})

#Store in file
output_file = 'C:/Users/tejas/OneDrive/Desktop/Assignments/Compsci 762_Foundations of ML/Assignment 2/predictions.csv'
predictions_df.to_csv(output_file, index=False)

In [187]:
# Task 2 Code

X_train = train_data.drop(columns=['category'])
y_train = train_data['category']
X_test = test_data

#Preprocess the review feature
vectorizer = CountVectorizer()
X_train_text = vectorizer.fit_transform(X_train['review'])
X_test_text = vectorizer.transform(X_test['review'])

#Preprocess the other features - mean_checkin_time, longitude and latitude

#Binning the data
n_bins = 5 # number of bins
binning = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
X_train_binned = binning.fit_transform(X_train[['mean_checkin_time', 'longitude', 'latitude']])
X_test_binned = binning.transform(X_test[['mean_checkin_time', 'longitude', 'latitude']])

# Convert the binned features to sparse matrices
X_train_binned_sparse = sp.csr_matrix(X_train_binned)
X_test_binned_sparse = sp.csr_matrix(X_test_binned)

#Combining preprocessed features
X_train_final = sp.hstack((X_train_binned_sparse, X_train_vectorized))
X_test_final = sp.hstack((X_test_binned_sparse, X_test_vectorized))

# Applying SMOTE
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_final, y_train)

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.naive_bayes import GaussianNB

classifier = MultinomialNB()
param_grid = {
    'alpha': [0.1, 0.5, 1.0] #smoothening parameter
}

grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, cv=10,error_score='raise')
cross_val_scores = cross_val_score(grid_search, X_train_resampled, y_train_resampled, cv=5)

# Calculate precision, recall, and F1-score during cross-validation
y_pred = cross_val_predict(classifier, X_train_resampled, y_train_resampled, cv=5)
precision = precision_score(y_train_resampled, y_pred, average='macro')
recall = recall_score(y_train_resampled, y_pred, average='macro')
f1 = f1_score(y_train_resampled, y_pred, average='macro')

# Print the cross-validation scores and evaluation metrics
print("Cross-validation scores:", cv_scores)
print("Mean cross-validation score:", np.mean(cv_scores))
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

print("Cross-validation scores:", cross_val_scores)
print("Mean cross-validation score:", cross_val_scores.mean())

grid_search.fit(X_train_resampled, y_train_resampled)

print("Best estimator:", grid_search.best_estimator_)
print("Best parameters:", grid_search.best_params_)


Cross-validation scores: [0.96091486 0.95033812 0.94467382 0.94032992]
Mean cross-validation score: 0.9490641830438349
Precision: 0.9467801359038402
Recall: 0.9457979225684608
F1-score: 0.9453842177715125
Cross-validation scores: [0.94428706 0.95372993 0.9631728  0.96411709 0.96883853]
Mean cross-validation score: 0.9588290840415485
Best estimator: MultinomialNB(alpha=0.1)
Best parameters: {'alpha': 0.1}
