**ASSIGNMENT 2 - Classification Empirical Study: Naïve Bayes vs Logistic Regression**

--------------------------------------------------------------------------

**1. Group Description**

Group Number: 10

Member Name 1: Jake Wang

Member Student Number 1: ***REMOVED***

Member Name 2: Victor Li

Member Student Number 2: ***REMOVED***

--------------------------------------------------------------------------

**Import Libraries**

In [None]:
import pandas
import numpy
import itertools

--------------------------------------------------------------------------

**2. Dataset**

Chosen dataset: Car dataset

The dataset is coming from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/).
It can be retrieved by following [this link](https://archive.ics.uci.edu/dataset/19/car+evaluation).

**Read Dataset**

In [None]:
url = "https://raw.githubusercontent.com/uOttawa-Collabs/CSI4106-Fall-2023/master/Assignment%202/Car/car.data"
dataframe = pandas.read_csv(url)
dataframe.columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
dataframe

**3. Familiarize with the classification task and the dataset**

***Objective***

The objective of this study is to develop a classification model to evaluate cars based on their configurations and parameters. The dataset comprises six discrete (categorical) feature variables: buying price, maintenance price, number of doors, capacity in terms of persons, the size of the luggage boot, and estimated safety of the car. The target variable is the evaluation of the car, categorized into four classes: unacceptable, acceptable, good, and very good.

***Applications***

1. May assist customers to make informed decisions based on their preferences and constraints, offering valuable insights.
2. May assist car dealership to help assist potential buyers in finding cars that align with their requirements, thereby enhancing customer satisfaction and loyalty.

***Dataset Characteristics Analysis***

****Features****

The dataset used for this study contains information on the following six feature variables:
* Buying Price (`buying`): Indicates the price range at which the car was purchased.
    * Possible values: `vhigh`, `high`, `med`, `low`.
* Maintenance Price (`maint`): Represents the maintenance cost of the car.
    * Possible values: `vhigh`, `high`, `med`, `low`.
* Number of Doors (`doors`): Denotes the number of doors the car has.
    * Possible values: `2`, `3`, `4`, `5more`.
* Capacity (`persons`): Indicates the maximum number of persons the car can accommodate.
    * Possible values: `2`, `4`, `more`.
* Luggage Boot Size (`lug_boot`): Represents the size of the luggage boot in the car.
    * Possible values: `small`, `med`, `big`.
* Estimated Safety (`safety`): Provides an estimation of the safety level of the car.
    * Possible values: `low`, `med`, `high`.

****Classes****

The target variable is Car Evaluation (`class`), which classifies cars into four categories:
unacceptable (`unacc`), acceptable (`acc`), good (`good`), and very good (`vgood`).
These classes serve as the basis for evaluating the overall desirability of a car.

****Training Examples****

The dataset contains 1728 training samples in total. No samples have missing data.

--------------------------------------------------------------------------

**4. Brainstorm about the attributes**

* In the automotive sales market, various factors significantly influence purchasing decisions. Aside from intrinsic attributes provided in the dataset, **features**, **second-hand price**, **power**, **brand**, **color**, and **after-sale service** are essential as well.

  Among the provided attributes in the dataset, the data related to the **number of doors** appears to be less useful, given its correlation with a car's **capacity**. In typical scenarios, a capacity of 2 necessitates a minimum of 2 doors, while a capacity of 4 often results in 4 doors, with a minimum of 2. The relationship between a car's capacity and its doors is evident: larger capacities generally correspond to a higher number of doors. Although it is conceivable that unconventional door numbers, such as odd numbers, might intrigue some customers, it is unlikely that this singular feature significantly influences purchase decisions.

* Based on the diagrams below, it is apparent that each feature exhibits uniformity across all values, mitigating potential biases during model training. Given that all features in this dataset represent discrete values (for instance, the number of persons cannot be fractional), the method of attribute normalization appears inappropriate for this specific dataset.

In [None]:
from matplotlib import pyplot as plot

buying_dict = {"low": 0, "med": 0, "high": 0, "vhigh": 0}
maint_dict = {"low": 0, "med": 0, "high": 0, "vhigh": 0}
door_dict = {"2": 0, "3": 0, "4": 0, "5more": 0}
persons_dict = {"2": 0, "4": 0, "more": 0}
lug_dict = {"small": 0, "med": 0, "big": 0}
safety_dict = {"low": 0, "med": 0, "high": 0}

def analyze_data(dataframe, prop, dic):
    data = dataframe[prop]
    for row in data:
        dic[row] += 1
    generate_diagram(dic, prop)

def generate_diagram(dic: dict, name):
    print(dic)
    x_list = []
    y_list = []
    for i in dic.keys():
        x_list.append(i)
        y_list.append(dic[i])
    plot.bar(x_list, y_list)
    plot.title("Analysis of " + " Feature " + name)
    plot.xlabel("name")
    plot.ylabel("number")
    plot.show()


analyze_data(dataframe, "buying", buying_dict)
analyze_data(dataframe, "maint", maint_dict)
analyze_data(dataframe, "doors", door_dict)
analyze_data(dataframe, "persons", persons_dict)
analyze_data(dataframe, "lug_boot", lug_dict)
analyze_data(dataframe, "safety", safety_dict)

--------------------------------------------------------------------------

**5. Encode the features**

* The data source that was chosen is using discrete variables for all features. It natively fits into the Categorical Naive Bayes Model (`CategoricalNB`).
* For Logistic Regression, one-hot encoding enables the conversion to continuous variables for all features.

In [None]:
# Perform one-hot encoding for the dataset
encoded_dataframe = pandas.get_dummies(
    dataframe,
    columns=["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"],
    dtype=int
)
encoded_dataframe

--------------------------------------------------------------------------

**6. Define 2 models using default parameters**


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import CategoricalNB

naive_bayes_classifier = CategoricalNB()
logistic_regression_classifier = LogisticRegression()

--------------------------------------------------------------------------

**7. Train/test/evaluate the 2 models in cross-validation**


--------------------------------------------------------------------------

**8. Train/test/evaluate the 2 models in cross-validation with modified parameters (#1)**


--------------------------------------------------------------------------

**9. Train/test/evaluate the 2 models in cross-validation with modified parameters (#2)**


--------------------------------------------------------------------------

**10. Analyze the obtained results**


--------------------------------------------------------------------------

**11. Conclusion**


--------------------------------------------------------------------------

**12. References**

[1] M. Bohanec, 'Car Evaluation'. UCI Machine Learning Repository, 1988. doi:10.24432/C5JP48