#DataSift: Optimizing Classification Methods for Enhanced Predictive Analysis

## Objective:

This project aims to determine the most suitable classification method for a given dataset. The selected classification algorithms for evaluation include:

* **K-Nearest Neighbors (KNN)**

* **Naive Bayes**

* **Logistic Regression**

* **Decision Trees**

* **Random Forest**

* **Bagging**


##Methodology:

Dataset: Utilize a representative dataset for the classification task.

**Algorithms**:

* Implement KNN for classification.
* Apply Naive Bayes algorithm.
* Employ Logistic Regression.
* Utilizing Decision Tree.
* Appling Random Forest.
* Employ Bagging.

**Evaluation**:

* Compare the predictions of each algorithm with the actual (ground truth) values.
* Measure and analyze the performance metrics for each method.

**Outcome**:

* Identify the classification algorithm that demonstrates the highest accuracy and reliability.
* Present a comparative analysis of the performance of KNN, Naive Bayes, Logistic Regression, Decision Tree, Random Forest and Bagging.


#Data Preprocessing


**Dataset Source**: [/www.kaggle.com/datasets/michau96/classification-syntetic-data-for-practice/](https://www.kaggle.com/datasets/michau96/classification-syntetic-data-for-practice/)

Note: The original file has 32 columns, the first column being index but it was unlabeled. The file was editted to make 'Index' as the first label for removal using drop() function.

**Dataset Characteristics**:

* Size: 1 million rows

* Columns: 31


The database contains exactly 1 million rows and 31 columns. The first 30 columns are explanatory variables, of which: the first 10 were generated using distributions from the numpy.random package, the next 10 were generated using the function np.random.randint and the last 10 using np.random.choice using various probabilities. The last column is the "class", i.e. the explained variable which can take the value 0 or 1.


We import the necessary packages for data manipulation and visualization. Specifically, pandas is used for data handling, matplotlib.pyplot for plotting, and numpy for numerical operations.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

We import the dataset from the CSV file named 'BinaryClass_1m.csv'.

In [3]:
df = pd.read_csv("BinaryClass_1m.csv").drop("Index",axis=1)
df.head()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,...,Feature_22,Feature_23,Feature_24,Feature_25,Feature_26,Feature_27,Feature_28,Feature_29,Feature_30,Class
0,0.265905,-1.779512,-3.280885,0.164727,1.090527,0.690742,0.424352,7.056372,168.0,9.0,...,0.0,0.0,1.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0
1,0.965041,0.278549,3.743072,0.006583,1.439838,0.623385,0.679332,4.650126,151.0,5.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
2,0.648668,-1.14508,-2.332163,0.111092,3.867226,0.39773,0.98295,6.539507,167.0,5.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,0.46448,-0.706006,-0.021605,0.525589,1.039123,0.46269,0.011771,12.082375,153.0,6.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0
4,0.570902,0.36563,-5.236326,0.559546,2.952963,0.803728,0.281474,13.872611,131.0,7.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,2.0,0.0,0.0


We create a function **get_xy** that performs both scaling and oversampling on the data, returning the scaled features as **X_value** and the corresponding labels as **y_value**.

This version combines the actions of scaling and oversampling into a single sentence, making it more streamlined while still conveying the key information.

In [10]:
def get_xy(dataframe, overSample=False):

  X = dataframe[dataframe.columns[:-1]].values
  y = dataframe[dataframe.columns[-1]].values

  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  if overSample is True:
    ros = RandomOverSampler()
    X,y = ros.fit_resample(X,y)

  data = np.hstack((X,np.reshape(y,(-1,1))))

  return data,X,y

We partition the entire dataset into three segments: training (60%), validation (20%), and testing (20%). Then scale them. The **train** dataset is over-sampled for improved result.

#Train, Validaion, Test Datasets

In [11]:
train,val,test = np.split(df.sample(frac=1), [int(0.8*len(df)),int(0.9*len(df))])

In [12]:
_, X_train, y_train = get_xy(train,overSample= True)
_, X_val, y_val = get_xy(val,overSample= False)
_, X_test, y_test = get_xy(test,overSample= False)

# K-Nearest Neighbors (KNN)

In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [14]:
knn_model = KNeighborsClassifier(n_neighbors=7)
knn_model.fit(X_train, y_train)

In [15]:
y_pred = knn_model.predict(X_test)
print(classification_report(y_test ,y_pred))

              precision    recall  f1-score   support

         0.0       0.61      0.53      0.57     35380
         1.0       0.39      0.47      0.42     22444

    accuracy                           0.50     57824
   macro avg       0.50      0.50      0.49     57824
weighted avg       0.52      0.50      0.51     57824



# Naive Bayes

In [16]:
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

In [17]:
y_pred = nb_model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.61      0.05      0.09     35380
         1.0       0.39      0.95      0.55     22444

    accuracy                           0.40     57824
   macro avg       0.50      0.50      0.32     57824
weighted avg       0.52      0.40      0.27     57824



# Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)

In [19]:
y_pred = lr_model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.61      0.54      0.57     35380
         1.0       0.39      0.45      0.42     22444

    accuracy                           0.51     57824
   macro avg       0.50      0.50      0.49     57824
weighted avg       0.52      0.51      0.51     57824



# Decision Trees

In [20]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

In [21]:
y_pred = dt_model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.61      0.61      0.61     35380
         1.0       0.39      0.39      0.39     22444

    accuracy                           0.53     57824
   macro avg       0.50      0.50      0.50     57824
weighted avg       0.53      0.53      0.53     57824



# Random Forest

In [22]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

In [23]:
y_pred = rf_model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.61      0.93      0.74     35380
         1.0       0.38      0.07      0.11     22444

    accuracy                           0.60     57824
   macro avg       0.49      0.50      0.43     57824
weighted avg       0.52      0.60      0.50     57824



# Bagging

In [24]:
from sklearn.ensemble import BaggingClassifier
b_model = BaggingClassifier()
b_model.fit(X_train, y_train)

In [25]:
y_pred = b_model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.61      0.80      0.69     35380
         1.0       0.39      0.21      0.27     22444

    accuracy                           0.57     57824
   macro avg       0.50      0.50      0.48     57824
weighted avg       0.53      0.57      0.53     57824



# Conclusion

After meticulous analysis and experimentation involving a range of classification algorithms—K-Nearest Neighbors (KNN), Naive Bayes, Logistic Regression, Decision Trees, Random Forest, Baging —the standout performer is undeniably the **Random Forest**. It consistently outshines its counterparts, showcasing the highest accuracy throughout our project. The comprehensive examination of synthetic data for practice highlights Random Forest as the algorithm of choice for the given dataset. This outcome underscores the critical importance of selecting an appropriate classification method tailored to the unique characteristics of the data.

The dominance of Random Forest in accuracy reinforces its effectiveness in handling the intricacies of the dataset. This conclusion not only provides valuable insights for the current project but also serves as a guiding principle for future endeavors where precision in classification is paramount. The journey of selecting the right algorithm has led us to Random Forest as a robust and reliable choice, setting the stage for more accurate and efficient classification results in future refinements and optimizations.