## Step 1: Import Required Libraries

We import all necessary libraries for:

- Data manipulation (`pandas`, `numpy`)
- Visualization (`matplotlib`, `seaborn`)
- Data preprocessing and model training (`scikit-learn`)


In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, f1_score


## Step 2: Load the Data

We load the Raisin dataset and clean the column names:
- Strip leading/trailing spaces
- Convert to lowercase
- Replace spaces with underscores


In [None]:
df = pd.read_csv('Raisin_Dataset.csv')
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.head()


## Step 3: Clean the Dataset

We convert the categorical class labels to binary numeric values:
- `'Kecimen'` → 0
- `'Besni'` → 1

This prepares the target column for binary classification.


In [None]:
df['class'] = df['class'].replace({'Kecimen': '0', 'Besni': '1'})
df['class'] = pd.to_numeric(df['class'], errors='coerce')


## Step 4: Prepare Feature Matrix and Target Vector

We select the numeric features to use as input (`X`) and define the target variable (`y`).
Then, we split the dataset into training and test sets using an 80/20 split.


In [32]:
X = df.drop(columns=['area', 'majoraxislength', 'minoraxislength', 'eccentricity', 'convexarea', 'extent', 'perimeter']).select_dtypes(include=np.number)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Step 5: Normalize Features

Logistic Regression assumes input features are on a similar scale.  
We apply standardization using `StandardScaler`, which transforms the data to have:

- Mean = 0  
- Standard Deviation = 1


In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Step 6: Train the Logistic Regression Model

We initialize and train a logistic regression model on the normalized training data.


In [34]:
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)


## Step 7: Predict and Evaluate the Model

We make predictions on the test set and evaluate performance using:

- **Classification report** (precision, recall, F1)
- **Accuracy score**
- **F1 score (macro & weighted)** for deeper insights, especially useful for imbalanced datasets


In [None]:
y_pred = logreg.predict(X_test_scaled)

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:", accuracy_score(y_test, y_pred))

f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')

print("F1 Score (macro):", f1_macro)
print("F1 Score (weighted):", f1_weighted)


## Step 8: Visualize the Confusion Matrix

The confusion matrix helps us understand how many observations were correctly and incorrectly classified for each class.


In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y), yticklabels=np.unique(y))
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


## Step 9: Predict Class Probabilities (Optional)

Logistic Regression can also output predicted probabilities for each class.  
This is useful for confidence analysis or threshold tuning.


In [None]:
y_proba = logreg.predict_proba(X_test_scaled)
print("First 5 predicted probabilities:")
print(y_proba[:5])


## 🔍 Model Performance Interpretation

The model achieved **perfect scores** across all evaluation metrics:

- **Accuracy**: 1.00  
- **Precision, Recall, F1-score (both classes)**: 1.00  
- **F1 Score (macro & weighted)**: 1.00  
- **Confusion Matrix**: Zero misclassifications

### What this means:
- The logistic regression model has **perfectly separated** the two classes in this dataset.
- All predictions were **correct** on the test set.
- There are **no false positives or false negatives**.

### Important Notes:
- Such perfect results are rare in real-world applications.
- Check for **data leakage**, overly simple data, or too few samples before assuming the model will generalize well to unseen data.
