## Iris dataset Tutorial
- **Load the dataset:** We'll start by loading the Iris dataset.

- **Exploratory Data Analysis (EDA):** We'll explore the dataset to gain insights into the data and understand its structure.
 
- **Data Preprocessing:** We'll preprocess the data, including handling missing values, scaling features, and splitting the data into training and testing sets.

- **Model Selection:** We'll choose an appropriate machine learning model for the classification task.

- **Model Training:** We'll train the selected model on the training data.

- **Model Evaluation:** We'll evaluate the model's performance on the testing data.

In [1]:
# Step 1: Load the dataset
from sklearn.datasets import load_iris
import pandas as pd

In [2]:
iris = load_iris()
data = iris.data
target = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

In [3]:
df = pd.DataFrame(data, columns = feature_names)
df['species'] = target_names[target]

In [4]:
# Step 2: Exploratory Data Analysis (EDA)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [7]:
df['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [8]:
# Step 3: Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [9]:
X  = df.drop('species', axis = 1)
y = df['species']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
# Step 4: Model Selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [13]:
classifiers = {
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(),
    'Random Forest': RandomForestClassifier(random_state=42)
}

In [14]:
for name, classifier in classifiers.items():
    scores = cross_val_score(classifier, X_train_scaled, y_train, cv=5)
    print(f'{name} accuracy: {scores.mean():.2f}')

KNN accuracy: 0.93
SVM accuracy: 0.95
Random Forest accuracy: 0.95


In [15]:
# Step 5: Model Training
classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_train_scaled, y_train)

In [16]:
# Step 6: Model Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test set
y_pred = classifier.predict(X_test_scaled)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy:.2f}')

Test Accuracy: 1.00


In [17]:
# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [18]:
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]


In [19]:
import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Create a DataFrame to hold the data
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]

# Save the DataFrame as a CSV file
df.to_csv('iris_dataset.csv', index=False)