In this project we are going to classify whether a species of Iris flower is **Setosa**, **Virginica** or **Versicolour**. The dataset for this project is collected from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris). There are 150 rows and 5 columns in this dataset, 4 columns are the feature columns and 1 column is the target column.

The four numeric features columns are:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm 

and, 1 target column is:
1. species
    - Iris Setosa
    - Iris Versicolour
    - Iris Virginica



# Import libraries

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [9]:
data = pd.read_csv('../data/raw/iris.data')
data.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


The dataset don't have any column names. So, we have to give these columns names. As, it is discussed above there are 4 feature columns **sepal-length, sepal-width, petal-length, petal-width** and 1 target column **species**.

In [10]:
data.columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
data.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,species
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [12]:
data.describe()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
count,149.0,149.0,149.0,149.0
mean,5.848322,3.051007,3.774497,1.205369
std,0.828594,0.433499,1.759651,0.761292
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal-length  149 non-null    float64
 1   sepal-width   149 non-null    float64
 2   petal-length  149 non-null    float64
 3   petal-width   149 non-null    float64
 4   species       149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


# Data Preparation

Check if there is any null value in the data

In [16]:
data.isnull().any().mean()

0.0

There isn't any null values in the data.
Now let's check data types of these values.

In [18]:
data.dtypes

sepal-length    float64
sepal-width     float64
petal-length    float64
petal-width     float64
species          object
dtype: object

All the four feature columns are of **float** type but the target column is of **object** type. So, we have to convert the object type to either int or category type.

Let's change the data types of species column to **category** type

In [19]:
data['species'] = data['species'].astype('category')

In [20]:
data.dtypes

sepal-length     float64
sepal-width      float64
petal-length     float64
petal-width      float64
species         category
dtype: object

Split data into features **X** and ckass **y**

In [22]:
X = data.drop('species', axis=1)
y = data['species']

In [23]:
X.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
0,4.9,3.0,1.4,0.2
1,4.7,3.2,1.3,0.2
2,4.6,3.1,1.5,0.2
3,5.0,3.6,1.4,0.2
4,5.4,3.9,1.7,0.4


In [24]:
y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: species, dtype: category
Categories (3, object): ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

### Split data into training and testing data

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)