# Introduction

#### In this Jupyter Notebook, I will be analysing the Iris dataset using various Python methods.
#### The following are tasks that I will complete in my analysis:
1. Import Libraries
2. Load the Dataset
    - I will load the Iris dataset using Pandas. 
3. Describe the Dataset
    - I will describe the Iris dataset and identify the feature names and species names.
4. Summarise the Features
    - I will write a script that outputs the summary of each variable into a single text file.
5. Explore the Dataset

        a. I will create a histogram for each feature. 
    
        b. I will create a boxplot for each feature. 
    
        c. I will create a scatterplot for each pair of features. 
    
        d. I will create a heatmap to display mean feature value per species. 

# 1. Import Libraries

In [5]:
# Dataframes
import pandas as pd

# Numpy
import numpy as np

# ScikitLearn: Machine Learning repository that contains sample datasets
import sklearn as skl 
from sklearn import datasets

# Plots
import matplotlib
from matplotlib import pyplot as plt

# 2. Load the Dataset

#### I downloaded the Iris dataset from the UC Irvine Machine Learning Repository (See: https://archive.ics.uci.edu/dataset/53/iris).
#### I added the dataset (iris.csv) to my repository (zoeharlowe/pands-project).
#### I researched the Pandas pd.read_csv() documentation to find out how to set column names (See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [None]:
# Set column names
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# Read in the dataset 
df = pd.read_csv("iris.csv", names = column_names)

# Display first 5 rows of the dataset
df.head(5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### I also displayed the last 5 rows of the dataset.

In [31]:
# Display the last 5 rows of the dataset
df.tail(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


# 3. Describe the Dataset

#### I identified the feature names.

In [58]:
# Display feature names
column_names = df.columns.values
print(column_names)

['sepal_length' 'sepal_width' 'petal_length' 'petal_width' 'species']


#### I identified the three class types.

In [57]:
# Find unique values in the species column.
unique_values = np.unique(df["species"])

print(unique_values)

['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


#### I used the describe() function to find the count, mean, std, max, min, and interquartile ranges of the dataset for each feature.

In [32]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# 4. Summarise the Dataset

#### I created a text file (summary.txt) and used a for loop to write information about each feature into the text file.

#### I used this article on GeeksForGeeks to help me create the text file: https://www.geeksforgeeks.org/create-a-new-text-file-in-python/

#### Below you can see that I made a lambda function to find the number of features. I did this with the help of ChatGPT (See: https://chatgpt.com/share/68023853-726c-8000-901f-72d720dfc9bf)

In [None]:
# Create file
FILENAME = "summary.txt"

with open(FILENAME, 'w') as f:

    # Title
    f.write("Iris Dataset Summary\n")
    f.write("====================================\n\n")

    # Overall summary
    f.write("OVERALL SUMMARY\n")
    f.write(f"Shape of dataset: \t {df.shape} \n") # shape
    f.write(f"Number of species: \t {len(unique_values)} \n") # number of species

    # Number of features - I created a lambda function to count the number of features in each row
    float_count = df.apply(lambda row: sum(isinstance(x, float) for x in row), axis=1).iloc[0]
    f.write(f"Number of features:  {float_count} \n") # number of features
    
    f.write(f"Species names:\t\t {unique_values} \n") # species names
    f.write(f"Feature names:\t\t {column_names} \n") # variable names

#### Now that I have a general summary of the dataset, I want to talk about the variables in this dataset and give a summary of each one.
#### I used code to show the number of each species recorded in the dataset. I created a simple count function to count how many times a species name appears in the dataset.

In [None]:
# Function sample_count()
#FILENAME = "iris.csv"
#def count_function(species_name):
    #count = (df["species"] == species_name).sum()

In [None]:
# Setosa summary
FILENAME = "summary.txt"

# Open file in append mode to avoid overwriting the previous content
#with open(FILENAME, 'a') as f:
    #f.write("\nSETOSA SUMMARY\n")

    #f.write(f"Number of species: {count_function('Iris-setosa')} \n")