# **Principles of Data Analytics - Tasks**

### Authored by: Stephen Kerr

#### **Assessment Links**

- The Tasks Descriptions are outlined in the following link: [Assessment Description][def1]
- The Marking Scheme is outlined in the following link: [Assesment Marking Scheme][def2] 


[def1]: https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/assessment/tasks.md
[def2]: https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/assessment/instructions.md

## **Task 1: Source the Data Set**


### **Task 1 Description**

Import the Iris data set from the sklearn.datasets module.  
Explain, in your own words, what the load_iris() function returns.

### **Task 1 Submission:**

The **load_iris()** function loads the Iris dataset which is classic multi-class classification dataset.  
The dataset is imported as a *'Bunch'* which is a dictionary like-object with the following attributes:  
- **'data'** which is the data matrix.
- **'target'** which is the classification target.
- **'feature_name'** which is a list of the dataset columns.
- **'target_names'** which is a list of the target classes.
- **'data'** attribute being a pandas Dataframe.
- **'target'** attribute is a pandas Series.
- **'DESCR'** which is a string that is a full desciption of the dataset.
- **'filename'** which is a sting showing the path to the location of the data.

The iris data was loaded with the parameter ***'as_frame'*** set as *True* resulting in:


There is also an *additional attribute* when the load_iris() is loaded with the ***'as_frame'*** = *True*, called **'frame'** which is a pandas DataFrame with the combination of data and target. 

---

## References: 

1. [![load_iris](https://tse4.mm.bing.net/th?id=OIP.Hf2oXZgEGL98vH30SEeZQQAAAA&pid=Api&P=0&h=180) Click the image to learn more about load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)
2. [Markdown Syntax Cheatsheet](https://www.markdown-cheatsheet.com/ "Here the title goes")

In [9]:
# import sklearn
import sklearn as skl

# Load the iris data set as 'data_bunch'.
# Note, used the parameter 'as_frame' = True to get the data in a pandas DataFrame.
iris = skl.datasets.load_iris(as_frame=True)

# printed the 'iris' data bunch attributes / keys
print(f'The following are the Attributes of the Iris Data \'Bunch\':' )
for key in iris.keys():
    if key == 'frame':
        print(f'\t{key} - This is the main source of the data.') 
    else: 
        print(f'\t{key}')

# Print out the Target Names
print(iris['target_names'])

# In the Iris Data Bunch the 'frame' DataFrame is the key store of data, 
# To compleete the DataFrame we need to add in a 'species' column,
# In order to the assign each row to the appropraite species class ['setosa' 'versicolor' 'virginica'],
# We use the map() method the 'target' column of the 'frame' DataFrame and the 'target_names' list,
iris['frame']['species'] = iris['frame']['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Asigning the 'iris_dataframe' as 'iris['frame'] 
iris_dataframe = iris['frame']

The following are the Attributes of the Iris Data 'Bunch':
	data
	target
	frame - This is the main source of the data.
	target_names
	DESCR
	feature_names
	filename
	data_module
['setosa' 'versicolor' 'virginica']


In [10]:


iris_dataframe.head().style

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,species
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


## **Task 2: Explore the Data Structure**

### **Task 2 Description:** 

Print and explain the shape of the data set, the first and last 5 rows of the data, the feature names, and the target classes.

### **Task 2 Submission**



The target classes where assigned to the ***'iris_dataframe'*** in the last section.

In [33]:
# Shape of iris dataset - Note we added the 'species' column 
print(f'The shape of the Iris Dataset (\'iris_dataframe\') is: '
      f'\n \tRows (instances) = {iris_dataframe.shape[0]},'
      f'\n \tColumns (features) = {iris_dataframe.shape[1]},\n')

The shape of the Iris Dataset ('iris_dataframe') is: 
 	Rows (instances) = 150,
 	Columns (features) = 6,



In [34]:
# The first 5 rows of the Iris Data Set using the .head() and .style Pandas method: 
# # Reference https://www.delftstack.com/howto/python-pandas/pandas-display-dataframe-in-a-table-style/
iris_dataframe.head().style

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,species
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


In [13]:
# The last 5 rows of the Iris Data Set using the .tail() and .style Pandas method:
iris_dataframe.tail().style

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,species
145,6.7,3.0,5.2,2.3,2,virginica
146,6.3,2.5,5.0,1.9,2,virginica
147,6.5,3.0,5.2,2.0,2,virginica
148,6.2,3.4,5.4,2.3,2,virginica
149,5.9,3.0,5.1,1.8,2,virginica


In [50]:
# The feature names are:
print('The follwing is all the feature names and target classes for the Iris Dataset:')
for i, key in enumerate(iris_dataframe.keys(), start=1):
    if key == 'target' or key == 'species':
        print(f'\t {i}. {key} is a target class.')
    else: 
        print(f'\t {i}. {key} is a feature name.')


# Note previously we added in a new column that assigns each row to the appropriate target class. 
# Could replace the 'target' column with the 'species' to cutdown on data

The follwing is all the feature names and target classes for the Iris Dataset:
	 1. sepal length (cm) is a feature name.
	 2. sepal width (cm) is a feature name.
	 3. petal length (cm) is a feature name.
	 4. petal width (cm) is a feature name.
	 5. target is a target class.
	 6. species is a target class.


In [69]:
# Subsetting the 'iris_dataframe' by the 'species' # reference == datacamp

# creating the 'setosa_DataFrame'
setosa_Dataframe = iris_dataframe[iris_dataframe['species'] == 'setosa']
# displaying the total number of 'setosa' in the dataset
print(f'The number of setosa in the datset = {setosa_Dataframe.shape[0]}'
      f'\nWhich is a percentage of {(setosa_Dataframe.shape[0]/iris_dataframe.shape[0])*100}%')

# creating the 'versicolor_DataFrame'
versicolor_Dataframe = iris_dataframe[iris_dataframe['species'] == 'versicolor']
# displaying the total number of 'setosa' in the dataset
print(f'The number of versicolor in the datset = {versicolor_Dataframe.shape[0]}'
      f'\nWhich is a percentage of {(versicolor_Dataframe.shape[0]/iris_dataframe.shape[0])*100}%')

# creating the 'virginica_DataFrame'
virginica_Dataframe = iris_dataframe[iris_dataframe['species'] == 'virginica']
# displaying the total number of 'setosa' in the dataset
print(f'The number of virginica in the datset = {virginica_Dataframe.shape[0]}'
      f'\nWhich is a percentage of {(virginica_Dataframe.shape[0]/iris_dataframe.shape[0])*100}%')



The number of setosa in the datset = 50
Which is a percentage of 33.33333333333333%
The number of versicolor in the datset = 50
Which is a percentage of 33.33333333333333%
The number of virginica in the datset = 50
Which is a percentage of 33.33333333333333%


## **Task 3: Summarize the Data**

### **Task Description:** 

For each feature in the dataset, calculate and display:

- mean
- minimum
- maximum
- standard deviation
- median

### **Task 3 Submission**

#### **Feature Name - Sepal Length (cm)**

In the code below you can see that the mean, minimum, maximum, standard deviation, median for the ***'Sepal Length'*** across the Iris dataset, and in the species subset.

In [80]:
# mean for the 'sepal length (cm)' feature whole data set
print(f'The Mean "Sepal Lenght (cm)"accorss the whole dataset is: '
      f'{iris_dataframe['sepal length (cm)'].describe()}')

# mean for the 'sepal length (cm)' feature in the setosa subset
print(f'The Mean "Sepal Lenght (cm)" in the setosa subset is: '
      f'{setosa_Dataframe['sepal length (cm)'].mean()}')

# mean for the 'sepal length (cm)' feature in the setosa subset
print(f'The Mean "Sepal Lenght (cm)" in the versicolor subset is: '
      f'{versicolor_Dataframe['sepal length (cm)'].mean()}')

# mean for the 'sepal length (cm)' feature in the setosa subset
print(f'The Mean "Sepal Lenght (cm)" in the virginica subset is: '
      f'{virginica_Dataframe['sepal length (cm)'].mean()}')

The Mean "Sepal Lenght (cm)"accorss the whole dataset is: count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal length (cm), dtype: float64
The Mean "Sepal Lenght (cm)" in the setosa subset is: 5.006
The Mean "Sepal Lenght (cm)" in the versicolor subset is: 5.936
The Mean "Sepal Lenght (cm)" in the virginica subset is: 6.587999999999998


## Task 4: Visualize Features

### Task Description: 

XYZ

## Task 5: Investigate Relationships

### Task Description: 

XYZ

## Task 6: Analyze  Relationships

### Task Description: 

XYZ

## Task 7: Analyze  Class Distributions

### Task Description: 

XYZ

## Task 8: Compute Correlations

### Task Description: 

XYZ

## Task 9: Fit a Simple Linear Regression

### Task Description: 

XYZ

## Task 10: Too Many Features 

### Task Description: 

XYZ

# End