# **Principles of Data Analytics - Tasks**

### Authored by: Stephen Kerr

#### **Assessment Links**

- The Tasks Descriptions are outlined in the following link: [Assessment Description][def1]
- The Marking Scheme is outlined in the following link: [Assesment Marking Scheme][def2] 


[def1]: https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/assessment/tasks.md
[def2]: https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/assessment/instructions.md

## **Task 1: Source the Data Set**


### **Task 1 Description**

Import the Iris data set from the sklearn.datasets module.  
Explain, in your own words, what the load_iris() function returns.

### **Task 1 Submission:**

The **load_iris()** function loads the Iris dataset which is classic multi-class classification dataset.  
The dataset is imported as a *'Bunch'* which is a dictionary like-object with the following attributes:  
- **'data'** which is the data matrix.
- **'target'** which is the classification target.
- **'feature_name'** which is a list of the dataset columns.
- **'target_names'** which is a list of the target classes.
- **'data'** attribute being a pandas Dataframe.
- **'target'** attribute is a pandas Series.
- **'DESCR'** which is a string that is a full desciption of the dataset.
- **'filename'** which is a sting showing the path to the location of the data.

The iris data was loaded with the parameter ***'as_frame'*** set as *True* resulting in:


There is also an *additional attribute* when the load_iris() is loaded with the ***'as_frame'*** = *True*, called **'frame'** which is a pandas DataFrame with the combination of data and target. 

---

## References: 

1. [![load_iris](https://tse4.mm.bing.net/th?id=OIP.Hf2oXZgEGL98vH30SEeZQQAAAA&pid=Api&P=0&h=180) Click the image to learn more about load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)
2. [Markdown Syntax Cheatsheet](https://www.markdown-cheatsheet.com/ "Here the title goes")

In [58]:
# import sklearn
import sklearn as skl

# Load the iris data set as 'data_bunch'.
# Note, used the parameter 'as_frame' = True to get the data in a pandas DataFrame.
iris = skl.datasets.load_iris(as_frame=True)

# printed the 'iris' data bunch attributes / keys
print(iris.keys())

# Print out the Target Names
print(iris['target_names'])

# The main 'frame' DataFrame the main store of data 
# Add in a 'species' column to the 'frame' Dataframe,
# In order to the assign each row to the appropraite species class ['setosa' 'versicolor' 'virginica']
# We use the map() method the 'target' column of the 'frame' DataFrame and the 'target_names' list
iris['frame']['species'] = iris['frame']['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print(iris['frame'])

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
['setosa' 'versicolor' 'virginica']
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4      

## **Task 2: Explore the Data Structure**

### **Task 2 Description:** 

Print and explain the shape of the data set, the first and last 5 rows of the data, the feature names, and the target classes.

### **Task 2 Submission**



In [60]:
# Shape of iris dataset
print(f'The shape of the Iris Data Set is: '
      f'\n \tRows (instances) = {iris['frame'].shape[0]},'
      f'\n \tColumns (features) = {iris['frame'].shape[1]},\n')

print(f'This means the Iris Data Set has {iris['frame'].shape[0]} Rows and' 
      f' {iris['frame'].shape[1]} Columns.' 
      f'\nTotaling {int(iris['frame'].shape[0]) * int(iris['frame'].shape[1])} Unique Data points.\n')

The shape of the Iris Data Set is: 
 	Rows (instances) = 150,
 	Columns (features) = 6,

This means the Iris Data Set has 150 Rows and 6 Columns.
Totaling 900 Unique Data points.



In [61]:
# The first 5 rows of the Iris Data Set using the .head() Pandas method: 
print(iris['frame'].head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


In [62]:
# The last 5 rows of the Iris Data Set using the .tail() Pandas method:
print(iris['frame'].tail())

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target    species  
145       2  virginica  
146       2  virginica  
147       2  virginica  
148       2  virginica  
149       2  virginica  


In [73]:
# The feature names are:
print('The follwing is all the feature names for the Iris Data Set:')
for n, i in enumerate(iris['feature_names'], start=1):
    print(f'\t {n}. {i}')

The follwing is all the feature names for the Iris Data Set:
	 1. sepal length (cm)
	 2. sepal width (cm)
	 3. petal length (cm)
	 4. petal width (cm)


In [74]:
# The target classes

print('The follwing is all the target names for the Iris Data Set:')
for n,i in enumerate(iris['target_names'], start=1):
    print(f'\t {n}. {i}')

# Note previously we added in a new column that assigns each row to the appropriate target class. 


The follwing is all the target names for the Iris Data Set:
	 1. setosa
	 2. versicolor
	 3. virginica


## **Task 3: Summarize the Data**

### **Task Description:** 

For each feature in the dataset, calculate and display:

- mean
- minimum
- maximum
- standard deviation
- median

In [72]:
print(iris['feature_names'])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


## Task 4: Visualize Features

### Task Description: 

XYZ

## Task 5: Investigate Relationships

### Task Description: 

XYZ

## Task 6: Analyze  Relationships

### Task Description: 

XYZ

## Task 7: Analyze  Class Distributions

### Task Description: 

XYZ

## Task 8: Compute Correlations

### Task Description: 

XYZ

## Task 9: Fit a Simple Linear Regression

### Task Description: 

XYZ

## Task 10: Too Many Features 

### Task Description: 

XYZ

# End