<a href="https://colab.research.google.com/github/w4bo/handsOnDataPipelines/blob/main/01-MachineLearning/03-Iris.solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The IRIS challenge

### Goal

It is your job to predict the species for each iris plant. For each iris plant, you must predict the `Species` variable. 

### Metric

Submissions are evaluated using the accuracy score. When splitting train and test datasets, the test dataset should contain 40% of the data.

### Requirements

You are allowed to use `numpy`, `pandas`, `matplotlib`, `sns`, and `sk-learn` Python libraries. You can import any model from `sk-learn`.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# df = pd.read_csv('datasets/iris.csv')
df = pd.read_csv('https://raw.githubusercontent.com/w4bo/handsOnDataPipelines/main/01-MachineLearning/datasets/iris.csv')

## Data understanding

Hints
- There are 150 observations with 4 features each (sepal length, sepal width, petal length, petal width).
- Each observation is labelled with a `Species`

Take a first glance to the `df`
- Do we consider all features?
- Are there null values?
- Which are the attribute types?
- Which are the attribute ranges?
- How many labels?
- Are classes unbalanced? 

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Species'].value_counts()

### Summing up

| Question | Answer | Do we need action? |
| -        | -      | - |
| Are the null values? | No | No imputation |
|Which are the attribute types? | All attributes are numeric | No encoding | 
|Which are the attribute ranges? | Attribute ranges are similar | No normalization |
|How many labels? | 3 | - |
|Are classes unbalanced? | No, classess are equally distributed | No rebalancing |

IRIS is a simple dataset, it is useful for this lab but is not really representative for real-world ML tasks.

### Data visualization

Check the value distribution

In [None]:
tmp = df.drop('Id', axis=1)
tmp.hist(bins=50, figsize=(20,15))
plt.show()

Check variable relationships

In [None]:
g = sns.pairplot(tmp, hue='Species', markers='+')
plt.show()

In [None]:
g = sns.violinplot(y='Species', x='SepalLengthCm', data=df, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='SepalWidthCm', data=df, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalLengthCm', data=df, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalWidthCm', data=df, inner='quartile')
plt.show()

In [None]:
from scipy.stats import pearsonr
rho = tmp.corr(method ='pearson')
pval = tmp.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [0.01, 0.05, 0.1] if x <= t]))
rho.round(2).astype(str) + p

In [None]:
min_corr = 0.3
kot = rho[(abs(rho) >= min_corr) & (rho < 1)]
plt.figure(figsize=(8, 6))
sns.heatmap(kot, cmap=sns.color_palette("coolwarm", as_cmap=True)) 

#### Summing up

- After graphing the features in a pair plot, it is clear that the relationship between pairs of features of a iris-setosa (in pink) is distinctly different from those of the other two species.
- There is some overlap in the pairwise relationships of the other two species, iris-versicolor (brown) and iris-virginica (green).

## Modeling with scikit-learn

Preparing the dataset for the ML pipeline.
- X: the dataset
- y: the labels

In [None]:
X = df.drop(['Id', 'Species'], axis=1)
y = df['Species']
# print(X.head())
print(X.shape)
# print(y.head())
print(y.shape)

In [None]:
X

In [None]:
y

## Train and test on the same dataset

- Pick a classifier from SKLearn (e.g., logistic regression, decision tree, random forest, k-NN classifier) and train your model on the entire dataset

In [None]:
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    y_pred = knn.predict(X)
    scores.append(metrics.accuracy_score(y, y_pred))
    
plt.plot(k_range, scores)
plt.xticks(k_range)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()

In [None]:
logreg = LogisticRegression()
logreg.fit(X, y)
y_pred = logreg.predict(X)
print(metrics.accuracy_score(y, y_pred))

### Summing up

- Training on the entire dataset *is not* suggested since the end goal is to predict iris species using a dataset the model has not seen before.
- There is also a *high* risk of overfitting the training data.

## Split the dataset into a training set and a testing set

### Advantages
- By splitting the dataset pseudo-randomly into a two separate sets, we can train using one set and test using another.
- This ensures that we won't use the same observations in both sets.
- More flexible and faster than creating a model using all of the dataset for training.

### Disadvantages
- The accuracy scores for the testing set can vary depending on what observations are in the set. 
- This disadvantage can be countered using k-fold cross-validation.

### Notes
- The accuracy score of the models depends on the observations in the testing set, which is determined by the seed of the pseudo-random number generator (random_state parameter).
- As a model's complexity increases, the training accuracy (accuracy you get when you train and test the model on the same data) increases.
- If a model is too complex or not complex enough, the testing accuracy is lower.
- For KNN models, the value of k determines the level of complexity. A lower value of k means that the model is more complex.

Split the training and test sets such that the test set contains 40% of the data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Fit your model and try it with several parameters

In [None]:
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))
    
plt.plot(k_range, scores)
plt.xticks(k_range)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()

You can also try different models (check https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

Predict the outcome of an unseen observation (hint use the `.predict()` method)

In [None]:
# To train the model on the entire dataset
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X, y)
y_pred = knn.predict(X)
metrics.accuracy_score(y, y_pred)

In [None]:
# make a prediction for an example of an out-of-sample observation
knn.predict([[6, 3, 4, 2]])

In [None]:
logreg.predict([[6, 3, 4, 2]])