<center> <h2> Part 1 </h2> </center>

In this assignment, you will work with a popular ML dataset: **The Iris Dataset**. The dataset contains the width and length measurements of the two main parts of iris plants, sepal and petal. The goal is to use these four measurements to classify iris plants into one of the three classes included in the dataset. The classes represent the three types of iris plant, as shown below. 

The dataset is in fact bundled with sci-kit learn, but you will **not** import the data directly from the library. Instead, in this part, you will first scrape it from a webpage (and thus practice your web scraping skills!) and then prepare it for classification. :)

### Question 1 
    
You will see that the table includes five columns corresponding to the feature and target variables from this dataset. Your goal is to scrape the data for these 150 samples from this table. You should refrain from using pandas capabilities to import a table; that's not the goal here. Instead, you should use the web scraping techniques we covered earlier in the semester to scrape the values from this page and store them in a dataframe. 

With this in mind, write a function, scrape_data(), that uses BeautifulSoup to scrape data from the link above. Your function should first scrape the required data based on the sample output below and then store them in a dataframe, which it should finally return. The returned dataframe should include 150 rows and 5 columns.

**Hint 1:**
HTML tables consist of rows marked with < tr >. In a row, you have two main cells: header and data cells. Header cells, used for labels, are marked with < th >, and data cells, used for values, are marked with < td >. You should start by examining the source code of the table for one of the rows.

**Hint 2:**
Sometimes you may need to explicitly specify the data type when storing scraped data in dataframes, as they will be read as string values. For this purpose, you will find it useful to use the astype() method of the dataframe object in pandas:
    
    * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

In [None]:
from bs4 import BeautifulSoup
import urllib.request as urllib
from urllib.error import HTTPError
import pandas as pd

In [None]:
def scrape_data():
    
    url = 'https://yildirimcaglar.github.io/ds3000/iris/index.html'
    try:
        html = urllib.urlopen(url)
        soup = BeautifulSoup(html.read())
        html.close()
    except HTTPError:
        return "request failed"
    table = soup.find('table', {'class': 'iris-data'})
    
    columns = [th_tag.getText() for th_tag in table.find('thead').findAll('th') if th_tag.getText() != '']
    data_dict = {key:[] for key in columns}
    
    for row in table.find('tbody').findAll('tr'):
        pdr = pd.Series(index=columns, dtype='float64').astype('float64')
        for i, col in enumerate(row.findAll('td')):
            data_dict[columns[i]].append(col.getText())

    return pd.DataFrame(data_dict).astype('float64')

In [None]:
# your function should return a dataset containing the scraped data
df = scrape_data()
df

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


In [None]:
# here is the descriptive stats table for the columns
# for your reference
df.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [None]:
# remember to save your df as a csv file
# and submit it as part of the assignment
df.to_csv('iris.csv')

### Question 2 

As you can see, each value in the target variable corresponds to a type of iris plant:

    * 0: 'Iris-setosa' 
    * 1: 'Iris-versicolor'
    * 2: 'Iris-virginica'
    
For your convenience, a dictionary is already defined below. 

Write a helper function, lookup_label, that returns the label associated with a target value from the target_dict below. This function will later be used to look up the label associated with a target value.




In [None]:
target_dict = {0:"Iris-setosa", 1: "Iris-versicolor", 2: "Iris-virginica"}

In [None]:
def lookup_label(target_value):
    return target_dict[target_value]

In [None]:
lookup_label(1)

'Iris-versicolor'

### Question 3 
Write a function that returns the number of instances of each class in the dataset. 

Your function should take this into account and return labels (Iris-setosa / Iris-versicolor / Iris-virginica) instead of 0 / 1 / 2. You will need to use one of the data transformation methods we have covered earlier. Refer to the sample output below.

In [None]:
def case_distribution(df):
    return df['target'].map(lookup_label).value_counts()

In [None]:
case_distribution(df)

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: target, dtype: int64

### Question 4 
Write a function that extracts and returns a tuple of features and target variables from the df dataframe above. Features are all columns from columns 0 to 3, and the target is the target column.

    * The features variable should be a DataFrame object with 150 rows and 4 columns
    * The target variable should be Series object with 150 values

In [None]:
def features_and_target(df):
    return (df.iloc[:, :-1], df.iloc[:, -1].astype('int32'))

In [None]:
features, target = features_and_target(df)

In [None]:
features

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [None]:
target

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int32

<hr />

<center>Splendid! Your dataset is ready. </center>
<hr />

<center><h1> Part 2 </h1></center>

In this part, you will apply and evaluate the four classifiers (kNN, LinearSVC, Naive Bayes, and Decision Tree) introduced so far to the Iris dataset. You will do so using both of the training/testing approaches we have covered.

### Question 5 
You will find it useful to define a dictionary, estimators, containing these four classifiers so you can apply them in an iteration statement. For this question, import the relevant classifiers from sci-kit learn. 

In [None]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

estimators = {
    'k-Nearest Neighbor': KNeighborsClassifier(), 
    'Support Vector Machine': LinearSVC(max_iter=100000),
    'Gaussian Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier()}

In [None]:
estimators.values()

dict_values([KNeighborsClassifier(), LinearSVC(max_iter=100000), GaussianNB(), DecisionTreeClassifier()])

### Question 6 

Write a function that fits these four classifiers using a percentage-split approach. You will first need to split the dataset into training and testing sets (use variables X_train, X_test, y_train, y_test). Then train and evaluate the model using these four algorithms. Refer to the sample output.

You should use an iteration statement to apply these classifiers. Use random_state=3000 when splitting your dataset. Your function should import the necessary functions from sklearn.

In [None]:
from sklearn.model_selection import train_test_split

def classifiers_percentage_split():
    for estimator_name, estimator_object in estimators.items():
        
        X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
        estimator_object.fit(X_train, y_train)
        scores = estimator_object.score(X_test, y_test)

        print(estimator_name + ": \n\t" + f'Classification accuracy on the test data={scores.mean():.2%}' +"\n")

In [None]:
classifiers_percentage_split()

k-Nearest Neighbor: 
	Classification accuracy on the test data=94.74%

Support Vector Machine: 
	Classification accuracy on the test data=89.47%

Gaussian Naive Bayes: 
	Classification accuracy on the test data=89.47%

Decision Tree: 
	Classification accuracy on the test data=84.21%



### Question 7 

Write a function that fits these four classifiers using a **cross-validation** approach. You should use an iteration statement to apply these classifiers. Use random_state=3000 when splitting your dataset. Your function should import the necessary functions from sklearn. Refer to the sample output for the required values.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

def classifiers_cross_validation():
    for estimator_name, estimator_object in estimators.items():
        kfold = KFold(n_splits=10, random_state=3000, shuffle=True)

        scores = cross_val_score(estimator=estimator_object, X=features, y=target, cv=kfold)

        print(estimator_name + ": \n\t" + f'mean accuracy={scores.mean():.2%}, ' 
              + f'standard deviation={scores.std():.2%}' +"\n")

In [None]:
classifiers_cross_validation()

k-Nearest Neighbor: 
	mean accuracy=96.67%, standard deviation=4.47%

Support Vector Machine: 
	mean accuracy=94.67%, standard deviation=5.81%

Gaussian Naive Bayes: 
	mean accuracy=95.33%, standard deviation=6.00%

Decision Tree: 
	mean accuracy=94.67%, standard deviation=7.18%



### Question 8 
Based on the results from the previous two questions, which **classifier** and **training/testing** approach would you choose? Why?

Type your answer below:

#### k-Nearest neighbor classifier with cross-validation training/testing approach b/c highest accuracy, low standard deviation

<hr />

<center>Part 2 is done too! </center>
<hr />

<center><h1> Part 3 </h1></center>  


In this part, you will implement kNN algorithm from scratch. You're strongly encouraged to review the section on the kNN algorithm before attempting this part of the assignment.

To begin with, once again, split the data, features and target from Q4, into training and test splits to be used in upcoming questions. Use X_train, X_test, y_train, y_test as variable names for the splits. **Use random_state=3000** when splitting the dataset.

In [None]:
# ungraded
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)

In [None]:
# sample output
X_train.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width
143,6.8,3.2,5.9,2.3
82,5.8,2.7,3.9,1.2
12,4.8,3.0,1.4,0.1
137,6.4,3.1,5.5,1.8
100,6.3,3.3,6.0,2.5


In [None]:
X_test.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width
59,5.2,2.7,3.9,1.4
19,5.1,3.8,1.5,0.3
98,5.1,2.5,3.0,1.1
67,5.8,2.7,4.1,1.0
24,4.8,3.4,1.9,0.2


### Question 9 
Write your own function called 'euclidean_distance' that calculates euclidean distance between 2 vectors. 
- The function accepts row1 and row2 as parameters (you can assume they are of the same size)
- It returns the euclidean distance between them

**Hint:** The math module has a function that returns the square root of a number (look it up!).

Refer to the sample output below.

In [None]:
import math
def euclidean_distance(row1, row2):
    d_squared = [(a - b)**2 for a, b in zip(row1, row2)]
    return math.sqrt(sum(d_squared))

In [None]:
euclidean_distance(row1 = [5,6], row2 = [2,10])

5.0

In [None]:
euclidean_distance(row1 = [1,2,3], row2 = [4,5,6])

5.196152422706632

### Question 10 
Write a function 'calculate_distances' to find the distance between each row in the training set and a given test row instance. Please note that this will make little sense if you have not reviewed the idea behind the kNN algorithm.

* The function should take **test_row** and **X_train** as arguments.
* It should return a list of distances that has the length same as the number of rows in **X_train**
* The elements of the list should correspond to the distances between the rows of X_train and the test row given below in the same order. 

**Note:** Use 'euclidean_distance' function defined in the previous question to compute the distances.

Refer to the sample output below. Use the sample test row defined below for testing purposes.

In [None]:
# Sample test row 
test_row = X_test.iloc[0,:]

# Sample test target
test_row_target = y_test.iloc[0]

In [None]:
#here are the features
test_row

sepal length    5.2
sepal width     2.7
petal length    3.9
petal width     1.4
Name: 59, dtype: float64

In [None]:
#here is the target
test_row_target

1

In [None]:
#here is the label associated with the target
lookup_label(test_row_target)

'Iris-versicolor'

In [None]:
def calculate_distances(test_row, X_train):
    return X_train.apply(lambda row: euclidean_distance(test_row, row), axis=1).values.tolist()

In [None]:
distances = calculate_distances(test_row, X_train)

In [None]:
#here are the distances between the test row and the first 5 samples in the training set
distances[:5]

[2.760434748368452,
 0.6324555320336755,
 2.8618176042508368,
 2.078460969082653,
 2.681417535558384]

In [None]:
#this is the distance between the sample in the first row of X_train and test_row defined above
distances[0] 

2.760434748368452

### So far...

Run the following script to add X_train to a new dataframe along with the distances calculated above and the actual target values from the original dataset. When successfully implemented, your dataframe, called, df_distances, should look like the sample dataframe given below.

In [None]:
# assign() method adds a new column "distance", assigns distances to this column, and returns a new df
df_distances = X_train.assign(distance = distances) 

#let's add the actual labels as well
df_distances["target"] = y_train

In [None]:
#this is what the new df looks like
df_distances.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,distance,target
143,6.8,3.2,5.9,2.3,2.760435,2
82,5.8,2.7,3.9,1.2,0.632456,1
12,4.8,3.0,1.4,0.1,2.861818,0
137,6.4,3.1,5.5,1.8,2.078461,2
100,6.3,3.3,6.0,2.5,2.681418,2


In [None]:
#as a reminder, this is what X_train looks like
X_train.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width
143,6.8,3.2,5.9,2.3
82,5.8,2.7,3.9,1.2
12,4.8,3.0,1.4,0.1
137,6.4,3.1,5.5,1.8
100,6.3,3.3,6.0,2.5


### Question 11 

Write a function, sort_by_distance, that sorts the above df_distances by the distance column in ascending order and returns it. Your function should expect to receive these as arguments, as shown in the sample function call below:

In [None]:
def sort_by_distance(df_distances, column_name):
    return df_distances.sort_values(by=column_name)

In [None]:
df_distances = sort_by_distance(df_distances, "distance")
df_distances

Unnamed: 0,sepal length,sepal width,petal length,petal width,distance,target
89,5.5,2.5,4.0,1.3,0.387298,1
94,5.6,2.7,4.2,1.3,0.509902,1
53,5.5,2.3,4.0,1.3,0.519615,1
80,5.5,2.4,3.8,1.1,0.529150,1
69,5.6,2.5,3.9,1.1,0.538516,1
...,...,...,...,...,...,...
105,7.6,3.0,6.6,2.1,3.691883,2
122,7.7,2.8,6.7,2.0,3.802631,2
131,7.9,3.8,6.4,2.0,3.887158,2
117,7.7,3.8,6.7,2.2,3.992493,2


### Question 12 
Write a function 'find_neighbors' to find the k-closest samples in the training set to a given test row instance.

- The function takes df_distances and k as parameters
- It should return a subset of the df_distances dataframe containing the k-nearest points in the train_set to test_row. Notice that the top k elements in the dataframe are same as the k nearest neighbors.
- The length of the returned dataframe should be be equal to k

Refer to the sample output below.

In [None]:
def find_neighbors(df_distances, k):
    return sort_by_distance(df_distances, 'distance').iloc[:k, :]

In [None]:
#sample function call with k=3
df_knn = find_neighbors(df_distances, 3)
df_knn

Unnamed: 0,sepal length,sepal width,petal length,petal width,distance,target
89,5.5,2.5,4.0,1.3,0.387298,1
94,5.6,2.7,4.2,1.3,0.509902,1
53,5.5,2.3,4.0,1.3,0.519615,1


In [None]:
#another sample function call with a different k
find_neighbors(df_distances, 5)

Unnamed: 0,sepal length,sepal width,petal length,petal width,distance,target
89,5.5,2.5,4.0,1.3,0.387298,1
94,5.6,2.7,4.2,1.3,0.509902,1
53,5.5,2.3,4.0,1.3,0.519615,1
80,5.5,2.4,3.8,1.1,0.52915,1
69,5.6,2.5,3.9,1.1,0.538516,1


### Question 13 
Write a function majority_vote that performs a majority vote to determine the predicted label associated with a test sample.

- The function contains takes df_knn and the column name on which the majority vote is to be performed. For a classification problem, the majority vote simply involves determining the mode of labels for the k-nearest neighbors included in the df_knn dataframe.
- The function should return a prediction in the form of an int number, which represents the most frequently occuring label present in the df_knn dataframe

Refer to the sample output below.

In [None]:
def majority_vote(df_knn, column_name):
    return df_knn[column_name].mode().values[0]

In [None]:
prediction = majority_vote(df_knn, "target")
prediction

1

In [None]:
#let's look up the label using the function you defined earlier
label = lookup_label(prediction)
label

'Iris-versicolor'

In [None]:
#as a reminder, here is the actual target value
test_row_target

1

In [None]:
#let's compare your own knn's prediction to that of sklearn's
model = KNeighborsClassifier().fit(X=X_train, y=y_train)
predicted = model.predict([test_row])
predicted[0]

1