<a href="https://colab.research.google.com/github/qingdao81/uplimit-python-ml/blob/main/Lars_Bachmann_Week_3_Project_%5BAug23%5D_P4ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> 1. DUPLICATE THIS COLAB DOCUMENT TO START WORKING ON IT: On the top-left corner of this page, go to File > Save a copy to drive.
> 2. SHARE SETTINGS: In the new notebook, set the sharing settings to "Anyone with the link" by clicking "Share" on the top-right corner.

<center>
  <img src=https://www.freevector.com/uploads/vector/preview/31087/07Januari2021-06_generated.jpg width="500" align="center" />
</center>
<br/>


# Week 3: (Un)Supervised Predictions for Airbnb listings!

This is the third and last week's project 😢 of *Intermediate Python for Data Science*. Here we are going to put everything we've learned over the last three weeks together and create yet another exciting algorithm that we can then use for creating a Streamlit App!

We'll first start using unsupervised learning, and with that information we are going to make a supervised model. All by using Pipelines, ColumnTransformers, Plotly and some other parts! Let's do it 💪💪!

## Downloading the Dataset

You'll need to download some prerequisite Python packages in order to run all the code below. Let's install them!

In [1]:
%%capture
!pip install numpy streamlit gdown pandas==1.5.2 scikit-learn==1.2.0

We will download the datasets from Google Drive just like we did the previous weeks using the [Pickle](https://pythonnumericalmethods.berkeley.edu/notebooks/chapter11.03-Pickle-Files.html) format.

In [2]:
import os
import shutil

import gdown
import pandas as pd

# Download file from Google Drive
# This file is based on data from: http://insideairbnb.com/get-the-data/
file_id = "1KTF77Sj0kWyft9gNT3_6k84gauPA95rG"
downloaded_file = "listings.pkl"
# Download the files from Google Drive
gdown.download(id=file_id, output=downloaded_file)

# Show all columns (instead of cascading columns in the middle)
pd.set_option("display.max_columns", None)
# Don't show numbers in scientific notation
pd.set_option("display.float_format", "{:.2f}".format)

Downloading...
From: https://drive.google.com/uc?id=1KTF77Sj0kWyft9gNT3_6k84gauPA95rG
To: /content/listings.pkl
100%|██████████| 494k/494k [00:00<00:00, 67.5MB/s]


In [3]:
# Setting seed allows us to generate a random dataset split and
# algorithms that are the same on every computer. Otherwise,
# every time you run the split, you'd get a different dataset split.
SEED = 42

## Preprocessing the Dataset
Please load the downloaded file as a DataFrame (df). The method for loading these datasets is the same as what we did on the Uplimit platform.

#### Task 1: Read Pickle

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/loading-inspect-dataset#corise_cladmotdl002n3b6prslqdc79)

Read the Python Pickle file we've just downloaded as `df_list`.

In [105]:
# Read a Python Pickle file
df_list = pd.read_pickle(downloaded_file)

<details>
<summary>Show Solution</summary>

```python
df_list = pd.read_pickle("listings.pkl")
```
</details>

Now let's have a look at the **Listings DataFrame** and see what kinds of datapoints there are in the dataset. Show the first 2 rows.

In [106]:
# Show first two rows of the dataset
df_list.head(2)

Unnamed: 0,id,host_acceptance_rate,neighbourhood,room_type,price_in_dollar,amenities,accommodates,host_is_superhost,has_availability,review_scores_rating,instant_bookable,number_of_reviews_l30d,discount_per_5_days_booked,discount_per_10_days_booked,discount_per_30_and_more_days_booked,host_reported_average_tip,service_cost
0,23726716,0.95,De Pijp - Rivierenbuurt,Private room,127.0,15,7,False,True,4.61,False,3,8.0,15.0,16.0,1.03,$4.99
1,35815046,1.0,De Baarsjes - Oud-West,Shared room,62.0,13,2,False,True,4.38,False,6,4.0,10.0,16.0,1.26,$2.99


<details>
<summary>Show Solution</summary>

```python
df_list.head(2)
```
</details>

Awesome, just like last week, our next step is to get an overview of the columns that are in this particular DataFrame 🧐.

#### Task 2: Print column names, types, and non-null values

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/loading-inspect-dataset#corise_clcqs8s8c00002a6lgh770s18)

Let's try and get an overview of the **Listings DataFrame**, called `df_list` with the [`info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)-command. This should show us some details about the columns in the DataFrame, like the column names, their data types, and the number of non-null values.

In [107]:
df_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4817 entries, 0 to 6172
Data columns (total 17 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    4817 non-null   int64   
 1   host_acceptance_rate                  4817 non-null   float64 
 2   neighbourhood                         4817 non-null   category
 3   room_type                             4817 non-null   category
 4   price_in_dollar                       4817 non-null   float64 
 5   amenities                             4817 non-null   int64   
 6   accommodates                          4817 non-null   int64   
 7   host_is_superhost                     4817 non-null   bool    
 8   has_availability                      4817 non-null   bool    
 9   review_scores_rating                  4817 non-null   float64 
 10  instant_bookable                      4817 non-null   bool    
 11  numb

In [108]:
df_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4817 entries, 0 to 6172
Data columns (total 17 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    4817 non-null   int64   
 1   host_acceptance_rate                  4817 non-null   float64 
 2   neighbourhood                         4817 non-null   category
 3   room_type                             4817 non-null   category
 4   price_in_dollar                       4817 non-null   float64 
 5   amenities                             4817 non-null   int64   
 6   accommodates                          4817 non-null   int64   
 7   host_is_superhost                     4817 non-null   bool    
 8   has_availability                      4817 non-null   bool    
 9   review_scores_rating                  4817 non-null   float64 
 10  instant_bookable                      4817 non-null   bool    
 11  numb

<details>
<summary>Show Solution</summary>

```python
df_list.info()
```
</details>

Just like last week we have a lot of attributes, of which a few we will drop. Some require processing while others don't. For example columns with the dtype **category** and **boolean** require some processing so that it can be used by Scikit-learn algorithms since most of these algorithms work best with numerical values.

#### Task 3: Make a selection

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/loading-inspect-dataset#corise_cladmotdl002n3b6prslqdc79)

There are a total of 17 columns. For readability purposes we are going to prematurely drop a lot of these columns. In practice you often drop columns at a later stage once you've determined their value. This is something we'll look into next week. For now we want you to drop the columns:

- id (it has no meaning for the ML algorithm)
- discount_per_5_days_booked
- discount_per_10_days_booked
- discount_per_30_and_more_days_booked

In [109]:
df_list = df_list.drop(columns=["id", "discount_per_5_days_booked", "discount_per_10_days_booked", "discount_per_30_and_more_days_booked"])

<details>
<summary>Show Solution</summary>

```python
df_list_copy = df_list.drop(columns=['id', 'discount_per_5_days_booked',
                                     'discount_per_10_days_booked', 'discount_per_30_and_more_days_booked'])

```
</details>

Take a look at the leftover information contained in the dataframe again.

In [110]:
df_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4817 entries, 0 to 6172
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   host_acceptance_rate       4817 non-null   float64 
 1   neighbourhood              4817 non-null   category
 2   room_type                  4817 non-null   category
 3   price_in_dollar            4817 non-null   float64 
 4   amenities                  4817 non-null   int64   
 5   accommodates               4817 non-null   int64   
 6   host_is_superhost          4817 non-null   bool    
 7   has_availability           4817 non-null   bool    
 8   review_scores_rating       4817 non-null   float64 
 9   instant_bookable           4817 non-null   bool    
 10  number_of_reviews_l30d     4817 non-null   int64   
 11  host_reported_average_tip  4817 non-null   float64 
 12  service_cost               4817 non-null   category
dtypes: bool(3), category(3), float64(

<details>
<summary>Show Solution</summary>

```python
df_list.info()

```
</details>

We are left with 3 boolean values and 3 categorical values, which need to be encoded.

Now to determine our target variable, we have **no** labels assigned to this target variable; which is the combination of "price_in_dollar" and "host_reported_average_tip".

Just like we saw on this weeks content, we want to first cluster these two numerical values into a few groups. This creates a new categorical variable called **listing_tipping_group**, which will be used as our target for the algorithm that we are going to make.



#### Task 4: Visualize the two numerical variables

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/one-and-more-var#corise_clctcb6ll00062a76ap82hkt8)

Create a scatter plot displaying the two numerical variables: "price_in_dollar" and "host_reported_average_tip".

In [111]:
import plotly.express as px

fig = px.scatter(df_list, x="price_in_dollar", y="host_reported_average_tip")
fig.show()

<details>
<summary>Show Solution</summary>

```python
import plotly.express as px

fig = px.scatter(df_list, x="price_in_dollar", y="host_reported_average_tip")
fig.show()

```
</details>

#### Task 5: Define Three clusters via K-Means

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/unsupervised-learning#corise_cld3n6qss000s2a6ls9js7l7g)

Just like in the unsupervised Section of Uplimit we are going to use the columns **price_in_dollar** and **host_reported_average_tip** to create a cluster variable. This we'll use later on! Let's start by importing the Kmeans, Pipeline and MinMaxScaler

In [112]:
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

<details>
<summary>Show Solution</summary>

```python
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

```
</details>

Next, we want to create a pipeline which integrates the scaler and KMeans algorithm. Then we use fit_predict to create the KMean labels.

*Make sure to set a seed for your algorithm/model.*

In [113]:
pipeline = Pipeline([
    ("scaler", MinMaxScaler()) , # YOUR CODE HERE
    ("model", KMeans(n_clusters=3, random_state=SEED)) # YOUR CODE HERE
    ])

Kmean_labels = pipeline.fit_predict(
    df_list[["price_in_dollar", "host_reported_average_tip"]]
)





<details>
<summary>Show Solution</summary>

```python
pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("model", KMeans(n_clusters=3, random_state=SEED))
    ])

Kmean_labels = pipeline.fit_predict(
    df_list[["price_in_dollar", "host_reported_average_tip"]]
)

```
</details>

Now since these labels are seen as numerical, when visualizing they will be regarded as numerical/continuous variables. However, these labels indicate three separate groups, meaning we want to change it into a categorical variable. For now we'll want you to convert these numerical labels into string variables.

In [114]:
# Change the labels from numerical into categorical.
#Kmean_labels = ... # YOUR CODE HERE
Kmean_labels = [str(x) for x in Kmean_labels]

In [115]:
#Kmean_labels

<details>
<summary>Show Solution</summary>

```python
Kmean_labels = [str(x) for x in Kmean_labels]

```
</details>

Now, as a last check, let's confirm the labels were assigned as expected, by visualizing the scatterplot with the labels!

In [116]:
fig = px.scatter(
    df_list, x="price_in_dollar", y="host_reported_average_tip", color=Kmean_labels
)
fig.show()

<details>
<summary>Show Solution</summary>

```python
fig = px.scatter(
    df_list, x="price_in_dollar", y="host_reported_average_tip", color=Kmean_labels
)
fig.show()

```
</details>

#### Task 6: Reassign as a new column to the Dataset

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/unsupervised-learning#corise_clc1wb7gf00023b6oaqcg6tn1)

Let's add this list as a column to our dataset **df_list** using the name **listing_tipping_group**.

In [117]:
df_list["listing_tipping_group"] = Kmean_labels

In [118]:
#Kmean_labels

<details>
<summary>Show Solution</summary>

```python
df_list["listing_tipping_group"] = Kmean_labels

```
</details>

#### Task 7: Manually split up Three into Four clusters

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/unsupervised-learning#corise_clc1wb7gf00023b6oaqcg6tn1)

As you might've noticed, there are also listings that receive no tips at all. We want to recognize these listings as a separate group. Let's use Pandas for that to overwrite labels which have received no tips to the number 3.

In [119]:
df_list.loc[df_list["host_reported_average_tip"] == 0.00, "listing_tipping_group"] = "3"
#df_list[df_list["listing_tipping_group"] == 3]

<details>
<summary>Show Solution</summary>

```python
df_list.loc[df_list["host_reported_average_tip"] == 0.00, "listing_tipping_group"] = "3"

```
</details>

To make our labels a bit more "expressive", let's use a [`replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) operation to make the values like:

- "0" to "average tip"
- "1" to "high tip"
- "2" to "low tip"
- "3" to "no tip"


In [120]:
df_list["listing_tipping_group"] = df_list["listing_tipping_group"].replace({"0": "average tip", "1": "high tip", "2": "low tip", "3": "no tip"})


<details>
<summary>Show Solution</summary>

```python
df_list["listing_tipping_group"] = df_list["listing_tipping_group"].replace({"0": "average tip", "1": "high tip", "2": "low tip", "3": "no tip"})

```
</details>

Awesome, now let's encode these labels back into numbers, haha! You might think why did we do it in the first place? Well, the Scikit-learn encoder has a way to easily transform these labels back and forth between numbers and label names. So when we need labels, we just turn it back using a Scikit-learn function, which is easier than constantly having to replace values, like we did above (or remembering what the values meant).

#### Task 8: Encode our Label

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/unsupervised-learning#corise_clc2384f100073b6ozta1negh)

For the model/algorithm that we are building, we use the **listing_tipping_group** as our target ($y$) label. Let's assign this column as our *y* variable and encode it with a [`LabelEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [121]:
y = df_list["listing_tipping_group"]

<details>
<summary>Show Solution</summary>

```python
y = df_list["listing_tipping_group"]

```
</details>

Now, let's encode it! Make sure to use ravel.

In [122]:
import numpy as np
from numpy import ravel  # change column-vector to 1d-array
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
y = label_enc.fit_transform(ravel(y))

<details>
<summary>Show Solution</summary>

```python
import numpy as np
from numpy import ravel  # change column-vector to 1d-array
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
y = label_enc.fit_transform(ravel(y))

```
</details>

Let's try and changing back one of the encoded labels by using [`inverse_transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.inverse_transform). Input a value between 0 to 3 and see what comes back.

*You might notice that the numbers don't coincide with what the numbers originally were when we generated the three clusters, this is okay and shouldn't influence our results*

In [123]:
label_enc.inverse_transform(np.array([0, 1 ,3]))

array(['average tip', 'high tip', 'no tip'], dtype=object)

#### Task 9: Set Booleans to Int

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/unsupervised-learning#corise_clc2384f100073b6ozta1negh)

Taking a closer look at our dataset, reveals that we have three columns that are boolean, which we want to convert to integer.

In [124]:
list(df_list.select_dtypes(include=["bool"]).columns)

['host_is_superhost', 'has_availability', 'instant_bookable']

Let's go ahead and convert it into int!

In [125]:
df_list["host_is_superhost"] = df_list["host_is_superhost"].astype(int)
df_list["has_availability"] = df_list["has_availability"].astype(int)
df_list["instant_bookable"] = df_list["instant_bookable"].astype(int)

<details>
<summary>Show Solution</summary>

```python
df_list["host_is_superhost"] = df_list["host_is_superhost"].astype(int)
df_list["has_availability"] = df_list["has_availability"].astype(int)
df_list["instant_bookable"] = df_list["instant_bookable"].astype(int)

```
</details>

#### Task 10: Split our Dataset

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/load-inspect-split#corise_clcut65qu00072a73x6glgvcr)

Before we split our dataset we drop **price_in_dollar** and **host_reported_average_tip** since these were used to create **listing_tipping_group**.

In [126]:
X, y = (
    df_list[["host_acceptance_rate", "host_is_superhost",
             "has_availability", "number_of_reviews_l30d",
             "neighbourhood", "room_type",
             "accommodates", "review_scores_rating",
             "instant_bookable", "service_cost"]] , # YOUR CODE HERE
    df_list["listing_tipping_group"] # YOUR CODE HERE
)

<details>
<summary>Show Solution</summary>

```python
X, y = (
    df_list[["host_acceptance_rate", "host_is_superhost",
             "has_availability", "number_of_reviews_l30d",
             "neighbourhood", "room_type",
             "accommodates", "review_scores_rating",
             "instant_bookable", "service_cost"]],
        df_list[["listing_tipping_group"]]
        )

```
</details>

So now that we split up the dataset into X and y, let's split it also up into training, validation and test set.

In [127]:
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

def train_validation_test_split(
    X, y, train_ratio: float, validation_ratio: float, test_ratio: float
):
    # Split up dataset into train and test, of which we split up the test.
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=(1 - train_ratio), random_state=SEED
    )

    # Split up test into two (validation and test).
    X_val, X_test, y_val, y_test = train_test_split(
        X_test,
        y_test,
        test_size=(test_ratio / (test_ratio + validation_ratio)),
        random_state=SEED,
    )

    # Return the splits
    return X_train, X_val, X_test, y_train, y_val, y_test


# Splits according to ratio of 80/10/10
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(
    X, y, 0.75, 0.15, 0.1
)

#### Task 11: Encode the Numerical Variables

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/load-inspect-split#corise_clcut70vz00082a73zc30553d)

Now the real interesting part starts where we turn the different numerical variable ranges to the same scale (From 0 to 1) by using a [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html). This way, algorithms that are sensitive to scale will not regard some features more important than others, purely because of a scale that was initially much bigger.

So let's prepare a pipeline for the numerical columns!

In [136]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Select the numerical columns
numerical_cols_X = X_train.select_dtypes(include=["int64", "float64"]).columns

# Numerical pipeline
num_pipeline = Pipeline([
    ("scaler", MinMaxScaler())
])

<details>
<summary>Show Solution</summary>

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Select the numerical columns
numerical_cols_X = X_train.select_dtypes(include=["int64", "float64"]).columns

# Numerical pipeline
num_pipeline = Pipeline([
    ('scaler', MinMaxScaler())
])

```
</details>

#### Task 12: Encode the Categorical Variables

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/load-inspect-split#corise_clcut70vz00082a73zc30553d)

Now, categorical variables often don't need scaling but they do need proper encoding. For this we'll use [`OneHotEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to achieve that.

Make such a pipeline!

In [137]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Select the categorical columns
categorical_cols_X = X_train.select_dtypes(include=["category"]).columns

# Categorical pipeline: containing only one encoder
cat_pipeline = Pipeline([
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse=False))
])

<details>
<summary>Show Solution</summary>

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Select the categorical columns
categorical_cols_X = X_train.select_dtypes(include=["category"]).columns

# Categorical pipeline: containing only one encoder
cat_pipeline = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
])

```
</details>

#### Task 13: Combine the Pipelines in ColumnTransformer

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/load-inspect-split#corise_clcut70vz00082a73zc30553d)

Let's put the ColumnTransformer to good use and combine our two pipelines into one variable called *preprocessor*.

In [138]:
from sklearn.compose import ColumnTransformer

# Combine the two pipelines into one ColumnTransformer
preprocessor = ColumnTransformer([
    ('cat', cat_pipeline, categorical_cols_X),  # Notify Transformer which cols to use.
    ('num', num_pipeline, numerical_cols_X)  # Notify Transformer which cols to use.
])

<details>
<summary>Show Solution</summary>

```python
from sklearn.compose import ColumnTransformer

# Combine the two pipelines into one ColumnTransformer
preprocessor = ColumnTransformer([
    ('cat', cat_pipeline, categorical_cols_X),  # Notify Transformer which cols to use.
    ('num', num_pipeline, numerical_cols_X)  # Notify Transformer which cols to use.
])

```
</details>

#### Task 14: Select our Model

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/scikit-advanced#corise_clbtolj4u00033b6ontw6x38q)

In this week's project we are going to be using a [SVM classifier](https://youtu.be/_YPScrckx28) to make predictions! The implementation is called [`SVC()`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) and with pipelines is just as simple as plugging in other models like we see in [this example](https://corise.com/course/intermediate-python-for-data-science/v2/module/unsupervised-learning#corise_clc2384f100073b6ozta1negh).

*Make sure to set the seed of the SVM/SVC algorithm/model.*

In [139]:
from sklearn.svm import SVC

# Combine the preprocesser with the Algorithm/model
pipeline = Pipeline([
    ("pre", preprocessor), # YOUR CODE HERE
    ("model", SVC(random_state=SEED) # YOUR CODE HERE
    )
])



<details>
<summary>Show Solution</summary>

```python
from sklearn.svm import SVC

# Combine the preprocesser with the Algorithm/model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', SVC(random_state=SEED)
    )
])

```
</details>

#### Task 15: Train the Model

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/unsupervised-learning#corise_cld3n6qss000s2a6ls9js7l7g)

In [140]:
# Train the final pipeline on the training set.
pipeline.fit(X_train, y_train)


`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.



<details>
<summary>Show Solution</summary>

```python
pipeline.fit(X_train, y_train)

```
</details>

#### Task 16: Measure our Performance

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/first-algo#corise_clcuzcwbf00052a732gxqjbkd)

During the course we've mentioned a few times there are more metrics. In this case let's use another such metric, [F1](https://towardsdatascience.com/the-f1-score-bec2bbc38aa6)! It is commonly used for Classification and therefore, let's try and implement [`f1_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)!

*Set the parameter `average="macro"` since for the `f1_score()` we have more than two classes that we are trying to predict (4).*

In [143]:
from sklearn.metrics import f1_score

# Do a "trial exam" using the validation set.
y_predict = pipeline.predict(X_val)

# Compare algorithms' "trial exam" vs. expected.
f1_score(y_val, y_predict, average="macro").round(4) # YOUR CODE HERE

0.7946

<details>
<summary>Show Solution</summary>

```python
from sklearn.metrics import f1_score

# Do a "trial exam" using the validation set.
y_predict = pipeline.predict(X_val)

# Compare algorithms' "trial exam" vs. expected.
f1_score(y_val, y_predict, average="macro").round(4)

```
</details>

---

#### (Optional) Task 17: Confusion Matrix

The confusion matrix is a great chart to visually capture how your model is performing. It shows you which labels were expected, and how they were actually predicted. The confusion matrix of the previous model reveals that:

- The diagonal are correctly predicted labels, which for the majority seems right
- 7 average tips were incorrectly identified to be high tip
- 21 average tips were incorrectly identified to be low tip
- 59 high tips were incorrectly identified to be average tip
- 24 low tips were incorrectly identified as average tip
- 6 no tips were incorrectly identified as average tip
- 1 no tips were incorrectly identified as high tip
- 4 no tips were incorrectly identified as low tip
- sum up a row horizontally provides you with sum **actual** total of the class
- sum up a row vertically provides you with a sum of **predicted** total of the class

[This video](https://youtu.be/Kdsp6soqA7o?t=24) also clearly explains how to interpret it.

In [None]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

conf_mat = confusion_matrix(y_val, y_predict, labels=pipeline["model"].classes_)

fig = px.imshow(conf_mat,
                labels=dict(x="Predicted Label", y="True Label"),
                x=pipeline["model"].classes_,
                y=pipeline["model"].classes_,
                text_auto=True)
fig.show()

#### (Optional) Task 18: Example prediction

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/first-algo#corise_clcuzcwbf00052a732gxqjbkdr)

Now, just like last week, let's retrain our model, to only use a few features. In this case we'll go with 4 features. This model we'll use to predict in which tip bracket we fit and is used to make another Streamlit app with!

Make a selection of these features for X_train and X_val:
- "review_scores_rating"
- "room_type"
- "service_cost"
- "instant_bookable"

In [None]:
from numpy import ravel  # change column-vector to 1d-array
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X_train = ... # YOUR CODE HERE
X_val = ... # YOUR CODE HERE

# Select the categorical columns
categorical_cols_X = X_train.select_dtypes(include=["category"]).columns
numerical_cols_X = X_train.select_dtypes(include=["int64", "float64"]).columns

# Categorical pipeline: containing only one encoder
cat_pipeline = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
# Numerical pipeline: containing only one encoder
num_pipeline = Pipeline([
    ('scaler', MinMaxScaler())
])

# Combine the two pipelines into one ColumnTransformer
preprocessor = ColumnTransformer([
    ('cat', cat_pipeline, categorical_cols_X),
    ('num', num_pipeline, numerical_cols_X)
])

# Combine the preprocesser with the Algorithm/model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', SVC(random_state=SEED))
])

# Train the final pipeline on the training set.
pipeline.fit(X_train, ravel(y_train))

<details>
<summary>Show Solution</summary>

```python
from numpy import ravel  # change column-vector to 1d-array
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X_train = X_train[["review_scores_rating", "room_type", "service_cost", "instant_bookable"]]
X_val = X_val[["review_scores_rating", "room_type", "service_cost", "instant_bookable"]]

# Select the categorical columns
categorical_cols_X = X_train.select_dtypes(include=["category"]).columns
numerical_cols_X = X_train.select_dtypes(include=["int64", "float64"]).columns

# Categorical pipeline: containing only one encoder
cat_pipeline = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
# Numerical pipeline: containing only one encoder
num_pipeline = Pipeline([
    ('scaler', MinMaxScaler())
])

# Combine the two pipelines into one ColumnTransformer
preprocessor = ColumnTransformer([
    ('cat', cat_pipeline, categorical_cols_X),
    ('num', num_pipeline, numerical_cols_X)
])

# Combine the preprocesser with the Algorithm/model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', SVC(random_state=SEED))
])

# Train the final pipeline on the training set.
pipeline.fit(X_train, ravel(y_train))

```
</details>

Now, let's run it fully, using the pipeline and observing the f1_score.

In [None]:
# Do a "trial exam" using the validation set.
y_predict = ... # YOUR CODE HERE

# Compare algorithms' "trial exam" vs. expected.
f1_score( ... ).round(4) # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
# Do a "trial exam" using the validation set.
y_predict = pipeline.predict(X_val)

# Compare algorithms' "trial exam" vs. expected.
f1_score(y_val, y_predict, average="macro").round(4)

```
</details>

While the metric performance has decreased, we were able to distill the number of features to only 4. Now let's try and see if we can predict a new listings' expected average tip.

In [None]:
# review_scores_rating: 0 to 5
# room_type: ['Shared room', 'Private room', 'Hotel room', 'Entire home/apt']
# service_cost: ['$0.99', '$4.99', '$2.99', '$10.99']
# instant_bookable: 0, 1
example = pd.DataFrame({
    "review_scores_rating": [0.50],
    "room_type": ["Shared room"],
    "service_cost": ["$0.99"],
    "instant_bookable": [0]
    })

pipeline.predict(example)[0]

Awesome! Today you've made a Machine Learning project, using Numpy, Pandas, Plotly and Scikit-Learn to create predictions given the Airbnb dataset! This is a simplified version of a professional workflow and is often how companies start, by first doing the bare minimum and then keep expanding the complexity of the model. That way you can have something working quickly, and improve steadily with a very quick feedback loop!

Now as an extra we'll explore how to deploy this model as a Streamlit app!

### (Optional) Make an App for Your Portfolio!

<center>
  <img src=https://griddb-pro.azureedge.net/en/wp-content/uploads/2021/08/streamlit-1160x650.png width="500" align="center" />
</center>
<br/>

**Participants such as yourselves often want to use the weekly Uplimit projects for their portfolios. To facilitate that, we've created this section. It might seem like a lot, but it's actually just following instructions and copy-pasting. Reach out on Slack if you get stuck!**

You will make an app that uses the model you just created, encapsulates that in a neat Streamlit interface, where you can provide input through the use of sliders!

<center>
  <img src=https://corise-ugc.com/static/course/intermediate-python-for-data-science/assets/clgawo8po03cx12d029tbha5c/Screenshot%202023-04-10%20150104.png width="500" align="center" />
</center>
<br/>

To visualize this, we will again use a library called [Streamlit](https://streamlit.io/). For now you are not expected to know how Streamlit works, but you are expected to be able to copy-paste and follow instructions if you want to share this project as part of your portfolio!

We are going to use [Streamlit Share](https://share.streamlit.io/) to host your projects. It's a website that allows us to host our interactive projects for free online! Again, we don't expect you to understand how to use and/or modify the code we will show below. We do expect you to read the instructions and copy-paste our code to the Streamlit Share platform. Feel free to change it any way you like. Some great starting points are [here](https://python.plainenglish.io/how-to-build-web-app-using-streamlit-pandas-numpy-5e134f0cf552), [here](https://docs.streamlit.io/library/get-started/create-an-app), [here](https://streamlit.io/components), and [here](https://streamlit.io/gallery)!

In [None]:
import pickle

# rename our pipeline to model
model = pipeline

# Dump our model
pickle.dump(pipeline, open("model.pkl", "wb"))

In [None]:
from google.colab import files

# Download the file locally
files.download('model.pkl')

In [None]:
%%writefile streamlit_app.py
import streamlit as st
import pickle
import pandas as pd

model = pickle.load(open("model.pkl", "rb"))

st.title("Week 3: The Airbnb dataset of Amsterdam")
st.markdown(
    "The dataset contains modifications with regards to the original for illustrative & learning purposes"
)

st.text("This widget can be used by hosts to check their expected tips per listing.")

# review_scores_rating: 0 to 5
review_scores_rating = st.slider('What rating is this listing?', 0.00, 5.00, 4.50)
# room_type: ['Shared room', 'Private room', 'Hotel room', 'Entire home/apt']
room_type = st.radio(
    "What room type do you have?",
    ('Shared room', 'Private room', 'Hotel room', 'Entire home/apt'))
# service_cost: ['$0.99', '$4.99', '$2.99', '$10.99']
service_cost = st.radio(
    "What room type do you have?",
    ('$0.99', '$4.99', '$2.99', '$10.99'))
# instant_bookable: 0, 1
instant_bookable = st.radio(
    "Is the listing instantly bookable?",
    ("True", "False"))
instant_bookable = 1 if instant_bookable == "True" else 0

example = pd.DataFrame({
    "review_scores_rating": [review_scores_rating],
    "room_type": [room_type],
    "service_cost": [service_cost],
    "instant_bookable": [instant_bookable]
    })

if st.button('Predict?'):
    st.write("The model predicts that the tipping category for this listing is:", model.predict(example)[0])

The **%%writefile [FILE_NAME].[FILE_EXTENSION]** command let's us save the code written in the cells in your Google Colab instance. Having it saved like that enables us to download it as a file, as seen below:

In [None]:
from google.colab import files

# Download the file locally
files.download('streamlit_app.py')

In [None]:
%%writefile requirements.txt
streamlit
pandas==1.5.2
scikit-learn==1.2.0

In [None]:
from google.colab import files

# Download the file locally
files.download('requirements.txt')

Please verify that you've downloaded three files:
- `model.pkl`
- `streamlit_app.py`
- `requirements.txt`

Now let's head over to GitHub and [create an account](https://github.com/signup).

Then, since you are logged in [go to GitHub.com](https://github.com) and click on the **+** icon at the top-right corner and select **New repository**.

<center>
  <img src=https://i.ibb.co/4gkPBCp/Screen-Shot-2022-11-28-at-1-51-02-PM.png width="300" align="center" />
</center>
<br/>

Here you provide:
- **Repository name**: Up to you
- **License**: Up to you. We recommend **apache-2.0**.

- **Public or private?** Public, otherwise you can't host it on [Streamlit Share](https://share.streamlit.io)!

<center>
  <img src=https://i.ibb.co/0B533dw/Screen-Shot-2022-11-28-at-1-55-14-PM.png width="450" align="center" />
</center>
<br/>

Then upload the three files to this URL below. ***Please modify it before copy-pasting it***:

```https://github.com/[YOUR_ACCOUNT_NAME]/[YOUR_REPOSITORY_NAME]/upload/main```

<center>
  <img src=https://i.ibb.co/jTsrgJw/Screen-Shot-2022-11-28-at-1-58-31-PM.png width="500" align="center" />
</center>
<br/>

Commit directly to the `main` branch, then click **Commit changes**.

Next, you have to create an account on [Streamlit Share](https://share.streamlit.io/signup).

<center>
  <img src=https://i.ibb.co/znFngJc/Screen-Shot-2022-11-28-at-1-59-47-PM.png width="500" align="center" />
</center>
<br/>

It's recommended to click **Continue with GitHub**.

Then, select **New app** **>** **Deploy a new app...** **>** **From existing repo**.

<center>
  <img src=https://i.ibb.co/VQPQzt3/Screen-Shot-2022-11-28-at-2-05-04-PM.png width="500" align="center" />
</center>

Followed by providing your:

```[GITHUB_ACCOUNT_NAME]/[GITHUB_REPOSITORY]```

<center>
  <img src=https://i.ibb.co/PDSQccD/Screen-Shot-2022-11-28-at-2-10-47-PM.png width="500" align="center" />
</center>

You will have to wait around 1-5 minutes, then an automatic hyperlink is generated for your new website. An example is this app:

```https://[GITHUB_ACCOUNT_NAME]-[GITHUB_REPOSITORY]-[RANDOM_6_LETTER_STRING].streamlit.app/```

***Please modify the link before copy-pasting it.***

---

# 🎉 CONGRATULATIONS!!!

Awesome!! You've finished all the Weeks' assignments! Finishing Week 1, 2 and 3 is an amazing feat and requires a lot of hard work and dedication. Please take time to enjoy this!

If you have any lingering questions, post them on Slack! As you know, we're always here to help.

And if you want any additional challenge questions, check out the bonus extensions below.

---

## Extensions (Optional)

<center>
  <img src=https://upload.wikimedia.org/wikipedia/commons/c/c6/Celebration_fireworks.jpg width="500" align="center" />
</center>
<br/>

🎉🎉 Amazing 🎉🎉 You completed this week's project! Have you thought about extending this project and try some extensions like:

- Using [PCA](https://corise.com/course/intermediate-python-for-data-science/v2/module/unsupervised-learning#corise_clc1r07oc00083b6oy094wn04) on some of the available attributes so that you can simplify the model?
- [Create features based on the features you have available](https://corise.com/course/intermediate-python-for-data-science/v2/module/professionalize-moar#corise_clc262ibo00253b6o1omtabr1) and what can be found on the internet?
- Perform [hyperparameter tuning](https://corise.com/course/intermediate-python-for-data-science/v2/module/professionalize-moar#corise_clc24tq5a00103b6okrxvknkz) to find the most optimal parameters for your algorithm?
- Does it make sense to actually use [other metrics](https://corise.com/course/intermediate-python-for-data-science/v2/module/professionalize-moar#corise_clc262ckm00233b6o546sk268)?
- Train the model on [different kinds of Classification algorithms](https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/) from the Scikit-learn toolkit!
- ...

The possibilities are endless!

# Next Up?
This was the courses' last week. We've hope you've enjoyed it as much as we did! Thank you for taking this course and working on these projects!