# LAB | Unsupervised learning

## Import libraries here

In [None]:
# Your code here
import pandas as pd
from sklearn.datasets import load_wine
import sklearn




# Challenge 1
## Import the Dataset
- In this challenge, we will start off by working with the wine dataset in scikit-learn. We will select the wine dataset and use a clustering algorithm to learn more about the functionalities of this library. 
- We start off by loading the dataset using the `load_wine` function ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html)). In the cell below, we will import the function from scikit-learn.
- Run the code below:
```python
# import the dataset
from sklearn.datasets import load_wine
wine = load_wine()
# Dataset info
print(wine['DESCR'])
# Creating a dataframe
wine_df = pd.DataFrame(wine['data'],columns=wine['feature_names'])
```

In [None]:
# Your code here


## Clustering the Dataset
- In this portion of the lab, we will cluster the data to find common traits between the different wines. 
- We will use the Kmeans clustering algorithm and StandardScaler to achieve this goal.

### Standardize
- Standardize `wine_df` and save in a variable called `wine_stand_arr`
- You will need to use `StandardScaler` to standardize the features 
```python
from sklearn.preprocessing import StandardScaler
```

In [None]:
# Your code here

### Create K-Means clustering
- Import the `KMeans` from scikit-learn and then proceed to create 4 clusters.
- Use the dataset that you standardized
```python
from sklearn.cluster import KMeans
```

In [None]:
# Your code here

- Print the cluster labels using `.labels_`

In [None]:
# Your code here

- Compute the size of each cluster. 
- This can be done by counting the number of occurrences of each unique label.
- Which is the largest cluster of the 4?

In [None]:
# Your code here

- Store the labels as a new column in your `wine_df` dataframe

In [None]:
# Your code here

- Group the dataset by cluster you will be able to see the differences between the clusters

In [None]:
# Your code here

# Challenge 2
## Import the Dataset
- In this challenge we will work with the patient dataset
- Read the `patient-admission.csv` dataset and store in a variable called patients

In [None]:
# Your code here

- Transform the `patient_dob` and `appointment_date` columns to datetime using the `pd.to_datetime` function.

In [None]:
# Your code here

- Next, drop the `id`, `patient_name`, `patient_email`, `patient_nhs_number`, and `doctor_phone` columns. These are not quantitative columns and will not contribute to our analysis.

In [None]:
# Your code here

### Missing data
- Now we work on the missing data. Most ML algorithms will not perform as intended if there is missing data.
- Count how many rows contain missing data in each column. 
- You should see three columns contain missing data:
>- `doctor_name`: 58 missing data
>- `prescribed_medicines`: 488 missing data
>- `diagnosis`: 488 missing data

In [None]:
# Your code here

- The main issues are found in the `prescribed_medicines` and `diagnosis` columns. Can we simply drop these rows?
- The answer is not yet. Because when there is missing data in these columns, it doesn't mean the data records are broken. Instead, it means no medication was prescribed and no diagnosis was recorded. Therefore, once we fill in the missing data these columns will be fine. But we'll revisit these columns and decide whether we will eventually drop them when we look at how many unique values are there in these categorical columns.  
- For the `prescribed_medicines` column, fill the missing values with the value `'no prescription'`. 
- For the `diagnosis` column, fill the missing values with `'no diagnosis'`.

In [None]:
# Your code here

- How about `doctor_name`? 
- Since a doctor visit without a doctor name might not be meaningful, we will drop these rows.

In [None]:
# Your code here

### Label encoding
- Another step in preprocessing is the label encoding. 
- We have 4 columns that are of `bool` type that we would like to convert them to an integer column containing either zero or one. 
- We can do this using [scikit-learn's label encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).
- In the cell below, import the label encoder and encode the 4 boolean columns `patient_diabetic`, `patient_allergic`, `patient_show`, `is_regular_visit` with `0` and `1`

```python
from sklearn.preprocessing import LabelEncoder
```

In [None]:
# Your code here

- Print the data dtypes to confirm those four `bool` columns are converted to `int64`

In [None]:
# Your code here

#### Object data
- The last step is to handle the `object` data.
- There are 4 `object` columns now: `patient_gender`, `doctor_name`, `prescribed_medicines`, and `diagnosis`. 
- In the next cell, check the unique values of each of the `object` columns using `value_counts()`.

In [None]:
# Your code here

- The number of unique values is large for all three columns except `patient_gender`. 
- We will handle these columns differently:


- For `diagnosis`, there are too many unique values which will make ML difficult. However, we can re-encode the values to either with or without diagnosis. Remember at an earlier step we filled in the missing values of this column with *no diagnosis*? We can re-encode *no diagnosis* to `0` and all other values to `1`. In this way we can tremendously simplify this column.


- For `prescribed_medicines`, we can drop this column because it is perfectly correlated with `diagnosis`. Whenever there is no diagnosis, there is no prescribed medicine. So we don't need to keep this duplicated data.


- How about `doctor_name`? There are not excessive unique values but still quite many (19). We may either drop or keep it but keeping it will make the analysis more complicated. So due to the length of this lab let's drop it.

- How about `gender`? This one is easy. Just like re-encoding the boolean values, we can re-encode gender to `0` and `1` because there are only 2 unique values.

In the next cells, do the following:

1. Create a new column called `diagnosis_int` that has `0` and `1` based on the values in `diagnosis`.

1. Create a new column called `patient_gender_int` that has `0` and `1` based on the values in `patient_gender`.

1. Drop the following columns: `doctor_name`, `diagnosis`, `prescribed_medicines`, `patient_gender`,`patient_dob`, and `appointment_date`.

In [None]:
# Your code here

- Let's look at the head again to ensure the re-encoding and dropping are successful:

In [None]:
# Your code here

### Standardize
- Now it is time to normalize the data, use the `StandardScaler` function to do it.
- Save in a variable called `patients_stand`

In [None]:
# Your code here

### Clustering
- Our data is now ready for clustering. Let's use k-means again
- Use `n_clusters = 4` 
- We start by initializing and fitting a model in the cell below
- Call this model `patients_cluster`

In [None]:
# Your code here

- Attach the labels to the dataframe. 
- Do this by accessing the `labels_` in the `patients_cluster` model and assign them to a new column in `patients` that you will call `labels`.
- Group the dataset by cluster you will be able to see the differences between the clusters

In [None]:
# Your code here

## Bonus visualization
### Visualize K-Means Clusters

- How did k-means cluster the data? You can obtain an intuitive view with a scatter plot. 
- Generate a 2-d cluster plot below using `matplotlib`. 
- You need to apply the PCA to be able to visualize your results effectively. 
- Color the results by the labels of your k-means.


### PCA - 2 dimensions
- Apply PCA with 2 components in the patients_stand
```python
from sklearn.decomposition import PCA
```

In [None]:
# Your code here

### Plot PCA
- Use scatterplot to visualize your data
- Use component 1 and component 2 from the results of PCA

In [None]:
# Your code here

### PCA - 3 dimensions
- Apply PCA with 3 components

In [None]:
# Your code here

### Plot PCA 3D
- Additionally, you can visualize the clusters in 3-D scatter plot.
- Use the code below
```python
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pc1,pc2,pc3, depthshade=True, c = patients['Cluster'], alpha=1)
plt.show()
```

In [None]:
# Your code here