<p><font size="6" color='grey'> <b>
Machine Learning
</b></font> </br></p>
<p><font size="5" color='grey'> <b>
Unsupervised Learning - KMeans & DBSCAN - Location
</b></font> </br></p>

---


In [None]:
#@title 🔧 Colab-Umgebung { display-mode: "form" }
!uv pip install --system -q git+https://github.com/ralf-42/Python_Modules
from ml_lib.utilities import get_ipinfo
import sys
print()
print(f"Python Version: {sys.version}")
print()
get_ipinfo()

# 0  | Install & Import
***

In [None]:
# Install

In [None]:
# Import
from pandas import read_csv, DataFrame, concat
import numpy as np

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, silhouette_samples

import plotly.express as px

from yellowbrick.cluster import SilhouetteVisualizer, intercluster_distance

In [None]:
# Warnung ausstellen
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# 1 |  Understand

---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Aufgabe verstehen</br>
✅ Daten sammeln</br>
✅ Statistische Analyse (Min, Max, Mean, Korrelation, ...)</br>
✅ Datenvisualisierung (Streudiagramm, Box-Plot, ...)</br>
✅ Prepare Schritte festlegen</br>

<p><font color='black' size="5">
Anwendungsfall
</font></p>

Auf Basis von Geodaten kann ein Clustering erfolgen.




<p><font color='black' size="5">
Daten laden
</font></p>



In [None]:
filename = "https://raw.githubusercontent.com/ralf-42/ML_Intro/main/02%20data/location_data.csv"
df = read_csv(filename, encoding="ISO-8859-1")

In [None]:
data = df[df.region == "California"]
data.reset_index(inplace=True)

In [None]:
data = data[["Lat", "Long"]]

<p><font color='black' size="5">
EDA (Exploratory Data Analysis) mit Pandas
</font></p>

In [None]:
data.info()

In [None]:
data.describe().T

# 2 | Prepare

---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Nicht benötigte Features löschen</br>
✅ Datentyp ermitteln/ändern</br>
✅ Duplikate ermitteln/löschen</br>
✅ Missing Values behandeln</br>
✅ Ausreißer behandeln</br>
✅ Kategorischer Features Kodieren</br>
✅ Numerischer Features skalieren</br>
✅ Feature-Engineering (neue Features schaffen)</br>
✅ Dimensionalität reduzieren</br>
✅ Resampling (Over-/Undersampling)</br>
✅ Pipeline erstellen/konfigurieren</br>
✅ Train-Test-Split durchführen</br>

# 3 | Modeling
---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Modellauswahl treffen</br>
✅ Pipeline erweitern/konfigurieren</br>
✅ Training durchführen</br>
✅ Hyperparameter Tuning</br>
✅ Cross-Valdiation</br>
✅ Bootstrapping</br>
✅ Regularization</br>


<p><font color='black' size="5">
Modellauswahl & Training
</font></p>

In [None]:
model_kmeans = KMeans(n_clusters=3)
model_kmeans.fit(data)

In [None]:
eps = 0.5
min_samples = 5
model_dbscan = DBSCAN(eps=eps, min_samples=min_samples)
model_dbscan.fit(data)

# 4 | Evaluate
---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Prognose (Train, Test) erstellen</br>
✅ Modellgüte prüfen</br>
✅ Residuenanalyse erstellen</br>
✅ Feature Importance/Selektion prüfen</br>
✅ Robustheitstest erstellen</br>
✅ Modellinterpretation erstellen</br>
✅ Sensitivitätsanalyse erstellen</br>
✅ Kommunikation (Key Takeaways)</br>

## 4.1 | KMeans
---

<p><font color='black' size="5">
Silhouette Coefficient KMeans
</font></p>

In [None]:
s_score_kmeans = silhouette_score(
    data[["Lat", "Long"]], model_kmeans.labels_, metric="euclidean"
)
print(f"Silhouettenkoeffizient KMeans: {s_score_kmeans:0.2f}")

In [None]:
silhouette_vals = silhouette_samples(data, model_kmeans.labels_)
silhouette_vals[:10]

In [None]:
unique_clusters = np.unique(model_dbscan.labels_)
for cluster in unique_clusters:
    if cluster != -1:  # Ausreißer ignorieren
        cluster_avg_silhouette = np.mean(
            silhouette_vals[model_kmeans.labels_ == cluster]
        )
        print(
            f"Cluster {cluster}: ∅ Silhouettenkoeffizient = {cluster_avg_silhouette:.3f}"
        )

In [None]:
visualizer = SilhouetteVisualizer(model_kmeans, colors="yellowbrick")
visualizer.fit(data)
visualizer.show()

In [None]:
visualizer = intercluster_distance(model_kmeans, data, random_state=42)

<p><font color='black' size="5">
Centroide
</font></p>

In [None]:
print("Centroide:")
DataFrame(model_kmeans.cluster_centers_, columns=data.columns)

## 4.2 | DBSCAN
---

<p><font color='black' size="5">
Silhouette Coefficent DBSCAN
</font></p>

In [None]:
s_score_dbscan = silhouette_score(
    data[["Lat", "Long"]], model_dbscan.labels_, metric="euclidean"
)
print(f"Silhouettenkoeffizient DBScan: {s_score_dbscan:0.2f}")

In [None]:
silhouette_vals = silhouette_samples(data, model_dbscan.labels_)
silhouette_vals[:10]

In [None]:
unique_clusters = np.unique(model_dbscan.labels_)
for cluster in unique_clusters:
    if cluster != -1:  # Ausreißer ignorieren
        cluster_avg_silhouette = np.mean(
            silhouette_vals[model_dbscan.labels_ == cluster]
        )
        print(
            f"Cluster {cluster}: ∅ Silhouettenkoeffizient = {cluster_avg_silhouette:.3f}"
        )

## 4.3 | Visualiserung
---

<p><font color='black' size="5">
Aufbau Datenwürfel für Visualisierung
</font></p>

In [None]:
data["KMeans"] = DataFrame(model_kmeans.labels_).astype("string")
data["DBSCAN"] = DataFrame(model_dbscan.labels_).astype("string")

<p><font color='black' size="5">
Visualisierung KMeans
</font></p>

In [None]:
fig = px.scatter_mapbox(
    data, lat="Lat", lon="Long", color="KMeans", zoom=5, width=1200, height=600
)

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title=f"KMean Clustering")
fig.show()

<p><font color='black' size="5">
Visualisierung DBScan
</font></p>

In [None]:
fig = px.scatter_mapbox(
    data, lat="Lat", lon="Long", color="DBSCAN", zoom=5, width=1200, height=600
)

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(title=f"DBSCAN Clustering")
fig.show()

# 5 | Deploy
---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Modellexport und -speicherung</br>
✅ Abhängigkeiten und Umgebung</br>
✅ Sicherheit und Datenschutz</br>
✅ In die Produktion integrieren</br>
✅ Tests und Validierung</br>
✅ Dokumentation & Wartung</br>