![Alt text](https://imgur.com/orZWHly.png=80)
source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

In [18]:
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()

#print(penguins_df['flipper_length_mm'].value_counts())
print(penguins_df.isna().sum())

# Convert categorical feature 'sex' into numerical values
df = pd.get_dummies(penguins_df['sex'])
print(df.head())
df_new = pd.concat([penguins_df, df], axis=1)
print(df_new.head())

# Select the features for clustering
features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', 'FEMALE', 'MALE']
X = df_new[features]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(X_scaled)

# Get cluster assignments
df_new['cluster'] = kmeans.labels_
print(df_new.head())

stat_penguins = df_new.groupby('cluster')['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'].mean().reset_index()

print(stat_penguins.head())


culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64
   FEMALE  MALE
0       0     1
1       1     0
2       1     0
3       1     0
4       0     1
   culmen_length_mm  culmen_depth_mm  flipper_length_mm  ...     sex FEMALE  MALE
0              39.1             18.7              181.0  ...    MALE      0     1
1              39.5             17.4              186.0  ...  FEMALE      1     0
2              40.3             18.0              195.0  ...  FEMALE      1     0
3              36.7             19.3              193.0  ...  FEMALE      1     0
4              39.3             20.6              190.0  ...    MALE      0     1

[5 rows x 7 columns]
   culmen_length_mm  culmen_depth_mm  flipper_length_mm  ...  FEMALE MALE  cluster
0              39.1             18.7              181.0  ...       0    1        2
1              39.5             17.4              186.0  ...       1    0        0
2            