![Alt text](https://imgur.com/orZWHly.png=80)
source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

In [91]:
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
3,36.7,19.3,193.0,3450.0,FEMALE
4,39.3,20.6,190.0,3650.0,MALE


In [92]:
penguins_df.isnull().sum()

culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

In [93]:
penguins_df['sex'] = penguins_df['sex'].map({'MALE': 0, 'FEMALE': 1})

In [94]:
penguins_df.isnull().sum()

culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

In [95]:
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

In [96]:
pipeline = make_pipeline(scaler, kmeans)

In [97]:
pipeline.fit(penguins_df)
labels = pipeline.predict(penguins_df)

In [98]:
penguins_df['labels']=labels

In [99]:
penguins_df

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,labels
0,39.1,18.7,181.0,3750.0,0,2
1,39.5,17.4,186.0,3800.0,1,0
2,40.3,18.0,195.0,3250.0,1,0
3,36.7,19.3,193.0,3450.0,1,0
4,39.3,20.6,190.0,3650.0,0,2
...,...,...,...,...,...,...
327,47.2,13.7,214.0,4925.0,1,1
328,46.8,14.3,215.0,4850.0,1,1
329,50.4,15.7,222.0,5750.0,0,1
330,45.2,14.8,212.0,5200.0,1,1


In [100]:
stat_penguins=penguins_df.groupby('labels')['body_mass_g','flipper_length_mm','culmen_depth_mm','culmen_length_mm'].mean()

In [101]:
stat_penguins

Unnamed: 0_level_0,body_mass_g,flipper_length_mm,culmen_depth_mm,culmen_length_mm
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3419.158879,189.046729,17.611215,40.217757
1,5092.436975,217.235294,14.996639,47.568067
2,4006.603774,194.764151,19.111321,43.878302
